# stable_conformal_prediction_sets__4b7a4b43.pdf

Stable Conformal Prediction Sets

Eugene Ndiaye 1

When one observes a sequence of variables (x1, y1), . . . , (xn, yn), Conformal Prediction (CP) is a methodology that allows to estimate a conﬁdence set for yn+1 given xn+1 by merely assuming that the distribution of the data is exchangeable. CP sets have guaranteed coverage for any ﬁnite population size n. While appealing, the computation of such a set turns out to be infeasible in general, e.g., when the unknown variable yn+1 is continuous. The bottleneck is that it is based on a procedure that readjusts a prediction model on data where we replace the unknown target by all its possible values in order to select the most probable one. This requires computing an inﬁnite number of models, which often makes it intractable. In this paper, we combine CP techniques with classical algorithmic stability bounds to derive a prediction set computable with a single model ﬁt. We demonstrate that our proposed conﬁdence set does not lose any coverage guarantees while avoiding the need for data splitting as currently done in the literature. We provide some numerical experiments to illustrate the tightness of our estimation when the sample size is sufﬁciently large, on both synthetic and real datasets.

1. Introduction

Modern machine learning algorithms can predict the label of an object based on its observed characteristics with impressive accuracy. They are often trained on historical datasets sampled from the same distribution and it is important to quantify the uncertainty of their predictions. Conformal prediction is a versatile and simple method introduced in (Vovk et al., 2005; Shafer & Vovk, 2008) that provides a ﬁnite sample and distribution free 100(1 α)% conﬁdence region on the predicted object based on past observations. The

1H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA, USA. Correspondence to: Eugene Ndiaye <endiaye3@gatech.edu>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

main idea can be subsumed as a hypothesis testing between

H0 : yn+1 = z and H1 : yn+1 = z ,

where z is any replacement candidate for the unknown response yn+1. The conformal prediction set will consist of the collection of candidates whose tests are not rejected. The construction of a p-value function is simple. We start by ﬁtting a model with training set {(x1, y1), . . . , (xn, yn), (xn+1, z)} and sort the prediction scores/errors for each instance in ascending order. A candidate z will be considered as conformal or typical if the rank of its score is sufﬁciently small compared to the others. The key assumption is that the predictive model and the joint probability distribution of the sequence {(xi, yi)}n+1 i=1 are invariant w.r.t. permutation of the data. As a consequence, the ranks of the scores are equally likely and thus follow a uniform distribution which allow to calibrate a threshold on the rank statistics leading to a valid conﬁdence set. This method has a strong coverage guarantee without any further assumptions on the distribution and is valid for any ﬁnite sample size n; see more details in Section 2.

Conformal prediction technique has been applied for designing uncertainty sets in active learning (Ho & Wechsler, 2008), anomaly detection (Laxhammar & Falkman, 2015; Bates et al., 2021), few shot learning (Fisch et al., 2021), time series (Chernozhukov et al., 2018; Xu & Xie, 2021; Chernozhukov et al., 2021), robust optimization (Johnstone & Cox, 2021) or to infer the performance guarantee for statistical learning algorithms (Holland, 2020; Cella & Ryan, 2020). Currently, we are seeing a growing interest in these approaches due to their ﬂexibility and ease of deployment even for very complex problems where classical approaches offer limited performance (Efron, 2021). We refer to (Balasubramanian et al., 2014) for other AI applications.

Despite its nice properties, the computation of conformal prediction sets requires ﬁtting a model on a new augmented dataset where the unknown quantity yn+1 is replaced by a set of candidates. In a regression setting where an object can take an uncountable possible value, the set of candidates is inﬁnite. Therefore, computing the conformal prediction is infeasible without additional structural assumptions about the underlying model ﬁt, and even so, the current computational costs remain very high. Hence the prevailing recommendation to use less efﬁcient data splitting methods.

Stable Conformal Prediction Sets

Contribution. We leverage algorithmic stability to bound the variation of the predictive model w.r.t. to changes in the input data. This results in a circumvention of the computational bottleneck induced by the necessary readjustment of the model each time we want to assess the typicalness of a candidate replacement of the target variable. As such, we can provide a tight estimation of the conﬁdence sets without loss in the coverage guarantee. Our method is computationally and statistically efﬁcient since it requires only a single model ﬁt and does not involve any data splitting.

Notation. For a nonzero integer n, we denote [n] to be the set {1, , n}. The dataset of size n is denoted Dn = (xi, yi)i [n], the row-wise input feature matrix X = [x1, , xn, xn+1] . Given a set {u1, , un}, the rank of uj for j [n] is deﬁned as

i=1 1ui uj .

We denote u(i) the i-th order statistics.

2. Conformal Prediction

Conformal prediction (Vovk et al., 2005) is a framework for constructing online conﬁdence sets, with the remarkable properties of being distribution free, having a ﬁnite sample coverage guarantee, and being able to be adapted to any estimator under mild assumptions. We recall the arguments in (Shafer & Vovk, 2008; Lei et al., 2018) to construct a conformity/typicalness function based on rank statistics that yields to distribution-free inference methods. The main tool is that the rank of one variable among an exchangeable and identically distributed sequence follows a (sub)-uniform distribution (Br ocker & Kantz, 2011). Lemma 2.1. Let U1, . . . , Un, Un+1 be an exchangeable and identically distributed sequence of random variables. Then for any α (0, 1), we have

Pn+1(Rank(Un+1) (n + 1)(1 α)) 1 α .

We remind that yn+1 is the unknown target variable. We introduce a learning problem with the augmented training data Dn+1(z) := Dn {(xn+1, z)} for z R and with the augmented vector of labels y(z) = (y1, , yn, z):

β(z) arg min β Rp L(y(z), Φ(X, β)) + Ω(β) , (1)

where Φ is a feature map and for any parameter β Rp

Φ(X, β) = [Φ(x1, β), . . . , Φ(xn+1, β)] Rn+1 .

Given an input feature vector x, the prediction of its output/label adjusted on the augmented data, can be deﬁned as

µz(x) := Φ(x, β(z)) .

For example in case of empirical risk minimization, we have

L(y(z), Φ(X, β)) =

i=1 ℓ(yi, Φ(xi, β))+ℓ(z, Φ(xn+1, β)) .

There are many examples of cost functions in the literature. A popular example is the power norm regression, where ℓ(a, b) = |a b|q. When q = 2, this corresponds to the classical linear regression. The cases where q = (1, 2) are frequent in robust statistics where the case q = 1 is known as the least absolute deviation. The loss logcosh ℓ(a, b) = γ log(cosh(a b)/γ) is a differentiable alternative to the ℓ norm (Chebychev approximation). One can also have the loss function Linex (Gruber, 2010; Chang & Hung, 2007) which provides an asymmetric loss function ℓ(a, b) = exp(γ(a b)) γ(a b) 1, for γ = 0. Any convex regularization function Ωe.g., Ridge (Hoerl & Kennard, 1970) or norm inducing sparsity (Bach et al., 2012) can be considered. Also the feature map Φ can be parameterized and learned a la neural network. Equation (1) includes many modern formulations of statistical learning estimators. The only requirement on these is to be invariant with respect to the data permutation; this leaves a very large degree of freedom on their choice. For example, β(z) can be the output of an iterative model e.g., proximal gradient descent, with early stopping.

Let us deﬁne the conformity measure for Dn+1(z) as

i [n], Ei(z) = S(yi, µz(xi)) , (2)

En+1(z) = S(z, µz(xn+1)) , (3)

where S is a real-valued function e.g., in a linear regression problem, one can take s(a, b) = |a b|. The main idea for constructing a conformal conﬁdence set is to consider the typicalness/conformity of a candidate point z measured as

π(z) := 1 1 n + 1Rank(En+1(z)) . (4)

The conformal set gathers all the real values z such that π(z) α, if and only if, the score En+1(z) is ranked no higher than (n + 1)(1 α) , among {Ei(z)}i [n+1] i.e.,

Γ(α)(xn+1) := {z R : π(z) α} . (5)

A direct application of Lemma 2.1 to Ui = Ei(yn+1) reads P(π(yn+1) α) α i.e., the random variable π(yn+1) takes small values with small probability and it reads the coverage guarantee

P(yn+1 Γ(α)(xn+1)) 1 α .

Figure 1 illustrates the candidates selected for inclusion in the conﬁdence set as the most likely variables.

Stable Conformal Prediction Sets

0.6 0.5 0.4 0.3 0.2 0.1 Candidate z

Conformity (z)

π Target yn+1 Level α

Figure 1. Illustration of a conformal prediction set with a conﬁdence level level 0.9 i.e., α = 0.1. The ends of the CP set are indicated by the red cross.

2.1. Computational Limitations and Previous Works

For regression problems where yn+1 lies in a subset of R, obtaining the conformal set Γ(α)(xn+1) in Equation (5) is computationally challenging. It requires re-ﬁtting the prediction model β(z) for inﬁnitely many candidates z in order to compute the map of conformity measure such as z 7 Ei(z) = |yi x i β(z)|. Except for a few examples, the computation of a conformal prediction set is infeasible in general. We describe below some successful computational strategies while pointing out their potential shortcomings.

In Ridge regression, for any x in Rp, z 7 x β(z) is a linear function of z, implying that Ei(z) is piecewise linear. Exploiting this fact, an exact conformal set Γ(α)(xn+1) for Ridge regression was efﬁciently constructed in (Nouretdinov et al., 2001). Similarly, using the piecewise linearity w.r.t. sparsity level of the Lasso path provided by the Lars algorithm (Efron et al., 2004), (Hebiri, 2010) builds a sequence of conformal sets for the Lasso associated to the transition points of the Lars with the observed data Dn. Nevertheless, such procedure breaks the proof technique for the coverage guarantee as the exchangeability of the sequence (Ei(yn+1))i [n+1] is not necessarily maintained. However, a slight adaptation can ﬁx the previous problem. Indeed using the piecewise linearity in z of the Lasso solution, (Lei, 2019) proposed a piecewise linear homotopy under mild assumptions, when a single input sample point is perturbed. This ﬁnally allows to compute the whole solution path z 7 β(z) and successfully provides a conformal set for the Lasso and Elastic Net. These processes are however limited to quadratic loss function. Later, (Ndiaye & Takeuchi, 2019) proposed an adaptation using approximate solution path (Ndiaye et al., 2019) instead of exact solution. This results

in a careful discretization of the set of candidates restricted into a preselected compact [zmin, zmax]. Assuming that the optimization problem in Equation (1) is convex and that the loss function is smooth, this leads to a computational complexity of O(1/ ϵ) where ϵ > 0 is a prescribed optimization error. All these previous methods are at best restricted to convex optimization formulations. A different road consists in merely assuming that the conformal set Γ(α)(xn+1) in Equation (5) itself is a bounded interval. As such, its endpoints can be estimated by approximating the roots of the function z 7 π(z) α. Under slight additional assumptions, a direct bisection search can then compute a conformal set with a complexity of O(log2(1/ϵr)) (Ndiaye & Takeuchi, 2021) where ϵr > 0 is the tolerance error w.r.t. to exact root.

Cross-conformal Predictors was initially introduced in its one split version in (Papadopoulos et al., 2002).The idea is to separate the data into two independent parts, ﬁt the model on one part and rank the scores on the other part where Lemma 2.1 remains applicable and thus preserves the coverage guarantee. Although this approach avoids the computational bottleneck by requiring only one data adjustment, the statistical efﬁciency of the model may be reduced due to a much smaller sample size available during the training and calibration phases. In general, the proportion of the training set to the calibration set is a hyperparameter that requires appropriate tuning: a small calibration set leads to highly variable conformational scores and a small training set leads to poor model ﬁt. Such trade-off is very recurrent in machine learning and often appears in the debate between bias reduction and variance reduction. It is often decided by the cross-validation method with several folds (Arlot & Celisse, 2010). Cross-conformal predictors (Vovk, 2015) follow the same ideas and exploit the full dataset for calibration and signiﬁcant proportions for training the model. The dataset is partitioned into K folds and one performs a split conformal set by sequentially deﬁning the kth fold as calibration set and the remaining as training set for k {1, . . . , K}. The leave-one-out aka Jackknife CP set, requires K = n model ﬁt which is prohibitive even when n is moderately large. On the other hand, the K-fold version will require K model ﬁt but will come at the cost of ﬁtting on a lower sample size and will leads to an additional excess coverage of O( p

2/n) and requires a subtle aggregation of the different pi-values obtained; see (Carlsson et al., 2014; Linusson et al., 2017). (Barber et al., 2021) shown that the conﬁdence level attained is 1 2α instead of 1 α and can only approximately reaches the target coverage 1 α under additional stability assumption.

Although these recent advances have drastically improved the tractability of the calculations, in practice multiple re-adjustments of the data are required. This remains very expensive especially for complex models. Imagine having to re-train a neural network from scratch ten or twenty times

Stable Conformal Prediction Sets

to get a reasonable estimate. In this paper, we actually show that a single model ﬁt is enough to tightly approximate the conformal set when the underlying model ﬁtting is stable.

3. Approximation via Algorithmic Stability

The Section 2 guarantees that π(yn+1) α with high probability. Therefore, since yn+1 is unknown, the conformal set just selects all z that satisﬁes the same inequality i.e., Γ(α)(xn+1) = {z : π(z) α}. This leads to ﬁtting a new model for any z. Here, we take a different strategy. The main remark is that only one element of the dataset changes at a time, then with mild stability assumptions, one can expect that the model prediction will not change drastically. Instead of inverting π( ), we will bound it with quantities independent of the model ﬁt µz for any z. Deﬁnition 3.1 (Algorithmic Stability). A prediction function µ is stable if for any observed features xi, i [n + 1], we have

|S(q, µz(xi)) S(q, µz0(xi))| τi z, z0, q R . (6)

In the literature, it is common to make assumptions about the stability of a predictive model to obtain upper bounds on its generalization error and thus ensure that it does not overﬁt the training data e.g., (O. Bousquet & Elisseeff, 2002). Although the conformal prediction framework applies even when the underlying model is not stable, we show that this additional assumption allows for efﬁcient evaluation of conﬁdence sets. We also add that in cases where the generalization capabilities of the model are poor, the size of the conﬁdence intervals can become very large; even unbounded, and not at all informative. Proposition 3.2. Assume that the model ﬁt µ is stable as in Deﬁnition 3.1. Then, we have:

z, ˆz, πlo(z, ˆz) π(z) πup(z, ˆz) ,

πlo(z, ˆz) := 1 1 n + 1

i=1 1Li(z,ˆz) Un+1(z,ˆz) ,

πup(z, ˆz) := 1 1 n + 1

i=1 1Ui(z,ˆz) Ln+1(z,ˆz) ,

where, we deﬁne, for any index i in [n],

Li(z, ˆz) = Ei(ˆz) τi ,

Ui(z, ˆz) = Ei(ˆz) + τi ,

Ln+1(z, ˆz) = S(z, µˆz(xn+1)) τn+1 ,

Un+1(z, ˆz) = S(z, µˆz(xn+1)) + τn+1 .

Proof. By stability, for any q, we have:

|S(q, µz(xi)) S(q, µˆz(xi))| τi .

Applying the previous inequality to q = yi for any index i in [n + 1], we have Li(z, ˆz) Ei(z) Ui(z, ˆz) and it holds:

Ui(z, ˆz) Ln+1(z, ˆz) = Ei(z) En+1(z)

= Li(z, ˆz) Un+1(z, ˆz) .

Taking the indicator of the corresponding sets, we obtain the result.

A direct consequence of Proposition 3.2 is that the exact conformal set can be wrapped as follows. Corollary 3.3 (Stable Conformal Sets). Under the assumption of Proposition 3.2, the conformal prediction set is lower and upper approximated as

Γ(α) lo (xn+1) Γ(α)(xn+1) Γ(α) up (xn+1) ,

Γ(α) lo (xn+1) = {z : πlo(z, ˆz) α} ,

Γ(α) up (xn+1) = {z : πup(z, ˆz) α} .

Since our proposal arises from a combination of the conformal prediction sets with a correction from the stability bounds, we call the resulting (upper) conﬁdence set Γ(α) up (xn+1) stab CP for stable conformal set. By construction, it contains the exact conﬁdence set Γ(α)(xn+1) and therefore enjoys at least the same statistical beneﬁts displayed in the following result. Proposition 3.4 (Coverage guarantee). Assume that the model ﬁt µ is stable as in Deﬁnition 3.1. Then the stab CP set is an upper envelope of the exact conformal prediction set in Equation (5) and is thus valid i.e.,

P(yn+1 Γ(α) up (xn+1)) 1 α .

As promised in the abstract, our proposed method suffers no loss of statistical coverage, requires only one model adjustment to the data at an arbitrary candidate point ˆz, and fully uses all the data (no splitting). Thus we can beneﬁt both from statistical efﬁciency with a smaller conﬁdence interval as in the case of the exact calculation; but also we completely break the computational difﬁculty as in the case of splitting methods. To our knowledge, there is no equivalent method that can beneﬁt from such a double performance.

3.1. Practical Computation of stab CP sets

By construction, the computation of stable conformal sets is equivalent to collecting all z such that πup(z, ˆz) α. Let s begin by noting that Un+1(z, ˆz) > Ln+1(z, ˆz) when τn+1 > 0 which we will assume for simplicity. We have

πup(z, ˆz) α

i=1 1Ui(z,ˆz) Ln+1(z,ˆz) (1 α)(n+1) .

Stable Conformal Prediction Sets

2 0 2 Candidate z

conformity of z

2 0 2 4 Candidate z

4 2 0 2 4 Candidate z

(c) n = 300

0 2000 4000 6000 8000 10000 Sample size n

Approximation Gap

sup z Z Gap(z, 0)

z Z Gap(z, 0)

max i [n+1] τi

(d) Convergence of the approximation gap Gap(z,ˆz)

2 0 2 Candidate z

conformity of z

πup(z, ymin)

πup(z, (ymin + ymax)/2)

πup(z, µ0(xn+1))

πup(z, yn+1)

πup(z, ymax)

(e) Diabetes (442, 10)

3 2 1 0 1 2 3 Candidate z

conformity of z

(f) Synthetic (30, 100)

Figure 2. Illustration of the evolution of the conformity function as a function of sample size. The underlying model ﬁt is β(z) arg minβ Rp y(z) Xβ 1 /(n + 1) + λ β 2 where y(z) = (y1, , yn, z) and we use sklearn synthetic dataset make regression(n, p = 100, noise = 1). We ﬁxed ˆz = 0 and λ = 0.5. The set Z is a linear grid in the interval [y(1), y(n)]. We also illustrate the batch approximation for different values of ˆz.

This means that a candidate z is selected, if at most (1 α)(n + 1) elements of {Ui(z, ˆz)}i [n] are smaller than Ln+1(z, ˆz). Which is equivalent to1

Ln+1(z, ˆz) U( (1 α)(n+1) )(z, ˆz) =: Q1 α(ˆz) .

Hence, we can conclude that

Γ(α) up (xn+1) = {z : S(z, µˆz(xn+1)) Q1 α(ˆz) + τn+1}.

For the absolute value score, it reduces to the interval

Γ(α) up (xn+1) = [µˆz(xn+1) (Q1 α(ˆz) + τn+1)] .

For the sake of clarity, we summarize the computations for this simplest case in Algorithm 1 and discuss the generalization in the appendix. In general terms, stab CP sets are convex sets when the score function z 7 S(z, µˆz(xn+1)) has convex level sets. This presumes that our strategy will also facilitate the calculations in cases where the target yn+1 is multi-dimensional.

1For i [n], Ui(z, ˆz) and Li(z, ˆz) are independent of z.

Algorithm 1 Stable Conformal Prediction Set

Input: data {(x1, y1), . . . , (xn, yn)} and xn+1 Coverage level α (0, 1), any estimate ˆz R Stability bounds τ1, . . . , τn+1 of the learning algorithm Output: prediction interval at xn+1 Fit a model µˆz on the training data Dn+1(ˆz) Compute the quantile Q1 α(ˆz) = U( (1 α)(n+1) )(z, ˆz) where the Uis are deﬁned in Proposition 3.2 Return: [µˆz(xn+1) (Q1 α(ˆz) + τn+1)]

3.2. Batch Approximation

The stable conformal sets require a single model ﬁt µˆz for an arbitrary candidate ˆz. The approximation gaps are computable as

max{π(z) πlo(z, ˆz), πup(z, ˆz) π(z)} Gap(z,ˆz) ,

where Gap(z,ˆz) := πup(z,ˆz) πlo(z,ˆz) .

Since the above upper and lower bounds hold for any ˆz, tighter approximations are obtained with a batch of candi-

Stable Conformal Prediction Sets

dates Z = ˆz1, , ˆzd as

πup(z, Z) = inf ˆz Z πup(z, ˆz) and πlo(z, Z) = sup ˆz Z πlo(z, ˆz) .

Another possibility is to build an interpolation of z 7 µz( ) based on query points ˆz1, , ˆzd (zmin, zmax) R. For example, one can consider as predictive model the following piecewise linear interpolation

ˆz1 z ˆz1 zmin µzmin + zmin z

ˆz1 zmin µˆz1 if z zmin , z ˆzt+1 ˆzt ˆzt+1 µˆzt + z ˆzt ˆzt+1 ˆzt µˆzt+1 if z [ˆzt, ˆzt+1] ,

z ˆzd zmax ˆzd µzmax + zmax z

zmax ˆzd µˆzd if z zmax ,

An important point is that, by using the stability bound, the coverage guarantee of the interpolated conformal set is preserved without the need of the expensive symmetrization proposed in (Ndiaye & Takeuchi, 2021). Such techniques are more relevant when the sample size is small or when precise estimates of the stability bounds are not available. The corresponding conformity function is deﬁned in a similar way as the previous versions, where we simply plugin the interpolated model. We refer to the appendix for more details. Remark 3.5 (Categorical Variables). In this article, we have essentially limited ourselves to regression problems which, in general, pose intractable computational difﬁculties. However, the methods remain applicable for classiﬁcation problems where the set of candidates can only take a ﬁnite number of values in C := c1, . . . , cm. In this case, an additional precaution of encoding the categories in real numbers is necessary. Considering the leave-one-out score function, our proposal is therefore an alternative to the approximations via inﬂuence function used in (Alaa & Schaar, 2020; Abad et al., 2022) when an exact computation (Cherubin et al., 2021) would be unusable or too costly.

3.3. Stability Bounds

In this section, we recall some stability bounds. The proof techniques rely on regularity assumptions on the function to be minimized and are relatively standard in optimization (Shalev-Shwartz & Ben-David, 2014, Chapter 13). Stability is a widely used assumption to provide generalization bounds for machine learning algorithms (O. Bousquet & Elisseeff, 2002; Hardt et al., 2016). We specify that here the notion of stability that we require is related to the variation of the score and not of the loss function in the optimization objective. However, the ideas for establishing the stability bounds are essentially the same and we recall the core strategies here for the sake of completeness.

Let us start with the unregularized model where Ω= 0 i.e.,

β(z) arg min β Rp L(y(z), Φ(X, β)) = Fz(Φ(X, β)) . (7)

Deﬁnition 3.6. A function f is λ-strongly convex if for any w0, w and ς (0, 1)

f(ςw0 + (1 ς)w) ςf(w0) + (1 ς)f(w)

2 ς(1 ς) w0 w 2 .

Proposition 3.7. Assume that for any z, Fz is λ-strongly convex and ρ-Lipschitz. It holds

µz(X) µz0(X) 2ρ

Proof. By optimality of β(z), we have

Fz(Φ(X, β(z))) Fz(Φ(X, β)) β . (8)

We simply apply the optimality condition and strong convexity of the function Fz to the vectors w0 = Φ(X, β(z0)) = µz0(X) and w = Φ(X, β(z)) = µz(X), it holds

0 (8) Fz(ςw0 + (1 ς)w) Fz(w)

(3.6) Fz(w0) Fz(w) λ

2 (1 ς) w0 w 2 .

Since Fz is ρ-Lipschitz, we have

λ 2 w0 w 2 Fz(w0) Fz(w) ρ w w0 .

Therefore, λ

2 w0 w ρ, hence the result.

The Proposition 3.7 does not assume that the optimization problem in Equation (7) is convex in the model parameter β. We can now easily deduce a stability bound according to the Deﬁnition 3.1. Corollary 3.8. If the score function S(q, ) is γ-Lipschitz for any q, then

λ , i [n + 1] .

When the loss function is not strongly convex, it is known that adding a strongly convex regularization can stabilize the algorithm (Shalev-Shwartz & Ben-David, 2014, Chapter 13). The proof technique is similar to the previous one with the difference that now the bound is on the arg min of the optimization problem and not the predictions of the model. This requires stronger assumptions. Proposition 3.9. Assume the optimization problem Equation (1) is convex, Ωis λ-strongly convex. If the loss L is convex-ρ-Lipschitz, then

β(z) β(z0) 2ρ

When the loss function L is convex-ν-smooth with ν < λ and L(y(z), µz(X)) C for any z, then

β(z) β(z0) 2

Stable Conformal Prediction Sets

These optimization error bounds also imply the following stability bounds.

Corollary 3.10. Assume that the score function S(q, ) is γLipschitz for any q, and that the prediction model µ (x) := Φ(x, β( )) satisﬁes for any x Rp, z, z0 R,

|µz(x) µz0(x)| LΦ|x β(z) x β(z0)| .

If the loss is ρ-Lipschitz, then

τi = 2γρLΦ xi

λ , i [n + 1] .

If the loss is ν-smooth with ν < λ and bounded by C, then

τi = 2γLΦ xi

2νC λ ν , i [n + 1] .

Another way to understand such regularized bounds, is to leverage duality. A smoothness assumption in the primal space will translate into a strongly concave assumption in the dual space (Hiriart-Urruty & Lemar echal, 1993, Theorem 4.2.2, p. 83). The dual formulation (Rockafellar, 1997, Chapter 31) of Equation (1) reads:

θ(z) arg max θ Rn+1 L (y(z), θ) Ω (X θ) , (9)

where, given a proper, closed and convex function f : Rn R {+ }, we denoted its Fenchel Legendre transform as f : Rn R {+ } deﬁned by f (x ) = supx dom f x , x f(x) with dom f = {x Rn : f(x) < + }.

Let Pz and Dz denote the primal and dual objective functions. We have the following classical error bounds for the dual optimization problem. If the loss function L is ν-smooth, then L is 1/ν-strongly convex and we have for (β, θ) dom Pz dom Dz

1 2ν θ(z) θ 2 Dz(θ(z)) Dz(θ)

= Pz(β(z)) Dz(θ)

Duality Gapz(β, θ) ,

where the equality follows from strong duality and we recall from weak duality that the duality gap upper bounds the optimization error as follow:

Duality Gapz(β, θ) := Pz(β) Dz(θ)

Pz(β) Pz(β(z)) .

This readily leads to several possible bounds. If the dual function Dz( ) is ρ -Lipschitz for any z, then

If the duality gap can be assumed to be bounded by C for any z [zmin, zmax], then

We obtain stability bounds when one uses the dual solution (which is a function of the residual) as a conformity score S(y(z), µz(X)) = |θ(z)| where the absolute value is taken coordinate wise. For example, these dual based score functions were used in (Ndiaye & Takeuchi, 2019). Remark 3.11 (Bound on the loss). The assumption of a bounded loss function that we make, is not rigorously feasible and some adaptations are necessary. For simplicity, let us consider that Φ(x, 0) = 0 and Ω(0) = 0. Using the optimality of β(z), we obtain for any candidate z

L(y(z), µz(X)) L(y(z), µz(X)) + Ω(β(z))

L(y(z), 0) .

Unfortunately, for common examples such as least squares, the right hand side is unbounded. Nevertheless, since the data are assumed to be exchangeable, we have

P(yn+1 [y(1), y(n)]) 1 2 n + 1 .

Hence it is reasonable to restrict the range of candidates as z [y(1), y(n)], which implies

L(y(z), µz(X)) sup z [y(1),y(n)] L(y(z), 0) =: C .

4. Numerical Experiments

We conduct all the experiments with a coverage level of 0.9 i.e., α = 0.1. For comparisons, we run the evaluations on 100 repetitions of examples and display the average of the following performance statistics for different methods: the empirical coverage i.e., the percentage of times the prediction set contains the held-out target yn+1, the length of the conﬁdence intervals, and the execution time. We compare the method we propose stab CP with the conformal prediction set computed with an oracle method deﬁned below, with a splitting strategy split CP (Papadopoulos et al., 2002; Lei et al., 2018), and ﬁnally with an estimation of the α-level set of the conformity function root CP (Ndiaye & Takeuchi, 2021) by root-ﬁnding solvers. Note that, when the conformal set is a bounded interval, stab CP approximates root CP as in Figure 2. In all experiments conducted, we observed that the exact conformal prediction set is indeed an interval. Although this is often the case, we recall that it might not be in general. Just for the comparisons, we therefore estimated the stab CP sets with a root-ﬁnding solver as well, as if a closed form solution was not available. A python package with our implementation is available at https://github.com/Eugene Ndiaye/stable_ conformal_prediction where additional numerical experiments (e.g., using large pre-trained neural net) and benchmarks will be provided.

Stable Conformal Prediction Sets

oracle CP cov = 0.87

split CP cov = 0.87

root CP cov = 0.87

stab CP cov = 0.87

lad : boston

(a) Boston (506, 13)

oracle CP cov = 0.87

split CP cov = 0.87

root CP cov = 0.87

stab CP cov = 0.87

lad : diabetes

(b) Diabetes (442, 10)

oracle CP cov = 0.9

split CP cov = 0.9

root CP cov = 0.9 T = 19.23

stab CP cov = 0.9

lad : housingcalifornia

(c) Housingcalifornia (20640, 8)

oracle CP cov = 0.86

split CP cov = 0.86

root CP cov = 0.86

stab CP cov = 0.86

lad : friedman1

(d) Friedman1 (500, 100)

Figure 3. Benchmarking conformal sets for the least absolute deviation regression models with a ridge regularization on real datasets. We display the lengths of the conﬁdence sets over 100 random permutation of the data. We denoted cov the average coverage and T the average computational time normalized with the average time for computing oracle CP which requires a single full data model ﬁt. The full and exact CP set can always be approximated with a ﬁne (costly) grid discretization of the output space and can then be used as a default baseline. Here, it is represented by root CP since in the examples displayed the full CP set turns out to be an interval and then root CP is equal to the full CP up to ϵr digit precision on the decimals; we used a default value of ϵr = 10 4.

Stable Conformal Prediction Sets

oracle CP. To deﬁne an oracle prediction set as reference, we follow in (Ndiaye & Takeuchi, 2019; 2021) and assume that the unavailable target variable yn+1 is observed by the algorithm. Hence, we deﬁne the oracle scores

i [n], Eor i = S(yi, µyn+1(xi)) ,

Eor n+1(z) = S(z, µyn+1(xn+1)) ,

and the oracle conformal set as

Γ(α) oracle(xn+1) := {z : πoracle(z) α} ,

πoracle(z) = 1 1 n + 1

i=1 1Eor i Eor n+1(z) .

split CP. A popular and classical estimation of conformal prediction sets relies on splitting the dataset. The split conformal prediction set introduced in (Papadopoulos et al., 2002), separates the model ﬁtting and the calibration steps. Let us deﬁne

the training set

Dtr = {(x1, y1), , (xm, ym)} with m < n ,

the calibration set

Dcal = {(xm+1, ym+1), , (xn, yn)} .

Then the model is ﬁtted on the training set Dtr to get µtr( ) and deﬁne the score function on the calibration set Dcal:

i [m + 1, n], Ecal i = S(yi, µtr(xi)) ,

Ecal n+1(z) = S(z, µtr(xn+1)) .

Thus, we obtain the split conformal set as

Γ(α) split(xn+1) = {z : πsplit(z) α} ,

πsplit(z) = 1 1 n m + 1

i=m+1 1Ecal i Ecal n+1(z) .

5. Discussion

The data splitting approach does not use all the data in the training phase. It is often less statistically efﬁcient, and its interval length can vary greatly depending on the additional randomness of the split. On the contrary, our approach does not use any splitting, provides an approximation of the exact conformal set that is pretty accurate depending on the stability of the model as can be observed on Figure 3. All this requires one and only one data ﬁtting of the underlying learning model. You will notice that split CP and stab CP have the same structure and are simple intervals if the score functions are reasonably simple. The presence

of data splitting in the former is replaced by an additional stability term in the latter. So if the predictive model is very stable, stab CP beneﬁts from all the data, and very little regularization to get closer to the oracle version that includes the unknown target yn+1. To date, we are not aware of any other method that can obtain a full conformal prediction set with such computational efﬁciency while ensuring no loss on the coverage guarantee. We observe on the benchmarks with real data Figure 3 that the stab CP is often very similar to the root CP which approximates with a very ﬁne precision the exact set (under the assumption that the latter is a bounded interval). Our proposal has the net advantage of being twenty to thirty times faster and can often be computed in closed form.

However, as can be seen in Figure 2, our proposed method loses precision when the sample size is small. This reﬂects the difﬁculty of estimating a reliable conﬁdence set in the absence of algorithmic stability. At the same time, it is difﬁcult to have an algorithm that generalizes well with so little training data. Otherwise, when the size of the data is important, the inﬂuence of the stability bound is very little felt because they are often of the order of magnitude O(1/n).

Finally, a notorious limitation is that one needs to know explicitly the stability bounds. This can be difﬁcult to estimate for some models. The bounds we presented in Section 3.3 cover a wide range of examples and can be completed by bounds displayed in (Hardt et al., 2016; Bassily et al., 2020; Lei et al., 2021; Klochkov & Zhivotovskiy, 2021) for stochastic gradient descent. Even if the notion of stability required here is slightly different, any error bound on the estimator can be naturally converted into a stability bound for conformal prediction sets. So we don t lose much generality as long as we make the assumption that the score function is sufﬁciently regular e.g., Lipschitz. This is precisely what allowed us to obtain the bounds presented in this article. Yet, if the parameter of the predictive model is deﬁned iteratively by a gradient descent process on a non-convex objective function, obtaining stability bounds becomes quite delicate. Moreover, the Lipschitz constant of neural network objectives can be poorly estimated. In this case, our approach could not be applied safely or could lead to uninformative conﬁdence intervals. The splitting strategy remains more ﬂexible. It would be interesting to study ﬁne combinations of data splitting and inclusion of stability bounds to reduce the size of the conﬁdence intervals and their variance while being pivotal to explicit stability bounds.

Acknowledgements

We warmly thank the reviewers for their insightful comments and contributions to improve the presentation of this paper. We also thank Elvis Dohmatob and Xiaoming Huo for proofreading and for pointing out mistakes in notations.

Stable Conformal Prediction Sets

Abad, J., Bhatt, U., Weller, A., and Cherubin, G. Approximating full conformal prediction at scale via inﬂuence functions. ar Xiv preprint ar Xiv:2202.01315, 2022.

Alaa, A. and Schaar, M. V. D. Discriminative jackknife: Quantifying uncertainty in deep learning via higher-order inﬂuence functions. International Conference on Machine Learning, 2020.

Arlot, S. and Celisse, A. A survey of cross-validation procedures for model selection. Statistics surveys, 2010.

Bach, F., Jenatton, R., Mairal, J., and Obozinski, G. Optimization with sparsity-inducing penalties. Foundations and Trends in Machine Learning, 2012.

Balasubramanian, V., Ho, S.-S., and Vovk, V. Conformal prediction for reliable machine learning: theory, adaptations and applications. Elsevier, 2014.

Barber, R. F., Candes, E. J., Ramdas, A., and Tibshirani, R. J. Predictive inference with the jackknife+. The Annals of Statistics, 2021.

Bassily, R., Feldman, V., Guzm an, C., and Talwar, K. Stability of stochastic gradient descent on nonsmooth convex losses. Advances in Neural Information Processing Systems, 2020.

Bates, S., Cand es, E., Lei, L., Romano, Y., and Sesia, M. Testing for outliers with conformal p-values. ar Xiv preprint ar Xiv:2104.08279, 2021.

Br ocker, J. and Kantz, H. The concept of exchangeability in ensemble forecasting. Nonlinear Processes in Geophysics, 2011.

Carlsson, L., Eklund, M., and Norinder, U. Aggregated conformal prediction. IFIP International Conference on Artiﬁcial Intelligence Applications and Innovations, 2014.

Cella, L. and Ryan, R. Valid distribution-free inferential models for prediction. ar Xiv preprint ar Xiv:2001.09225, 2020.

Chang, Y.-C. and Hung, W.-L. Linex loss functions with applications to determining the optimum process parameters. Quality & Quantity, 2007.

Chernozhukov, V., W uthrich, K., and Zhu, Y. Exact and robust conformal inference methods for predictive machine learning with dependent data. Conference On Learning Theory, 2018.

Chernozhukov, V., W uthrich, K., and Zhu, Y. An exact and robust conformal inference method for counterfactual and synthetic controls. Journal of the American Statistical Association, 2021.

Cherubin, G., Chatzikokolakis, K., and Jaggi, M. Exact optimization of conformal predictors via incremental and decremental learning. International Conference on Machine Learning, 2021.

Efron, B. Resampling plans and the estimation of prediction error. Stats, 2021.

Efron, B., Hastie, T., Johnstone, I. M., and Tibshirani, R. Least angle regression. The Annals of Statistics, 2004.

Fisch, A., Schuster, T., Jaakkola, T., and Barzilay, R. Fewshot conformal prediction with auxiliary tasks. ICML, 2021.

Gruber, M. Regression estimators: A comparative study. JHU Press, 2010.

Hardt, M., Recht, B., and Singer, Y. Train faster, generalize better: Stability of stochastic gradient descent. International Conference on Machine Learning, 2016.

Hebiri, M. Sparse conformal predictors. Statistics and Computing, 2010.

Hiriart-Urruty, J.-B. and Lemar echal, C. Convex analysis and minimization algorithms. II. Springer-Verlag, 1993.

Ho, S.-S. and Wechsler, H. Query by transduction. IEEE transactions on pattern analysis and machine intelligence, 2008.

Hoerl, A. E. and Kennard, R. W. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 1970.

Holland, M. J. Making learning more transparent using conformalized performance prediction. ar Xiv preprint ar Xiv:2007.04486, 2020.

Johnstone, C. and Cox, B. Conformal uncertainty sets for robust optimization. Conformal and Probabilistic Prediction and Applications, 2021.

Klochkov, Y. and Zhivotovskiy, N. Stability and deviation optimal risk bounds with convergence rate o(1/n). Advances in Neural Information Processing Systems, 2021.

Laxhammar, R. and Falkman, G. Inductive conformal anomaly detection for sequential detection of anomalous sub-trajectories. Annals of Mathematics and Artiﬁcial Intelligence, 2015.

Lei, J. Fast exact conformalization of lasso using piecewise linear homotopy. Biometrika, 2019.

Stable Conformal Prediction Sets

Lei, J., GSell, M., Rinaldo, A., Tibshirani, R. J., and Wasserman, L. Distribution-free predictive inference for regression. Journal of the American Statistical Association, 2018.

Lei, Y., Yang, Z., Yang, T., and Ying, Y. Stability and generalization of stochastic gradient methods for minimax problems. 2021.

Linusson, H., Norinder, U., Bostr om, H., Johansson, U., and L ofstr om, T. On the calibration of aggregated conformal predictors. Conformal and probabilistic prediction and applications, 2017.

Ndiaye, E. and Takeuchi, I. Computing full conformal prediction set with approximate homotopy. Neur IPS, 2019.

Ndiaye, E. and Takeuchi, I. Root-ﬁnding approaches for computing conformal prediction set. ar Xiv preprint ar Xiv:2104.06648, 2021.

Ndiaye, E., Le, T., Fercoq, O., Salmon, J., and Takeuchi, I. Safe grid search with optimal complexity. ICML, 2019.

Nouretdinov, I., Melluish, T., and Vovk, V. Ridge regression conﬁdence machine. ICML, 2001.

O. Bousquet, O. and Elisseeff, A. Stability and generalization. The Journal of Machine Learning Research, 2002.

Papadopoulos, H., Proedrou, K., Vovk, V., and Gammerman, A. Inductive conﬁdence machines for regression. European Conference on Machine Learning, 2002.

Rockafellar, R. T. Convex analysis. Princeton University Press, 1997. Reprint of the 1970 original, Princeton Paperbacks.

Shafer, G. and Vovk, V. A tutorial on conformal prediction. Journal of Machine Learning Research, 2008.

Shalev-Shwartz, S. and Ben-David, S. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

Vovk, V. Cross-conformal predictors. Annals of Mathematics and Artiﬁcial Intelligence, 2015.

Vovk, V., Gammerman, A., and Shafer, G. Algorithmic learning in a random world. Springer, 2005.

Xu, C. and Xie, Y. Conformal prediction interval for dynamic time-series. ICML, 2021.

Stable Conformal Prediction Sets

6. Appendix

In these supplementary notes, we complete some proofs and bring algorithmic precisions of our approach as well as additional numerical experiments.

6.1. Stab CP Set with General Score Function

We explain a simple procedure to approximate the set prediction with an arbitrary pre-deﬁned accuracy. We recall that

Γ(α) up (xn+1) = {z : πup(z, ˆz) α}

= {z : S(z, µˆz(xn+1)) Q1 α(ˆz) + τn+1} ,

which is a convex set when the level-set of the score function is convex. By simplicity, we assume that the score function is such that Γ(α) up (xn+1) is a bounded interval. Algorithm 2 summarizes the process.

Algorithm 2 Stable conformal prediction set for score function with convex level-set

Input: data {(x1, y1), . . . , (xn, yn)} and xn+1 Coverage level α (0, 1), any estimate ˆz R Stability bounds τ1, . . . , τn+1 of the learning algorithm Output: prediction interval at xn+1 Fit a model µˆz on the training data Dn+1(ˆz) Compute the quantile Q1 α(ˆz) = U( (1 α)(n+1) )(z, ˆz) where the Uis are deﬁned in Proposition 3.2

Compute Γ(α) up (xn+1) = [ℓα(xn+1), uα(xn+1)] up to ϵr > 0 tolerance error as follow:

1. ﬁnd zmin < z0 < zmax such that

πup(zmin, ˆz) < α < πup(z0, ˆz) and α > πup(zmax, ˆz) . (10)

2. Perform a bisection search in [zmin, z0]. It will output a point ˆℓsuch that ℓα(xn+1) belongs to [ˆℓ ϵr] after at most log2( z0 zmin

ϵr ) iterations.

3. Perform a bisection search in [z0, zmax]. It will output a point ˆu such that uα(xn+1) belongs to [ˆu ϵr] after at most log2( zmax z0

ϵr ) iterations.

Return: Γ(α) up (xn+1)

6.2. Stability of the Linear Interpolation

We discussed in Section 3.2 the potential gain in accuracy when approximating the conformity function using not a single point but a batch of points. Here we justify the interpolation approach when the score function S is sufﬁciently regular. Proposition 6.1. Let us assume that the score function S(q, ) is γ-Lipschitz for any q, and consider the interpolated prediction model deﬁned as

ˆz1 z ˆz1 zmin µzmin + zmin z

ˆz1 zmin µˆz1 if z zmin , z ˆzt+1 ˆzt ˆzt+1 µˆzt + z ˆzt ˆzt+1 ˆzt µˆzt+1 if z [ˆzt, ˆzt+1] ,

z ˆzd zmax ˆzd µzmax + zmax z

zmax ˆzd µˆzd if z zmax ,

where µ is stable according to Deﬁnition 3.1. It holds

|S(q, µz(xi)) S(q, µz0(xi))| 3γτi . (12)

Proof. Using the triangle inequality, we have

stab := |S(q, µz(xi)) S(q, µz0(xi))|

|S(q, µz(xi)) S(q, µz(xi))| + |S(q, µz(xi)) S(q, µz0(xi))| + |S(q, µz0(xi)) S(q, µz0(xi))| .

Stable Conformal Prediction Sets

If µ is stable, then the second term of the right hand side of the previous inequality is bounded by τi. Now, assuming that S is γ-Lipschitz in its second argument, for any q, we have:

|S(q, µz(xi)) S(q, µz(xi))| γEi z ,

Ei z = |µz(xi) µz(xi)|

|µz(xi) αtµzt(xi) (1 αt)µzt+1(xi)|

αt|µz(xi) µzt(xi)| + (1 αt)|µz(xi) µzt+1(xi)|

αtτi + (1 αt)τi = τi ,

with αt n z1 z z1 zmin , z zt+1

zt zt+1 , z zd zmax zd

o is the scaling of interpolation points. Thus, we obtain

stab γ(Ei z + τi + Ei z0) 3γτi .

The upper and lower approximation of the conformity function obtained with the interpolated model ﬁt along with stability bounds are deﬁned as:

πlo(z) = 1 1 n + 1

i=1 1 Li(z) Un+1(z) ,

πup(z) = 1 1 n + 1

i=1 1 Ui(z) Ln+1(z) ,

where for any index i in [n + 1], using the stability bound in Equation (12), we deﬁne

Li(z) = Ei(z) 3γτi and Ui(z) = Ei(z) + 3γτi , Ei(z) = S(yi, µz(xi)) and En+1(z) = S(z, µz(xi)) .

In general, approximating the entire model path with respect to output/label changes using ﬁnite grid points is not always safe for calculating the conformal prediction set because it breaks the exchangeability assumptions of the data set. Incorporating the stability bound will regularize the conformity function to restore the validity of the method. However, the procedure proves to be quite robust to wrong estimation of the stability bounds. The experiments in (Ndiaye & Takeuchi, 2021) are conducted with estimates τi = 0 and the prediction sets obtained are essentially the same as the exact one. More detailed experiments will be proposed in our github implementation.

6.3. Additional Experiments

In this appendix, we add some numerical experiments to illustrate how stab CP can behave when using an estimator that is not deﬁned as an argmin but rather as an output of an iterative process. In this case, we use a Multi-Layer Perceptron regressor trained with T = n iter number of gradient descent iterations. Recent analyses (Hardt et al., 2016) have shown that any model trained with the stochastic gradient method in a reasonable amount of time achieves low generalization error. The proof of these results consists in showing that the estimator veriﬁes a stability condition when the input data are slightly perturbed. The bounds on the iterates of stochastic gradient methods are often proportional to T

n . They also depend on the Lipschitz regularity constants which unfortunately can be hard to estimate in practice. Here, we will be satisﬁed with the order of magnitude and evaluate the behavior of the conformity function according to the number of iterations performed. We run the experiments on two different datasets with a sample size of 442 and 20640.

Stable Conformal Prediction Sets

3 2 1 0 1 2 3 Candidate z

conformity of z

n_iter = 0.1 n_samples

πup(z, 0), τi = T xi n

πup(z, 0), τi = xi n

3 2 1 0 1 2 3 Candidate z

conformity of z

n_iter = 0.5 n_samples

3 2 1 0 1 2 3 Candidate z

conformity of z

n_iter = 0.9 n_samples

3 2 1 0 1 2 3 Candidate z

conformity of z

n_iter = 2.0 n_samples

Figure 4. Illustration of different conformity functions with respect to a sequence of stability bounds. We observe that by merely staking an order of magnitude O(1/n) as stability bound, gives a good estimate of the conformal prediction set even if the bound is not safe. These experiments are conducted with a Multi-Layer Perceptron regressor on the Diabetes (442, 10) dataset, trained with T = n iter iterations of Stochastic Gradient Descent.

Stable Conformal Prediction Sets

3 2 1 0 1 2 3 Candidate z

conformity of z

n_iter = 0.1 n_samples

πup(z, 0), τi = T xi n

πup(z, 0), τi = xi n

3 2 1 0 1 2 3 Candidate z

conformity of z

n_iter = 0.5 n_samples

3 2 1 0 1 2 3 Candidate z

conformity of z

n_iter = 0.9 n_samples

3 2 1 0 1 2 3 Candidate z

conformity of z

n_iter = 2.0 n_samples

Figure 5. Illustration of different conformity functions with respect to a sequence of stability bounds. We observe that by merely staking an order of magnitude O(1/n) as stability bound, gives a good estimate of the conformal prediction set even if the bound is not safe. These experiments are conducted with a Multi-Layer Perceptron regressor on the Housingcalifornia (20640, 8) dataset, trained with T = n iter iterations of Stochastic Gradient Descent.

Stable Conformal Prediction Sets

oracle CP cov = 0.95

split CP cov = 0.95

root CP cov = 0.95

stab CP cov = 0.95

(a) Boston (506, 13)

oracle CP cov = 0.9

split CP cov = 0.9

root CP cov = 0.9 T = 34.16

stab CP cov = 0.9

(b) Diabetes (442, 10)

oracle CP cov = 0.94

split CP cov = 0.94

root CP cov = 0.94

stab CP cov = 0.94

(c) Housingcalifornia (20640, 8)

oracle CP cov = 0.87

split CP cov = 0.87

root CP cov = 0.87

stab CP cov = 0.9

(d) Friedman1 (500, 100)

Figure 6. Benchmarking conformal sets for MLP regression models with a ridge regularization on real datasets. The parameter of the model is obtained after T = n/10 iterations of stochastic gradient descent. For stab CP, we use a stability bound estimate τi = T xi /(n+1). We display the lengths of the conﬁdence sets over 100 random permutation of the data. We denoted cov the average coverage and T the average computational time normalized with the average time for computing oracle CP which requires a single full data model ﬁt.

Stable Conformal Prediction Sets

oracle CP cov = 0.92

split CP cov = 0.92

root CP cov = 0.92

stab CP cov = 0.92

(a) Boston (506, 13)

oracle CP cov = 0.91

split CP cov = 0.91

root CP cov = 0.91

stab CP cov = 0.92

(b) Diabetes (442, 10)

oracle CP cov = 0.88

split CP cov = 0.87

root CP cov = 0.88

stab CP cov = 0.88

(c) Housingcalifornia (20640, 8)

oracle CP cov = 0.85

split CP cov = 0.84

root CP cov = 0.85

stab CP cov = 0.85

(d) Friedman1 (500, 100)

Figure 7. Benchmarking conformal sets for Gradient Boosting regression models with a ridge regularization on real datasets. For stab CP, we use a stability bound estimate τi = xi /(n + 1). We display the lengths of the conﬁdence sets over 100 random permutation of the data. We denoted cov the average coverage and T the average computational time normalized with the average time for computing oracle CP which requires a single full data model ﬁt.

Stable Conformal Prediction Sets

oracle CP cov = 0.89

split CP cov = 0.9

root CP cov = 0.89

stab CP cov = 0.93

(a) Boston (506, 13)

oracle CP cov = 0.93

split CP cov = 0.93

root CP cov = 0.93

stab CP cov = 0.97

(b) Diabetes (442, 10)

oracle CP cov = 0.85

split CP cov = 0.85

root CP cov = 0.85

stab CP cov = 0.85

(c) Housingcalifornia (20640, 8)

oracle CP cov = 0.87

split CP cov = 0.89

root CP cov = 0.87

stab CP cov = 0.99

(d) Friedman1 (500, 100)

Figure 8. Benchmarking conformal sets for Gradient Boosting regression models with a ridge regularization on real datasets. For stab CP, we use a rough stability bound estimate τi xi /10. We display the lengths of the conﬁdence sets over 100 random permutation of the data. We denoted cov the average coverage and T the average computational time normalized with the average time for computing oracle CP which requires a single full data model ﬁt. This example shows that for unstable models such as decision trees, a coarse estimation of the stability bound can result in an overestimation of the conﬁdence interval, which is a notable limitation of the proposed method.