# leaveoneout_stable_conformal_prediction__9aed6647.pdf Published as a conference paper at ICLR 2025 LEAVE-ONE-OUT STABLE CONFORMAL PREDICTION Kiljae Lee, Yuan Zhang The Ohio State University lee.10428@osu.edu, yzhanghf@stat.osu.edu Conformal prediction (CP) is an important tool for distribution-free predictive uncertainty quantification. Yet, a major challenge is to balance computational efficiency and prediction accuracy, particularly for multiple predictions. We propose Leave-One-Out Stable Conformal Prediction (LOO-Stab CP), a novel method to speed up full conformal using algorithmic stability without sample splitting. By leveraging leave-one-out stability, our method is much faster in handling a large number of prediction requests compared to existing method RO-Stab CP based on replace-one stability. We derived stability bounds for several popular machine learning tools: regularized loss minimization (RLM) and stochastic gradient descent (SGD), as well as kernel method, neural networks and bagging. Our method is theoretically justified and demonstrates superior numerical performance on synthetic and real-world data. We applied our method to a screening problem, where its effective exploitation of training data led to improved test power compared to state-of-the-art method based on split conformal. 1 INTRODUCTION Conformal prediction (CP) offers a powerful framework for distribution-free prediction. It is useful for a variety of machine learning tasks and applications, including computer vision (Angelopoulos et al., 2020), medicine (Vazquez & Facelli, 2022; Lu et al., 2022), and finance (Wisniewski et al., 2020), where reliable uncertainty quantification for complex and potentially mis-specified models is in much need. Initially introduced by Vovk et al. (2005), conformal prediction has gained renewed interest. Numerous studies significantly enriched the conformal prediction toolbox and deepened theoretical understandings (Lei et al., 2018; Gibbs & Candes, 2021; Barber et al., 2023). Given data D = {(Xi, Yi)}n i=1, where (Xi, Yi) (X, Y) i.i.d. PX,Y , the goal is to predict the unob- served responses {Yn+j}m j=1 on the test data Dtest = {(Xn+j, Yn+j=?)}m j=1 i.i.d. PX,Y . Conformal prediction constructs prediction intervals Cα(Xn+j) at any given level α (0, 1), such that P(Yn+j Cα(Xn+j)) 1 α, for all j = 1, . . . , m. (1) A highlighted feature of conformal prediction is distribution-free: even when a wrong model is used for prediction, the coverage validity (1) still holds (but the prediction interval will become wider). A primary challenge in conformal prediction lies in balancing computation cost with prediction accuracy. Among the variants of conformal prediction, full conformal is the most accurate (i.e., shortest predictive intervals) but also the slowest; split conformal greatly speeds up by a data-splitting scheme, but decreases accuracy and introduces variability that heavily depends on the particular split (Angelopoulos & Bates, 2021; Vovk, 2015; Barber et al., 2021). Derandomization (Solari & Djordjilovi c, 2022; Gasparin & Ramdas, 2024) addresses the latter issue but increases computational cost and may make the method more conservative (Ren & Barber, 2024). Algorithmic stability is an important concept in machine learning theory (Bousquet & Elisseeff, 2002). It measures the sensitivity of a learning algorithm to small changes in the input data. Numerous studies have focused on techniques that induce algorithmic stability, such as regularized loss minimization (Shalev-Shwartz et al., 2010; Shalev-Shwartz & Ben-David, 2014) and stochastic gradient descent (Hardt et al., 2016; Bassily et al., 2020). Recent research has applied the concept of algorithmic stability to the field of conformal prediction (Ndiaye, 2022; Liang & Barber, 2023). Published as a conference paper at ICLR 2025 Ndiaye (2022) proposed replace-one stable conformal prediction (RO-Stab CP) that effectively accelerates full conformal by leveraging algorithmic stability. While it accelerates full conformal without introducing data splitting, thus preserving prediction accuracy, the prediction model needs to be refit for each Xn+j, lowering its computational efficiency for multiple predictions. In this paper, we introduce Leave-One-Out Stable Conformal Prediction (LOO-Stab CP), which represents is a distinct form of algorithmic stability type than that in RO-Stab CP. The key innovation is that our stability correction no longer depends on the predictor at the test point Xn+j. As a result, our method only needs one model fitting using the training data D. This enables our method to effectively handle a large number of prediction requests. Theoretical and numerical studies demonstrate that our method achieves competitive prediction accuracy compared to existing method, while preserving the finite-sample coverage validity guarantee. 2 PRIOR WORKS ON CONFORMAL PREDICTION (CP) To set up notation and introduce our method, we begin with a brief review of classical CP methods. Full conformal. We begin by considering the prediction of a single Yn+j. Fix j [m] = {1, . . . , m} and let y denote a guessed value of the unobserved Yn+j. We call Dy j = D {(Xn+j, y)} the augmented data and train a learning algorithm f (such as linear regression) on Dy j . To emphasize that the outcome of the training depends on both Xn+j and y, we denote the fitted model by bf y j . Here we require that the training algorithm is permutation-invariant, meaning that bf y j remains unchanged if any two data points (Xi, Yi) and (Xi , Yi ) are swapped for i, i [n] {n + j}. Next, for each i = 1, . . . , n, n + j, we define a non-conformity score Sy i,j = S(Yi, bf y j (Xi)) = |Yi bf y j (Xi)| to measure bf y i s goodness of prediction on the ith data point. Notice that Sy i,j also depends on Xn+j through bf y j (Xi), thus its subscript j. For simplicity, we set j = 1 and write Sy i,j as Sy i only for this part. Now, plugging in y = Yn+1, by symmetry, all non-conformity scores {SYn+1 i }n i=1 {SYn+1 n+1 } are exchangeable, and then the rank of SYn+1 n+1 (in ascending order) is uniformly distributed over {1, . . . , n + 1}, implying that P(SYn+1 n+1 Q1 α({SYn+1 i }n i=1 { })) 1 α, 1 where Qp(S) := inf{x|FS(x) p} denotes the lower-p sample quantile of data S. This implies coverage validity P(Yn+1 Cfull α (Xn+1)) 1 α of the prediction set Cfull α (Xn+1) defined as Cfull α (Xn+1) = y Y : Sy n+1 Q1 α({Sy i }n i=1 { }) . (2) This leads to the full conformal (Full CP) method: compute Cfull α (Xn+1) by a grid search over Y. Split conformal. The grid search required by full conformal is expensive. The key to acceleration is to decouple the prediction function bf, thus most non-conformity scores {Sy i }n i=1, from both j and y: if Sy n+1 is the only term that depends on y, then the prediction set can be analytically solved from (2). Split conformal (Split CP) (Papadopoulos et al., 2002; Vovk, 2015) randomly splits D into the training data Dtrain and the calibration data Dcalib, train bf only on Dtrain, and compute and rank non-conformity scores only on Dcalib {(Xn+j, y)}. While split conformal effectively speeds up computation, the flip side is the reduced amount of data used for both training and calibration, leading to wider predictive intervals and less stable output. Replace-one stable conformal. Ndiaye (2022) accelerated Full CP by leveraging algorithmic stability. From now on, we will switch back to the full notation for S and no longer abbreviate Sy i,j as Sy i . To decouple the non-conformity scores from y, Ndiaye (2022) evaluate these scores using ey, an arbitrary guess of y. Therefore, we call his method replace-one stable conformal prediction (RO-Stab CP). To bound the impact of guessing y, he introduced the replace-one (RO) stability. Definition 1 (Replace-One Algorithmic Stability). A prediction method bf is replace-one stable, if for all j [m] and i [n] {n + j}, there exists τ RO i,j < , such that sup y,ey,z Y |S(z, bf y j (Xi)) S(z, bf ey j (Xi))| τ RO i,j , 1Here we replaced S n+1 in {S Yn+1 i }n i=1 {S n+1} by . This does not change the quantile. Published as a conference paper at ICLR 2025 where bf y j is trained on D {(Xn+j, y)}, for y = y or ey. Recall that Sey i,j = |Yi bf ey j (Xi)| denote the non-conformity score computed using ey. Then it suffices to build a predictive interval that contains Cfull j,α (Xn+j) in (2). By Definition 1, the following inequality |y bf ey j (Xn+j)| τ RO n+j,j Sy n+j,j Q1 α({Sy i,j}n i=1 { }) Q1 α({Sey i,j + τ RO i,j }n i=1 { }) holds true for any y Cfull j,α (Xn+j). Consequently, the RO stable prediction set CRO j,α (Xn+j) = n y Y : |y bf ey j (Xn+j)| τ RO n+j,j Q1 α({Sey i,j + τ RO i,j }n i=1 { }) o (3) contains Cfull j,α (Xn+j) as a subset, thus also has valid coverage. The numerical studies in Ndiaye (2022) demonstrated that RO-Stab CP computes as fast as Split CP while stably producing more narrower predictive intervals (i.e., higher prediction accuracy). 3 LOO-STABCP: LEAVE-ONE-OUT STABLE CONFORMAL PREDICTION 3.1 LEAVE-ONE-OUT (LOO) STABILITY AND GENERAL FRAMEWORK When predicting one Yn+j, RO-Stab CP has accelerated full conformal to the speed comparable to split conformal. However, its non-conformity scores Sey i,j s still depend on Xn+j. Consequently, in order to predict {Yn+j}m j=1, RO-Stab CP would refit the model m times, once for each Xn+j. This naturally motivates our approach: can we let all predictions be based off a common model bf, which only depends on D = {(Xi, Yi)}n i=1, but not any of {Xn+j}m j=1? Interestingly, the idea might appear similar to a beginner s mistake when learning Full CP, overlooking that the model fitting in Full CP should also include (Xn+j, y), not just D. Clearly, to ensure a valid method, we must correct for errors inflicted by using bf in lieu of bf y j . Since these two model fits (ours vs Full CP) are computed on similar sets of data, with the only difference of whether to consider (Xn+j, y), our strategy is to study the leave-one-out (LOO) stability of the prediction method. Definition 2 (Leave-One-Out Algorithmic Stability). A prediction method is leave-one-out stable, if for all j [m] and i [n] {n + j}, there exists τ LOO i,j < , such that sup y,z Y |S(z, bf y j (Xi)) S(z, bf(Xi))| τ LOO i,j . The τ LOO i,j s appearing in Definition 2 are called LOO stability bounds. Their values can often be determined by analysis of the specific learning algorithm f. For each j, we used a different set of LOO stability bounds {τ LOO i,j }i [n] {n+j}. This approach is adopted to achieve sharper bounds compared to using a uniformly bound for all j. We clarify that the concept of algorithmic stability is well-defined for parametric or non-parametric f s alike. For an f parameterized by some θ, the stability bound does not focus on the whereabout of the optimal θ, but on how much impact leaving out (n + j)th data point will have on the trained f, possibly via quantifying its impact on the estimated θ. We will elaborate using concrete examples in Section 3.2. For now, we assume that τ LOO i,j s are known and present the general framework of our method, called leave-one-out stable conformal prediction (LOO-Stab CP), as Algorithm 1. The implementation requires computation of O(mn) many τ LOO i,j values. However, these computations are relatively inexpensive and do not significantly impact the overall time cost. In many examples (such as SGD, see Section 3.2.2), the main computational cost comes from model fitting, especially for complex models. We empirically confirm this in Section 4 through various numerical experiments. Table 1 presents a comparison of the computational complexity for conformal prediction methods. The concept of leave-one-out perturbations in conformal prediction has been studied in Liang & Barber (2023), but their angle is very different from ours. They focused on studying the LOO as a part of Jackknife+ (Barber et al., 2021), which fits n models, one for each D \ {(Xi, Yi)}. Then all these n models are used simultaneously for each prediction. In contrast, we use LOO technique in a very different way, developing a fast algorithm that fit only one model to D (without deletion). The one in our leave-one-out refers to (Xn+j, y) in D {(Xn+j, y)}, for each j. Published as a conference paper at ICLR 2025 Algorithm 1: (LOO-Stab CP) Leave-One-Out Stable Conformal Prediction Set Input : Training set D = {(Xi, Yi)}n i=1, test points {Xn+j}m j=1, desired coverage 1 α. Output: Prediction interval CLOO j,α (Xn+j) for each j [m] 1. Fit the prediction function bf on D; 2. Compute (uncorrected) non-conformity scores on D: Si = |Yi bf(Xi)| for i [n]; for j [m] do 3. Compute stability bounds τ LOO i,j for i [n] {n + j}; 4. Compute prediction interval: CLOO j,α (Xn+j) = h bf(Xn+j) n Q1 α {Si + τ LOO i,j }n i=1 { } + τ LOO n+j,j oi ; Full CP Split CP RO-Stab CP LOO-Stab CP # of model fits |Y| m 1 m 1 # of prediction evaluations (n + 1) |Y| m n + m (n + 1) m n + m # of stability bounds Not applicable Not applicable (n + 1) m (n + 1) m Table 1: Computational complexities of our method and benchmarks, where |Y| is the size of the search grid used in Full CP. We emphasize that in many examples, such as SGD (Section 3.2.2), one model fitting is much more costly than one prediction or computation of one stability bound. Next, we provide the theoretical guarantee of our algorithm s coverage validity. Theorem 1. If the prediction method is leave-one-out stable as in Definition 2. Then for each j [m], the prediction set CLOO j,α (Xn+j) constructed by Algorithm 1 satisfies P(Yn+j CLOO j,α (Xn+j)) 1 α. 3.2 LOO STABLE ALGORITHMS So far, we have been treating the stability bounds τ LOO i,j as given without showing how to obtain them. In this section, we derive these bounds for two important machine learning tools: Regularized Loss Minimization (RLM) and Stochastic Gradient Descent (SGD). Many machine learning tasks aim to minimize a loss function ℓ(y, fθ(x)) over training data. Empirical Risk Minimization (ERM) is a common approach, which seeks to minimize 1 n Pn i=1 ℓ(Yi, fθ(Xi)) with respect to θ. However, the objective function is often highly nonconvex, making the optimization challenging. RLM alleviates nonconvexity by adding an explicit penalty (e.g., ridge and LASSO) to the objective function (Hoerl & Kennard, 1970; Tibshirani, 1996). Alternatively, SGD implicitly regularizes the optimization procedure (Robbins & Monro, 1951) by iteratively updating model parameters using one data point at a time. Its computational efficiency makes it a preferred method in deep learning (Le Cun et al., 2015; He et al., 2016). 3.2.1 EXAMPLE 1: REGULARIZED LOSS MINIMIZATION (RLM) To derive the LOO stability bound, we compare two versions of RLM, only differing by their training data. The first is trained on D, producing bθ = arg minθ Θ[ 1 n Pn i=1 ℓ(Yi, fθ(Xi)} + Ω(θ)], where Θ is the parameter space and Ω(θ) is the explicit penalty term; while the second is trained on the augmented data Dy j (recall Dy j = D {(Xn+j, y)}), producing bθy j = arg minθ Θ[ 1 n+1{Pn i=1 ℓ(Yi, fθ(Xi)) + ℓ(y, fθ(Xn+j)} + Ω(θ)]. The LOO stability for RLM is described by Definition 2, with bf( ) = fbθ( ) and bf y j ( ) = fbθy j ( ). To state our main result, we need some concepts from optimization. Definition 3 (ρ-Lipschitz). A continuous function g : Rp Rq is ρ-Lipschitz, if g(x) g(y) ρ x y , for any x, y Rp. Published as a conference paper at ICLR 2025 Definition 4 (Strong Convexity). A function g : Rp R is λ-strongly convex, if g(tx + (1 t)y) tg(x) + (1 t)g(y) λ 2 x y 2, for any x, y Rp and t (0, 1). In addition, a function g is convex if it is 0-strongly convex. Now we are ready to formulate the LOO stability bounds for RLM. Theorem 2. Suppose: 1) for each i [n + m] and given any y, the loss function ℓ(y, fθ(Xi)) is convex and ρi-Lipschitz in θ; 2) the penalty term Ωis λ-strongly convex; 3) for each i [n + m], the prediction function fθ(Xi) is νi-Lipschitz in θ; and 4) given any y, the non-conformity score S(y, z) is γ-Lipschitz in z.2 Then, RLM has the following LOO and RO stability bounds. τ LOO i,j = 2γνi(ρn+j + ρ) λ(n + 1) , and τ RO i,j = 4γνiρn+j λ(n + 1) , (4) where i ranges in [n] {n + j} for each 1 j m, and ρ = n 1 Pn i=1 ρi. As a side remark, a uniform RO stability bound has been established in Ndiaye (2022) (Corollary 3.10). Our RO bound in (4) is non-uniform (not maximizing over Xi) and potentially sharper. 3.2.2 EXAMPLE 2: STOCHASTIC GRADIENT DESCENT (SGD) For simplicity, we recap how SGD operates when trained on D. It starts with an initial parameter value θ0 and runs for R epochs. In each epoch, generate a random permutation π = (π1, . . . , πn) of [n]. Then for each i [n], update the model parameter by θ = θ η θℓ(Yπi, fθ(Xπi)), where η > 0 is a user-selected learning rate. After a total of Rn updates, the output bθ is used for prediction. Like in RLM, our LOO stability bound compares two versions of SGD, trained on D and Dy j , respectively. Theorem 3. Suppose: 1) for each i [n+m], the loss function ℓ(y, fθ(Xi)) is convex, ρi-Lipschitz in θ, and its gradient θℓ(y, fθ(Xi)) is φi-Lipschitz in θ, for any y; 2) for each i [n + m], the prediction function fθ(Xi) is νi-Lipschitz in θ; and 3) the non-conformity score S(y, z) is γLipschitz in z, for any y. Then, with learning rate η 2 max{φi}, SGD has the following LOO and RO stability bounds. τ LOO i,j = Rη γνiρn+j, and τ RO i,j = 2Rη γνiρn+j, (5) where i ranges in [n] {n + j} for each 1 j m. Readers may have noticed that for SGD, τ LOO i,j is only half of τ RO i,j , which is very different from the case for RLM (c.f. Theorem 2). The gap here stems from the iterative nature of SGD. Recall that in each epoch, SGD performs n (or n + 1) gradient descent (GD) updates, with each update depending on a single data point. Consequently, leaving out one data point results in one fewer GD update. In contrast, replacing one data point means performing one GD update differently in the worst-case scenario, this update may move in opposite directions before and after the replacement, doubling the stability bound. SGD s iterative nature makes it an excellent example where the number of model fits is the main bottleneck in scaling a crucial learning technique. For SGD, each model fit requires O(Rn) gradient updates, while each prediction costs O(1) time, and evaluating stability bounds for each prediction costs O(n) time. Combining this understanding with Table 1, we see that our method provides significantly faster stable conformal prediction than RO-Stab CP for performing a large number of predictions. 3.2.3 TOWARDS BROADER APPLICABILITY OF LOO-STABCP Kernel method: The kernel method (or kernel trick ) (Sch olkopf, 2002) is a commonly used technique in statistical learning. It implicitly transforms data into complex spaces through a kernel function k(x, x ). This leads to the reformulated optimization problem: bθ = arg minθ Rn 1 n Pn i=1 ℓ(Yi, k T i θ) + λθT Kθ, where K is a positive-definite kernel matrix Ki,j = k(Xi, Xj), and ki denotes its i-th row. It is not difficult to verify that the kernel method is a special case of RLM, thus Theorem 2 applies to the kernel method. 2Here, z represents the prediction output, and therefore, this is an assumption independent of the model. Published as a conference paper at ICLR 2025 Neural networks: The stability bounds for RLM and SGD rely on convexity assumptions that might not always hold in practice, such as in (deep) neural networks. Here, we analyze the LOO stability of SGD as a popular optimizer for neural networks, without assuming convexity. Theorem 4. Assume the conditions of Theorem 3, except that the loss function ℓ(y, fθ(Xi)) is not required to be convex in θ. Then, for the same range of (i, j), SGD has the following LOO and RO stability bounds: τ LOO i,j = R+η γνiρn+j, and τ RO i,j = 2R+η γνiρn+j, where R+ = PR r=1 κr and κ = Qn i=1(1 + ηφi). While Theorem 4 does provide a rigorous theoretical justification for neural networks, in practice, the term κ may be large if the learning rate η is not sufficiently small or the activation function is not very smooth, leading to large Lipschitz constants φi s. Therefore, this stability bound may turn out to be conservative. Similar to Hardt et al. (2016), we also observed that the empirical stability of SGD for training neural networks is often far better than the worst-case bound described by Theorem 4, see our numerical results in Section 5. This suggests practitioners to still apply the stability bound in Theorem 3, dismissing non-convexity. It is an intriguing but challenging future work to narrow the gap between theory and practice here. Bagging: Bagging (bootstrap aggregating) (Soloff et al., 2024) is a general framework that averages over B models trained on resamples of size m from D. Random forest (Wang et al., 2023) is a popular special case of bagging, in which each f (b)(x) is a regression tree. Therefore, we focus on studying bagging. It predicts by f B(x) = 1 B PB b=1 f (b)(x), where f (b)(x) indicates the individual model trained on the bth resample. For simplicity, here we analyze a derandomized bagging (Soloff et al., 2024), i.e., setting B . The prediction function becomes f (x) = E[f (b)(x)]. Below is the LOO stability of derandomized bagging. Here, we denote f y,(b) j (x) and f (b)(x) as the individual models obtained from Dy j and D, respectively. Theorem 5. Assume that 1) for any j [m] and (x, y) X Y, all individual prediction functions f y,(b) j (x) and f (b)(x) are bounded within a range of width wj; 2) the nonconformity score S(y, z) is γ-Lipschitz in z for any y. Then, derandomized bagging achieves the following LOO stability bound: τ LOO i,j = γwj where p = 1 1 1 n m and i [n] {n + j} for j = 1, . . . , m. From above, note that the only assumption about the prediction model is bounded output. For example, regression trees satisfy this assumption. Due to page limit, we relegate more results and discussion to Appendix A.3. 4 SIMULATION In this simulation, we compare several CP methods serving RLM and SGD. We set n = m = 100, α = 0.1 and generated synthetic data using Xi i.i.d. N(0, 1 dΣ) with d = 100, where Σi,j = ρ|i j| (i.e., AR(1)). In particular, we chose ρ = 0.5 in this experiment. For the response variable we set Yi = µ(Xi; β) + ϵi, where ϵi i.i.d. N(0, 1). We considered two models for µ( ; β): linear µ(x; β) = Pd j=1 βjxj and nonlinear µ(x; β) = Pd j=1 βjexj/10. In both models, set βj (1 j/d)5 for j [d], and normalize: β 2 2 = d. To fit the model, we used robust linear regression, equipped with Huber loss: ℓ(y, fθ(x)) = 1 2(y fθ(x))2, if |y fθ(x)| ϵ, ϵ|y fθ(x)| 1 2ϵ2, if |y fθ(x)| > ϵ, where fθ(x) = x T θ and we set ϵ = 1 throughout. We used absolute residual as non-conformity scores. In RLM, we set Ω(θ) = θ 2 and solved it using gradient descent (Diamond & Boyd, Published as a conference paper at ICLR 2025 2016). Throughout, we ran SGD for R = 15 epochs for all methods, except R = 5 for the very slow Full CP. For both RLM and SGD, we set the learning rate to be η = 0.001. For more implementation details, see Appendix G.1. Each experiment was repeated 100 times. We compare our method to the following benchmarks: 1) Oracle CP (Appendix C.1, Algorithm 4): an impractical method that uses the true {Yn+j}m j=1 to predict; 2) Full CP; 3) Split CP: 70% training + 30% calibration; and 4) RO-Stab CP. The performance of each method is evaluated by three measures: 1) coverage probability (method validity); 2) length of predictive interval (prediction accuracy); and 3) computation time (speed). Figure 1: Comparison of CP methods. Our method (LOO-Stab CP) achieves competitive prediction accuracy and computes at the speed comparable to Split CP, while maintaining coverage validity. Figure 1 presents the results of our simulation. In the plots for coverage and length, the horizontal dashed lines represent the desired coverage level (1 α) and the length of the tightest possible prediction band obtained from the true distribution of the data3, respectively. As expected, all methods maintain valid coverage. Our method shows competitive prediction accuracy, comparable to those of Oracle CP, Full CP, and RO-Stab CP. These four methods exhibit more consistent and overall superior accuracy compared to Split CP. In terms of computational efficiency, our method performs on par with Split CP and is clearly faster than other methods. Notably, our method significantly outperforms RO-Stab CP in handling a large number of prediction requests. 5 DATA EXAMPLES We showcase the use of our method on the two real-world data examples analyzed in Ndiaye (2022). The Boston Housing data (Harrison Jr & Rubinfeld, 1978) contain 506 different areas in Boston, each area has 13 features as predictors, such as the local crime rate and the average number of rooms. The goal is to predict the median house value in that area. The Diabetes data (Efron et al., 2004) measured 442 individuals at their baseline time points for 10 variables, including age, BMI, and blood pressure, aiming to predict diabetes progression one year after baseline. Both datasets are complete, with no missing entries. All continuous variables have been normalized, and no outliers were identified. For each data set, we randomly held out m data points (as the test data) for performance evaluation and released the rest to all methods for training/calibration. We tested two settings: m = 1 and m = 100, two model fitting algorithms: RLM and SGD; and repeated each experiment 100 times. 3In our setup, it is 3.290 since P(|ϵi| 1.645) = 0.9. Published as a conference paper at ICLR 2025 The other configurations, including the model fitted to data (ℓ, ϵ, Ω, etc.), the list of compared benchmarks and the performance measures, all remained unchanged from Section 4. Figure 2: Comparison of prediction interval lengths, under choices of m = 1 and m = 100. Figure 2 shows the result. While most interpretations are consistent with that of the simulation, we observe two significant differences between the settings m = 1 and m = 100. First, under m = 1, RO-Stab CP and our method take comparable time, but when m increases to 100, our method exhibits remarkable speed advantage, as expected. Second, with an increased m, the amount of available data for prediction/calibration also decreases. This leads to wider prediction intervals for all methods. Also, Split CP continues to produce more variable and lengthier prediction intervals compared to most other methods for m {1, 100}. In summary, our method LOO-Stab CP exhibits advantageous performances in all aspects across different settings. The empirical coverage rates are consistent with those in the previous experiments and are provided in Appendix G.3. Figure 3: Comparison of CP methods with neural networks with single hidden layer under choice of m = 100. LOO-Stab CP continues to closely achieve the target coverage while exhibiting lower variability in prediction intervals. To further evaluate the performance of LOO-Stab CP with non-convex learning methods, we conducted experiments with a neural network of a single hidden layer of 20 nodes and a sigmoid activation function. We set η = 0.001 and R = 30. For stability bounds, we borrowed from the practical guidance in Hardt et al. (2016) and Ndiaye (2022) and set τ LOO i,j Rη γ Xi Xn+j for LOO-Stab CP and τ RO i,j 2Rη γ Xi Xn+j for RO-Stab CP, respectively. This choice is elaborated in Appendix A.2, see (8). Figure 3 presents the results. LOO-Stab CP maintained valid Published as a conference paper at ICLR 2025 coverage across all scenarios. These findings highlight the robustness of LOO-Stab CP in handling complex models like neural networks. Finally, since one could consider derandomization by aggregating results across multiple different splits to reduce variation of Split CP (Solari & Djordjilovi c, 2022; Gasparin & Ramdas, 2024), we also numerically compared our method to this approach. The result demonstrated that our LOO-Stab CP is computationally faster and less conservative than two popular derandomized Split CP methods Solari & Djordjilovi c (2022); Gasparin & Ramdas (2024). Due to page limit, we relegate all details of this study to Appendix B. 6 APPLICATION: CONFORMALIZED SCREENING Many decision-making processes, such as drug discovery and hiring, often involve a screening stage to filter among a large number of candidates, prior to more resource-intensive stages like clinical trials and on-site interviews. The data structure is what we have been studying in this paper: training data D = {(Xi, Yi)}n i=1 and a large number of test points {Xn+j}m j=1 without observing Yn+j s. Suppose higher values of Y are of interests. Jin & Cand es (2023) formulated this as the following randomized hypothesis testing problem: H0j : Yn+j cj vs H1j : Yn+j > cj, for j [m], where cj s are user-selected thresholds (e.g., qualifying score for phone interviews). Then screening candidates means simultaneously testing these m randomized hypotheses. To control for error in multiple testing, one popular criterion is the false discovery rate (FDR), defined as the expected false discovery proportion (FDP) among all rejections. FDR = E[FDP], where FDP = Pm j=1 1{H0j is incorrectly rejected} 1 Pm j=1 1{H0j is rejected} . Jin & Cand es (2023) proposed a method called cf BH based on Split CP. Our narration will build upon non-conformity scores without repeating details about model fitting. In this context, the nonconformity score should be defined differently, for instance: S(y, z) = y z without the absolute value, where y is the observed response and z is the fitted value. On the calibration data, Si = S(Yi, bf(Xi)), whereas on the test data, we would consider Scj n+j = S(cj, bf(Xn+j)). Jin & Cand es (2023) s cf BH method computes the following conformal p-value: i Icalib 1{Si < Scj n+j} + 1 |Icalib| + 1 , (6) where Icalib denotes the index set corresponding to the calibration data. To intuitively understand (6), notice that psplit j < α if and only if cj falls outside the level-(1 α) (one-sided) split conformal prediction interval for Yn+j. Finally, plugging {psplit j }m j=1 into a Benjamini-Hochberg (BH) procedure (Benjamini & Hochberg, 1995) controls the FDR at a desired level q: compute k = max{k : Pm j=1 1{psplit j qk/m} k}, and reject all H0j s with psplit j < qk /m. While Jin & Cand es (2023) s method effectively controls FDR and computes fast, the data splitting mechanism leaves space for more thoroughly exploiting available information for model fitting. To this end, we propose a new approach, called LOO-cf BH built upon our main method LOO-Stab CP. We compute stability-adjusted p-values as follows: Pn i=1 1{Si τ LOO i,j < Scj n+j + τ LOO n+j,j} + 1 n + 1 . (7) Algorithm 2 describes the full details of our method. To numerically compare our method to existing approaches, we used the recruitment data set Ganatara (2020) that was also analyzed in Jin & Cand es (2023). It contains 215 individuals, each measured on 12 features such as education, work experience, and specialization. The binary response indicates whether the candidate receives a job offer. We import the robust regression from Section 5 as the prediction method, optimized by SGD. Since the task is classification, we use the Published as a conference paper at ICLR 2025 Algorithm 2: (LOO-cf BH) Conformal Selection by Prediction with Leave-One-Out p-values Input : Training set D, test points {Xn+j}m j=1, thresholds {cj}m j=1, FDR level q. Output: Set of rejected null hypotheses 1. Fit the prediction function bf on D; 2. Compute (uncorrected) non-conformity scores on D: Si = S(Yi, bf(Xi)) for i [n]; for j [m] do 3. Compute stability bounds τ LOO i,j for i [n] {n + j}; 4. Compute LOO conformal p-values p LOO j as in (7); end 5. Implement BH Procedure: k = max{k : Pm j=1 1{p LOO j qk/m} k} and reject all H0j s satisfying p LOO j < qk /m. clip function in Jin & Cand es (2023) as the non-conformity score: S(y, bf) = 100y bf. For illustration, we also formulated a benchmark RO-cf BH, following the spirit of Ndiaye (2022) and using replace-one stability. RO-cf BH replaces all τ LOO terms in (7) by the corresponding τ RO terms; it is otherwise identical to LOO-cf BH. We repeated the experiment 1000 times, each time leaving out 20% data points as the test data. In cf BH, the data was split into 70% for training and 30% for calibration. We tested three target FDR levels q {0.1, 0.2, 0.3} and consider three performance measures: 1) FDP; 2) test power, defined as Pm j=1 1{H0j is rejected} Pm j=1 1{H1j is true} ; and 3) time cost. Figure 4: Comparison of screening methods on recruitment data. Time cost does not vary with q. Figure 4 shows the result. Our method achieves valid FDP control for all tested q. Compared to cf BH, our method is more powerful, due to improved exploitation of available data for prediction. The performance measures also reflect that our method LOO-cf BH produces more stable prediction intervals, while sample splitting introduces additional artificial random variations to the result of cf BH. Compared to RO-cf BH, we highlight our method s significant speed advantage. Moreover, as we showed in Theorem 3, for SGD, our LOO approach achieves a tighter stability bound than RO. As a result, our method is less conservative and more powerful compared to RO-cf BH. 7 CONCLUSION AND FUTURE WORK In this paper, we propose a novel approach to stable conformal prediction. Our method significantly improves computational efficiency for multiple prediction requests, compared to the classical stable conformal prediction (Ndiaye, 2022). Here, we mention three directions for future work. First, while we have derived stability bounds for RLM, SGD, neural networks and bagging, improving the tightness of bounds for complex methods remains an important avenue for future research. Second, we have been focusing on continuous responses. It would be an intriguing future work to expand our theory to classification. A third direction is to investigate algorithmic stability for adaptive conformal prediction. Published as a conference paper at ICLR 2025 SUPPLEMENTAL MATERIALS The Supplemental Materials contains all proofs and additional numerical results. The code for reproducing numerical results is available at: https://github.com/Kiljae L/LOO-Stab CP. ACKNOWLEDGMENT The authors thank Eugene Ndiaye for helpful discussion and the anonymous reviewers for constructive criticism. This research was supported by NSF grant DMS-2311109. Anastasios Angelopoulos, Stephen Bates, Jitendra Malik, and Michael I Jordan. Uncertainty sets for image classifiers using conformal prediction. ar Xiv preprint ar Xiv:2009.14193, 2020. Anastasios N Angelopoulos and Stephen Bates. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. ar Xiv preprint ar Xiv:2107.07511, 2021. Rina Foygel Barber, Emmanuel J. Cand es, Aaditya Ramdas, and Ryan J. Tibshirani. Predictive inference with the jackknife+. The Annals of Statistics, 49(1):486 507, 2021. doi: 10.1214/ 20-AOS1965. URL https://doi.org/10.1214/20-AOS1965. Rina Foygel Barber, Emmanuel J Candes, Aaditya Ramdas, and Ryan J Tibshirani. Conformal prediction beyond exchangeability. The Annals of Statistics, 51(2):816 845, 2023. Raef Bassily, Vitaly Feldman, Crist obal Guzm an, and Kunal Talwar. Stability of stochastic gradient descent on nonsmooth convex losses. Advances in Neural Information Processing Systems, 33: 4381 4391, 2020. Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological), 57(1):289 300, 1995. Olivier Bousquet and Andr e Elisseeff. Stability and generalization. The Journal of Machine Learning Research, 2:499 526, 2002. Steven Diamond and Stephen Boyd. Cvxpy: A python-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83):1 5, 2016. Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least angle regression. The Annals of Statistics, 32(2):407 499, 2004. Dhimant Ganatara. Campus recruitment. https://www.kaggle.com/datasets/ benroshan/factors-affecting-campus-placement, 2020. Matteo Gasparin and Aaditya Ramdas. Merging uncertainty sets via majority vote. ar Xiv preprint ar Xiv:2401.09379, 2024. Isaac Gibbs and Emmanuel Candes. Adaptive conformal inference under distribution shift. Advances in Neural Information Processing Systems, 34:1660 1672, 2021. Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International conference on machine learning, pp. 1225 1234. PMLR, 2016. David Harrison Jr and Daniel L Rubinfeld. Hedonic housing prices and the demand for clean air. Journal of environmental economics and management, 5(1):81 102, 1978. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55 67, 1970. Published as a conference paper at ICLR 2025 Ying Jin and Emmanuel J Cand es. Selection by prediction with conformal p-values. Journal of Machine Learning Research, 24(244):1 41, 2023. Yann Le Cun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436 444, 2015. Jing Lei, Max G Sell, Alessandro Rinaldo, Ryan J Tibshirani, and Larry Wasserman. Distributionfree predictive inference for regression. Journal of the American Statistical Association, 113 (523):1094 1111, 2018. Ruiting Liang and Rina Foygel Barber. Algorithmic stability implies training-conditional coverage for distribution-free prediction methods. ar Xiv preprint ar Xiv:2311.04295, 2023. Charles Lu, Andr eanne Lemay, Ken Chang, Katharina H obel, and Jayashree Kalpathy-Cramer. Fair conformal predictors for applications in medical imaging. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 12008 12016, 2022. Eugene Ndiaye. Stable conformal prediction sets. In International Conference on Machine Learning, pp. 16462 16479. PMLR, 2022. Harris Papadopoulos, Kostas Proedrou, Vladimir Vovk, and Alexander Gammerman. Inductive confidence machines for regression. In European Conference on Machine Learning, 2002. URL https://api.semanticscholar.org/Corpus ID:42084298. Zhimei Ren and Rina Foygel Barber. Derandomised knockoffs: leveraging e-values for false discovery rate control. Journal of the Royal Statistical Society Series B: Statistical Methodology, 86 (1):122 154, 2024. Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pp. 400 407, 1951. B Sch olkopf. Learning with kernels: support vector machines, regularization, optimization, and beyond, 2002. Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014. Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability and uniform convergence. The Journal of Machine Learning Research, 11:2635 2670, 2010. Aldo Solari and Vera Djordjilovi c. Multi split conformal prediction. Statistics & Probability Letters, 184:109395, 2022. Jake A Soloff, Rina Foygel Barber, and Rebecca Willett. Bagging provides assumption-free stability. Journal of Machine Learning Research, 25(131):1 35, 2024. Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267 288, 1996. Janette Vazquez and Julio C Facelli. Conformal prediction in clinical medical sciences. Journal of Healthcare Informatics Research, 6(3):241 252, 2022. Vladimir Vovk. Cross-conformal predictors. Annals of Mathematics and Artificial Intelligence, 74: 9 28, 2015. Vladimir Vovk, Alexander Gammerman, and Glenn Shafer. Algorithmic learning in a random world, volume 29. Springer, 2005. Yan Wang, Huaiqing Wu, and Dan Nettleton. Stability of random forests and coverage of randomforest prediction intervals. Advances in Neural Information Processing Systems, 36:31558 31569, 2023. Wojciech Wisniewski, David Lindsay, and Sian Lindsay. Application of conformal prediction interval estimations to market makers net positions. In Conformal and probabilistic prediction and applications, pp. 285 301. PMLR, 2020. Published as a conference paper at ICLR 2025 Supplemental Materials for Leave-one-out Stable Conformal Prediction Kiljae Lee and Yuan Zhang A DETAILED INSIGHTS ON PRACTICAL EXTENSIONS OF LOO-STABCP In this appendix, we provide further numerical and theoretical analysis to support the approaches discussed in Section 3.2.3. A.1 NUMERICAL EXPERIMENTS USING KERNEL TRICK The key insight of the kernel trick is that by transforming the data into a higher-dimensional feature space using a kernel function, the original optimization problem bβ = arg min β Rd 1 n i=1 ℓ(Yi, XT i β) + λ β 2, can be reformulated as bθ = arg min θ Rn 1 n i=1 ℓ(Yi, k T i θ) + λθT Kθ. This reformulation using the kernel trick does not violate the assumptions required for RLM and SGD, as the transformation maintains the core structure of the optimization problem. Specifically, the kernel matrix K implicitly defines the high-dimensional feature space through the kernel function k(Xi, Xj), without requiring explicit computation of the transformed features. This ensures that the problem remains computationally tractable. For RLM, the regularization term β 2 in the original formulation translates directly to θT Kθ in the kernelized version, preserving the strong convexity of the optimization problem as long as we use a positive definite kernel (e.g. radial basis kernel, polynomial kernel, etc.). Similarly, for SGD, the smoothness and Lipschitz continuity of the loss function are preserved, as the transformation affects only the inner product computations, which is linear, and does not alter the fundamental properties of the objective function. Thus, if our original problem satisfies the conditions of LOO stability of RLM and SGD, the kernel trick enables the model to capture nonlinear patterns in the data while ensuring that the theoretical guarantees remain intact. By integrating the kernel trick, we revisit the scenarios in Section 4, where we initially considered synthetic data examples using standard robust linear regression methods. For our experiments, we employed the radial basis function (RBF) kernel k RBF(x, x ) = exp x x 2 2σ2 and the polyno- mial kernel k Poly(x, x ) = (x T x + c)d, both chosen for their ability to model complex nonlinear relationships effectively. For hyperparameters, we chose σ = 0.1, c = 1, and d = 2. We compared these results to the outcomes in Section 4 and this can be theoretically viewed as a special case of kernel robust regression using a linear kernel. As shown in Figure 5, LOO-Stab CP continues to perform reliably under both settings settings without any loss of coverage, validating its adaptability to more sophisticated model structures. Moreover, compared to the linear setting, the use of the kernel trick in nonlinear settings leads to a notable reduction in prediction interval length. This reduction highlights the ability of LOO-Stab CP with kernel trick to provide more precise predictions while capturing the complex patterns inherent in data, thereby enhancing its practical utility. A.2 DETAILED INSIGHTS INTO NONCONVEX OPTIMIZATION In Section 3.2.3, we derived stability bounds for SGD under nonconvex settings. Here, we provide additional details on the derivation and implications of these bounds. Published as a conference paper at ICLR 2025 Figure 5: Comparison of CP methods using kernelized technique. In the convex case (Theorem 3), the stability bounds for SGD are given by: τ LOO i,j = Rη γνiρn+j, τ RO i,j = 2Rη γνiρn+j. On the other hand, for the nonconvex case (Theorem 4), these bounds are modified to include the term R+: τ LOO i,j = R+η γνiρn+j, τ RO i,j = 2R+η γνiρn+j, where R+ = PR r=1 κr and κ = Qn i=1(1 + ηφi). Note that the only distinction in the nonconvex case is that R+ replaces R. Hence, R+ can be interpreted as representing the cumulative effect of nonconvexity. This term is influenced by the learning rate η and the Lipschitz constants of the gradients φi. Specifically, if κ 1, R+ approximates R, aligning with the convex optimization scenario. However, when κ is significantly greater than 1, R+ can grow exponentially with R, resulting in overly loose bounds. The practical implication of this result is that η and φi significantly influence the tightness of stability bounds. While smaller values of η can mitigate this issue, they may also slow down convergence, creating a trade-off between theoretical stability and computational efficiency. As described in Section 5, we conducted experiments with a neural network featuring a single hidden layer and employed approximated stability bounds: τ LOO i,j Rη γ Xi Xn+j and τ RO i,j 2Rη γ Xi Xn+j . (8) Although these terms approximate our problem as if it were convex, they still capture the interaction between the training and test points in our dataset, providing a practical measure of stability. By using these approximations, we adapted our conformal prediction framework without relying on overly conservative worst-case bounds. Alongside the results in Section 5, Figure 6 shows outcomes for a two-hidden-layer network with 10 and 5 nodes, respectively, under the same settings. Our empirical results, shown in Figure 3 and Figure 6, demonstrate that these approximations, despite their theoretical looseness, do not compromise the validity of LOO-Stab CP. These findings are consistent with prior observations (Hardt et al., 2016; Ndiaye, 2022), where theoretical stability bounds in nonconvex settings are often pessimistic, yet empirical results tend to outperform these expectations. Published as a conference paper at ICLR 2025 Figure 6: Comparison of CP methods with neural networks with two hidden layers under choice of m = 100. A.3 ADDITIONAL RESULTS AND DISCUSSION ON BAGGING While derandomized bagging discussed in 3.2.3 provides conceptual insights into stability, practical bagging methods have internal randomness induced by the resampling scheme. To account for this randomness, here, we provide theoretical results on the LOO stability of bagging in practice. The randomness in bagging is two-fold. One source is the resampling process, where datasets are created by sampling with replacement from the original dataset. Another arises from the base learning algorithm itself, as seen in random forest, where random feature subsets are selected in each bag. The latter can be characterized by a random variable ξ U(0, 1). Algorithm 3 illustrates the implementation of bagging. Algorithm 3: Bagging Input : Training set D = {(Xi, Yi)}n i=1, Number of bags B, Number of samples in each bag m Output: Prediction function bf( ) = f B( ) := 1 B PB b=1 f (b)( ). for b [B] do 1. Sample bag r(b) = (i(b) 1 , . . . , i(b) m ) where i(b) j i.i.d. U([n]) for j [m]; 2. Sample seed ξ(b) U([0, 1]); 3. Fit model f (b) with r(b) and ξ(b); end The prediction function of bagging is inherently random, making it challenging to derive a deterministic stability bound. Nonetheless, based on Theorem 5, we can deduce with high probability that bagging is LOO stable. In the context of bagging, bf y j (x) = 1 B PB b=1 f y,(b) j (x) and bf(x) = 1 B PB b=1 f (b)(x), where the sample average replaces the expectation compared to derandomized bagging. The following theorem provides a probabilistic guarantee on the LOO stability bounds for bagging. Theorem 6. Suppose the conditions of Theorem 5 hold. Then, for any δ (0, 1), bagging has the following LOO stability bounds with probability at least 1 δ. τ LOO i,j (δ) = γwj with p = 1 (1 1 n)m where i ranges in [n] {n + j} for each 1 j m. The implications of Theorem 5 and Theorem 6 are as follows. Note that the above theorem requires only the minimal assumption that the base model fitting algorithm used in bagging has bounded output. This suggests that LOO-Stab CP can be applied to a wide range of algorithms. For example, building on the stability of bagging, Wang et al. (2023) extended the results to the stability of random forest. Their key insight was that random forest utilize weak decision trees as their base model fitting algorithm, and the final output of a decision tree is always determined as the average of the responses Published as a conference paper at ICLR 2025 in the training data it uses. As a result, it is straightforward to see that the output of a decision tree cannot exceed the range of the responses in the training set. Moving forward, these insights can serve as a foundation for exploring stability guarantees in other complex learning algorithms. B COMPARISON WITH DERANDOMIZATION APPROACHES In this appendix, we extend our numerical experiments to include comparisons with derandomization approaches (Solari & Djordjilovi c, 2022; Gasparin & Ramdas, 2024), which are potential alternatives to LOO-Stab CP in terms of reducing the variability of Split CP. Specifically, these methods differ from Split CP, which relies on a single data split, by merging multiple prediction intervals constructed from various splits into one final prediction interval. Figure 7: Comparison of CP methods including derandomization approaches on synthetic datasets. Among these, two notable methods have been proposed in the literature. Solari & Djordjilovi c (2022) were the first to propose this approach. They generated split conformal prediction intervals with 1 α 2 validity from multiple data splits and then derived a final prediction interval through majority voting (i.e., the range covered by more than half of these intervals). They showed that this final interval maintains 1 α validity. We refer to their method as MM-Split CP (Majority Multi-Split Conformal Prediction). Meanwhile, Gasparin & Ramdas (2024) focused on the exchangeability of each prediction interval derived through MM-Split CP. Building on this property, they proposed an alternative aggregation technique that produces tighter yet still valid prediction intervals by applying a majority vote correction. We denote this method as EM-Split CP (Exchangeable Multi-Split Conformal Prediction). For further details on these methods, we refer readers to Solari & Djordjilovi c (2022); Gasparin & Ramdas (2024). We compare the performance of these two derandomization techniques with our proposed method, LOO-Stab CP. To this end, we applied the methods to the settings described in Section 4 and Section 5. For MM-Split CP and EM-Split CP, we merged 30 splits. Figures 7 and 8 present the results on synthetic and real datasets, respectively. From the results, we observe that the variability in coverage and interval length produced by MM-Split CP and EM-Split CP is noticeably lower than that of Split CP, indicating that these derandomization techniques effectively reduce the internal variability of data-splitting approaches. However, we also find that the average coverage of MM-Split CP and EM-Split CP is generally higher than the predetermined level, suggesting that these derandomization techniques tend to produce conservative intervals. Furthermore, both methods require significantly more computational time, which can be attributed to their reliance on multiple model fits, unlike Split CP and Published as a conference paper at ICLR 2025 Figure 8: Comparison of CP methods including derandomization approaches on real datasets. LOO-Stab CP, which rely on a single model fit. In contrast, LOO-Stab CP produces tight and stable intervals while maintaining reasonable coverage. These results underscore the computational efficiency and precision of LOO-Stab CP. These findings are consistent across both synthetic and real data scenarios, showcasing the adaptability and efficiency of LOO-Stab CP compared to other derandomization methods. C IMPLEMENTATIONS OF CONFORMAL PREDICTION METHODS C.1 ORACLE CONFORMAL PREDICTION Algorithm 4: (Oracle CP) Oracle Conformal Prediction Set Input : Training set D = {(Xi, Yi)}n i=1, test sets Dtest = {(Xn+j, Yn+j)}m j=1, desired coverage 1 α. Output: Prediction interval Coracle j,α (Xn+j) for each j [m]. for j [m] do 1. Fit the prediction function bfj on DYn+j j ; 2. Compute non-conformity scores on DYn+j j : Si,j = |Yi bfj(Xi)| for i [n] {n + j}; 3. Compute prediction interval: Coracle j,α (Xn+j) = [ bfj(Xn+j) Q1 α({Si,j}n i=1 {Sn+j,j})]; end Published as a conference paper at ICLR 2025 C.2 FULL CONFORMAL PREDICTION Algorithm 5: (Full CP) Full Conformal Prediction Set Input : Training set D = {(Xi, Yi)}n i=1, test points {Xn+j}m j=1, search grid G, desired coverage 1 α. Output: Prediction interval Cfull j,α (Xn+j) for each j [m]. for j [m] do 1. Fit the prediction function bf y j on Dy j ; 2. Compute non-conformity scores on D: Sy i,j = |Yi bf y j (Xi)| for i [n]; 3. Compute quantile value Q1 α({Sy i,j}n i=1 { }); end 4. Compute prediction interval Cfull j,α (Xn+j) = y G : Sy n+j,j Q1 α({Sy i,j}n i=1 { }) ; end C.3 SPLIT CONFORMAL PREDICTION Algorithm 6: (Split CP) Split Conformal Prediction Set Input : Training set Dtrain = {(Xi, Yi)}i Itrain, calibration set Dcalib = {(Xi, Yi)}i Icalib, test points {Xn+j}m j=1, desired coverage 1 α. Output: Prediction interval Csplit j,α (Xn+j) for each j [m]. 1. Fit the prediction function bftrain on Dtrain; 2. Compute non-conformity scores on Dcalib: Si = |Yi bftrain(Xi)| for i Icalib; for j [m] do 3. Compute prediction interval Csplit j,α (Xn+j) = [ bftrain(Xn+j) Q1 α({Si}i Icalib { })]; end C.4 REPLACE-ONE STABLE CONFORMAL PREDICTION Algorithm 7: (RO-Stab CP) Replace-One Stable Conformal Prediction Set Input : Training set D = {(Xi, Yi)}n i=1, test points {Xn+j}m j=1, initial guesses {eyn+j}m j=1, desired coverage 1 α. Output: Prediction interval CRO j,α (Xn+j) for each j [m]. for j [m] do 1. Fit the prediction function bfj on Deyj j ; 2. Compute (guessed) non-conformity scores on D: Si = |Yi bfj(Xi)| for i [n]; 3. Compute stability bounds τ RO i,j for i [n] {n + j}; 4. Compute prediction interval: CRO j,α (Xn+j) = [ bfj(Xn+j) (Q1 α({Si + τ RO i,j }n i=1 { }) + τ RO n+j,j)]; end Published as a conference paper at ICLR 2025 D IMPLEMENTATIONS OF LOO STABLE ALGORITHMS D.1 REGULARIZED LOSS MINIMIZATION Algorithm 8: (RLM) Regularized Loss Minimization Input : Training set D = {(Xi, Yi)}n i=1. Output: Prediction function bf( ) = fbθ( ). 1. Compute optimal parameter bθ := arg minθ Θ{ 1 n Pn i=1 ℓ(Yi, fθ(Xi)) + Ω(θ)}; D.2 STOCHASTIC GRADIENT DESCENT Algorithm 9: (SGD) Stochastic Gradient Descent (with Random Reshuffling) Input : Training set D = {(Xi, Yi)}n i=1, number of epochs R, step size η, initial value θ0. Output: Prediction function bf( ) = fbθ( ). 1. Initialize parameter θ := θ0; for r [R] do 2. Sample a permutation π of [n] uniformly at random; for i [n] do 3. Update parameter θ := θ η θℓ(Yπi, fθ(Xπi)); end end 4. Set the final parameter bθ := θ; E USEFUL LEMMAS Lemma 1 (Lemma 13.5 in Shalev-Shwartz & Ben-David (2014)). 1. Let g, h : Rp R be convex function and λ-strongly convex function, respectively. Then, g + h is λ-strongly convex. 2. Let g : Rp R be λ-strongly convex and y minimize g, then, for any x, g(x) g(y) λ Lemma 2 (Lemma 3.6 in Hardt et al. (2016)). Let g : Rp R be a function such that g : Rp Rp is a φ-Lipschitz. Define h : Rp Rp such that h(θ) = θ α g(θ) with α 2/φ. Then, h is (1 + ηφ)-Lipschitz. If g is in addition convex, h is 1-Lipschitz. F PROOFS OF THEOREMS F.1 PROOF OF THEOREM 1 Proof. By Definition 2, we have Sy i,j Si + τ LOO i,j , for i [n] and j = [m]. Similarly, for any j, we have |y bf(Xn+j)| τ LOO n+j,j Sy n+j,j. Therefore, for any j, the following holds for all y contained in Cfull j,α (Xn+j): |y bf(Xn+j)| τ LOO n+j,j Q1 α({Si + τ LOO i,j }n i=1 { }), which is equivalent to y [ bf(Xn+j) (Q1 α({Si + τ LOO i,j }n i=1 { }) + τ LOO n+j,j)]. This directly implies CLOO j,α (Xn+j) Cfull j,α (Xn+j) and hence P(Yn+j CLOO j,α (Xn+j)) P(Yn+j Cfull j,α (Xn+j)) 1 α, for any choice of α. Published as a conference paper at ICLR 2025 F.2 PROOF OF THEOREM 2 Proof. For y Y and j [m] define F y j (θ) = 1 n+1{Pn i=1 ℓ(Yi, fθ(Xi)) + ℓ(y, fθ(Xn+j)} + Ω(θ) and F(θ) = 1 n{Pn i=1 ℓ(Yi, fθ(Xi))} + Ω(θ). Then, bθy j = arg minθ Θ F y j (θ) and bθ = arg minθ Θ F(θ). We begin by proving the LOO algorithmic stability. Fix y and j and suppose that bθy j bθ 2(ρn+j+ ρ) λ(n+1) . Then, for all i [n] {n + j} and z Y, |S(z, bf y j (Xi)) S(z, bf(Xi))| γ| bf y j (Xi) bf(Xi)| = γ|fbθy j (Xi) fbθ(Xi)| γνi bθy j bθ 2γνi(ρn+j + ρ) The first and the second inequalities follow from the Lipschitz property of non-conformity score function and the prediction function, respectively. Therefore, it suffices to obtain the bound of bθy bθ as assumed above. By the first part of Lemma 1, F y j and F are λ-strongly convex functions of θ. Using the second part of the Lemma 1, we have: λ 2 bθ bθy j 2 F y j (bθ) F y j (bθy j ) = n n + 1{F(bθ) F(bθy j )} + 1 n + 1{ℓ(y, fbθ(Xn+j)) ℓ(y, fbθy j (Xn+j)) + Ω(bθ) Ω(bθy j )} 1 n + 1{ℓ(y, fbθ(Xn+j)) ℓ(y, fbθy j (Xn+j)) + Ω(bθ) Ω(bθy j )}. The last inequality follows from the optimality of bθ. Now, by the Lipschitz property of the loss function, we have: ℓ(y, fbθ(Xn+j)) ℓ(y, fbθy j (Xn+j)) ρn+j bθ bθy j . (10) On the other hand, again by the optimality of bθ, it holds that 0 F(bθy j ) F(bθ) = 1 i=1 {ℓ(Yi, fbθy j (Xi)) ℓ(Yi, fbθ(Xi)) + Ω(bθy) Ω(bθ), which implies Ω(bθ) Ω(bθy j ) ρ bθy j bθ , (11) by the Lipschitz property of the loss function. Finally, combining (10) and (11) to (9), we get bθy j bθ 2(ρn+j+ ρ) For the RO algorithmic stability, fix y, ey, and j. By the similar arguments as for (9), we have λ 2 bθey j bθy j 2 1 n + 1{ℓ(y, fbθ ey j (Xn+j)) ℓ(y, fbθy j (Xn+j))} + 1 n + 1{ℓ(ey, fbθy j (Xn+j)) ℓ(ey, fbθ ey j (Xn+j))} n + 1 bθy j bθey j , and this implies bθy j bθey j 4ρn+j λ(n+1). The rest of the proof is similar to the previous case. Published as a conference paper at ICLR 2025 F.3 PROOF OF THEOREM 3 Proof. We start with proving the case of RO algorithmic stability first, for clearer presentation. Also, we only prove the case of j = 1 and R = 1 since extending to the case of j > 1 or R > 1 is straightforward. Let π be an arbitrary permutation of [n + 1] and k be such that πk = n + 1. Fix y and ey. Let (θy 0, θy 1, . . . , θy n+1) and (θey 0, θey 1, . . . , θey n+1) be the updating sequences of SGD sharing π for Dy j and Dey j , respectively. Note that bf y j = bf y 1 = fθy n+1 and bf ey j = bf ey 1 = fθ ey n+1. As in the proof of Theorem 2, we first bound the distance between the two terminal parameters, θy n+1 θey n+1 . Let us first consider the case of k = n + 1. Then, by the SGD update rule, we can see that θy i = θey i for all i = [n] since for SGD update, the two sequences share the first n data points as well as the initial parameter. Therefore, we have θy n+1 θey n+1 = {θy n η θℓ(y, fθy n(Xn+1))} {θey n η θℓ(ey, fθ ey n(Xn+1))} η θℓ(y, fθy n(Xn+1)) + η θℓ(ey, fθ ey n(Xn+1)) Here, we used triangle inequality, and then the Lipschitz property of loss function. Now, consider the case of k < n + 1. If i = k, by the similar argument, we can show that θy i θey i θy i 1 θey i 1 + 2ηρn+1. Otherwise, if i = k, by Lemma 2 with the choice of α := η, φ := φπi 2 η, and g(θ) := ℓ(Yπi, fθ(Xπi)), we have: θy i θey i = {θy i 1 η θℓ(Yπi, fθy i (Xπi))} {θey i 1 η θℓ(Yπi, fθ ey i (Xπi))} θy i θey i , (12) since ℓ(Yπi, fθ(Xπi)) is convex. Unraveling the recursion from the above two inequality, we get θy n+1 θey n+1 θy 0 θey 0 + 2ηρn+1 = 2ηρn+1. The last equality holds since the two updating sequences share the common initial value. The remaining parts follow similarly to the proof of Theorem 2. Next, to prove the LOO algorithmic stability, fix y and let π = (π 1, . . . , π n) be the sequence obtained from π by excluding the kth entry. For example, if we choose n = 4 and π = (3, 2, 5, 1, 4), then k = 3 and π = (3, 2, 1, 4). Then, it can be shown that π is an arbitrary permutation of [n]. Define an updating sequence of SGD, (θ0, θ1, . . . , θn) for D induced by π , i.e, bf = fθn. Note that θ0 = θy 0. As the case of the RO algorithmic stability, it suffices to show that θy n+1 θn ηρn+1. If k = n + 1, then we have θy i = θi for i = [n]. Therefore, it follows that θy n+1 θn = {θy n η θℓ(y, fθy n(Xn+1))} θn η θℓ(y, fθy n(Xn+1)) ηρn+j. For the case of k < n + 1 and further remaining parts, we can follow the same procedure used in the RO algorithmic stability. F.4 PROOF OF THEOREM 4 Proof. The overall structure of the proof is almost identical to that of Theorem 3. Again, let us focus on the proof of RO stability with R = 1 first. Recall that in that proof, the convexity assumption was used only in (12). Since we have discarded the convexity assumption of ℓ(Yπi, fθ(Xπi)) by Lemma 2 again, the Lipschitz constant of h(θ) = θ η θℓ(Yπi, fθ(Xπi)) is replaced from 1 to 1 + ηφπi. That is, we obtain the following recursive inequalities: ( θy i 1 θey i 1 + 2ηρn+1 if i = k, (1 + ηφπi) θy i 1 θey i 1 if i < k. (13) Published as a conference paper at ICLR 2025 Considering θy 0 θey 0 = 0, unraveling these inequalities yields θy n+1 θey n+1 i=k+1 (1 + ηφπi) i=1 (1 + ηφπi) i=1 (1 + ηφi) since ηφi > 0 by definitions. Extending this to the case of R > 1 is not as straightforward as the proof of Theorem 3, hence we also present the corresponding proof. In this case, we can use induction. Set R > 1 and let r [R]. Suppose that up to (r 1)th epoch, the difference of parameter is bounded by 2 Pr 1 s=1 κs ηρn+j. Then, rth iteration can be treated as the case of R = 1 with θy 0 θey 0 2 Pr 1 s=1 κs ηρn+j. In this case, unraveling (13) yileds θy n+1 θey n+1 i=k+1 (1 + ηφπi) i=1 (1 + ηφπi) θy 0 θey 0 + 2ηρn+j i=k+1 (1 + ηφπi) i=1 (1 + ηφπi) ηρn+j + 2ηρn+j i =k (1 + ηφπi) i=k+1 (1 + ηφπi) Since we already proved the case of r = 1, this completes proof for RO stability. For the LOO stability, we can use the same reasoning. F.5 PROOF OF THEOREM 5 Proof. Fix (x, y) X Y and j [m]. Due to the symmetry of the resampling scheme, i.e., sampling uniformly with replacement, we have bf(x) = E h f y,(b) j (x) n + 1 / r i . Therefore, using the above facts along with the Lipschitz property of the non-conformity score function, we get |S(z, bf y j (x)) S(z, bf(x))| γ bf y j (x) bf(x) = γ E h f y,(b) j (x) i E h f y,(b) j (x) n + 1 / r i = γ E h f y,(b) j (x) E h f y,(b) j (x) i n + 1 / r i . Next, by the definitions of conditional expectation and covariance, E h f y,(b) j (x) E h f y,(b) j (x) i n + 1 / r i = 1 P(n + 1 / r)E hn f y,(b) j (x) E h f y,(b) j (x) io 1{n + 1 / r} i = 1 P(n + 1 / r)Cov f y,(b) j (x), 1{n + 1 / r} . Published as a conference paper at ICLR 2025 Combining the above results, we have |S(z, bf y j (x)) S(z, bf(x))| 1 1 p Cov f y,(b) j (x), 1{n + 1 / r} , (14) where p = P(n + 1 r) = 1 (1 1 n)m. Furthermore, it holds that Cov f y,(b) j (x), 1{n + 1 / r} h Var f y,(b) j (x) Var (1{n + 1 / r}) i 1 w2 j 4 p(1 p) Here, the first inequality follows from the Cauchy-Schwarz inequality. For the second inequality, we apply Popoviciu s inequality for variance and the properties of the Bernoulli distribution. Substituting this bound into (14) completes the proof. F.6 PROOF OF THEOREM 6 Proof. Fix (x, y) X Y and j [m]. Let bf y j (x) and bf(x) denote the predictions correspond- ing to bagging, and let bf y, j (x) and bf (x) denote the predictions corresponding to derandomized bagging. Then, |S(z, bf y j (x)) S(z, bf(x))| γ bf y j (x) bf(x) = γ n bf y j (x) bf y, j (x) o + n bf y, j (x) bf (x) o + n bf (x) bf(x) o γ h bf y j (x) bf y, j (x) + bf y, j (x) bf (x) + bf (x) bf(x) i . (15) Consider each term on the last line of (15). For the first term, note that bf y j (x) bf y, j (x) = b=1 f y,(b) j (x) E h f y,(b) j (x) i . Since each single prediction f y,(b) j (x) is almost surely bounded within an interval of range wj, by Hoeffding s inequality, we have P bf y j (x) bf y, j (x) t 1 2 exp 2Bt2/w2 j , for any t > 0. Setting δ 2 = 2 exp( 2Bt2/w2 j) yields bf y j (x) bf y, j (x) w2 j 2B log 4 Similarly, for the third term, we obtain an identical bound: bf(x) bf (x) w2 j 2B log 4 For the second term, a direct application of Theorem 5 gives the following deterministic bound: bf y, j (x) bf (x) γwj Published as a conference paper at ICLR 2025 Combining all the bounds for the three terms in (15) using the union bound, we have that, with probability at least 1 δ, |S(z, bf y j (x)) S(z, bf(x))| γ w2 j 2B log 4 w2 j 2B log 4 G DETAILS OF NUMERICAL EXPERIMENTS G.1 DETAILS OF ALGORITHMS The configurations satisfy the assumptions of Theorem 2 and Theorem 3, allowing us to compute the stability bounds concretely. First, for RLM, the following stability bounds were used. For i [n] {n + j} for each j [m], τ LOO i,j = 2ϵ Xi , τ RO i,j = 4ϵ Xi Xn+j Next, the stability bounds for SGD are as follows: τ LOO i,j = Rηϵ Xi Xn+j , τ RO i,j = 2Rηϵ Xi Xn+j . G.2 ADDITIONAL RESULTS FROM SECTION 4 Oracle CP Full CP Split CP RO-Stab CP LOO-Stab CP RLM Coverage 0.903 (0.040) 0.896 (0.043) 0.903 (0.060) 0.910 (0.039) 0.910 (0.039) Length 3.272 (0.250) 3.300 (0.257) 3.455 (0.514) 3.442 (0.257) 3.442 (0.257) Time 3.201 (0.172) 176.783 (15.704) 0.017 (0.006) 3.190 (0.176) 0.035 (0.008) SGD Coverage 0.903 (0.041) 0.896 (0.043) 0.900 (0.059) 0.911 (0.040) 0.906 (0.040) Length 3.252 (0.250) 3.300 (0.257) 3.420 (0.557) 3.464 (0.259) 3.405 (0.259) Time 0.720 (0.087) 19.320 (2.018) 0.005 (0.003) 0.720 (0.075) 0.009 (0.005) RLM Coverage 0.892 (0.045) 0.886 (0.047) 0.893 (0.059) 0.897 (0.044) 0.897 (0.044) Length 3.659 (0.317) 3.690 (0.340) 3.812 (0.554) 3.828 (0.344) 3.827 (0.344) Time 3.289 (0.219) 163.101 (22.120) 0.017 (0.005) 3.275 (0.221) 0.038 (0.010) SGD Coverage 0.892 (0.045) 0.886 (0.047) 0.895 (0.062) 0.900 (0.043) 0.894 (0.044) Length 3.641 (0.318) 3.690 (0.340) 3.855 (0.612) 3.849 (0.345) 3.789 (0.345) Time 0.732 (0.074) 17.425 (2.206) 0.005 (0.003) 0.746 (0.093) 0.009 (0.005) Table 2: Mean (and standard deviation) of empirical coverage, average prediction interval length, and execution time across 100 iterations for each scenario in simulation. Published as a conference paper at ICLR 2025 G.3 ADDITIONAL RESULTS FROM SECTION 5 Oracle CP Full CP Split CP RO-Stab CP LOO-Stab CP m = 1 Boston RLM 0.920 (0.273) 0.910 (0.288) 0.910 (0.288) 0.920 (0.273) 0.920 (0.273) SGD 0.900 (0.302) 0.920 (0.273) 0.910 (0.288) 0.900 (0.302) 0.900 (0.302) Diabetes RLM 0.910 (0.288) 0.900 (0.302) 0.890 (0.314) 0.910 (0.288) 0.910 (0.288) SGD 0.920 (0.273) 0.910 (0.288) 0.920 (0.273) 0.930 (0.256) 0.920 (0.273) m = 100 Boston RLM 0.906 (0.031) 0.897 (0.031) 0.898 (0.040) 0.905 (0.030) 0.905 (0.030) SGD 0.905 (0.029) 0.901 (0.032) 0.901 (0.036) 0.910 (0.028) 0.906 (0.029) Diabetes RLM 0.900 (0.035) 0.889 (0.037) 0.894 (0.043) 0.900 (0.035) 0.900 (0.035) SGD 0.902 (0.031) 0.890 (0.036) 0.902 (0.037) 0.914 (0.030) 0.906 (0.031) Table 3: The mean (and the standard deviation) of empirical coverage over 100 iterations for each scenario on Boston Housing and Diabetes datasets. G.4 ADDITIONAL RESULTS FROM SECTION 6 cf BH RO-cf BH LOO-cf BH q = 0.1 FDP 0.0928 (0.0713) 0.0038 (0.0115) 0.0657 (0.0617) Power 0.6319 (0.2053) 0.3041 (0.1452) 0.6744 (0.1295) Time 0.0037 (0.0006) 0.2976 (0.0114) 0.0060 (0.0011) q = 0.2 FDP 0.2000 (0.0807) 0.0602 (0.0813) 0.1836 (0.0560) Power 0.9277 (0.0838) 0.6522 (0.1450) 0.9430 (0.0486) Time 0.0037 (0.0005) 0.2971 (0.0100) 0.0060 (0.0002) q = 0.3 FDP 0.2882 (0.0806) 0.2483 (0.1212) 0.2837 (0.0934) Power 0.9923 (0.0198) 0.9627 (0.0347) 0.9917 (0.0136) Time 0.0037 (0.0002) 0.2970 (0.0101) 0.0060 (0.0003) Table 4: Mean (and standard deviation) of FDP, power, and execution time for three conformal selection methods.