# conformal_prediction_via_regressionasclassification__06223313.pdf Published as a conference paper at ICLR 2024 CONFORMAL PREDICTION VIA REGRESSION-AS-CLASSIFICATION Etash Guha RIKEN Center for AI Project, Samba Nova Systems {etash.guha}@sambanovasystems.com Shlok Natarajan Salesforce {shloknatarajan}@salesforce.com Thomas M ollenhoff RIKEN Center for AI Project {thomas.moellenhoff}@riken.jp Mohammad Emtiyaz Khan RIKEN Center for AI Project {emtiyaz.khan}@riken.jp Eugene Ndiaye Apple {e ndiaye}@apple.com Conformal prediction (CP) for regression can be challenging, especially when the output distribution is heteroscedastic, multimodal, or skewed. Some of the issues can be addressed by estimating a distribution over the output, but in reality, such approaches can be sensitive to estimation error and yield unstable intervals. Here, we circumvent the challenges by converting regression to a classification problem and then use CP for classification to obtain CP sets for regression. To preserve the ordering of the continuous-output space, we design a new loss function and make necessary modifications to the CP classification techniques. Empirical results on many benchmarks shows that this simple approach gives surprisingly good results on many practical problems. 1 INTRODUCTION Quantifying and estimating the uncertainty of machine-learning models is an important task for many problems, especially mission-critical applications where reliable predictions are required. Conformal Prediction (CP) (Vovk et al., 2005) has recently gained popularity and has been used successfully in applications such as breast cancer detection (Lambrou et al., 2009), stroke risk prediction (Lambrou et al., 2010), and drug discovery (Cort es-Ciriano & Bender, 2020). Under mild conditions, CP techniques aim to construct a prediction set that, for given test inputs, is guaranteed to contain the true (unknown) output with high probability. The set is built using a conformity score, which, roughly speaking, indicates the similarity between a new test example and the training examples. The conformal set merely gathers examples that have large conformity scores. Despite its popularity, CP for regression can be challenging, especially when the output distribution is heteroscedastic, multimodal, or skewed (Lei & Wasserman, 2014). The main challenge lies in the design of the conformity score. It is common to use a simple choice for score functions such as distance to mean regressor, but such choices may ignore the subtle features of the shape of the output distribution. For instance, this could lead to symmetric intervals or ignoring the heteroscedasticity. In theory, it is better to estimate the (conditional) distribution over the output, for example, by using kernel density estimation and directly using it to build a confidence interval. However, such estimation approaches are also challenging, and estimates can be sensitive to the choice of kernel and hyperparameters, which can yield unstable results. We circumvent the challenges by exploiting the existing CP techniques for classification. We proceed by first converting regression to a classification problem and then using CP techniques for classification to obtain a conformal set. Regression-as-classification approaches are popular for various applications in computer vision and have led to more accurate training than only-regression training (Stewart et al., 2023; Zhang et al., 2016; Fu et al., 2018; Rothe et al., 2015; Van Den Oord Published as a conference paper at ICLR 2024 1.5 1.0 0.5 0.0 0.5 1.0 1.5 x Interval Ranges (xi, yi) Covered (xi, yi) Not Covered (a) Heteroskedastic 1.5 1.0 0.5 0.0 0.5 1.0 1.5 x Interval Ranges (xi, yi) Covered (xi, yi) Not Covered (b) Bimodal Figure 1: We show two examples where the output distribution is heteroskedastic (left) and bimodal (right). In both cases, our method is able to change the interval (shaded gray region) adaptively as the input values x are increased. Examples outside the gray regions (white dots) are deemed different from those inside it (black dots). et al., 2016; Diaz & Marathe, 2019). We leverage them to construct a distribution-based conformal set that can flexibly capture the shape of the output distribution while preserving the simplicity and efficiency of CP for classification. First, we discretize the output space into bins, treating each bin as a distinct class. Second, to preserve the ordering of the continuous output space, we design an alternative loss function that penalizes the density on bins far from the true output value but also facilitates variability by using an entropy regularization. The loss design is similar in spirit to Weigend & Srivastava (1995); Diaz & Marathe (2019). The resulting method can adapt to heteroscedasticity, bimodality, or both in the label distribution. We verify this on synthetic and real datasets where we achieve the shortest intervals compared to other CP baselines. See examples in Figure 1. 2 BACKGROUND ON CONFORMAL PREDICTION Given a new input xnew, CP techniques aim to construct a set that contains the true but unknown output ynew with high probability. Assuming that a pair of input-output variables (x, y) has a joint density p(x, y) and a conditional density p(y | x), oracle prediction sets (with joint and conditional coverage) for the output y can be constructed as {z R : p(x, z) τα} or {z R : p(z | x) τα,x}, (1) where the thresholds τα and τα,x are selected to ensure that the corresponding sets have a probability mass that meets or exceeds prescribed confidence level 1 α (0, 1). As the ground-truth distribution is unknown, we rely solely on estimating these uncertainty sets using the density estimators ˆp(x, y) and ˆp(y | x). The latter can be inaccurate due to numerous sources of errors such as model misspecification, small sample size, high optimization errors during training, and overfitting. Without a stronger distribution assumption, the finite-sample guarantee is typically not upheld. Conformal Prediction has arisen as a method for yielding sets that do hold finite-sample guarantees. Given a partially observed instance (xnew, ynew) where ynew is unknown, Conformal Prediction (CP) (Vovk et al., 2005) constructs a set of values that contains ynew with high probability without knowing the underlying data distribution. Under conformal prediction, this property is guaranteed under the mild assumption that the data satisfies exchangeability. The set is called the conformal set and is built using a conformity score, denoted by σ(x, y), which measures how appropriate an output value is for a given input example. There are many ways to build the conformity score, but they all involve splitting the data into a training set Dtr and a calibration set Dcal. Often, a prediction model µtr(x) is built using the training set, and then a conformity score is obtained using this model along with the calibration set (we will shortly give an example). The conformal set merely gathers the points with larger conformity scores: {z R : σ(xnew, z) Q1 α(Dcal)}, Published as a conference paper at ICLR 2024 where Q1 α(Dcal) is the (1 α)-quantile of the conformity scores on the calibration data. This set provably contains ynew with probability larger than 1 α for any finite sample size and without assumption on the ground-truth distribution. There are many design choices for this conformity score. For example, one can choose a prediction model µtr(x) as an estimate of the conditional expectation and measure conformity as the absolute value of the residual, i.e., σ(x, y) = |y µtr(x)|. The corresponding conformal set is a single interval centered around the prediction µtr(x) and of constant length Q1 α(Dcal) for any example xnew, without taking into account its variability. However, in situations where the underlying data distribution demonstrates skewness or heteroscedasticity, we may desire a more flexible conformity score. We wish to generate an approach to approximate this density function that is versatile in the distributions it can represent and avoids the difficulties seen in prior methods. We explore established density estimation in Classification Conformal Prediction that are already performing effectively. Related Works Since their introduction, a lot of work has been done to improve the set of conformal predictions. As simple score function, distance to conditional mean ie σ(x, y) = |y µ(x)| where µ(x) is an estimate of E(y | x) was prominently used (Papadopoulos et al., 2002; Lei et al., 2018; Guha et al., 2023). Instead, (Romano et al., 2019) suggests estimating a conditional quantile instead and a conformity score function based on the distance from a trained quantile regressor, i.e. σ(x, y) = max(µα/2(x) y, y µ1 α/2(x)) where µα are the α-th quantile regressors. Within the literature, our strategy is more closely related to the distribution-based methods (Lei et al., 2013). Following a similar line of work (Chernozhukov et al., 2021) argued that the conformal quantile regression score function might be less adaptive since the distance of the quantile behaves similarly to the distance of the mean estimate. Instead, they suggested estimating the cumulative (conditional) distribution function and directly outputting {y : F(x, y) [α/2, 1 α/2]}. However, their method cannot account for bimodality since it can only output a single interval by design. An equally interesting approach for distribution-based prediction sets is based on learning a Bayesian estimator, which, however, may require a well-specified prior and can be computationally expensive (Fong & Holmes, 2021). Variants based on estimating the conditional density of the response using histogram regressors (Sesia & Romano, 2021) could detect a possible asymmetry of the ground-truth distribution. However, our densities are estimated through classification techniques, whereas their densities are learned through a histogram of many regressors. Smoothness over the distribution is encoded in our loss function, whereas they use a post hoc subinterval finding algorithm through linear programming to prevent many disjoint intervals. Our techniques have some overlap with ordinal regression and ordinal classification. Xu et al. (2023) discusses various risk categories (similar to coverage) for ordinal classification, while our work considers different score functions where coverage serves as the loss function. Lu et al. (2022) discuss the adaptation of APS style score functions to accommodate the ordinal structure of classes, which was subsequently utilised in (Sesia & Romano, 2021). 3 CP VIA REGRESSION-AS-CLASSIFICATION 3.1 CLASSIFICATION CONFORMAL PREDICTION We aim to compute a conformity function that accurately predicts the appropriateness of a label for a specific data point. Given that the distribution function across labels can adopt diverse forms, such as being bimodal, heavy-tailed, or heteroscedastic, our approach must effectively factor in such shapes while upholding its coverage precision. A frequently employed technique involves using the conditional label density as a conformity function, leading to reliable results in the classification context. Typically, practitioners perform conformal prediction for classification with probability estimates from a Softmax neural network that covers K output logits using cross-entropy loss. Let us denote the parametrized density qθ( | x) = softmax(fθ(x)), where softmax(v)j = exp(vj) PK k=1 exp(vk) , as the outputted discrete probability distribution over the labels of the input x. Traditionally, we fit our neural network by minimizing the cross-entropy loss on the training set: ˆθ arg min θ i=1 KL(δyi || qθ( | xi)). Published as a conference paper at ICLR 2024 Here, δyi is the Dirac Distribution with all of the probabilitistic mass on yi. Let us assume that we have trained and acquired a ˆθ that has minimized the traditional cross-entropy loss function on the training dataset. A natural conformity score is simply the probability of a label according to the learned conditional distribution, i.e., σ(x, y) = qˆθ(y | x). This approach is both straightforward and efficient. The neural network s flexibility allows it to learn numerous label distributions with adaptivity across examples with less explicit design of specific prior structure. 3.2 REGRESSION TO CLASSIFICATION APPROACH Naturally, we strive to embody such practicality and effectiveness in regression settings. However, the distribution of labels in the regression scenario is continuous, and learning a continuous distribution directly using a neural network is challenging (Rothfuss et al., 2019). Often, Bayesian or kernel density estimators are employed to estimate this distribution. Other techniques acquire knowledge of this distribution by training numerous regressors and categorizing the regressors, for example, Conformal Histogram Regression (Sesia & Romano, 2021). However, these methods look different from the conformal prediction approaches for classification. It would be desirable to be able to use similar methods for both classification and regression conformal prediction. One method of unifying classification and regression problems outside the conformal prediction literature is known as Regression-as-Classification. We simply turn a regression problem into a classification problem by binning the range space. Specifically, we generate K bins with K equally spaced numbers covering the interval Y = [ymin, ymax], where ymin (or ymax) is the minimum (or maximum) value of the labels observed in the training set. More explicitly, we define our discretization of the label space as ˆY = {ˆy1, . . . , ˆy K} where ˆyk+1 = ˆyk + ˆy K ˆy1 K 1 with ˆy1 = ymin and ˆy K = ymax. These values ˆy ˆY form the midpoints for each bin of our discretization. Naturally, the kth bin is all the labels in the range space Y closest to ˆyk. Intuitively, we can treat each bin as a class. Thus, we have turned a regression problem into a classification problem. This method is simple but has yielded surprising results. Some work has suggested that this form of binning results in more stable training (Stewart et al., 2023) and gives significantly better results for learning conditional expectations. To unify classification and regression conformal prediction, a simple solution is to employ the Classification Conformal Prediction model with discrete labels yi = arg minˆy ˆ Y |yi ˆy|. This will aid in training the neural network with modified labels through cross-entropy loss, resulting in a discrete distribution of qθ( | x), as outlined in the previous section. To compute conformity scores for all labels, we employ linear interpolation from the discrete probability function qθ( | x) to generate the continuous distribution qθ( | x). Nevertheless, this approach encounters a critical issue when employed for regression. 3.3 DATA FITTING A critical problem with employing Cross Entropy loss in the classification conformal prediction context is that any structural relationships between classes are disregarded. This is not surprising given that in the classification context, no structure exists between classes, and each class is independent. Therefore, the Cross Entropy loss does not need to differentiate between whether qθ allocates probable mass far away or close to the actual label. Instead, it only incentivizes the allocation of probable mass on the correct label. However, within the regression setting, despite a multitude of labels, the labels adhere to an ordinal structure. Thus, to enhance the accuracy of labeling, it is imperative to devise a loss function that incentivizes the allocation of probabilistic mass not only to the correct bin but also to the neighboring bins. Formally, given an input and output pair (x, y), our goal is to determine a density estimate qθ( | x) that assigns low (resp. high) probability to points that are far (resp. close) to the true label y, i.e., qθ(ˆy | xi) high (resp. small) when the loss ℓ(ˆy, yi) is small (resp. high). Published as a conference paper at ICLR 2024 Hence, a natural desideratum for learning the probability density function qθ is that their product ℓ(y, ˆy)q(ˆy | x) is small in expectation. We propose to find a distribution qθ minimizing the loss Eˆy q( |x)[ℓ(y, ˆy)] = k=1 ℓ(y, ˆyk)q(ˆyk | x) Minimizing this loss in the space of all possible distributions q( | x) is equivalent to minimizing the original loss ℓ(y, ), where the minimizing distribution is a Dirac delta δˆy at the minimizer ˆy of ℓ(y, ). Therefore, we expect an unregularized version of this loss to share similarities with the typical empirical risk minimization on ℓ(y, ). However, a key difference is that when minimizing in a restricted family of distributions (for example, those representable by a neural network with a fixed architecture), the distributional output can represent multi-modal or heavy-tailed label distributions. Minimizing the original loss ℓ(y, ), one would always be confined to a point estimate. 3.4 ENTROPY REGULARIZATION Although the proposed loss function better encodes the connection between bins, it tends towards outputting Dirac distributions, as outlined in the previous paragraph. Overconfidence in our neural network is a commonly reported problem in the literature (Wei et al., 2022). Nevertheless, smoothness has been a traditional requirement in density learning. As such, we rely on a classical entropyregularization technique for learning density estimators (Wainwright & Jordan, 2008). Given the set of density estimators that match the training label distribution well and put high probability mass on the best bins, we prefer the probability distribution that maximizes the entropy since this intuitively takes fewer assumptions on the data distribution structure. Our choice is based on selecting density estimators that effectively match the distribution of the training labels and assign a higher probability to the best bins. Formally, we can calculate the entropy of our probability distribution by using the Shannon entropy H of the produced probability distribution q( | x) as a penalty term as follows: H(qθ( | x)) = k=1 qθ(ˆyk|x) log qθ(ˆyk|x). Summary We learn a distribution by minimizing the following expected loss over a given training data Dtr = {(x1, y1), . . . , (xntr, yntr)} of size ntr: k=1 ℓ(yi, ˆyk) qθ(ˆyk | xi) τH(qθ( | xi)), (2) In particular, we choose ℓ(yi, ˆyk) as ℓ(yi, ˆyk) = |yi ˆyk|p where p > 0 is a hyperparameter. This selection functions as a natural distance metric that meets all required objectives. As a result, we can employ a technique similar to the aforementioned Classification Conformal Prediction methods. Initially, we train a Softmax neural network fθ with K logits, grounded on the loss function Equation (2) on the training dataset. To calculate conformity scores for the calibration set, we utilize the linearly interpolated σ(x, y) = qθ(y | x) for a specific data point (x, y). Namely, for any y between ˆyk and ˆyk+1, qθ(y | x) is defined as qθ(y | x) = γkqθ(ˆyk | x) + (1 γk)qθ(ˆyk+1 | x) where γk = ˆyk+1 y ˆyk+1 ˆyk . Complete details of this procedure are outlined in Algorithm 1. 4 EXPERIMENTS All code to run our method can be installed via pip install r2ccp. We investigate the empirical behavior of our R2CCP (Regression-to-Classification Conformal Prediction) method, which we have explained in detail in Algorithm 1. We have three sets of experiments. The first one is described in Section 4.1 and presents empirical evidence of the algorithm s ability to produce narrower intervals by utilizing label density characteristics, including Published as a conference paper at ICLR 2024 Algorithm 1 Regression to Classication Conformal Prediction (R2CCP). Dataset Dn = {(x1, y1), . . . , (xn, yn)} and new input xn+1 Desired confidence level α (0, 1) 2: Hyperparameters: temperature τ > 0, p > 0, number of bins K > 1 3: Discretize the output space [ymin, ymax] into K equidistant bins with midpoints {ˆy1, . . . , ˆy K} 4: Randomly split the dataset Dn in training Dtr and calibration Dcal 5: Find a distribution qˆθ( | x) by (approximately) optimizing on the training set Dtr ˆθ arg min θ Rd k=1 |yi ˆyk|pqθ(ˆyk | xi) τH(qθ( | xi)) where qθ(ˆyk| x) = softmax(fθ(x))k for a model (e.g., neural net) fθ : Rd RK. 6: S qˆθ(y | x) for (x, y) Dcal # qˆθ( | x) is linear interpolation of softmax probabilities. 7: Qα(Dcal) quantile(S, α) 8: return Conformal Set Γ(α)(xn+1) = {z R | qˆθ(z | xn+1) Qα(Dcal)} heteroscedasticity, bimodality, or a combination. Section 4.2 demonstrates the effectiveness of our algorithm on synthetic and real data by comparing it with various benchmarks from the Conformal Prediction literature in terms of length and coverage. Section 4.3 evaluates the effect of different loss functions on the final learned densities and their impact on the intervals produced. All experiments were run over 5 different seeds at a coverage level of 90%, and the standard error over the experiments is reported in the subscript. We do not tune the hyperparameters and keep values of K = 50, p = 0.5, and τ = 0.2 constant across all experiments. For all experiments, we report length, meaning the length of all the sets predicted, and coverage, the percent of instances where the true label is contained in the predicted intervals. 4.1 SPECIFIC CHARACTERISTICS OF LABEL DISTRIBUTION Heteroscedascity We generate a toy dataset where the input is one-dimensional. It contains samples from the following distribution: y N(0, |x|). The label distribution is heteroscedastic, meaning the variance of the labels changes as the input changes. In traditional Conformal Prediction literature, many existing algorithms fail to capture heteroscedasticity, resulting in wide intervals for inputs x where the label distribution of y has low variance. However, our learned algorithm can directly learn this relation and adjust the outputted probability distribution accordingly. Thus, we see that the lengths of the intervals will increase as the variance of the label distribution increases, which is desirable. We see this relation in Figure 1a. Moreover, we also use the dataset generated from (Lei & Wasserman, 2014) as discussed in Appendix D. This dataset exhibits heteroskedascity and bimodality as X passes 0.5. We see that our learned method can adjust intervals accordingly to maintain coverage and length for all X. We plot this in Figure 1b. We plot how the intervals (grey) change as the data distribution (black) changes. As the variance of the labels increases as x increases, the produced intervals adaptively get wider, taking advantage of the heteroscedasticity. Bimodality We showcase our algorithm s capability to address labels with a bimodal distribution. Our bimodal dataset is generated by repeatedly (1) sampling two sets of random features that are close geometrically and (2) giving one set of features a label of 1 plus some Gaussian noise and giving the other a label of 1 plus some Gaussian noise. Therefore, our dataset is comprised of many similar data points with bimodally distributed labels. This bimodal distribution is particularly hard for many existing CP algorithms to solve since it requires outputting two disjoint conformal sets to achieve low length. CQR, for instance, cannot deal with this circumstance and will generate a conformal set that covers the entire range space. However, our method is flexible enough to assign a low probability to labels between 1 and 1, and our resulting conformal set will not include these intermediate labels. We see this in Figure 6d. In Figure 6d, our outputted probability distribution has two modes around labels 1 and 1 and assigns a low probability value to the valley between the modes. Published as a conference paper at ICLR 2024 4.2 COMPARISON TO OTHER CONFORMAL PREDICTION ALGORITHMS The crucial criteria for assessing a Conformal Prediction algorithm consist of (1) coverage, representing the percentage of generated conformal sets that incorporate accurate labels, and (2) the length of the generated conformal sets. Our baseline techniques consist of the Kernel Density Estimator (KDE) as proposed by Lei & Wasserman (2014), alongside a conformity score shown by the estimated probability density. Furthermore, we take into consideration Fong & Holmes (2021) (CB), which employs the likelihood of a posterior distribution in their conformity function, and Sesia & Romano (2021) (CHR), which uses quantile regressors on every bin of a histogram density estimator. We have included the Conformal Quantile Regression as described by Romano et al. (2019) (CQR), which employs conformity based on the labels distance from quantile regressors. Moreover, we had the Distributional Conformal Prediction which uses a Cumulative Distribution Function to form its intervals from Chernozhukov et al. (2021) (DCP). Additionally, we have used the Lasso Conformal Predictor with a distance to mean regressors, which is the most straightforward option (LASSO). We use synthetic data exhibiting bimodally distributed and log-normally distributed to illustrate particular weaknesses of existing methods. We use real datasets also in Romano et al. (2019). Specifically, these are several datasets from the UCI Machine Learning repository (Bio, Blog, Concrete, Community, Energy, Forest, Stock, Cancer, Solar, Parkinsons, Pendulum) (Nottingham et al., 2023) and the medical expenditure panel survey number 19 21 (MEPS-19 21) (Cohen et al., 2009). These regression datasets are commonly used to benchmark regression models. Our approach yields tighter intervals on real datasets than some of the strongest baselines. Results We report lengths and coverages results in Table 1 respectively. We added figures depicting example probability distributions on these datasets in Figure 6 in Appendix H. The intervals produced by our method are the shortest over 10 of the 16 datasets. From Figure 6, we see that our method can learn many different shaped distributions well, which accounts for this significant improvement in intervals. Overall, the Kernel Density Estimation and our method can accurately predict the best intervals on datasets where the connection between the data and feature is simple, such as Bimodal or Log-Normal. While the Kernel Density Estimator can fail when the labels are complexly related to inputs, no such connection exists in these datasets since the labels are independent of the features. Thus, the Kernel Density Method s simplicity allows it to learn the label density quickly. Our method also seems to handle the case where there is no connection between the data and the labels, as seen in Figure 6d in Appendix H. Moreover, on datasets where the label density is smooth and close to the Laplace prior, such as on Figure 6c and Figure 6i, Conformal Bayes and our method can accurately learn this distribution since both methods can output smooth distributions. Moreover, on very sharp datasets such as on Concrete in Figure 6j and Energy in Figure 6h, both our method and CQR can capture the sharp and unnoisy distribution needed to achieve strong length. Moreover, on complex distributions such as in Pendulum in Figure 6e and Bio in Figure 6f, we see that both CHR and our method have the flexibility to portray complex distributions resulting in the most accurate intervals. Thus, the flexibility of our algorithm to smoothly learn sharp, wide, complex, and simple conditional label densities results in our method achieving the best length most consistently over the entire dataset. 4.3 ABLATION STUDIES Our loss function consists of an error and entropy terms. The error term penalizes distributions that put weight far away from the true label, whereas the entropy term acts as a regularizing term in the probability distribution space. We will do ablations on both the error and entropy terms to illustrate the importance of each part. For the error term, there are several notable alternatives that a practitioner may use. An alternative is the log maximum likelihood or cross-entropy formulations of the error term, which we denote as i=1 log(qθ( yi | xi)), where we remind the reader that yi = arg minˆy ˆ Y |ˆy yi|. We note that both MLE and CE are equivalent formulations in this setting. These two are standard error terms often used in practice. We will train our models with this loss function over all datasets to see how the change of error term affects the intervals length and the learned density. We will show that our chosen error term is better Published as a conference paper at ICLR 2024 DATASET BIMODAL LOGNORM CONCRETE MEPS-19 MEPS-20 MEPS-21 BIO COMMUNITY CQR 2.14(0.01) 1.58(0.03) 0.39(0.01) 1.87(0.05) 2.00(0.07) 1.99(0.03) 1.34(0.01) 1.44(0.03) KDE 0.35(0.01) 1.40(0.04) 1.54(0.03) 2.16(0.02) 2.51(0.05) 2.39(0.03) 2.27(0.00) 2.23(0.09) LASSO 2.14(0.00) 3.30(0.06) 2.74(0.03) 4.64(0.06) 4.79(0.04) 4.92(0.04) 3.89(0.00) 3.26(0.05) CB 2.16(0.00) 1.45(0.01) 0.96(0.01) 4.47(0.02) 4.50(0.02) 4.51(0.01) 2.09(0.00) 1.80(0.00) CHR 2.14(0.00) 1.52(0.05) 0.47(0.02) 2.60(0.08) 2.53(0.03) 2.75(0.03) 1.59(0.01) 1.49(0.05) DCP 2.14(0.01) 1.74(0.07) 0.47(0.01) 68.64(0.15) 66.71(0.40) 67.56(0.33) 1.74(0.01) 1.59(0.03) R2CCP (OURS) 0.46(0.01) 1.96(0.03) 0.38(0.01) 1.60(0.01) 1.70(0.03) 1.72(0.03) 1.11(0.01) 1.47(0.03) DATASET DIABETES SOLAR PARKINSONS STOCK CANCER PENDULUM ENERGY FOREST CQR 1.30(0.05) 1.98(0.22) 0.42(0.01) 1.85(0.23) 3.09(0.13) 2.25(0.31) 0.19(0.01) 3.18(0.19) KDE 1.34(0.05) 0.50(0.01) 3.79(0.02) 4.72(0.29) 3.82(0.09) 3.96(0.09) 2.72(0.07) 2.90(0.17) LASSO 3.01(0.04) 3.54(0.12) 3.46(0.03) 1.39(0.04) 3.55(0.14) 3.99(0.07) 1.29(0.03) 3.97(0.29) CB 1.19(0.01) 3.78(0.02) 3.42(0.01) 1.32(0.01) 3.14(0.04) 3.71(0.03) 1.26(0.01) 3.75(0.03) CHR 1.40(0.02) 1.49(0.23) 0.68(0.02) 1.59(0.07) 3.42(0.12) 1.69(0.11) 0.23(0.01) 3.03(0.15) DCP 1.29(0.02) 15.69(0.06) 0.83(0.04) 1.69(0.10) 3.57(0.07) 1.76(0.10) 0.23(0.01) 6.00(0.02) R2CCP (OURS) 1.34(0.02) 3.80(2.61) 0.50(0.00) 0.92(0.02) 3.21(0.08) 1.60(0.07) 0.20(0.02) 3.80(0.26) Table 1: This is the length results over all datasets. We see that our method achieves the best length on 10 of the 16 datasets. Meanwhile, CQR is best at 5, CHR is best at 3, CB is best at 1, and KDE is the best at 3. Our method achieves the shortest intervals across these datasets. 0 1 2 3 4 y (y | xn + 1) Confidence Level α Ground Truth (y | xn + 1) (a) Original 0 1 2 3 4 y (y | xn + 1) Confidence Level α Ground Truth (y | xn + 1) (b) No Entropy 0 1 2 3 4 y (y | xn + 1) Confidence Level α Ground Truth (y | xn + 1) 0 1 2 3 4 y (y | xn + 1) Confidence Level α Ground Truth (y | xn + 1) (d) MLE with Entropy Figure 2: The resulting density estimates with different loss functions. We see that removing entropy from our loss function or using MLE as error terms causes sharp density estimates. Moreover, adding in the entropy regularization with MLE does not smooth the density estimate but instead raises the entire distribution uniformly; which does not provide valuable information for CP. for producing optimal intervals empirically. Moreover, we will also demonstrate the importance of our entropy term. We do this by retraining our models with the entropy part omitted. We also test by combining the MLE error term with entropy regularization. Therefore, we will perform 4 different Published as a conference paper at ICLR 2024 DATASET BIMODAL LOGNORMAL CONCRETE MEPS-19 MEPS-20 MEPS-21 BIO COMMUNITY LNE 0.430.001 3.250.058 1.890.037 3.730.057 3.790.061 8.780.163 2.010.006 3.640.054 LMLE 0.350.001 1.760.020 0.740.016 1.580.008 1.590.009 1.710.034 2.710.001 2.320.047 LMLE + E 0.360.001 3.520.008 1.910.017 1.600.008 1.630.027 1.700.008 2.320.003 3.770.013 L 0.440.002 1.820.034 0.370.004 1.600.009 1.590.011 1.690.013 1.100.004 1.500.025 DATASET DIABETES SOLAR PARKINSONS STOCK CANCER PENDULUM ENERGY FOREST LNE 1.920.052 2.200.233 4.760.033 10.170.165 3.260.207 12.820.479 3.300.093 3.330.254 LMLE 1.560.031 0.140.005 0.340.006 4.480.214 3.260.117 3.000.203 0.250.011 3.040.110 LMLE + E 1.960.012 0.150.005 0.230.001 9.730.074 3.950.022 13.400.221 0.250.009 5.240.055 L 1.370.037 0.660.092 0.460.036 1.950.039 3.040.142 1.620.042 0.210.025 2.920.127 Table 2: We present the length results over all of the variant loss functions. We find that our loss function delivers the best over 12 datasets, demonstrating that our chosen loss function often generates the best intervals. For datasets where our method does not deliver the best results, it is likely that tuning the weight on the entropy τ and the smoothing term p would likely have improved the results, but we do not do this for the sake of evaluation. ablation experiments on the loss function by retraining on different variations of the loss functions and reporting the final length and coverage generated. Overall, the four loss functions we will use are the original loss function L, the original loss function without entropy LNE, the MLE loss function LMLE, and the MLE loss function with entropy added LMLE + E. We will demonstrate that our chosen loss function delivers the best results across all the loss functions. We report our results in Table 2. Results We see that our chosen loss function achieves the best length across most loss functions in Table 2. The only datasets where our chosen loss function does not achieve the best length are Energy, Solar, Bimodal, and Log-Normal. We now discuss the individual differences between our loss function and each variant loss function. When comparing our loss L with the variant without entropy LNE, we see the lengths constantly increase except for the bimodal distribution. Without an entropy regularizing term, the outputted probability distributions are not smooth, placing much mass on a single data point. This overconfidence results in the αth quantile of the low probability of the true label as in Figure 2. When looking at the differences between our loss function when changing the error term to Maximum Log Likelihood, we see similar overconfidence. While MLE loss works on several datasets, such as Energy, Solar, Bimodal, and Log-Normal, it also performs poorly on other datasets. MLE works well for datasets with less simple label distributions but fails otherwise due to similar overconfidence. Even when adding entropy as a regularizer to the MLE loss as in LMLE + E, we see that the addition of entropy does not improve the length. Since MLE loss does not treat nearby bins differently than far away bins, regularization decreases the overconfidence but does so uniformly across all bins. The result is a roughly uniform distribution with a single spike. This does not utilize the structure of regression where bins near the best bin are preferable. Thus, prioritizing nearby bins is crucial for achieving strong length. Moreover, when adding entropy regularization to our error term, we do not see a uniform increase in probability across all bins but instead a smoothening of a direct. This connection between entropy regularization and our distance-based loss function appears to be a powerful synergy per the intervals produced. Thus, the error term utilizing the regression structure by prioritizing nearby bins and the regularizing entropy term preventing overconfidence seem crucial for producing good intervals. Another important insight is that the datasets where other loss functions excel are the same ones where other CP methods perform better than ours. In particular, on the Bimodal, Log-Normal, Parkinson s, and Solar datasets, our CP method is not superior, and other loss functions outperform it. This indicates that the MLE s sharp distributions result in greater accuracy on these specific datasets. Hence, adjusting the weighting terms for entropy τ and smoothing p could enhance the smoothness of the learned distribution and, thereby, the performance of our algorithm across these datasets, leading to the best outcomes. Nevertheless, to maintain fairness, we avoid this approach. Published as a conference paper at ICLR 2024 Victor Chernozhukov, Kaspar W uthrich, and Yinchu Zhu. Distributional conformal prediction. Proceedings of the National Academy of Sciences, 118(48):e2107794118, 2021. Joel W Cohen, Steven B Cohen, and Jessica S Banthin. The medical expenditure panel survey: a national information resource to support healthcare cost research and inform policy and practice. Medical Care, pp. 44 50, 2009. Isidro Cort es-Ciriano and Andreas Bender. Concepts and applications of conformal prediction in computational drug discovery. Artificial Intelligence in Drug Discovery, pp. 63 101, 2020. Raul Diaz and Amit Marathe. Soft labels for ordinal regression. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. Edwin Fong and Chris C Holmes. Conformal Bayesian computation. In Advances in Neural Information Processing Systems (Neur IPS), 2021. Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, and Dacheng Tao. Deep ordinal regression network for monocular depth estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. Subhankar Ghosh, Taha Belkhouja, Yan Yan, and Janardhan Rao Doppa. Improving uncertainty quantification of deep classifiers via neighborhood conformal prediction: Novel algorithm and theoretical analysis. ar Xiv preprint ar Xiv:2303.10694, 2023. Etash Kumar Guha, Eugene Ndiaye, and Xiaoming Huo. Conformalization of sparse generalized linear models. In International Conference on Machine Learning (ICML), 2023. URL https: //proceedings.mlr.press/v202/guha23b.html. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012. Antonis Lambrou, Harris Papadopoulos, and Alex Gammerman. Evolutionary conformal prediction for breast cancer diagnosis. In 2009 9th international conference on information technology and applications in biomedicine, pp. 1 4. IEEE, 2009. Antonis Lambrou, Harris Papadopoulos, Efthyvoulos Kyriacou, Constantinos S Pattichis, Marios S Pattichis, Alexander Gammerman, and Andrew Nicolaides. Assessment of stroke risk based on morphological ultrasound image analysis with conformal prediction. In Artificial Intelligence Applications and Innovations: 6th IFIP WG 12.5 International Conference, AIAI 2010, Larnaca, Cyprus, October 6-7, 2010. Proceedings 6, pp. 146 153. Springer, 2010. Jing Lei and Larry Wasserman. Distribution-free prediction bands for non-parametric regression. Journal of the Royal Statistical Society Series B: Statistical Methodology, 76(1):71 96, 2014. Jing Lei, James Robins, and Larry Wasserman. Distribution-free prediction sets. Journal of the American Statistical Association, 108(501):278 287, 2013. Jing Lei, Max G Sell, Alessandro Rinaldo, Ryan J Tibshirani, and Larry Wasserman. Distributionfree predictive inference for regression. Journal of the American Statistical Association, 113 (523):1094 1111, 2018. Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. Locally valid and discriminative prediction intervals for deep learning models. Advances in Neural Information Processing Systems, 34:8378 8391, 2021. Charles Lu, Anastasios N Angelopoulos, and Stuart Pomerantz. Improving trustworthiness of ai disease severity rating in medical imaging with ordinal conformal prediction sets. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 545 554. Springer, 2022. Eugene Ndiaye. Stable conformal prediction sets. In International Conference on Machine Learning, pp. 16462 16479. PMLR, 2022. Published as a conference paper at ICLR 2024 Eugene Ndiaye and Ichiro Takeuchi. Root-finding approaches for computing conformal prediction set. Machine Learning, 112(1):151 176, 2023. Kolby Nottingham, Rachel Longjohn, and Markelle Kelly. UCI Machine Learning Repository, 2023. URL https://archive.ics.uci.edu/datasets. Accessed: September, 2023. Harris Papadopoulos, Kostas Proedrou, Volodya Vovk, and Alex Gammerman. Inductive confidence machines for regression. In Machine Learning: ECML 2002: 13th European Conference on Machine Learning Helsinki, Finland, August 19 23, 2002 Proceedings 13, pp. 345 356. Springer, 2002. Y. Romano, E. Patterson, and E. J. Candes. Conformalized quantile regression. In Advances in Neural Information Processing Systems (Neur IPS), 2019. Rasmus Rothe, Radu Timofte, and Luc Van Gool. Dex: Deep expectation of apparent age from a single image. In IEEE International Conference on Computer Vision Workshops (ICCVW), 2015. Jonas Rothfuss, Fabio Ferreira, Simon Walther, and Maxim Ulrich. Conditional density estimation with neural networks: Best practices and benchmarks. ar Xiv preprint ar Xiv:1903.00954, 2019. Matteo Sesia and Yaniv Romano. Conformal prediction using conditional histograms. In Advances in Neural Information Processing Systems (Neur IPS), 2021. Lawrence Stewart, Francis Bach, Quentin Berthet, and Jean-Philippe Vert. Regression as classification: Influence of task formulation on neural network features. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2023. A aron Van Den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. In International Conference on Machine Learning (ICML), 2016. V. Vovk, A. Gammerman, and G. Shafer. Algorithmic learning in a random world. Springer, 2005. Martin J Wainwright and Michael I Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1 2):1 305, 2008. Hongxin Wei, Renchunzi Xie, Hao Cheng, Lei Feng, Bo An, and Yixuan Li. Mitigating neural network overconfidence with logit normalization. In International Conference on Machine Learning, pp. 23631 23644. PMLR, 2022. Andreas S Weigend and Ashok N Srivastava. Predicting conditional probability distributions: A connectionist approach. International Journal of Neural Systems, 6(02):109 118, 1995. Yunpeng Xu, Wenge Guo, and Zhi Wei. Conformal risk control for ordinal classification. In Uncertainty in Artificial Intelligence, pp. 2346 2355. PMLR, 2023. Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In European Conference on Computer Vision (ECCV), 2016. Published as a conference paper at ICLR 2024 A AUTHOR CONTRIBUTION STATEMENT Etash Guha, Shlok Natarajan, Thomas M ollenhoff, Eugene Ndiaye, and Mohammad Emtiyaz Khan were all responsible for the development of the main idea and writing of the paper. Etash Guha and Shlok Natarjan were responsible for the experiments and implementation of the algorithm. Etash Guha, Thomas M ollenhoff, and Eugene Ndiaye were responsible for the design of the loss function, binning, and entropic regularization. Etash Guha and Eugene Ndiaye were responsible for writing the related works. B LIMITATIONS There are several limitations to our algorithm. The loss function we choose to optimize needs larger representational power due to increased complexity of the output. Therefore, the neural networks we used in practice to minimize this loss function are larger than the ones used for CQR. Moreover, to achieve strong training, we had to slightly vary the hyperparameters for a dataset depending on its size. For example, larger datasets needed models to be trained with smaller initial learning rate to avoid divergence. This was not needed for Quantie Regressors. If the network does not train well or achieve good training or validation loss, the intervals produced will be suboptimal, so finding a training setup that minimizes the loss effectively is crucial for our algorithm to produce meaningful intervals. Moreover, we do not directly learn the label distribution and our produced probability distributions do not directly mirror that of the ground truth distribution. One reason for this is the use of entropy regularization. While using entropy regularization introduces bias, we found it to be necessary to prevent the neural network from being overconfident on a simple bin in practice. C MORE DETAILS ON CONFORMAL PREDICTION AND VARIANTS Given an input observation x without its label y, Conformal Prediction method Vovk et al. (2005) allows to select set of values that its unobserved response y are likely to take. For an arbitrary value z, the rational is to score how the example (xn+1, z) is a typical/conformal observation given a sequence of iid data Dn = {(xi, yi)}i [n]. Intuitively, the scoring function, that we denote σ, measures how different is the new example compared to examples that we have already observed. We compute the score function σ(x, y) for every example in the augmented dataset Dn+1(z) = Dn (x, z) and report the fraction of time previous example score better than the new one. This define the so called conformity function π( ) [0, 1] as π(z) = 1 Rank(z) n + 1 where Rank(z) = X (x,y) Dn+1(z) I{σ(x, y) σ(xn+1, z)}. (3) If π(z) is small (resp. large) then z is non conformal (resp. conformal). Therefore, for a threshold level α (0, 1), a simple way to find our conformal set Γ(α)(xn+1) for the response yn+1 of the input xn+1 is to select all the values z whose conformity exceed the threshold i.e. Γ(α)(xn+1) = {z R : π(z) α}. The validity of the method rely on the fact that if we assume that order in which the observations arrive is irrelevant and that the score function σ is invariant wrt permutation of the data, then the sequence of scores {σ(x1, y1), . . . , σ(xn, yn), σ(xn+1, yn+1)} are also equally likely and thus the Rank(yn+1) is uniformly distributed in {1, . . . , n+1}. As such, for any confidence level α (0, 1), P(yn+1 Γ(α)(xn+1)) P(π(yn+1) α) = P(Rank(yn+1) (n + 1)(1 α)) 1 α. Conditional validity The previous coverage guarantee may not always be ideal because it is not conditional on the particular input xn+1. Unfortunately, it is not possible to have such conditional validity when the distribution of the inputs is continuous. Nevertheless, it can be approximated using some localisation strategies based on a weighted conformity function; see (Lin et al., 2021; Ghosh et al., 2023). This also allows to further reduce the size of the conformal prediction sets. Validity can be conditional on a discrete attribute of the input, xn+1, through the use of Mondrian Conformal Prediction. This method involves a slight modification of the conformity function. Let Published as a conference paper at ICLR 2024 a taxonomy κ be a function that partitions the data into a finite number of categories eg κ(x, .) is a classification of the example x. This allows to modify the pi-value function in order to achieve validity conditional on the category of the test example : (x,y) Dn+1(z) I{σ(x, y) σ(xn+1, z), κ(x, y) = κ(xn+1, z)} P (x,y) Dn+1(z) I{κ(x, y) = κ(xn+1, z)} Computational issues As mentioned in the introduction, the selection of the score function affects the size and shape of the prediction sets. There are also computational issues that arise when the full data set is used to fit the function. For example, consider using the estimated distance to the conditional mean σ(x, y) = |y ˆE[y|x]|. The estimator ˆE[ |x] is learned on the extended dataset Dn+1(z) for all possible values z, which is computationally infeasible for most methods without stronger assumptions; see (Ndiaye & Takeuchi, 2023), (Ndiaye, 2022). As such, most of the popular strategies are based on the fitting of the prediction model to an left-out independent set of data, as we did in this paper. D DETAILS ON DATASET FROM LEI & WASSERMAN (2014) For the dataset referenced in Figure 1b, we generate the dataset by sampling many (X, Y ) pairs from the following distribution. X Unif[ 1.5, 1.5] 2N(f(X) g(X), σ2(X)) + 1 2N(f(X) + g(X), σ2(X)) f(X) = (X 1)2(X + 1), g(X) = 4 p (X + 1/2)I(X 1/2), σ2(X) = 1/4 + |X| This dataset demonstrates bimodality after x passes the threshold of 0.5. This dataset was similarly used by the Lei & Wasserman (2014). E COVERAGE DATA We have presented the coverage data for the Conformal Prediction comparison as well as the ablation experiments in Table 3 and Table 4, respectively. Since all methods obey the classical Conformal Prediction framework, we expect coverage at the guaranteed level of 1 α. This guarantee is indeed what we see across all experiments. This confirms that our coverage guarantee indeed holds in practice. DATASET BIMODAL LOGNORM CONCRETE MEPS-19 MEPS-20 MEPS-21 BIO COMMUNITY CQR 0.90(0.00) 0.91(0.00) 0.91(0.00) 0.90(0.00) 0.90(0.00) 0.90(0.00) 0.90(0.00) 0.90(0.00) KDE 0.93(0.00) 0.90(0.00) 0.90(0.00) 0.91(0.00) 0.92(0.00) 0.91(0.00) 0.90(0.00) 0.90(0.00) LASSO 0.90(0.00) 0.91(0.00) 0.90(0.00) 0.90(0.00) 0.90(0.00) 0.90(0.00) 0.90(0.00) 0.90(0.00) CB 0.90(0.00) 0.91(0.01) 0.88(0.01) 0.90(0.00) 0.90(0.00) 0.89(0.00) 0.90(0.00) 0.90(0.00) CHR 0.90(0.00) 0.91(0.00) 0.91(0.00) 0.90(0.00) 0.90(0.00) 0.90(0.00) 0.90(0.00) 0.90(0.00) DCP 0.90(0.00) 0.91(0.00) 0.91(0.00) 1.00(0.00) 1.00(0.00) 1.00(0.00) 1.00(0.00) 0.90(0.00) R2CCP (OURS) 0.90(0.00) 0.90(0.00) 0.90(0.00) 0.90(0.00) 0.90(0.00) 0.90(0.00) 0.90(0.00) 0.90(0.00) DATASET DIABETES SOLAR PARKINSONS STOCK CANCER PENDULUM ENERGY FOREST CQR 0.91(0.00) 0.91(0.00) 0.90(0.00) 0.91(0.00) 0.92(0.00) 0.91(0.00) 0.90(0.00) 0.91(0.00) KDE 0.91(0.00) 0.71(0.15) 0.89(0.00) 0.90(0.00) 0.92(0.01) 0.90(0.01) 0.91(0.00) 0.81(0.08) LASSO 0.91(0.00) 0.90(0.00) 0.90(0.00) 0.91(0.00) 0.92(0.00) 0.90(0.00) 0.90(0.00) 0.90(0.00) CB 0.89(0.00) 0.92(0.01) 0.90(0.00) 0.88(0.01) 0.87(0.02) 0.88(0.01) 0.89(0.01) 0.90(0.01) CHR 0.91(0.00) 0.91(0.00) 0.90(0.00) 0.91(0.00) 0.92(0.00) 0.91(0.00) 0.91(0.00) 0.91(0.00) DCP 0.91(0.00) 1.00(0.00) 0.90(0.00) 0.91(0.00) 0.92(0.00) 0.90(0.00) 0.91(0.00) 1.00(0.00) R2CCP (OURS) 0.90(0.00) 0.92(0.02) 0.96(0.00) 0.92(0.02) 0.90(0.00) 0.90(0.00) 0.91(0.01) 0.89(0.00) Table 3: This is the coverage data over all datasets. We see that all methods achieve the roughly 1 α coverage guaranteed by conformal prediction. Published as a conference paper at ICLR 2024 DATASET BIMODAL LOGNORM CONCRETE MEPS-19 MEPS-20 MEPS-21 BIO COMMUNITY L 0.90(0.02) 0.90(0.02) 0.90(0.02) 0.90(0.01) 0.90(0.01) 0.90(0.01) 0.90(0.00) 0.90(0.01) LNE 0.90(0.02) 0.81(0.03) 0.84(0.03) 0.90(0.01) 0.90(0.01) 0.90(0.01) 0.68(0.01) 0.88(0.02) LMLE 0.90(0.02) 0.90(0.02) 0.90(0.02) 0.90(0.01) 0.90(0.01) 0.90(0.01) 0.90(0.00) 0.90(0.01) LMLE + E 0.90(0.02) 0.90(0.02) 0.90(0.02) 0.90(0.01) 0.90(0.01) 0.90(0.01) 0.92(0.00) 0.90(0.01) DATASET DIABETES SOLAR PARKINSONS STOCK CANCER PENDULUM ENERGY FOREST L 0.90(0.03) 0.90(0.02) 0.95(0.01) 0.90(0.03) 0.90(0.05) 0.90(0.03) 0.90(0.03) 0.89(0.03) LNE 0.94(0.02) 0.90(0.02) 0.73(0.01) 0.89(0.03) 0.90(0.05) 0.90(0.03) 0.90(0.02) 0.77(0.04) LMLE 0.90(0.03) 0.90(0.02) 0.90(0.01) 0.91(0.03) 0.90(0.05) 0.90(0.03) 0.90(0.02) 0.89(0.03) LMLE + E 0.90(0.03) 0.90(0.02) 0.90(0.01) 0.90(0.03) 0.90(0.05) 0.90(0.03) 0.90(0.03) 0.90(0.03) Table 4: This is the coverage data for different loss functions from the ablation. F MORE EXPERIMENTAL DETAILS In order to maintain fairness across all baselines, we use the same size neural network for our method across all experiments. Specifically, we discretize the range space into K = 50 points, weight the entropy term by τ = 0.2, use a 1000 hidden dimension, use 4 layers, use weight decay of 1e 4, use p = .5, and use Adam W as an optimizer. For most of the experiments, we use learning rate 1e 4 and batch size 32. However, for certain datasets, namely the MEPS datasets, we used a larger batch size of 256 to improve training time and used a smaller learning rate to prevent training divergence. We did change any other parameter between all of our runs. For the baselines of CQR and CHR, we use the neural network configurations mentioned in the paper. We found that the parameterizations mentioned by the authors in the papers achieved the best performance and changing the parameterizations weakened their results. G ADDITIONAL EXPERIMENTS We add several additional experiments to help contextualize our results. First, we add experiments tracking how our method performs over different coverage levels relative to CHR and CQR across several datasets in Appendix G.2. Second, in Appendix G.3, we analyze what percentage of the time the predicted intervals are singeltons, meaning there is only one continuous interval. Third, we analyze the correlation between the residual prediction error and the predicted interval s length with our method in Appendix G.4. G.1 PERFORMANCE WITH DIFFERENT NUMBER OF BINS We set the number of bins at a constant K = 50. A natural question is how the number of bins affects the performance of our conformal prediction methodology. To test this, we retrain several models with varying number of bins on the Concrete dataset. We report the size of the predicted intervals for each number of bins in Figure 3. We observe that as the number of bins increases, the average length of the predicted intervals decreases and stabilizes. This suggests that after a sufficient number of bins, the length of the intervals becomes relatively constant. In practice, classification problems with 1000 bins or more have been routinely solved since 2012 (Krizhevsky et al., 2012). This suggests that the increasing number of bins does not make the learning task more difficult, explaining how performance remains relatively unaffected as the number of bins is increased. G.2 DIFFERENT COVERAGE LEVELS While the coverage level of α = .1 is relatively standard across the Conformal Prediction literature, we want to see how our method compares to other methods across different coverage levels. We take the trained models from the original experiments and compute the intervals across different coverage levels for our method, CHR, and CQR. As expected, for all methods, the length of the predicted intervals increases as the required coverage level increases. We report our results in Figure 4. Published as a conference paper at ICLR 2024 20 40 60 80 100 Number of Bins Figure 3: We present an ablation of how the number of bins affects the average length of the generated intervals. G.3 NUMBER OF SINGLETONS We also provide an analysis of the percentage of times that our predicted intervals are singleton, meaning only one interval is predicted. We present our results in Table 5. We see that for a majority of the datasets, our model tends to predict singleton intervals. However, for both the Bimodal dataset and the Bio dataset, we see that our method predicts far less singletons. This observation is intuitive for the Bimodal dataset as two intervals are expected to account for the different modes. G.4 CORRELATION BETWEEN PREDICTION ERROR AND LENGTH OF PREDICTED INTERVALS We want to see how the prediction error correlates with the predicted interval lengths. We define prediction error as the absolute error residual between predicted label and the true label. Given our model is taylored to generated intervals, there are many ways to predict a single label for a given set of features. For example, we can select the bin with the largest probability predicted by our model ˆy = ˆyl where k = arg max k [K] qθ(ˆyk|x). We can also use the expected label according to our distribution as the predicted label k=1 ˆykqθ(ˆyk|x). We choose the second method for the sake of these experiments. We compute the absolute residual error as ˆy y . We plot our results in Figure 5. We see that the prediction error is not strongly correlated with the length of the predicted interval. H OUTPUT DISTRIBUTION Here, we plot the distribution of lengths of all Conformal Prediction methods over all datasets. We also plot example density functions learned by out method, CHR, and KDE on all datasets. We note that KDE s learned densities seem to be relatively less informative. Moreover, the learned densities from CHR are noisy and not smooth. Moreover, we plot several examples of only our learned probability distribution. We use this when referencing the experiments to demonstrate the different shapes of label distributions from the data. Published as a conference paper at ICLR 2024 0.2 0.4 0.6 0.8 Coverage Level Ours CQR CHR (a) MEPS-20 0.2 0.4 0.6 0.8 Coverage Level Ours CQR CHR (b) MEPS-21 0.2 0.4 0.6 0.8 Coverage Level Ours CQR CHR (c) DIABETES 0.2 0.4 0.6 0.8 Coverage Level Ours CQR CHR (d) BIMODAL 0.2 0.4 0.6 0.8 Coverage Level Ours CQR CHR (e) PENDULUM 0.2 0.4 0.6 0.8 Coverage Level Ours CQR CHR 0.2 0.4 0.6 0.8 Coverage Level Ours CQR CHR (g) LOG NORMAL 0.2 0.4 0.6 0.8 Coverage Level Ours CQR CHR 0.2 0.4 0.6 0.8 Coverage Level Ours CQR CHR (i) BREAST CANCER 0.2 0.4 0.6 0.8 Coverage Level Ours CQR CHR (j) CONCRETE 0.2 0.4 0.6 0.8 Coverage Level Ours CQR CHR 0.2 0.4 0.6 0.8 Coverage Level Ours CQR CHR (l) COMMUNITY 0.2 0.4 0.6 0.8 Coverage Level Ours CQR CHR 0.2 0.4 0.6 0.8 Coverage Level Ours CQR CHR (n) PARKINSON S 0.2 0.4 0.6 0.8 Coverage Level Ours CQR CHR (o) MEPS-21 Figure 4: We plot the width of the predicted intervals for our method, CHR, and CQR across different coverage levels. Published as a conference paper at ICLR 2024 DATASET BIMODAL LOGNORMAL CONCRETE MEPS19 MEPS20 MEPS21 BIO COMMUNITY NUMBER OF SINGLETONS 5 197 194 3157 3509 3132 6725 382 NUMBER OF EXAMPLES 400 200 206 3157 3509 3132 9146 399 DATASET DIABETES SOLAR PARKINSONSSTOCK CANCER PENDULUM ENERGY FOREST NUMBER OF SINGLETONS 81 174 1103 105 37 115 153 96 NUMBER OF EXAMPLES 89 214 1175 108 39 126 154 104 Table 5: We present an analysis of the number of singletons predicted by our method over all datasets. For most datasets, we see that our method tends to predict singletons. Published as a conference paper at ICLR 2024 0 10 20 30 40 Errors (a) MEPS-20 0 20 40 60 Errors (b) MEPS-21 0.0 0.5 1.0 1.5 Errors (c) DIABETES 0.8 0.9 1.0 1.1 1.2 Errors (d) BIMODAL 0 2 4 6 Errors (e) PENDULUM 0.0 0.5 1.0 1.5 2.0 2.5 Errors 0.0 0.5 1.0 1.5 2.0 Errors (g) LOG NORMAL 0.00 0.05 0.10 0.15 0.20 0.25 Errors 0 1 2 3 Errors (i) BREAST CANCER 0.0 0.1 0.2 0.3 0.4 0.5 Errors (j) CONCRETE 0.0 0.5 1.0 1.5 Errors 0 1 2 3 Errors (l) COMMUNITY 0 1 2 3 4 5 Errors 0.0 0.5 1.0 1.5 2.0 Errors (n) PARKINSON S 0 2 4 6 8 10 Errors Figure 5: We plot the corrleation between absolute residual error and the length of our predicted intervals over several datasets. We see that there is not a strong correlation. Published as a conference paper at ICLR 2024 0 10 20 30 40 50 60 y (y | xn + 1) Ground Truth Confidence Level α (y | xn + 1) (a) MEPS-20 0 10 20 30 40 50 60 70 y (y | xn + 1) Ground Truth Confidence Level α (y | xn + 1) (b) MEPS-21 0.5 1.0 1.5 2.0 y (y | xn + 1) Ground Truth Confidence Level α (y | xn + 1) (c) DIABETES 1.0 0.5 0.0 0.5 1.0 y (y | xn + 1) Ground Truth Confidence Level α (y | xn + 1) (d) BIMODAL 10 8 6 4 2 0 2 4 6 y (y | xn + 1) Ground Truth Confidence Level α (y | xn + 1) (e) PENDULUM 0.0 0.5 1.0 1.5 2.0 2.5 y (y | xn + 1) Ground Truth Confidence Level α (y | xn + 1) 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 y (y | xn + 1) Ground Truth Confidence Level α (y | xn + 1) (g) LOG NORMAL 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 y (y | xn + 1) Ground Truth Confidence Level α (y | xn + 1) (y | xn + 1) Ground Truth Confidence Level α (y | xn + 1) (i) BREAST CANCER 0.0 0.5 1.0 1.5 2.0 y (y | xn + 1) Ground Truth Confidence Level α (y | xn + 1) (j) CONCRETE 4 2 0 2 4 6 y (y | xn + 1) Ground Truth Confidence Level α (y | xn + 1) 0 1 2 3 4 y (y | xn + 1) Ground Truth Confidence Level α (y | xn + 1) (l) COMMUNITY 1 0 1 2 3 4 5 y (y | xn + 1) Ground Truth Confidence Level α (y | xn + 1) 2 1 0 1 2 3 y (y | xn + 1) Ground Truth Confidence Level α (y | xn + 1) (n) PARKINSON S 0 2 4 6 8 10 12 14 16 y (y | xn + 1) Ground Truth Confidence Level α (y | xn + 1) Figure 6: Example Probability Distributions outputted by our learned methodology on different datapoints from different datasets. We can see there are a variety of different shapes of distributions learned. Published as a conference paper at ICLR 2024 Ours KDE Lasso CQR CHR 0.5 1.0 1.5 2.0 Length (a) Bimodal Length Distribution Ours KDE Lasso CQR CHR 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Length (b) Bio Length Distribution Ours KDE Lasso CQR CHR 0 1 2 3 4 5 Length (a) Breast Cancer Length Distribution Ours KDE Lasso CQR CHR 0 1 2 3 4 Length (b) Community Length Distribution Ours KDE Lasso CQR CHR 0.0 0.5 1.0 1.5 2.0 2.5 Length (a) Concrete Length Distribution Ours KDE Lasso CQR CHR 0.5 1.0 1.5 2.0 2.5 3.0 Length (b) Diabetes Length Distribution Published as a conference paper at ICLR 2024 Ours KDE Lasso CQR CHR 0 1 2 3 4 Length (a) Energy Length Distribution Ours KDE Lasso CQR CHR 0 1 2 3 4 5 6 7 Length (b) Forest Length Distribution Ours KDE Lasso CQR CHR 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Length (a) Log Normal Length Distribution Ours KDE Lasso CQR CHR 0 10 20 30 40 Length (b) MEPS 19 Length Distribution Ours KDE Lasso CQR CHR 0 10 20 30 40 50 60 Length (a) MEPS 20 Length Distribution Ours KDE Lasso CQR CHR 0 10 20 30 40 50 60 Length (b) MEPS 21 Length Distribution Published as a conference paper at ICLR 2024 Ours KDE Lasso CQR CHR 0 1 2 3 4 5 6 Length (a) Parkinsons Length Distribution Ours KDE Lasso CQR CHR 0 2 4 6 8 Length (b) Pendulum Length Distribution Ours KDE Lasso CQR CHR 0 2 4 6 8 10 12 14 Length (a) Solar Length Distribution Ours KDE Lasso CQR CHR 0 1 2 3 4 5 Length (b) Stock Length Distribution 0.0 0.5 1.0 1.5 2.0 2.5 y (y | xn + 1) Confidence Level α Ground Truth CQR Lower CQR Upper (y | xn + 1) (y | xn + 1) Confidence Level α Ground Truth CQR Lower CQR Upper (y | xn + 1) (b) BREAST CANCER Figure 7: Plot of outputted density functions for ours, KDE, and CHR on for BIO and BREAST CANCER Published as a conference paper at ICLR 2024 1.0 0.5 0.0 0.5 1.0 y (y | xn + 1) Ground Truth Confidence Level α CQR Lower CQR Upper (y | xn + 1) (a) BIMODAL 0 1 2 3 4 y (y | xn + 1) Confidence Level α Ground Truth CQR Lower CQR Upper (y | xn + 1) (b) COMMUNITY Figure 8: Plot of outputted density functions for ours, KDE, and CHR on for BIMODAL and COMMUNITY 0.0 0.5 1.0 1.5 2.0 y (y | xn + 1) Confidence Level α Ground Truth CQR Lower CQR Upper (y | xn + 1) (a) CONCRETE 0.5 1.0 1.5 2.0 y (y | xn + 1) Confidence Level α Ground Truth CQR Lower CQR Upper (y | xn + 1) (b) DIABETES Figure 9: Plot of outputted density functions for ours, KDE, and CHR on for CONCRETE and DIABETES 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 y (y | xn + 1) Confidence Level α Ground Truth CQR Lower CQR Upper (y | xn + 1) 1 0 1 2 3 4 5 y (y | xn + 1) Ground Truth Confidence Level α CQR Lower CQR Upper (y | xn + 1) Figure 10: Plot of outputted density functions for ours, KDE, and CHR on for ENERGY and FOREST Published as a conference paper at ICLR 2024 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 y (y | xn + 1) Confidence Level α Ground Truth CQR Lower CQR Upper (y | xn + 1) (a) LOG NORMAL 0 10 20 30 40 50 60 70 y (y | xn + 1) Confidence Level α Ground Truth CQR Lower CQR Upper (y | xn + 1) (b) MEPS-21 Figure 11: Plot of outputted density functions for ours, KDE, and CHR on for LOG NORMAL and MEPS-19 2 1 0 1 2 3 y (y | xn + 1) Confidence Level α Ground Truth CQR Lower CQR Upper (y | xn + 1) (a) PARKINSON S 10 8 6 4 2 0 2 4 6 y (y | xn + 1) Confidence Level α Ground Truth CQR Lower CQR Upper (y | xn + 1) (b) PENDULUM Figure 12: Plot of outputted density functions for ours, KDE, and CHR on for PARKINSON S and PENDULUM 0 2 4 6 8 10 12 14 16 y (y | xn + 1) Confidence Level α Ground Truth CQR Lower CQR Upper (y | xn + 1) 4 2 0 2 4 6 y (y | xn + 1) Confidence Level α Ground Truth CQR Lower CQR Upper (y | xn + 1) Figure 13: Plot of outputted density functions for ours, KDE, and CHR on for SOLAR and STOCK