# credal_deep_ensembles_for_uncertainty_quantification__32c8dabb.pdf

Credal Deep Ensembles for Uncertainty Quantiﬁcation

Kaizheng Wang1,4, Fabio Cuzzolin3 Shireen Kudukkil Manchingal3

Keivan Shariatmadar2,4 David Moens2,4 Hans Hallez1

1KU Leuven, Department of Computer Science, Distri Net 2KU Leuven, Department of Mechanical Engineering, LMSD 3Oxford Brookes University, Visual Artiﬁcial Intelligence Laboratory 4Flanders Make@KU Leuven {kaizheng.wang, keivan.shariatmadar, david.moens, hans.hallez}@kuleuven.be {fabio.cuzzolin, 19185895}@brookes.ac.uk

This paper introduces an innovative approach to classiﬁcation called Credal Deep Ensembles (Cre DEs), namely, ensembles of novel Credal-Set Neural Networks (Cre Nets). Cre Nets are trained to predict a lower and an upper probability bound for each class, which, in turn, determine a convex set of probabilities (credal set) on the class set. The training employs a loss inspired by distributionally robust optimization which simulates the potential divergence of the test distribution from the training distribution, in such a way that the width of the predicted probability interval reﬂects the epistemic uncertainty about the future data distribution. Ensembles can be constructed by training multiple Cre Nets, each associated with a different random seed, and averaging the outputted intervals. Extensive experiments are conducted on various out-of-distributions (OOD) detection benchmarks (CIFAR10/100 vs SVHN/Tiny-Image Net, CIFAR10 vs CIFAR10-C, Image Net vs Image Net-O) and using different network architectures (Res Net50, VGG16, and Vi T Base). Compared to Deep Ensemble baselines, Cre DEs demonstrate higher test accuracy, lower expected calibration error, and signiﬁcantly improved epistemic uncertainty estimation.

1 Introduction

The quantiﬁcation of the uncertainty associated with neural network predictions has recently attracted increasing attention, to enhance the reliability and robustness of neural networks. Researchers agree to distinguish aleatory uncertainty (AU) from epistemic uncertainty (EU): the former arises from the inherent randomness, e.g., data noise, and is irreducible. The latter is caused by a lack of knowledge about the process which generates the data, due to the limited availability of training data, and is reducible [1, 36]. Effective EU quantiﬁcation is beneﬁcial for out-of-distribution (OOD) detection [30, 54] and can contribute to a variety of safety-critical applications, including autonomous driving [21], medical diagnosis [44], ﬂood uncertainty estimation [10], structural health monitoring [72].

In classiﬁcation, standard neural networks (SNNs) whose predictions amount to single probability distributions are unable to account for epistemic uncertainty, because a single distribution assumes precise knowledge about the dependency between inputs and outputs. To properly capture the EU, the network s outcome needs to express the uncertainty about a prediction s uncertainty itself [35, 62].

The most well-known approach to EU quantiﬁcation in deep learning leverages Bayesian neural networks (BNNs) [7, 22, 38]. BNNs model network parameters as distributions and thus predict a second-order distribution (i.e., a distribution of distributions) [36], although in practice predictions

Corresponding author.

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

are often generated by running the network on sample parameters extracted from a posterior. While efﬁcient training techniques (such as sampling [33, 57] and variational inference [7, 22]) have been developed to mitigate their complexity, practical challenges persist for BNNs, including the scaling to large datasets, handling complex network architectures, and real-time applicability [54].

An alternative approach, called Deep Ensembles (DEs), handles uncertainty quantiﬁcation by aggregating multiple individually-trained SNNs [43], so that predictions amount to ﬁnite sets of probability distributions over the classes. DEs, often viewed as an approximation of Bayesian model averaging (BMA) [38], have become a powerful baseline for uncertainty estimation [2, 25, 53, 54, 60]. However, empirical evidence [2] suggests that DEs could yield relatively low-quality estimates of epistemic uncertainty. Further, DEs lack a sound theoretical justiﬁcation [27, 47].

Credal inference [13, 36, 62] (which predicts convex sets of probability distributions or credal sets [46] on the target space) can provide an alternative way of quantifying epistemic uncertainty. Credal representations have been extensively studied within the broader ﬁeld of machine learning, including, e.g., the naive credal classiﬁer [14], the notion of credal network [13] or credal random forest classiﬁcation [65]. Imprecise BNNs have been recently introduced which model both network weights and predictions as credal sets [8]. While these models demonstrate robustness in Bayesian sensitivity analysis, their computational complexity is on a par with that of ensembles of BNNs, signiﬁcantly limiting their practical applicability.

Novelty and Main Contributions. This paper presents an innovative approach to classiﬁcation tasks called Credal Deep Ensembles (Cre DEs), ensembles of novel Credal-Set Neural Networks (Cre Nets), aiming to improve EU quantiﬁcation in the framework of credal inference. At the cost of merely doubling the number of output nodes compared to classical SNNs, Cre Nets are trained to predict a lower and an upper probability bound for each class rather than a single probability value. Such probability intervals over classes thus efﬁciently determine a prediction in the form of a credal set. The training strategy is inspired by Distributionally Robust Optimization [42, 55, 61], which simulates the potential divergence of the test distribution from the training distribution. As a result, the width of the predicted probability interval reﬂects the epistemic uncertainty about the future data distribution. Adopting an ensemble strategy, Cre DEs derive the ﬁnal prediction by averaging the probability intervals outputted by the members of the ensemble. A conceptual comparison between Cre DEs and DEs is illustrated in Figure 1.

Extensive experimental validation is conducted on several OOD detection benchmarks, including CIFAR10/100 (ID) vs SVHN/Tiny-Image Net (OOD), CIFAR10 (ID) vs CIFAR10-C (OOD), Image Net (ID) vs Image Net-O (OOD), and across different network architectures: Res Net50, VGG16 and Visual Transformer Base (Vi T Base). Compared to traditional Deep Ensembles, our Cre DEs achieve higher test accuracy and lower expected calibration error (ECE) on ID samples, and signiﬁcantly improve the quality of EU estimation.

Model Space

Target Space 𝕐

SNN-3 Cre Net-3

B D (Credal Sets Formed by Probability Intervals ) (Single Probability Distributions )

Model Space Target Space 𝕐

Credal Deep Ensembles Deep Ensembles

Average Aggregate

Figure 1: Comparison between the proposed Credal Deep Ensembles and traditional Deep Ensembles. The former aggregate a collection of credal set predictions from Cre Nets as the ﬁnal (credal) prediction, whereas the latter average a set of single probability distributions from standard SNNs as the outcome. E.g., in the probability simplex [16] associated with the target space Y={A, B, D} (the triangle in the ﬁgure), a probability vector (q A, q B, q D) is represented as a single point. For each Cre Net, the predicted lower and upper probabilities of each class act as constraints (parallel lines) which determine a credal prediction (in gray). Single credal predictions are aggregated as in Sec. 2.4.

Related Work Besides BNNs, DEs, and credal inference, other second-order uncertainty estimation approaches exist, such as Dirichlet-based methods [9, 48, 49, 50, 63], in which predictions are represented as Dirichlet distributions. One signiﬁcant challenge for the latter is the absence of ground

truth labels. Although various loss functions have been proposed, these models performance often diverges from theoretical EU assumptions [71]. Another rationale for the exclusion of Dirichlet-based approaches as baselines for our Cre DE work is that such models often necessitate the inclusion of OOD data during training [48, 49, 56]. This challenges their practical adaptability, as it cannot guarantee their robustness against other forms of unseen OOD data [71]. Moreover, a recent study [39] has shown that these methods often fail to capture the EU properly, making the resulting measures difﬁcult to interpret quantitatively.

Paper Outline The remainder of this paper is structured as follows. Sec. 2 presents our Cre Nets and Cre DEs in full detail. Sec. 3 describes the experimental validations and results. Sec. 4 summarizes our conclusions and future work. Appendices report mathematical proofs in A, additional experiments in B, implementation details in C, the analysis of alternative ensemble strategies for Cre DEs in D, and further discussion on future work in E (including achieving statistical guarantees using conformal learning and the framework s extension to regression), respectively.

The proposed Credal-Set Neural Network architecture and forward propagation are introduced in Sec. 2.1. Cre Nets training procedure is discussed in Sec. 2.2. The class prediction and uncertainty quantiﬁcation are discussed in Sec. 2.3. Credal Deep Ensembles are presented in Sec. 2.4.

2.1 Credal-Set Neural Networks

Architecturally, our Cre Net design focuses only on the ﬁnal classiﬁcation layers, and can therefore be applied on top of any representation layers of neural network models. The ﬁnal layers of a Cre Net (Figure 2) ﬁrst output a deterministic interval for each class, using for each class an output node associated with the interval midpoint m and one associated with its half-length h, respectively (a total of 2C nodes for C classes).

Interval Soft Max

Deterministic Interval

Probability Interval

Credal Set on Probability Simplex

Figure 2: Cre Net ﬁnal layer structure for three classes.

Let z be the input vector to the ﬁnal layer. Cre Nets compute m and h (the vectors collecting interval midpoints and half-lengths for all classes) as:

m=g(W1:C z+b1:C) h=ζ(WC+1:2C z+b C+1:2C), (1)

where W1:C, b1:C, WC+1:2C, b C+1:2C are the weights and biases associated with the ﬁrst C and the remaining C nodes, respectively. Here g( ) is an arbitrary activation function and ζ( ) denotes the Softplus function [79] that ensures the non-negativity of h.

The deterministic intervals associated with all classes, denoted as [a L, a U]:={[a Li, a Ui]}C i=1, can then be obtained as [a L, a U]=[m h, m+h].

A proper mapping from such deterministic intervals [a L, a U] to a collection of probability intervals [q L, q U]:={[q Li, q Ui]}C i=1 for each class needs to ensure that [q L, q U] satisﬁes the conditions:

q Li q U i i=1, ..., C and PC i=1 q Li 1 PC i=1 q Ui. (2)

The former condition guarantees a proper [q Li, q Ui] for each class. The latter enables the resulting collection of probability intervals to determine a non-empty credal set, Q, as follows [52]:

Q={q|qi [q Li, q Ui]; PC i=1 qi =1}. (3)

The probability vectors in Q meet the normalization condition, and their probability value per class is constrained by the probability intervals (Eq. (2)).

Traditional Soft Max activation cannot ensure that the convexity conditions in Eq. (2) are met when computing [q L, q U] using q L =Soft Max(a L) and q U =Soft Max(a U), respectively. A toy example

is given in Appendix A. Therefore, we employ Interval Soft Max activation as proposed in [75] to compute [q L, q U] from [a L, a U], as follows:

q Li = exp(a Li)

exp(a Li)+P

k =iexp( a Uk+a Lk

2 ) , q U i = exp(a Ui)

exp(a Ui)+P

k =iexp( a Uk+a Lk

where q Li and q Ui are the lower and upper probability bound for the ith class, respectively. As proven in Appendix A, the probability intervals generated by the Interval Soft Max duly satisfy Eq. (2).

2.2 Training Procedure

The rationale for the training of a Cre Net is for the predicted lower and upper bounds (Eq. (4)) to the probability of the classes, q L and q U, to express the epistemic uncertainty (induced by the limited size and variability of the training set) about how different the distribution of future test data may be from that of the training data.

To this extent, we designed a composite loss function with two components: one, which applies classical cross entropy to the upper probability vector, encourages the latter to optimistically assume that test data distribution will be similar. The other, inspired by Distributionally Robust Optimization (DRO) [42, 55, 61], pushes the lower probability to reﬂect a pessimistic stance on future distributional divergence. The width of the resulting interval will thus reﬂect the epistemic uncertainty associated with the prediction.

We ﬁrst contrast the classical training strategy with that of Distributionally Robust Optimization in Sec. 2.2.1. We then delve into the design and implementation of our Cre Net loss in Sec. 2.2.2.

2.2.1 Classical and DRO Training Strategy

Vanilla Strategy Given a training set D={xn, tn}N n=1, the conventional neural network training process aims to solve the following optimization problem

minimize θ Θ

n=1 L((xn, tn), θ) , (5)

where θ denotes the model s trainable parameters in the space Θ and L denotes an arbitrary loss function. The underlying assumption is that the training and test distributions are identical. As a result, the trained network serves as an empirical risk minimizer [36]. However, this ideal assumption often results in over-optimistic predictions because the test observations may, in practice, signiﬁcantly differ from the training data [34].

DRO Strategy In contrast to the vanilla strategy, the objective of DRO [6, 20] is to minimize the worst-case expected risk (R(θ)) over an uncertain set of distributions U, as follows:

minimize θ Θ

R(θ) .= sup U U E(x,t) UL((x, t), θ) , (6)

in which E is the expectation operation. In practice, a group DRO setting [59] is adopted in which the training distribution P is assumed to be a mixture of m groups Pg, indexed by g G = {1, 2, ..., m}. Because the optimum of a linear program is attained at a vertex, the worst-case risk in Eq. (6) is equivalent to a maximum over the expected loss of each group, as follows:

R(θ) = maximize g G E(x,t) Pg L((x, t), θ). (7)

In practice, the group DRO model minimizes the empirical worst-group risk ˆR(θ), namely:

minimize θ Θ

ˆR(θ) .= maximize g G E(x,t) ˆ Pg L((x, t), θ) , (8)

where ˆPg is the empirical distribution of the g-th group of training points. Therefore, group DRO learns models with good worst-group training loss across groups [59]. One special form of group DRO is adversarially reweighted learning [42], which consists of a minimax game between a learner and adversary. The learner optimizes for the main classiﬁcation task and aims to learn the best

parameters θ that minimize the expected loss. In contrast, the adversary maximizes the expected loss by making an adversarial assignment of weights wn, collected in a vector w. Consequently, the training optimization problem assumes the form

minimize θ Θ

maximize w S 1 N

n=1 wn L((xn, tn), θ) , (9)

where the set S of weight vectors varies across different implementations [42, 55, 61].

2.2.2 Cre Net Loss Design and Implementation

Design As anticipated, the Cre Net training process applies the vanilla training strategy to the upper probability prediction vector q U (Eq. (5)), and the DRO strategy to the lower probability prediction q L (Eq. (9)). The resulting overall loss function has a composite structure

LCre Net := 1

n=1CE(q Un, tn) | {z } Vanilla Component

+ maximize w S 1 N

n=1wn CE(q Ln, tn) | {z } DRO Component

where CE denotes the classical cross-entropy loss function used in classiﬁcation. Given a predicted discrete probability vector q and the ground-truth label t, CE is deﬁned as: CE:= PC k tk log2 qk. The vanilla component is applied to the upper probability vector q U because such a loss takes the training distribution at face value and is thus more likely to encourage optimistic (overconﬁdent) or upper bound predictions for the class scores. The DRO component is computed on the lower probability vectors q L, as it weighs training outliers to simulate future differences in data distribution at test time, encouraging pessimistic or lower bound class score predictions. Thus, the width of the resulting probability interval will reﬂect the uncertainty associated with the model s ignorance of how much the future test distribution will differ from the train distribution, using the boundary/outlier cases observed at training time to guess what the uncertainty on future test cases will be.

Cross-Entropy of Lower/Upper Probability Vectors Please note that in Eq. (10), the CE is applied to lower/upper probability vectors, which are not (normalized) probability vectors. However, as the ground truth (label) vector t equals 1 for the true class j and 0 for all the other elements, calculating CE(q, t) for any predicted probability vector q reduces to log2 q(j ). Consequently, all probability vectors with the same component for the true class will generate the same CE for that sample.

The consequence for Cre Net training is that feeding a lower (upper) probability vector q L (q U) to Eq. (10) is equivalent to computing the CE with any one of the probability vectors in the credal prediction (Eq. (3)) whose probability for the true class j equals the lower (upper) probability value there. It can be shown that these form one of the faces of the boundary of the credal set.

Importantly, because of the functional structure of Interval Soft Max activation (Eq. (4)), upper and lower probability vectors are not computed independently, but are correlated. Thus, they are minimized together via the total loss (Eq. (10)), with the DRO component also inﬂuencing the upper probability q U, driving the solution away from the trivial one (all-ones upper probability vectors).

Implementation As the S of weight vectors in Eq. (10) varies across different implementations [42, 55, 61] and estimating w in Eq. (10) is not straightforward when using batch-wise training [34], we resort to a simpler heuristic proposed by [34]. For each training batch, only the δ [0.5, 1) portion of samples with the highest cross-entropy with the lower probability vector (CE(q Ln, tn)) are selected to compute the DRO component of the loss. As a result, wn >1 is implicitly set in Eq. (10) for selected samples while wn =0 for deselected samples.

The underlying rationale is the following. Within a batch of samples, those instances that demonstrate high losses are identiﬁed as hard-to-learn samples, essentially representing the minority group within a training dataset [34]. Setting a value δ thus identiﬁes what fraction of the training points is chosen to represent potential future domain shifts at test time. A smaller δ signiﬁes a more cautious approach, in which even a few training outliers can indicate future challenges.

The lower bound to the design range for δ is 0.5 because we empirically observed that values of δ < 0.5 may destabilize the training process, as a too-large averaged loss is returned for backpropagation. When δ approaches 1, the data distribution of the samples considered by the vanilla and the DRO components of the loss becomes similar, implicitly assuming a less pronounced divergence between train and test distributions. The corresponding predicted probability intervals become narrower. If δ

were theoretically set to 1, all samples would be selected for backpropagation, implying that wn =1 for any n in Eq. (10). Consequently, the loss in Eq. (10) would be the sum of the vanilla component on q U and the vanilla component on q L. Empirically, we observed that this leads to a collapse of the upper and lower probability bounds to single values.

The implementation of the Cre Net training procedure is shown in Algorithm 1. Algorithm 1 Cre Net Training Procedure

Input: Training dataset D={xn, tn}N n=1; Portion of samples per batch δ [0.5, 1); Batch size η while enable training do

1. Compute CE(q Un, tn) and CE(q Ln, tn) for each sample 2. Sort the sample indices (m1, ..., mη) in descending order of CE(q Ln, tn) 3. Deﬁne ηδ = δη 4. Minimize LCre Net = 1

η Pη n=1 CE(q Un, tn)+ 1

ηδ Pηδ j=1 CE(q Lmj , tmj) end while

2.3 Class Prediction and Uncertainty Quantiﬁcation

Class Prediction For the class prediction we employ the maximin and maximax criteria [66]:

ˆimin :=argmax i q Li; ˆimax :=argmax i q Ui, (11)

which output (respectively) the class indices with the highest lower and upper reachable probability (q Li and q Ui) within the same credal set induced by the predicted lower and upper probabilities q Li, q Ui. Figure 3 illustrates how the lower and upper probabilities q Li, q Ui that determine the credal set Q may differ from the probabilities q Li and q Ui actually reachable for each class within Q. The reachable lower and upper probabilities for class i can be easily obtained as follows [17]:

q Ui=min q Ui, 1 X

, q Li=max q Li, 1 X

Uncertainty Quantiﬁcation Given a credal set prediction, upper and lower entropies generalizing Shannon s entropy, denoted as H(Q) and H(Q), can be deﬁned which may serve as measures for TU and AU, respectively [3, 36].

Computing H(Q) boils down to solving the following optimization problem:

H(Q)=maximize XC

i=1 qi log2 qi s.t. q Li qi q Ui i and XC

i=1qi =1. (13)

This seeks the highest entropy value of a probability distribution within the predicted credal set Q. H(Q), for which maximize is replaced by minimize, searches for the minimal such entropy. Such optimization problems can be addressed using a standard solver, e.g., the Sci Py optimization package [73]. Epistemic uncertainty can then be quantiﬁed as H(Q) H(Q) [36].

Computational Complexity Reduction To reduce the computational complexity of Eq. (13) for a large value of C (e.g., C = 1000), we propose an original approach called Probability Interval Dimension Reduction (PIDR) in Algorithm 2. This method ﬁrst identiﬁes the K 1 classes with the highest lower probability values, then merges the remaining elements into a single class with the associated upper and lower probability calculated using Eq. (12). Consequently, the dimension of the probability interval is reduced from C to K.

2.4 Credal Deep Ensembles

Inspired by conventional DEs [43], the ﬁnal step of our approach is to introduce Credal Deep Ensembles (Cre DEs). Cre DEs aggregate M individually trained Cre Nets and predict the aggregated probability intervals, denoted as [ q L, q U], as follows:

m=1q Lm, q U = 1

m=1q Um, (14)

where [q Lm, q Um] is the set of reachable probability intervals predicted by the m-th Cre Net. Eq. (20) in Appendix D proves that [ q L, q U] satisﬁes the convexity condition in Eq. (2) for constructing a

Algorithm 2 Probability Interval Dimension Reduction Algorithm

Input: [q L, q U]; Chosen number of classes K Output: Reduced-dimensional probability intervals [r L, r U] 1. Index vector of q L in descending order: l argsort(q L) 2. Deﬁne the upper and lower probability per selected class: r Lj q Llj , r Uj q Ulj for j =1, ..., K 1 3. Deﬁne upper and lower probability for deselected classes: r LK max(1 Pl C i=l Kq Ui, PK 1 j=1 r Lj);

r UK min(1 Pl C i=l Kq Li, PK 1 j=1 r Uj)

Figure 3: If intervals are redundant, some of the (e.g.) upper probabilities q UA may not be actually reachable in the credal set that results from the intersection of all interval constraints.

non-empty credal set. Therefore, class prediction and uncertainty estimation as described in Sec. 2.3 apply to Cre DEs. We discuss the rationale for averaging strategy and the alternative ensemble approaches for Cre DEs in Appendix D.

3 Experimental Validation

Setup We assessed Cre DEs through OOD detection benchmarks across various dataset pairings (ID vs OOD samples), including CIFAR10 [41]/CIFAR100 [40] vs SVHN [31]/Tiny-Image Net [45], CIFAR10 vs CIFAR10-C [29], Image Net [18] vs Image Net-O [31]. We trained 15 Cre Nets (using δ =0.5) and SNNs on the Res Net50 architecture [28] starting from different random seeds, using the training set as per the ID dataset in the pair. Following this, we constructed 15 different Cre DEs and DEs, respectively, by randomly selecting ﬁve members from the pool of 15 trained models. The same ensemble member lists are used for both DEs and Cre DEs, with each ensemble strictly guaranteed to be distinct. More details are given in Appendix C. Codes are available at https://gitlab.kuleuven.be/m-group-campus-brugge/distrinet_public/ credal-deep-ensembles.git.

Uncertainty Quantiﬁcation in DEs Total uncertainty (TU) can be quantiﬁed in DEs via the Shannon entropy (H) of the averaged predicted distribution. The AU, on the other hand, can be obtained by averaging the entropies of the predictions of each ensemble member [2, 36]. Namely,

TU:=H( q)=H( 1

M PM m=1 qm), AU:= H(q)= 1

M PM m=1 H(qm), (15)

where M is the number of networks, q and qm denote the average probability vector and the single probability vector of the m-th SNN model, respectively. The level of epistemic uncertainty, representing an approximation of mutual information [36], can be obtained as EU:=H( q) H(q).

Test Accuracy and ECE on ID Samples We evaluated the test accuracy and expected calibration error (ECE) [24, 58] of Cre DEs-5 and DEs-5 on the test set of each ID dataset. A lower ECE value signiﬁes a closer alignment between the model s conﬁdence scores and the true probabilities of the events. Since ECE is designed for a singular probability vector, we implemented a compromise calculation as follows. Suppose our model predicts the class indices k and j when using the ˆimin and ˆimax criteria, respectively, ECE values are then computed based on the associated lower q Lk and upper q Uj reachable probabilities in the credal set.

Table 1 reports the test accuracy and ECE for DEs-5 and Cre DEs-5 on the various datasets, indicating that our Cre DEs-5 achieved higher test accuracy and lower ECE on ID samples. Note that employing the ˆimin prediction showed higher ECE on the challenging Image Net dataset. This is likely because the strategy, selecting the class with the highest lower reachable probability, is a conservative one.

Table 1: Test accuracy (%, ) and ECE ( ) of DEs-5 and Cre DEs-5 using CIFAR10, CIFAR100, and Image Net as ID datasets over 15 runs. The better performance is marked in bold.

CIFAR10 CIFAR100 Image Net

Test Accuracy ECE Test Accuracy ECE Test Accuracy ECE

DEs-5 93.32 0.13 0.0131 0.0010 75.80 0.28 0.0392 0.0027 77.92 0.02 0.2415 0.0009

ˆimin 93.75 0.11 0.0092 0.0016 79.54 0.21 0.0366 0.0025 78.41 0.02 0.5930 0.0006 Cre DEs-5 (Ours) ˆimax 93.74 0.11 0.0108 0.0017 79.65 0.19 0.0268 0.0023 78.51 0.02 0.1685 0.0004

Table 2: OOD detection AUROC and AUPRC performance (%, ) between Cre DEs-5 and DEs-5 based on Res Net50 using EU as uncertainty metrics on CIFAR10/100 vs. SVHN/Tiny-Image Net and Image Net vs. Image Net-O. Results are averaged over 15 runs. Best results in bold.

ID Samples CIFAR10 CIFAR100 Image Net

OOD Samples SVHN Tiny-Image Net SVHN Tiny-Image Net Image Net-O

Performance Indicator AUROC AUPRC AUROC AUPRC AUROC AUPRC AUROC AUPRC AUROC AUPRC

DEs-5 H( q) H(q) 89.58 0.93 92.29 1.00 86.87 0.20 83.02 0.16 73.83 1.97 84.96 1.25 78.80 0.20 74.68 0.27 65.03 0.53 62.77 0.38

Cre DEs-5 H(Q) H(Q) 96.55 0.25 98.17 0.17 88.10 0.26 87.85 0.35 78.55 1.15 86.57 0.65 82.54 0.26 77.60 0.44 67.82 0.06 62.80 0.12

EU Quantiﬁcation for OOD Detection It is our hypothesis that OOD data express a higher EU. Hence, we can use a better EU quantiﬁcation as the means to improve the OOD detection [54]. Thus, superior OOD detection performance provides compelling evidence of enhanced uncertainty estimation quality. For the OOD detection performance assessment, we

𝐻ℚ 𝐻(ℚ)(Ours) 𝐻𝒒+ 𝐻, 𝒒

Figure 4: OOD detection (CIFAR10 vs CIFAR10-C) over increased corruption intensity.

employed AUROC (Area Under the Receiver Operating Characteristic curve) and AUPRC (Area Under the Precision-Recall curve) scores. AUROC captures true and false positive rates, while AUPRC assesses precision and recall trade-offs, offering valuable insights into model effectiveness across various conﬁdence levels. When calculating H(Q) and H(Q) in the Image Net vs Image Net-O experiment, we employed our PIDR Algorithm 2 with K = 20. Table 2 reports the OOD detection performance of Cre DEs-5 and DEs-5 in the CIFAR10/CIFAR100 vs SVHN/Tiny-Image Net, and Image Net vs Image Net-O settings. As the CIFAR10-C dataset contemplates data from CIFAR10 corrupted in 15 distinct ways, each with 5 different intensities, Figure 4 presents averaged AUROC and AUPRC scores for OOD detection on CIFAR10 vs CIFAR10-C across types of corruption, against the intensity of corruption. Table 2 and Figure 4 conﬁrm Cre DEs-5 s superior OOD detection performance over DEs-5. This indicates the effectiveness of Cre DEs in improving the EU quantiﬁcation quality, using H(Q) H(Q) as the uncertainty measures.

Qualitative Evaluation Due to the high dimensionality, visualizing or directly computing the size of the credal set becomes challenging as C increases. Consequently, we indirectly evaluate whether our Cre DEs consistently generate nearly Dirac credal sets as predictions through the maximum attainable upper bound probability of the prediction. The closer this probability is to 1, the more it approximates a Dirac credal set. Figure 5 shows the results of Res Net50-based Cre DEs-5 for the CIFAR10, SVHN, and Tiny-Image Net datasets. It veriﬁes that our method does not consistently generate nearly Dirac credal sets, especially for OOD samples. For CIFAR10, a substantial proportion of (but not all) the credal sets are quasi-Dirac. This observation is reasonable as it is consistent with the high test accuracy of Cre DEs and the low ECE reported in Table 1.

CIFAR10 SVHN Tiny-Image Net

Figure 5: Maximum reachable upper probability max (q U1, ..., q UC) per sample from 15 runs.

Additionally, Figure 6 shows the reliability diagram[24] of the Res Net50-based DEs-5 and Cre DEs-5 on the CIFAR10 dataset, demonstrating better calibration performance of our Cre DEs. Figure 7 showcases the EU estimation plots for these models. Although the EU estimates for DEs-5 and Cre DEs-5 are not directly comparable due to differing representations, Cre DEs-5 demonstrates signiﬁcantly higher EU estimates for OOD samples, as observed qualitatively.

Ablation Study on Various Network Architectures We also performed an ablation study on network backbones different from Res Net50, including VGG16 [67] and Vision Transformer Base (Vi T Base)

𝚤 !"#(Cre DEs-5) ECE: 0.8110#10$% 𝚤 !&'(Cre DEs-5) ECE: 0.9434#10$% DEs-5 ECE: 1.4582#10$%

Figure 6: Reliability diagram of Res Net50-based DEs-5 and Cre DEs-5 (using ˆimin and ˆimax, respectively) on CIFAR10.

CIFAR10 (ID) SVHN (OOD) Tiny-Image Net (OOD)

Figure 7: EU estimates comparison of Res Net50-based models. [77]. Table 3 reports the test accuracy and ECE of Cre DEs-5 and DEs-5 on the CIFAR10 test set (representing ID samples) and the OOD detection performance on CIFAR10 vs SVHN/Tiny-Image Net. Figure 8 compares OOD detection performance in the CIFAR10 vs CIFAR10-C setting against the intensity of corruption, using both AUPRC and AUROC as metrics. The results consistently demonstrate that Cre DEs achieve higher test accuracy, lower ECE, and signiﬁcantly improved epistemic uncertainty estimation, leading to enhanced OOD detection performance.

Table 3: Test accuracy (%, ) and ECE ( ) of DEs-5 and Cre DEs-5 on CIFAR10 as ID dataset (left). AUROC and AUPRC scores (%, ) for OOD detection on CIFAR10 vs SVHN/Tiny-Image Net (right). Results averaged over 15 runs. The Best results are in bold.

CIFAR10 (ID) CIFAR10 vs SVHN CIFAR10 vs Tiny-Image Net

Test Accuracy ECE AUROC AUPRC AUROC AUPRC

DEs-5 85.53 0.10 0.0815 0.0011 H( q) H(q) 82.19 0.82 87.52 0.81 78.58 0.15 73.28 0.23

ˆimin 87.94 0.11 0.0203 0.0014 VGG16 Cre DEs-5 (Ours) ˆimax 87.92 0.11 0.0611 0.0012 H(Q) H(Q) 87.68 0.73 93.47 0.57 82.56 0.28 80.81 0.52

DEs-5 90.43 0.97 0.0181 0.0019 H( q) H(q) 77.71 1.67 88.73 0.32 82.27 0.79 78.85 0.81

ˆimin 93.60 0.40 0.0107 0.0014 Vi T Base Cre DEs-5 (Ours) ˆimax 93.59 0.39 0.0104 0.0012 H(Q) H(Q) 88.57 2.08 93.24 1.25 88.73 0.32 87.84 0.52

VGG16 Vi T Base 𝐻ℚ 𝐻(ℚ)(Ours) 𝐻𝒒+ 𝐻, 𝒒

Figure 8: OOD detection on CIFAR10 vs CIFAR10-C against increased corruption intensity, using VGG16 and Vi T Base as backbones.

Ablation Study on Hyperparameter δ for Cre Net Training In our main evaluation, we set by default δ = 0.5 to reﬂect a balanced assessment of the train-test divergence and show how such a value allows our model to outperform the baselines. Table 4 reports the test accuracy and OOD detection performance (using EU estimates) of Cre DEs-5 under various values of δ. The

Table 4: Test accuracy (%, ) and OOD detection performance (%, ) of Cre DEs-5 using various δ. Results are averaged over 15 runs.

δ 0.5 0.625 0.75 0.875 0.9375 0.96875

Test Accuracy (CIFAR10) ˆimax 93.74 94.54 94.47 94.57 93.88 93.99 ˆimin 93.75 94.55 94.47 94.56 93.87 93.99

SVHN (OOD Detection) AUROC 97.44 97.44 97.92 97.95 97.42 97.51 AUPRC 93.07 96.34 97.00 96.92 98.79 98.82

Tiny-Image Net (OOD Detection) AUROC 88.28 89.01 89.10 89.18 89.85 89.24 AUPRC 88.13 89.81 89.76 89.72 89.18 89.26

ablation ﬁndings verify the robustness of Cre DEs across hyperparameter setups and indicate the δ = 0.5 might be too pessimistic a choice in CIFAR10 settings. Performance peaks at δ=0.875 in most cases, implying that δ=0.875 may provide the optimal estimate of how test and train sets diverge for CIFAR10. One possible way to ﬁnd the best δ in practice is to conduct standard cross-validation on speciﬁc test scenarios. However, the method is not particularly sensitive to this hyperparameter. Perspectively, an interesting option, in the presence of multiple datasets (e.g.acquired over time in a continual learning setting), could be applying the DRO loss component to different components of the training set, and assessing the results to robustly select δ.

We also report the average EU estimation values of Cre DEs-5 for each dataset in Table 5. Increasing the value of δ (i.e., giving less importance to the divergence between test and training

distributions) leads to a decreasing trend in the average EU estimates per dataset (particularly for ID CIFAR10 samples). This aligns with the intuition that, if the model is more uncertain about

Table 5: Averaged EU estimates of Cre DEs-5 using various δ.

δ 0.5 0.625 0.75 0.875 0.9375 0.96875

CIFAR10 0.3557 0.0611 0.0586 0.0572 0.0222 0.0215

SVHN 1.6093 0.2450 0.2553 0.2593 0.1612 0.1574 Tiny-Image Net 1.4996 0.2030 0.1957 0.1970 0.1025 0.1005

the divergence of the distributions (smaller δ), it should express a larger EU. Despite smaller uncertainty values at high δ s, the difference between ID and OOD samples remains noticeable. This explains why a δ closer to 1 does not always lead to low-performance OOD detection and why our model s OOD detection performance is robust against the choice of δ.

Model Inference Complexity Table 6 reports the parameter count and inference cost on one NVIDIA A100-SXM4-40GB GPU for a single SNN and Cre Net on Image Net. Cre Nets show a marginal increase in complexity due to its minor architectural modiﬁcations. More discussions on the inference and training complexity are presented in Appendix C.

Additional Experiments Appendix B.1 discusses the implementation and performance of EU quantiﬁcation in Cre DEs when using a different uncertainty measure for credal sets (the generalized Hartley measure [4, 36]). The results demonstrate that our Cre DEs consistently enhance the quality of EU quantiﬁcation, exhibiting robustness against different measures. Table 6: Model complexity of a Res Net50-based SNN and Cre Net on Image Net dataset.

Model Parameters (million) Inference time per sample (ms)

SNN vs Cre Net 25.557 vs 27.606 5.5 0.2 vs 5.7 0.3

Appendix B.2 reports an ablation study for the hyperparameter K of our PIDR Algorithm 2, which shows the effect of K on Cre DEs s uncertainty quantiﬁcation and time cost. Appendix B.3 assesses the ability of Cre DEs to evaluate total uncertainty (TU) (as opposed to EU) in OOD detection, suggesting that our Cre DEs also achieve an improved TU estimation, compared to DEs. Appendices B.4, B.5, and B.6 compare the uncertainty quantiﬁcation abilities of Cre DEs versus those of traditional DEs that also apply the DRO strategy, DEs that apply the product of experts [32] ensemble setting, several BNN baselines, respectively. Cre DEs continue to demonstrate superior performance in uncertainty estimation. Appendix B.7 assess Cre DEs in a case study involving active learning [23, 54]. All these additional experiments demonstrate that our Cre DEs deliver improved uncertainty quantiﬁcation.

4 Conclusion

Conclusion In this paper, we introduced a novel Credal-Set Neural Network (Cre Net) for classiﬁcation tasks. Given any given input instance, Cre Net is designed to predict a lower and an upper probability for each class, rather than a single probability value, thus providing an efﬁcient and effective implementation of credal inference. We also proposed Credal Deep Ensembles (Cre DEs), ensembles of Cre Nets, which extend the traditional deep ensemble idea to the credal domain. Extensive experimental validation was conducted on several OOD detection benchmarks, and across different network architectures and uncertainty measures. Compared to traditional Deep Ensembles, our Cre DEs achieve higher test accuracy and lower ECE on ID samples, while signiﬁcantly improving the quality of EU and TU estimation, leading in turn to strongly enhanced OOD detection performance. Hence, we believe our work can potentially improve neural network safety and reliability, and have wide applicability to real-world scenarios such as medical image analysis.

Limitation Despite the superior performance of Cre DEs, neither traditional DEs nor Cre DEs may be desirable when memory usage is stringent and computational resources are limited.

Future Work Three essential objectives of our future research include elaborating on statistical coverage guarantees of our Cre DEs (outlined in Appendix E.2), extending our framework to regression tasks (a roadmap is provided in Appendix E.3), and assessing our Cre DEs alongside other uncertainty-aware models in real-world applications comprehensively, like medical image analysis.

Acknowledgement

We thank the anonymous reviewers for their valuable feedback. This work has received funding from the European Union s Horizon 2020 research and innovation program under grant agreement No. 964505 (E-pi).

[1] Moloud Abdar, Farhad Pourpanah, Sadiq Hussain, Dana Rezazadegan, Li Liu, Mohammad Ghavamzadeh, Paul Fieguth, Xiaochun Cao, Abbas Khosravi, U Rajendra Acharya, et al. A review of uncertainty quantiﬁcation in deep learning: Techniques, applications and challenges. Information Fusion, 76:243 297, 2021.

[2] Taiga Abe, Estefany Kelly Buchanan, Geoff Pleiss, Richard Zemel, and John P Cunningham. Deep ensembles work, but are they necessary? Advances in Neural Information Processing Systems, 35:33646 33660, 2022.

[3] Joaquín Abellán, George J Klir, and Serafín Moral. Disaggregated total uncertainty measure for credal sets. International Journal of General Systems, 35(1):29 44, 2006.

[4] Joaquín Abellán and Serafín Moral. A non-speciﬁcity measure for convex sets of probability distributions. International journal of uncertainty, fuzziness and knowledge-based systems, 8(03):357 367, 2000.

[5] Tathagata Basu, Matthias CM Troffaes, and Jochen Einbeck. Binary credal classiﬁcation under sparsity constraints. In International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems, pages 82 95. Springer, 2020.

[6] Aharon Ben-Tal, Dick Den Hertog, Anja De Waegenaere, Bertrand Melenberg, and Gijs Rennen. Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2):341 357, 2013.

[7] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural network. In Proceedings of the International Conference on Machine Learning, pages 1613 1622. PMLR, 2015.

[8] Michele Caprio, Souradeep Dutta, Kuk Jin Jang, Vivian Lin, Radoslav Ivanov, Oleg Sokolsky, and Insup Lee. Imprecise Bayesian neural networks. ar Xiv preprint ar Xiv:2302.09656, 2023.

[9] Bertrand Charpentier, Daniel Zügner, and Stephan Günnemann. Posterior network: Uncertainty estimation without OOD samples via density-based pseudo-counts. Advances in neural information processing systems, 33:1356 1367, 2020.

[10] Priyanka Chaudhary, João P Leitão, Tabea Donauer, Stefano D Aronco, Nathanaël Perraudin, Guillaume Obozinski, Fernando Perez-Cruz, Konrad Schindler, Jan Dirk Wegner, and Stefania Russo. Flood uncertainty estimation using deep ensembles. Water, 14(19):2980, 2022.

[11] David Cohn, Zoubin Ghahramani, and Michael Jordan. Active learning with statistical models. Advances in neural information processing systems, 7, 1994.

[12] Giorgio Corani and Alessandro Antonucci. Credal ensembles of classiﬁers. Computational statistics & data analysis, 71:818 831, 2014.

[13] Giorgio Corani, Alessandro Antonucci, and Marco Zaffalon. Bayesian networks with imprecise probabilities: Theory and application to classiﬁcation. Data Mining: Foundations and Intelligent Paradigms: Volume 1: Clustering, Association and Classiﬁcation, pages 49 93, 2012.

[14] Giorgio Corani and Marco Zaffalon. Learning reliable classiﬁers from small or incomplete data sets: The naive credal classiﬁer 2. Journal of Machine Learning Research, 9(4), 2008.

[15] Fabio Cuzzolin. Credal semantics of bayesian transformations in terms of probability intervals. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 40(2):421 432, 2009.

[16] Fabio Cuzzolin. The geometry of uncertainty: The geometry of imprecise probabilities. Springer Nature, 2020.

[17] Luis M. De Campos, Juan F. Huete, and Seraﬁn Moral. Probability intervals: A tool for uncertain reasoning. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 02:167 196, June 1994.

[18] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A largescale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248 255. Ieee, 2009.

[19] Thierry Denœux. An evidential neural network model for regression based on random fuzzy numbers. In International Conference on Belief Functions, pages 57 66. Springer, 2022.

[20] John C Duchi, Peter W Glynn, and Hongseok Namkoong. Statistics of robust optimization: A generalized empirical likelihood approach. Mathematics of Operations Research, 46(3):946 969, 2021.

[21] Stanislav Fort and Stanislaw Jastrzebski. Large scale structure of neural network loss landscapes. Advances in Neural Information Processing Systems, 32, 2019.

[22] Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the International Conference on Machine Learning, pages 1050 1059. PMLR, 2016.

[23] Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep Bayesian active learning with image data. In International conference on machine learning, pages 1183 1192. PMLR, 2017.

[24] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In International conference on machine learning, pages 1321 1330. PMLR, 2017.

[25] Fredrik K Gustafsson, Martin Danelljan, and Thomas B Schon. Evaluating scalable Bayesian deep learning methods for robust computer vision. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 318 319, 2020.

[26] Ralph VL Hartley. Transmission of information 1. Bell System technical journal, 7(3):535 563, 1928.

[27] Bobby He, Balaji Lakshminarayanan, and Yee Whye Teh. Bayesian deep ensembles via the neural tangent kernel. Advances in neural information processing systems, 33:1010 1022, 2020.

[28] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016.

[29] Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. Proceedings of the International Conference on Learning Representations, 2019.

[30] Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassiﬁed and out-of-distribution examples in neural networks. ar Xiv preprint ar Xiv:1610.02136, 2016.

[31] Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15262 15271, 2021.

[32] Geoffrey E. Hinton. Products of experts. In Proceedings of the Ninth International Conference on Artiﬁcial Neural Networks, 1999.

[33] Matthew D Hoffman, Andrew Gelman, et al. The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. J. Mach. Learn. Res., 15(1):1593 1623, 2014.

[34] Zeyi Huang, Haohan Wang, Dong Huang, Yong Jae Lee, and Eric P Xing. The two dimensions of worst-case training and their integrated effect for out-of-domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9631 9641, 2022.

[35] Eyke Hüllermeier, Sébastien Destercke, and Mohammad Hossein Shaker. Quantiﬁcation of credal uncertainty in machine learning: A critical analysis and empirical comparison. In Proceedings of the Uncertainty in Artiﬁcial Intelligence, pages 548 557. PMLR, 2022.

[36] Eyke Hüllermeier and Willem Waegeman. Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods. Machine Learning, 110(3):457 506, 2021.

[37] Alireza Javanmardi, David Stutz, and Eyke Hüllermeier. Conformalized credal set predictors. ar Xiv preprint ar Xiv:2402.10723, 2024.

[38] Laurent Valentin Jospin, Hamid Laga, Farid Boussaid, Wray Buntine, and Mohammed Bennamoun. Hands-on Bayesian neural networks A tutorial for deep learning users. IEEE Computational Intelligence Magazine, 17(2):29 48, 2022.

[39] Mira Juergens, Nis Meinert, Viktor Bengs, Eyke Hüllermeier, and Willem Waegeman. Is epistemic uncertainty faithfully represented by evidential deep learning methods? In Forty-ﬁrst International Conference on Machine Learning, 2024.

[40] Alex Krizhevsky. Learning multiple layers of features from tiny images. University of Toronto, 05 2012.

[41] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. CIFAR-10 (Canadian Institute For Advanced Research). Technical report, CIFAR, 2009.

[42] Preethi Lahoti, Alex Beutel, Jilin Chen, Kang Lee, Flavien Prost, Nithum Thain, Xuezhi Wang, and Ed Chi. Fairness without demographics through adversarially reweighted learning. Advances in neural information processing systems, 33:728 740, 2020.

[43] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in Neural Information Processing Systems, 30, 2017.

[44] Antonis Lambrou, Harris Papadopoulos, and Alex Gammerman. Reliable conﬁdence measures for medical diagnosis with evolutionary algorithms. IEEE Transactions on Information Technology in Biomedicine, 15(1):93 99, 2010.

[45] Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge. CS 231N, 7(7):3, 2015.

[46] Isaac Levi. The enterprise of knowledge: An essay on knowledge, credal probability, and chance. MIT press, 1980.

[47] Jeremiah Liu, Zi Lin, Shreyas Padhy, Dustin Tran, Tania Bedrax Weiss, and Balaji Lakshminarayanan. Simple and principled uncertainty estimation with deterministic deep learning via distance awareness. Advances in Neural Information Processing Systems, 33:7498 7512, 2020.

[48] Andrey Malinin and Mark Gales. Predictive uncertainty estimation via prior networks. Advances in neural information processing systems, 31, 2018.

[49] Andrey Malinin and Mark Gales. Reverse KL-divergence training of prior networks: Improved uncertainty and adversarial robustness. Advances in Neural Information Processing Systems, 32, 2019.

[50] Andrey Malinin, Bruno Mlodozeniec, and Mark Gales. Ensemble distribution distillation. In International Conference on Learning Representations, 2019.

[51] Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsiﬁes deep neural networks. In International conference on machine learning, pages 2498 2507. PMLR, 2017.

[52] Serafín Moral-García and Joaquín Abellán. Credal sets representable by reachable probability intervals and belief functions. International Journal of Approximate Reasoning, 129:84 102, 2021.

[53] Bálint Mucsányi, Michael Kirchhof, and Seong Joon Oh. Benchmarking uncertainty disentanglement: Specialized uncertainties for specialized tasks. ar Xiv preprint ar Xiv:2402.19460, 2024.

[54] Jishnu Mukhoti, Andreas Kirsch, Joost van Amersfoort, Philip HS Torr, and Yarin Gal. Deep deterministic uncertainty: A new simple baseline. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24384 24394, 2023.

[55] Junhyun Nam, Hyuntak Cha, Sungsoo Ahn, Jaeho Lee, and Jinwoo Shin. Learning from failure: De-biasing classiﬁer from biased classiﬁer. Advances in Neural Information Processing Systems, 33:20673 20684, 2020.

[56] Jay Nandy, Wynne Hsu, and Mong Li Lee. Towards maximizing the representation gap between in-domain & out-of-distribution examples. Advances in neural information processing systems, 33:9239 9250, 2020.

[57] Radford M Neal et al. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2(11):2, 2011.

[58] Jeremy Nixon, Michael W Dusenberry, Linchuan Zhang, Ghassen Jerfel, and Dustin Tran. Measuring calibration in deep learning. In CVPR workshops, volume 2, 2019.

[59] Yonatan Oren, Shiori Sagawa, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust language modeling. ar Xiv preprint ar Xiv:1909.02060, 2019.

[60] Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, David Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model s uncertainty? evaluating predictive uncertainty under dataset shift. Advances in neural information processing systems, 32, 2019.

[61] Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. ar Xiv preprint ar Xiv:1911.08731, 2019.

[62] Yusuf Sale, Michele Caprio, and Eyke Höllermeier. Is the volume of a credal set a good measure for epistemic uncertainty? In Uncertainty in Artiﬁcial Intelligence, pages 1795 1804. PMLR, 2023.

[63] Murat Sensoy, Lance Kaplan, and Melih Kandemir. Evidential deep learning to quantify classiﬁcation uncertainty. Advances in neural information processing systems, 31, 2018.

[64] Glenn Shafer and Vladimir Vovk. A tutorial on Conformal Prediction. Journal of Machine Learning Research, 9(3), 2008.

[65] Mohammad Hossein Shaker and Eyke Hüllermeier. Ensemble-based uncertainty quantiﬁcation: Bayesian versus credal inference. In PROCEEDINGS 31. WORKSHOP COMPUTATIONAL INTELLIGENCE, volume 25, page 63, 2021.

[66] Keivan Shariatmadar and Mark Versteyhe. Numerical linear programming under nonprobabilistic uncertainty models interval and fuzzy sets. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 28(03):469 495, 2020.

[67] K Simonyan and A Zisserman. Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations (ICLR 2015). Computational and Biological Learning Society, 2015.

[68] Yutong Song and Yong Deng. Divergence measure of belief function and its application in data fusion. IEEE Access, 7:107465 107472, 2019.

[69] Hélène Soubaras. Towards an axiomatization for the generalization of the kullback-leibler divergence to belief functions. In Proceedings of the 7th conference of the European Society for Fuzzy Logic and Technology, pages 1090 1097. Atlantis Press, 2011.

[70] Holger Trittenbach, Adrian Englhardt, and Klemens Böhm. An overview and a benchmark of active learning for outlier detection with one-class classiﬁers. Expert Systems with Applications, 168:114372, 2021.

[71] Dennis Thomas Ulmer, Christian Hardmeier, and Jes Frellsen. Prior and posterior networks: A survey on evidential deep learning methods for uncertainty estimation. Transactions on Machine Learning Research, 2023.

[72] Manuel A Vega and Michael D Todd. A variational bayesian neural network for structural health monitoring and cost-informed decision-making in miter gates. Structural Health Monitoring, 21(1):4 18, 2022.

[73] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C J Carey, Ilhan Polat, Yu Feng, Eric W. Moore, Jake Vander Plas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R. Harris, Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and Sci Py 1.0 Contributors. Sci Py 1.0: Fundamental Algorithms for Scientiﬁc Computing in Python. Nature Methods, 17:261 272, 2020.

[74] Frans Voorbraak. A computationally efﬁcient approximation of dempster-shafer theory. International Journal of Man-Machine Studies, 30(5):525 536, 1989.

[75] Kaizheng Wang, Keivan Shariatmadar, Shireen Kudukkil Manchingal, Fabio Cuzzolin, David Moens, and Hans Hallez. Creinns: Credal-set interval neural networks for uncertainty estimation in classiﬁcation tasks. ar Xiv preprint ar Xiv:2401.05043, 2024.

[76] Yeming Wen, Paul Vicol, Jimmy Ba, Dustin Tran, and Roger Grosse. Flipout: Efﬁcient pseudo-independent weight perturbations on mini-batches. ar Xiv preprint ar Xiv:1803.04386, 2018.

[77] Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Zhicheng Yan, Masayoshi Tomizuka, Joseph Gonzalez, Kurt Keutzer, and Peter Vajda. Visual transformers: Token-based image representation and processing for computer vision. ar Xiv preprint ar Xiv:2006.03677, 2020.

[78] Marco Zaffalon. The naive credal classiﬁer. Journal of statistical planning and inference, 105(1):5 21, 2002.

[79] Hao Zheng, Zhanlei Yang, Wenju Liu, Jizhong Liang, and Yanpeng Li. Improving deep neural networks using softplus units. In 2015 International Joint Conference on Neural Networks (IJCNN), pages 1 4, 2015.

A Mathematical Proof

Toy Problem for Unavailability of Traditional Soft Max The traditional Soft Max activation function cannot be used to deﬁne the credal set, as it cannot ensure the condition in Eq. (2) when computing [q L, q U] as q L =Soft Max(a L) and q U =Soft Max(a U), respectively.

For example, assuming that we have a L :=( 1, 0, 1) and a U :=(0, 1, 3) from Cre Net, the q L and q U computed using Soft Max are:

q L =Soft Max(a L)=(0.0900, 0.2447, 0.6653), q U =Soft Max(a U)=(0.0420, 0.1142, 0.8438).

The resulting probability intervals are not properly deﬁned and appear unreasonable, as some lower bounds are considerably higher than the upper bounds.

Mathematical Proof for Interval Soft Max The proof that Interval Soft Max in Eq. (4) does satisfy the conditions in Eq. (2) is straightforward:

exp(a Li) PC k =iexp( a Uk+a Lk

2 ) + exp(a Li)

exp( a Ui+a Li

2 ) PC k =iexp( a Uk+a Lk

2 ) + exp( a Ui+a Li

exp(a Ui) PC k =iexp( a Uk+a Lk

2 ) + exp(a Ui) =

B Additional Experiments

In this section, Appendix B.1 discusses the implementation and performance of EU quantiﬁcation of Cre DEs using another uncertainty measure for credal sets: the generalized Hartley measure. Appendix B.2 performs an ablation study on the value of the hyperparameter K of the PIDR Algorithm 2. Appendix B.3 assessed Cre DEs performance in the OOD detection task when quantifying uncertainty using TU instead of EU. Appendices B.4, B.5, and B.6 compare Cre DEs versus traditional DEs that also apply the DRO strategy, DEs that apply the product of experts ensemble setting [32], and several BNN baselines, respectively. Appendix B.7 assessed Cre DEs in an active learning case study.

B.1 Generalized Hartley Measure for EU Quantiﬁcation of Cre DEs

Uncertainty quantiﬁcation in credal sets merits further investigation. For instance, recent research [35] has explored, e.g., the use of probability interval length as a measure of epistemic uncertainty, in the special case of binary classiﬁcation. However, these measures cannot be readily extended to multi-class cases. Recently, the most established methods for decomposing the total uncertainty of credal sets are generalized Entropy [3, 36] and the generalized Hartley Measure [4, 36].

Deﬁnition The generalized Hartley measure [4], GH(Q), measures the non-speciﬁcity across the distributions in the credal set, and can be seen as a proxy for its volume [36]. Mathematically, GH(Q) calculates the expectation of the Hartley measure [26] over all possible subsets B on the target space Y,2 as follows [4]: GH(Q) = P

B Y m Q(B) log2(|B|), (17)

in which m Q denotes the mass assignment function associated to Q and |B| indicates the cardinality of B. m Q(B) can be computed using the Möbius inverse of the capacity function νQ [36], as follows:

A B( 1)|B\A|νQ(A), (18)

where B\A = {y|y B and y / A} and νQ describes the lower probability of all possible subsets A B.

Efﬁcient Implementation in Cre DEs One of the reasons that hinder the application of the Generalized Hartley Measure is its computational complexity. In our work, we proposed an efﬁcient implementation and an approximate approach (the PIDR Algorithm 2) for computing it.

2In classiﬁcation, target space Y comprises a ﬁnite set of class labels, namely Y={y1, ..., y C}.

In our case, the lower probability νQ(A) associated with the predicted credal set can be readily computed as follows:

νQ(A)=max P

yj A q Lj, 1 P

where q L and q U are the reachable lower and upper probability values per class in the deﬁned credal set. They can be easily obtained from Eq. (12). Figure 3 illustrates how the lower and upper probabilities q Li, q Ui that determine the credal set Q may differ from the probabilities q Li and q Ui actually reachable for each class within Q.

The full GH(Q) calculation process is presented in Algorithm 3.

Algorithm 3 GH(Q) Calculation

Input: [q L, q U]:={[q Li, q Ui]}C i=1; Target space Y Output: GH(Q) Initialize: GH(Q)=0 for all B Y and |B| 2 do

Initialize: m Q(B) = 0 for all A B do

Compute νQ(A) using Eq. (19) m Q(B) = m Q(B)+( 1)|B\A| νQ(A) (Eq. (18)) end for GH(Q)=GH(Q)+m Q log2(|B|) (Eq. (17)) end for

Although the use of probability intervals simpliﬁes the calculation of GH(Q) in general, a signiﬁcant challenge arises for large values of C (e.g. C =100) due to the complexity of involving subsets of C.

However, when applying our proposed PIDR Algorithm 2, the dimension of the probability interval is reduced from C to K; therefore, calculating GH(Q) requires only 2K subsets.

Experimental Validation of GH(Q) for OOD Detection The OOD detection results for Cre DEs-5 using GH(Q) are shown in Tables 7, 8 and in Figure 9. Probability Interval Dimension Reduction (PIDR) (Algorithm 2) is utilized with settings K =4 and K =10 when computing GH(Q) for dataset pairs containing CIFAR10 and CIFAR100/Image Net, respectively. The results verify that

Our Cre DEs consistently enhance the quality of EU quantiﬁcation, exhibiting robustness across different uncertainty measures, i.e., the generalized Shannon entropy and the generalized Hartley measure. This improved EU quantiﬁcation leads to better OOD detection performance compared to Deep Ensemble baselines.

The proposed PIDR algorithm ensures an efﬁcient implementation of the generalized Hartley measure in our framework. An ablation study on PIDR s hyperparameter is conducted in Appendix B.2.

Note, however, that applying K = 10 for the setting Image Net vs Image Net-O does not yield a better result, due to the coarseness of approximating 1000 classes using only 10. This suggests that computing GH(Q) is still challenging for tasks involving 1000 or more classes.

B.2 Ablation Study on Hyperparameter of PIDR Algorithm

Effect on GH(Q) Quantiﬁcation Figure 10 illustrates the inﬂuence of various settings of K on GH(Q) quantiﬁcation. The average value GH(Q) suggests that the use of the PIDR Algorithm 2) results in an underestimated GH(Q) value, compared to the result without using PIDR (K = 10). Consequently, increasing the value of K enhances OOD detection performance. However, as K grows execution time increases exponentially, due to the iterative calculations of m Q(B) and νQ(A) in Algorithm 3 across 2K subsets. The time cost is measured on a single Intel Xeon Gold 8358 CPU@2.6 GHz. While higher than the time cost for EU calculation of DEs (4.1e 4 ms), this ﬁgure shows that calculating GH for 10 classes (K = 10 for 17 ms) remains practical without actual computational constraints. Besides, the numbers reported are for GH calculation without any optimization: a more

Table 7: OOD detection AUROC and AUPRC performance (%, ) between Cre DEs-5 and DEs-5 based on Res Net50 using EU as uncertainty metrics on CIFAR10/100 vs. SVHN/Tiny-Image Net and Image Net vs. Image Net-O. Results are averaged over 15 runs. Best results are in bold.

ID Samples CIFAR10 CIFAR100 Image Net

OOD Samples SVHN Tiny-Image Net SVHN Tiny-Image Net Image Net-O

Performance Indicator AUROC AUPRC AUROC AUPRC AUROC AUPRC AUROC AUPRC AUROC AUPRC

DEs-5 H( q) H(q) 89.58 0.93 92.29 1.00 86.87 0.20 83.02 0.16 73.83 1.97 84.96 1.25 78.80 0.20 74.68 0.27 65.03 0.53 62.77 0.38

Cre DEs-5 H(Q) H(Q) 96.55 0.25 98.17 0.17 88.10 0.26 87.85 0.35 78.55 1.15 86.57 0.65 82.54 0.26 77.60 0.44 67.82 0.06 62.80 0.12

(ours) GH(Q) 96.72 0.24 98.25 0.17 89.54 0.16 88.74 0.24 79.23 1.19 87.17 0.66 83.01 0.24 78.95 0.44 63.46 0.06 58.13 0.10

Table 8: AUROC and AUPRC scores (%, ) for OOD detection on CIFAR10 vs SVHN/Tiny Image Net. Results averaged over 15 runs. The Best results are in bold.

CIFAR10 vs SVHN CIFAR10 vs Tiny-Image Net

AUROC AUPRC AUROC AUPRC

DEs-5 H( q) H(q) 82.19 0.82 87.52 0.81 78.58 0.15 73.28 0.23

H(Q) H(Q) 87.68 0.73 93.47 0.57 82.56 0.28 80.81 0.52 VGG16 Cre DEs-5 (Ours) GH(Q) 86.99 0.72 93.18 0.41 82.23 0.18 80.83 0.24

DEs-5 H( q) H(q) 77.71 1.67 88.73 0.32 82.27 0.79 78.85 0.81

H(Q) H(Q) 88.57 2.08 93.24 1.25 88.73 0.32 87.84 0.52 Vi T Base Cre DEs-5 (Ours) GH(Q) 89.07 1.66 93.32 1.06 89.19 0.42 88.21 0.58

𝐻ℚ 𝐻(ℚ)(Ours) GH ℚ(Ours) 𝐻𝒒+ 𝐻, 𝒒 VGG16 Vi T Base Res Net50

Figure 9: OOD detection on CIFAR10 vs CIFAR10-C against increased corruption intensity, using Res Net50, VGG16, and Vi T Base as backbones.

Figure 10: Average time cost of GH(Q) (black dotted line) and GH(Q) value per sample across various datasets (blue lines), along with the AUROC/AUPRC scores (green/purple lines) for OOD detection versus increasing values of K.

efﬁcient code implementation could signiﬁcantly reduce the cost. The effect of various settings of K on H(Q) is shown as follows.

Effect on TU Quantiﬁcation In this experiment, we examine the effect of K on TU estimation. Figure 11 shows the average values of TU estimates (H(Q)) sample, together with the AUROC and AUPRC scores for CIFAR100 vs. SVHN/Tiny-Image Net. The results indicate that applying PIDR (Algorithm 2) tends to underestimate TU values. Consequently, increasing the value of K improves the OOD detection performance, but it also leads to an increase in execution time. This is because solving the constrained optimization problem in Eq. (13) involves more variables and constraints.

The reported time cost is measured on a single Intel Xeon Gold 8358 CPU@2.6 GHz, without optimization in the calculation process. We believe a more efﬁcient code implementation could signiﬁcantly mitigate this.

Figure 11: Average H(Q) time cost, average H(Q) value per sample, and OOD performance on the OOD detection benchmark (CIFAR100 vs. SVHN/Tiny-Image Net) for increasing values of K.

B.3 Total Uncertainty Estimation Evaluation via OOD Detection

In further additional experiments, we also assess the quality of the total uncertainty (TU) estimates produced by Cre DEs-5 on the various OOD detection benchmarks [43, 54]. The results in Tables 9 and 10 consistently demonstrate Cre DEs improved OOD detection performance using TU as a metric.

B.4 Comparison between Cre DEs and DEs with DRO Strategy

In this experiment, we additionally train 15 Res Net50-based SNNs on the CIFAR10 and CIFAR100 datasets using the DRO loss component and the same training strategy as in Algorithm 1, respectively. For a fair comparison, we set δ = 0.5 (just as for Cre Nets), using the same random seeds and training epochs. Other training conﬁgurations are described in Appendix C. We named the resulting deep ensembles with 5 ensemble members as DEs*-5.

Table 9: OOD detection AUROC and AUPRC performance (%, ) between Cre DEs-5 and DEs-5 based on Res Net50 using TU as uncertainty metrics on CIFAR10/100 vs. SVHN/Tiny-Image Net and Image Net vs. Image Net-O. Results are averaged over 15 runs. Best results in bold.

ID Samples CIFAR10 CIFAR100 Image Net

OOD Samples SVHN Tiny-Image Net SVHN Tiny-Image Net Image Net-O

Performance Indicator AUROC AUPRC AUROC AUPRC AUROC AUPRC AUROC AUPRC AUROC AUPRC

DEs-5 H( q) 94.80 0.43 97.26 0.29 88.80 0.19 87.21 0.29 78.53 1.94 88.83 1.01 80.75 0.15 77.65 0.19 50.20 0.07 50.43 0.06 Cre DEs-5 H(Q) 95.71 0.42 97.73 0.27 89.02 0.10 88.02 0.15 79.44 1.45 88.10 0.79 83.49 0.17 80.61 0.33 67.56 0.06 62.79 0.18

Table 10: OOD detection AUROC and AUPRC performance (%, ) between Cre DEs-5 and DEs-5 based on VGG16 and Vi T Base using TU as uncertainty metrics on CIFAR10 vs. SVHN/Tiny Image Net. Results are averaged over 15 runs. Best results in bold.

VGG16 Vi T Base

SVHN (OOD) Tiny-Image Net (OOD) SVHN (OOD) Tiny-Image Net (OOD)

AUROC AUPRC AUROC AUPRC AUROC AUPRC AUROC AUPRC

DEs-5 H( q) 84.50 0.49 90.78 0.35 79.40 0.10 75.91 0.14 79.80 1.75 87.97 1.17 83.81 0.81 81.68 0.89 Cre DEs-5 H(Q) 87.05 0.80 93.36 0.42 82.14 0.14 80.81 0.16 87.30 1.77 92.24 1.15 88.17 0.44 86.94 0.60

We compare test accuracy and ECE for DEs*-5 and Cre DEs-5 in Table 11, and their OOD detection performance on the CIFAR10/100 (ID) vs SVHN/Tiny-Image Net (OOD) benchmark in Table 12.

Table 11: Test accuracy and ECE of DEs*-5 and Cre DEs-5 on the CIFAR10 and CIFAR100 datasets. Best results in bold.

CIFAR10 CIFAR100

Test Accuracy ECE Test Accuracy ECE

DEs*-5 91.53 0.22 0.0159 0.0019 68.34 0.52 0.0372 0.0033

Cre DEs-5 ˆimin 93.75 0.11 0.0092 0.0016 79.54 0.21 0.0366 0.0025 ˆimax 93.74 0.11 0.0108 0.0017 79.65 0.19 0.0268 0.0023

Table 12: OOD detection performance comparison of DEs*-5 and Cre DEs-5 using the dataset pairs CIFAR10/100 (ID) vs SVHN/Tiny-Image Net (OOD).

CIFAR10 (ID) CIFAR100 (ID)

SVHN (OOD) Tiny-Image Net (OOD) SVHN (OOD) Tiny-Image Net (OOD) AUROC AUPRC AUROC AUPRC AUROC AUPRC AUROC AUPRC

TU DEs*-5:H( q) 91.82 0.96 95.13 0.70 86.26 0.30 84.09 0.42 78.70 1.61 88.20 0.91 76.99 0.28 73.03 0.37 Cre DEs-5:H(Q) 95.71 0.42 97.73 0.27 89.02 0.10 88.02 0.15 79.44 1.45 88.10 0.17 83.49 0.17 80.61 0.33

EU DEs*-5:H( q) H(q) 87.21 1.49 91.09 1.39 84.58 0.30 80.80 0.42 74.38 1.39 84.67 0.86 75.27 0.38 70.80 0.48 Cre DEs-5:H(Q) H(Q) 96.55 0.25 98.17 0.17 88.10 0.26 87.85 0.35 78.55 1.15 86.57 0.65 82.54 0.26 77.60 0.44 Cre DEs-5:GH(Q) 96.72 0.24 98.25 0.17 89.54 0.16 88.74 0.24 79.23 1.19 87.17 0.66 83.01 0.24 78.95 0.44

The reported results demonstrate that Cre DEs-5 outperforms DEs*-5 ensembles by achieving higher test accuracy and lower ECE values. Concerning OOD detection tasks, it can be found that Cre DEs in general improve the AUPRC and AUROC scores using either the TU or the EU metric, pretty much across the board. These results suggest that Cre DEs provide higher-quality EU and TU estimation.

In Table 12, a 0.1% drop in AUPRC using the TU metric can be observed. However, remember that Cre DEs calculate TU (using the upper entropy) by solving a constrained optimization problem in Eq. (13) using a numerical solver from Sci Py. The slight performance decrease is likely due to numerical errors during the optimization process.

B.5 Comparison between Cre DEs and DEs with POE Ensemble Setting

In this experiment, we conduct a comparison between our Cre DEs and DEs with the product of experts (POE) [32] ensemble setting, as opposed to the more commonly employed mixture of

experts approach in our primary analysis. Here, DEsp-5 denotes the deep ensembles that process the ﬁnal predictions from the ensemble members using the POE strategy. The experimental setup mirrors that used by OOD detection benchmarks, involving data pairs CIFAR10 (ID) vs SVHN/Tiny Image Net (OOD).

Table 13 shows that DEsp-5 could improve test accuracy but signiﬁcantly reduce the calibration performance of DEs-5 (larger ECE values). Among these comparisons, Cre DEs-5 emerged as the most superior method. Furthermore, we evaluate the uncertainty estimation through the OOD detection benchmark. Speciﬁcally, =the entropy of the ﬁnal prediction of DEsp-5 is calculated to quantify the total uncertainty. For Cre DEs-5 and DEs-5, we use the upper entropy, H(Q), and H( q), respectively. The results in Table 14 consistently demonstrate the superior performance of our method. Although the POE strategy improves the test accuracy of classical DEs, it signiﬁcantly degrades calibration performance and leads to inferior OOD detection performance.

Table 13: Test accuracy (ACC) (%) and ECE comparison on the CIFAR10 dataset, using the Res Net50, VGG16, and Vi T Base architectures.

Res Net50 VGG16 Vi T Base

ACC ECE ACC ECE ACC ECE

Cre DEs-5 (ˆimax) 93.74 0.11 0.0109 0.0017 87.92 0.11 0.0611 0.0012 93.59 0.39 0.0104 0.0012 Cre DEs-5 (ˆimin) 93.75 0.11 0.0092 0.0016 87.94 0.11 0.0203 0.0014 93.60 0.40 0.0107 0.0014 DEs-5 93.32 0.13 0.0131 0.0010 85.53 0.10 0.0815 0.0011 90.43 0.97 0.0181 0.0019 DEsp-5 93.47 0.11 0.0610 0.0011 85.55 0.08 0.1368 0.0008 90.56 0.90 0.0894 0.0087

Table 14: OOD detection performance comparison (%) on CIFAR10 vs SVHN/Tiny-Image Net, using the Res Net50, VGG16, and Vi T Base architectures.

Res Net50 VGG16 Vi T Base

SVHN Tiny-Image Net SVHN Tiny-Image Net SVHN Tiny-Image Net

AUROC AUPRC AUROC AUPRC AUROC AUPRC AUROC AUPRC AUROC AUPRC AUROC AUPRC

Cre DEs-595.71 0.4297.73 0.27 89.02 0.1088.02 0.15 87.05 0.8093.36 0.42 82.14 0.1480.81 0.16 87.30 1.7792.24 1.15 88.17 0.4486.94 0.60 DEs-5 94.80 0.4397.26 0.29 88.80 0.1987.21 0.29 84.50 0.4990.78 0.35 79.40 0.1075.91 0.14 79.80 1.7587.97 1.17 83.81 0.8181.67 0.89 DEsp-5 93.90 0.2496.10 0.21 88.03 0.2084.11 0.32 84.10 0.2289.83 0.16 78.11 0.0872.23 0.16 82.41 1.5688.51 0.95 83.21 1.0278.24 1.17

B.6 Comparison between Cre DEs and Bayesian Neural Networks

As discussed in the main body, the main reason for excluding Bayesian neural network (BNN) approaches in our main evaluation is that they generally have difﬁculty scaling to large datasets and complex network architectures [54]. In this section, we conducted an additional comparison between Cre DEs and DEs, MCDropout [22], and two Tensor Flow-standardized BNNs (BNN-R [51] and BNN-F [76]). All the models are trained on the Res Net50 for the CIFAR10 dataset from scratch. The input data shape is (32, 32, 3). The Adam optimizer is applied with a learning rate scheduler, initialized at 0.001. The learning rate is subject to a reduction of 0.1 at epochs 80 and 120. For BNNs, 10 forward passes are used for uncertainty estimation.

The uncertainty evaluation via OOD detection on the CIFAR10 vs SVHN/Tiny-Image Net dataset is reported in Table 15. The results consistently demonstrate the signiﬁcant improvements of our Cre DEs.

B.7 Case Study on Active Learning Settings

Active learning (AL) aims to efﬁciently train models with minimal data by acquiring additional samples from a vast pool of unlabeled data, which are then labeled by experts [11]. After each acquisition step, the model is retrained using the expanded training set. The iterative process continues until either the desired accuracy or the maximum allowable acquired samples are reached. Efﬁcient data acquisition can be a reliable estimate of the uncertainty of models [23, 54].

Setup We deploy Cre DEs-5 (δ=0.5) and DEs-5 (baseline) using the Res Net18 architecture and utilizing clean MNIST samples in the pool set. TU and EU estimations from each approach for the acquisition functions are utilized. We begin with an initial training set of 20 randomly selected MNIST points.

Table 15: OOD detection AUROC and AUPRC performance (%, ) between Cre DEs-5 and Bayesian models based on Res Net50 using EU and TU as uncertainty metrics on CIFAR10 vs. SVHN/Tiny Image Net. Results are averaged over 15 runs. The best results are in bold. The drop denotes the dropout rate applied to MCDropout.

Model Epistemic Uncertainty Measure as Metric Total Uncertainty Measure as Metric

SVHN (OOD) Tiny-Image Net (OOD) SVHN (OOD) Tiny-Image Net (OOD)

AUROC AUPRC AUROC AUPRC AUROC AUPRC AUROC AUPRC

Cre DEs-5 79.14 1.49 86.84 1.18 82.85 0.29 80.71 0.42 81.00 0.75 88.66 0.46 84.06 0.11 82.16 0.13

DEs-5 73.53 1.65 83.81 1.42 76.13 0.58 70.86 0.67 77.93 0.65 84.92 0.39 80.22 0.26 76.94 0.30

BNN-R 70.30 3.55 82.41 2.45 72.91 2.01 67.82 2.10 73.37 2.00 82.69 1.58 73.98 1.85 70.52 1.89

BNN-F 70.15 4.38 82.04 3.01 73.66 1.46 68.52 1.53 73.77 2.62 82.90 1.71 74.57 1.30 71.11 1.29

MCDropout 0.1 drop 74.19 1.55 82.93 1.01 75.04 0.77 68.25 1.31 76.92 1.85 85.93 1.22 77.48 0.56 73.63 0.62

0.4 drop 61.66 1.89 73.47 1.27 67.24 1.36 59.55 1.41 79.25 0.96 86.04 0.77 76.04 0.57 72.73 0.65

EUcre: 𝐻ℚ 𝐻(ℚ)

TUde: 𝐻'(𝒒)

EUcre: GH ℚ

EUde: 𝐻' 𝒒 𝐻(𝒒))

Figure 12: AL experiments using different acquisition functions. Achieved test accuracy vs. acquired training set size.

In each iteration, we acquire the 5 samples with the highest reported uncertainty estimates (EU or TU per model). After each step, we train models using the Adam optimizer for 20 epochs and select the one with the best accuracy from the validation set. AL process stops when the training set size reaches 150.

Results Figure 12 shows the result comparison between Cre DEs-5 and DEs-5 using TU and EU estimates as the acquisition functions per model. In the evaluation using MNIST, aiming for a 90% accuracy or a maximum sample count of 150, Cre DEs-5 employing acquisition functions TU (H(Q)) and EU (GH(Q)), demonstrates superior performance compared to DEs-5 using TU (H( q)). In addition, Cre DEs-5 with EU (H(Q) H(Q)) outperforms DEs-5 with EU (H( q) H(q)). The additional evidence veriﬁes the improved quality of EU and TU estimation of Cre DEs, compared to DEs. In future work, we aim to explore the potential integration of our methods into other active learning benchmarks [70] and real-world applications or further improve on them.

C Experiment Implementation Details

For the main experiments on the Res Net50 backbone, we used two Tesla P100-SXM2-16GB GPUs as devices to independently train 15 SNNs and Cre Nets using CIFAR10 and CIFAR100 datasets. The input shape of both networks was (224, 224, 3). We employed the Adam optimizer, with a learning rate scheduler set at 0.001 and reduced to 0.0001 during the last ﬁve training epochs.

Figure 13 shows the averaged training and validation accuracy for training process monitoring.

In the Image Net experiments, we employed three NVIDIA A100-SXM4-80GB GPUs. To create deep ensembles, we independently retrained 15 deep SNNs based on a pre-trained Res Net50 model for 3 epochs, using the Adam optimizer with an initialized learning rate of 1e 6. For Cre DEs, we initialized Cre Net weights using a pre-trained Res Net50 model and independently retrained 15 Cre Net models for 5 epochs, using the Adam optimizer with an initialized learning rate of 1e 5. The choice of a larger learning rate value and epoch count for Cre Nets is a consequence of their modiﬁed ﬁnal later compared to SNNs. For the ablation study on various network architectures, we again utilized two Tesla P100-SXM2-16GB GPUs and one NVIDIA A100-SXM4-80GB GPU as devices to independently train 15 SNNs and Cre Nets, based on VGG16 and Vi T Base architectures, respectively, and using the CIFAR10 dataset. VGG16-based SNNs and Cre Nets were trained for 20 epochs. SNNs and Cre Nets using the Vi T Base backbone were trained for 25 and 40 epochs, respectively. The input

Cre Net U Cre Net L Cre Net U-Val

Cre Net L-Val SNN SNN-Val

(a) CIFAR10

Cre Net U Cre Net L Cre Net U-Val

Cre Net L-Val SNN SNN-Val

(b) CIFAR100

Figure 13: Averaged training and validation (Val) accuracy (%) for Cre Nets and SNNs over 15 runs. The U and L in the labels of Cre Nets represent accuracies associated with upper and lower probability bounds, namely ˆimax and ˆimin, respectively.

shape for both networks was set to (224, 224, 3). For optimization, we employed the Adam optimizer with a learning rate scheduler initialized at 0.001 and reduced to 0.0001 during the ﬁnal 5 training epochs.

Training Complexity We did not include the report of training time complexity in the main paper as Cre Nets use a custom training loop, unlike the Tensor Flow-standardized training of standard neural networks (SNNs), precluding a fair comparison.

Nevertheless, we did train a single Cre Net and a single SNN based on the Res Net50 architecture on the CIFAR10 dataset, from scratch and on a single A100 GPU. The training time per epoch is 16.36s for the SNN and 73.77s for Cre Net, respectively. Given the evidence that Cre Nets only marginally increases the inference time (single forward pass), we are optimistic that by standardizing and optimizing the customized training loop and adopting a more efﬁcient code implementation of Algorithm 1, we could signiﬁcantly reduce the training load.

Further Discussion on Inference Complexity As we discussed previously, regarding inference time, doubling the ﬁnal layer nodes would slightly increase the inference time. For instance, the inference time per sample for a Res Net50 architecture on the Image Net dataset is 5.5 ms for a single standard neural network, vs 5.7 ms for a single Cre Net (a marginal increase). The inference cost on the CIFAR10/100 dataset reported in Table 16 further demonstrates a slight increase in inference complexity in our method. Moreover, Table 17 presents the inference cost, evaluated on a single AMD EPYC 7643 48-core CPU. The results indicate no signiﬁcant overhead of our Cre DEs and also demonstrate that employing VGG16, a lighter model architecture compared to Res Net50, substantially reduces the inference cost for both DEs and Cre DEs.

Table 16: Complexity comparison between Res Net50-based SNNs and Cre Nets using CIFAR10/100 datasets. The inference cost per dataset is measured by a single NVIDIA P100 SXM2-16GB GPU for both models.

Dataset Model Parameters (million) Inference time per sample (ms)

CIFAR10 SNNs vs Cre Nets 26.216 vs 26.221 60.6 0.7 vs 63.0 1.1 CIFAR100 SNNs vs Cre Nets 26.262 vs 26.314 62.5 0.5 vs 63.1 0.7

Table 17: Inference cost comparison on CPU between SNNs and Cre Nets per single CIFAR10 input of different architectures.

VGG16 (ms) Res Net50 (ms)

SNNs vs Cre Nets 19.2 3.8 vs 23.1 5.2 148.2 49.0 vs 163.3 39.4

Regarding the uncertainty estimation cost, we report the cost of calculating the Generalized Hartley (GH) measure and the upper entropy in Figures 10 and 11, respectively. For example, the time cost for GH calculation for CIFAR10 without approximation is 17 ms (0.02 ms in the reduced case considering 4 out of 10 classes) while calculating the EU in deep ensembles for CIFAR10 takes

1 10 4ms, measured on the same single CPU. Though higher Cre DEs remain practical without actual computational constraints. In addition, the reported numbers are without code efﬁciency optimization: a more efﬁcient code implementation could signiﬁcantly reduce the cost.

The practical takeaway here is that, as demonstrated by extensive experimental variation, our Cre DEs exhibit strong potential to enhance the uncertainty quantiﬁcation performance of DEs in real-world applications, with only a modest increase in computational complexity. However, if DEs are already deemed impractical due to computational limitations, our Cre DEs would not be a suitable alternative.

D Discussions on Ensemble Approaches

D.1 Rationale for Averaging Ensemble Strategy

The randomness of parameter initialization in neural networks is one of the reasons leading to (epistemic) uncertainty about the ground-truth model. As we gather more information, both epistemic and total uncertainty should decrease. For example, if we assume that we can train an inﬁnite number of standard neural networks, then the Deep Ensembles would eliminate the source of ignorance caused by the randomness of parameter initialization.

Our proposed averaging approach to creating an ensemble of Cre Nets follows a similar rationale. Speciﬁcally, if we aggregate an inﬁnite number of ensemble members, the uncertainty caused by the randomness would vanish. The outputted probability interval of Cre DEs, primarily acknowledges the lack of precise insights into the divergence between the training and test distributions.

D.2 Possible Alternative Ensemble Approaches

Cre DEs aggregate predictions from multiple individually trained Cre Nets, producing credal sets based on probability intervals. In addition to averaging, two alternative approaches, namely union (disjunctive combination) and intersection (conjunctive combination) [17], can be envisaged.

These alternative methods are illustrated in Figure 14.

(a) (b) (c)

Figure 14: Representation of three ensemble approaches: averaging (a), union (b), and intersection (c). In each subﬁgure, the ultimate credal set (highlighted in dark red) is formed by aggregating two individual credal sets, each constrained by probability intervals indicated in light green and blue, respectively.

Averaging Cre DEs average the upper and lower probabilities per class from M individually-trained Cre Nets and predict the aggregated probability intervals, denoted as [ q L, q U], as discussed in Eq. (14). It can be proved that [ q L, q U] is guaranteed to generate a non-empty credal set, as follows: PC i=1 q Li = 1

M PM m=1 PC i=1 q Lmi 1 1

M PM m=1 PC i=1 q Umi =PC i=1 q Ui. (20)

The semantic behind averaging is that we equally trust all pieces of information (individual credal sets) without judging the authenticity of the information. Similar to traditional deep ensembles (DEs), the averaging ensemble approach can alleviate the inﬂuence of training process randomness.

Union Given a collection of convex probability intervals, denoted as {[q Lm, q Um]}M m=1, De Campos et al. have proposed the computationally efﬁcient way to calculating the union of credal sets [17], as

follows: q Li = min m {1,...,M} q Lmi, q Ui = max m {1,...,M} q Umi. (21)

The union ensemble method implies that at least one piece of information is considered to be true. The union operation in Eq. (21) has a signiﬁcant limitation, as it results in an expanded credal set and introduces an overestimation effect on the precise union of credal sets, as shown in Figure 14.

Intersection A collection of convex probability intervals {[q Lm, q Um]}M m=1 can formulate an intersection as q Li = max m {1,...,M} q Lmi, q Ui = min m {1,...,M} q Umi. (22)

However, the obtained [ q L, q U] does not inherently satisfy the condition outlined in Eq. (2) for constructing a credal set [17]. Therefore, the intersection approach is not applicable in Cre DEs.

Empirical evaluation In this experiment, we mainly evaluate the impact of averaging and union ensemble approaches on the EU estimation (GH(Q)) of Cre DEs. Utilizing 15 individually trained Res Net50-based Cre Nets on CIFAR10 dataset, we formulate 15 Cre DEs-M by varying the ensemble number M from 2 to 10 through averaging and union ensemble methodologies. Each kind of Cre DEs M is assessed for the averaged GH(Q) concerning samples and the quantity of Cre DEs-M, and the averaged standard deviation (STD) of GH(Q) related to samples and the quantity of Cre DEs-M. The results are plotted in Figure 15 (b) and (a), respectively. Besides, we also present the AUPRC and AUROC scores of OOD detection using GH(Q) as the uncertainty metric in Figure 15 (c) and (d), accordingly.

Figure 15: Impact of averaging (Avg) and union on the EU estimation of Cre DEs on OOD detection benchmark involving CIFAR10 vs. SVHN/Tiny-Image Net (Tiny Image), implemented on Res Net50 architecture.

Figure 15 (a) illustrates a reduction in the averaged standard deviation (STD) of GH(Q) estimates as the number of ensemble members increases. This suggests that averaging the ensemble helps mitigate the uncertainty resulting from the randomness in the Cre Net training process. Consequently, the AUROC and AUPRC scores, shown in Figure 15 (c) and (d), exhibit enhancement through the utilization of the averaging ensemble approach, accompanied by a concurrent reduction in the STDs of the scores as the number of ensemble members increases. In contrast, Figure 15 (b) highlights the overestimation of EU across various datasets when employing the union ensemble method. While the average EU estimates for ID samples are overall lower than those for OOD instances, the overestimation may lead to OOD mis-detection in some speciﬁc samples. This explains the ﬂuctuations in the associated AUPRC and AUROC score curves in Figure 15.

E Further Discussion on Future work

E.1 Generalizing Cross-Entropy for Lower/Upper Probability

As stated in Sec. 2.2.2, calculating the cross-entropy (CE) loss for the lower and upper probability works due to the one-hot labeling nature of the ground truth vector t. However, generalizing the CE, which corresponds to the Kulback Leibler (KL) divergence

DKL(t|q) = X

j=1,...,C t(j) log t(j)

between a predicted probability vector q and the general ground truth vector t, to lower/upper probabilities is still an open research subject [68, 69].

In our case, the credal set Q is deﬁned by the outputted proper probability intervals [q L, q U]. Therefore, the KL divergence for a lower probability inducing a credal set may be calculated by:

Finding the probability vector that best approximates it. For probability intervals, there are two established such ways: normalizing either the lower or the upper probability (see [74]) or computing the so-called intersection probability (see [15]). Computing the KL divergence between the ground truth vector and the approximation obtained.

In future work, we aim to investigate the approach and compare those other well-founded methods for calculating the cross-entropy loss with the one used in the paper.

E.2 Theoretical Coverage Guarantees

In the current stage, our Cre Nets do not provide coverage guarantees, e.g., on how likely it is for the divergence of future data distributions to be within the modeled bounds. Nevertheless, various approaches to incorporating statistical guarantees in our framework can be envisaged.

In particular, a Cre DE, being a classiﬁer, can be employed as the underlying model in an inductive conformal learning framework [64], which builds an empirical cumulative distribution of the nonconformity scores of a set of calibration samples and at test time outputs the set of labels whose empirical CDF is above a desired signiﬁcance level ϵ (e.g., 90%).

Namely, given a test input x and the associated predictive system of probability intervals [q Lc, q Uc], c = 1, ..., C (the output of Cre DE), a sensible choice, for instance, is to set as nonconformity score of a pair (test input, class), (x, c), the complement of the upper probability for that class, given input x: s(x, c) .= 1 q Uc (i.e., a label c would be considered non-conformal if its predicted upper probability, for that input x, is low), and compute predictive regions as standard in conformal learning:

Γ(x) = c C : pc > ϵ,

pc = |(xj, cj) : s(xj, cj) > s(x, c)|

q + 1 + u |(xj, cj) : s(xj, cj) = s(x, c)|

(xj, cj) is the j-th calibration point, q is the number of calibration points, and u U(0, 1) (the uniform distribution on the interval (0, 1)).

We plan to explore this integration as the next step of our future work.

E.3 Extension for Regression Framework

The vast majority of papers using credal sets in machine learning focus on classiﬁcation [12, 5, 78], or, more recently, on self-supervised learning (but still in a classiﬁcation setting [37]). Nevertheless, a recent study [19] has shown that the formalism of belief functions (a special class of credal sets) can

be extended to regression, by leveraging random fuzzy sets. It might thus be possible to explore such connections between probability intervals and random fuzzy sets and devise a suitable regression framework based on Cre DEs.

The following section outlines a more direct possible extension of Cre DEs to regression problems as a future research direction.

Remember that a Cre Net outputs a credal set on the simplex of probability distributions over the classes. Each vertex of this credal set is therefore a probability distribution over the target space (the set of classes C for classiﬁcation).

On the other hand, a Bayesian regressor network (trained to learn a distribution of its weights) would output a (continuous) probability density over the target space (for the sake of simplicity, assume Y = R).

One could then train an ensemble of Bayesian regressor networks to predict a credal set with a ﬁxed number of vertices (one network outputting one vertex probability) so that the ﬁnal predicted credal set is the convex closure of those. Figure 16 illustrates the concept brieﬂy.

Figure 16: Concept of a credal regressor.

The Distributionally Robust Optimization (DRO) framework employed for Cre DE training is to model the divergence between the data distribution of samples belonging to G different groups within the training set. A full DRO formulation with G loss components, in combination with Bayesian deep learning techniques such as variational inference, could then be employed to drive the training of the G credal vertex networks, encouraged to generate diverse (probabilistic) predictions to model different possible data distributions, in a generalization of the two-component loss used here.

Broader Impacts

The main objective of this paper is to advance the ﬁeld of Machine Learning by improving the quality of uncertainty quantiﬁcation. There are many potential societal consequences of our work, none of which we feel must be speciﬁcally highlighted here. The proposed method demonstrates superior performance in detecting out-of-distribution (OOD) samples. Such capability can potentially safeguard end users from misguided decisions that stem from the incorrect predictions of neural networks on OOD instances. Therefore, our approach can potentially improve the safety, reliability, and trustworthiness of machine learning systems for classiﬁcation tasks and be applied in missioncritical domains, such as autonomous driving and medical sciences.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? Answer: [Yes] Justiﬁcation: The main claims made in the abstract and introduction accurately reﬂect our paper s contributions and scope. Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reﬂect how much the results can be expected to generalize to other settings. It is ﬁne to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justiﬁcation: We discuss the limitations in the Conclusion Section. Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-speciﬁcation, asymptotic approximations only holding locally). The authors should reﬂect on how these assumptions might be violated in practice and what the implications would be. The authors should reﬂect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reﬂect on the factors that inﬂuence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efﬁciency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be speciﬁcally instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes]

Justiﬁcation: We provide the relevant mathematical proofs in the Appendix. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justiﬁcation: We describe the experiments in the main body and detail the implementation in the Appendix. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or veriﬁable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might sufﬁce, or if the contribution is a speciﬁc model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufﬁcient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [Yes] Justiﬁcation: We use open-source datasets with references for the evaluation. All code is provided in the supplementary material. Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justiﬁcation: We provide all such experimental details in the Appendix. Ablation studies of hyperparameters are performed. Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Signiﬁcance

Question: Does the paper report error bars suitably and correctly deﬁned or other appropriate information about the statistical signiﬁcance of the experiments? Answer: [Yes] Justiﬁcation: The main results are computed from 15 runs. Error bars are provided. Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, conﬁdence intervals, or statistical signiﬁcance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors).

It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not veriﬁed. For asymmetric distributions, the authors should be careful not to show in tables or ﬁgures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding ﬁgures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufﬁcient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [Yes]

Justiﬁcation: We provide all information about the computational resources needed in the Appendix.

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justiﬁcation: The research conforms with the Code Of Ethics.

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [Yes]

Justiﬁcation: We discuss the broader impacts of our work in the Appendix.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake proﬁles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact speciﬁc groups), privacy considerations, and security considerations.

The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efﬁciency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justiﬁcation: The paper focuses on uncertainty quantiﬁcation in classiﬁcation tasks and is evaluated on existing benchmarks. The paper poses no such risks.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety ﬁlters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [Yes]

Justiﬁcation: We use open-source datasets with references for the experimental evaluation. We have cited all the datasets and models used in this paper.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.

For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justiﬁcation: The paper does not release new assets. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip ﬁle. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justiﬁcation: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is ﬁne, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justiﬁcation: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary signiﬁcantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.