# secondorder_uncertainty_quantification_a_distancebased_approach__3dd41a83.pdf Second-Order Uncertainty Quantification: A Distance-Based Approach Yusuf Sale 1 2 Viktor Bengs 1 2 Michele Caprio 3 Eyke H ullermeier 1 2 In the past couple of years, various approaches to representing and quantifying different types of predictive uncertainty in machine learning, notably in the setting of classification, have been proposed on the basis of second-order probability distributions, i.e., predictions in the form of distributions on probability distributions. A completely conclusive solution has not yet been found, however, as shown by recent criticisms of commonly used uncertainty measures associated with second-order distributions, identifying undesirable theoretical properties of these measures. In light of these criticisms, we propose a set of formal criteria that meaningful uncertainty measures for predictive uncertainty based on second-order distributions should obey. Moreover, we provide a general framework for developing uncertainty measures to account for these criteria, and offer an instantiation based on the Wasserstein distance, for which we prove that all criteria are satisfied. 1. Introduction The need for representing and quantifying uncertainty in machine learning (ML) particularly in supervised learning scenarios has become more and more obvious in the recent past (H ullermeier & Waegeman, 2021). This is largely due to the increasing use of AI-driven systems in safety-critical real-world applications having stringent safety requirements, such as healthcare (Lambrou et al., 2010; Senge et al., 2014; Yang et al., 2009) and socio-technical systems (Varshney & Alemzadeh, 2017). Dealing appropriately with uncertainty is a fundamental necessity in all these domains. Broadly, uncertainties are categorized as aleatoric, stem- 1Institute of Informatics, LMU Munich, Munich, Germany 2Munich Center for Machine Learning, Munich, Germany 3Precise Center, University of Pennsylvania, Philadelphia, USA. Correspondence to: Yusuf Sale . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). ming from inherent data variability, and epistemic, which arises from a model s incomplete knowledge of the datagenerating process. By its very nature, epistemic uncertainty (EU) often being characterized as reducible can be decreased with further information. In contrast, aleatoric uncertainty (AU), rooted in the data generating process itself, is fixed and cannot be mitigated (H ullermeier & Waegeman, 2021). The distinction between these uncertainty types has been a subject of keen interest in recent ML and statistical research (Gruber et al., 2023), finding applications in areas such as Bayesian neural networks (Kendall & Gal, 2017), adversarial attack detection mechanisms (Smith & Gal, 2018), and data augmentation strategies in Bayesian classification (Kapoor et al., 2022). Arguably, predictive uncertainty is the most studied form of uncertainty in both ML and statistics. It pertains prediction tasks such as those in supervised learning. In the latter, we consider a hypothesis space H, where each hypothesis h H maps a query instance xq X to a probability measure p on (Y, σ(Y)), where Y denotes the outcome space, and σ(Y) a suitable σ-algebra on Y. By producing estimates of the ground-truth probability measure p on (Y, σ(Y)), this probabilistic approach encapsulates aleatoric uncertainty about the actual outcome y Y. Since epistemic uncertainty is difficult to represent with conventional probability distributions (H ullermeier & Waegeman, 2021), such predictions fail to capture the epistemic part of (predictive) uncertainty. In order to account for both types of uncertainty, machine learning methods founded on more general theories of probability such as imprecise probabilities or credal sets (Walley, 1991; Augustin et al., 2014) have been considered (Corani et al., 2012). Another popular approach in this regard is to let the learner map a query instance xq to a second-order distribution, i.e., a distribution on distributions, effectively assigning a probability to each candidate probability distribution p. Such an approach is realized, for example, by classical Bayesian inference (Gelman et al., 2013) or by the Evidential Deep Learning (EDL) paradigm, which has recently become increasingly popular (Ulmer et al., 2023). In the EDL paradigm, one essentially learns a model (usually a deep neural network) by empirical risk minimization, whose Second-Order Uncertainty Quantification Figure 1: Uncertainty awareness in multi-class classification, illustrated on the probability simplex for Y = {y1, y2, y3}. From left to right increasing degrees of uncertainty awareness: Deterministic prediction (no uncertainty awareness), probabilistic prediction (AU, but no EU awareness), and second-order prediction (AU and EU awareness). output for a query instance xq are the parameters of a parameterized family of a second-order distribution. So far, only the Dirichlet distribution has been used for classification, while the Normal-Inverse-Gamma distribution has been applied for univariate regression (Amini et al., 2020) and the Normal-Inverse-Wishart distribution for multivariate regression (Malinin et al., 2020; Meinert & Lavin, 2021). However, this approach is not without controversy, as it may lead to convergence issues of the empirical risk minimizer (Bengs et al., 2022; Meinert et al., 2023) and the predominantly used loss functions lack some desirable properties (Bengs et al., 2023). Regardless of the specific design of the EDL approach, the concrete quantification of the total uncertainty (TU), aleatoric (AU), as well as epistemic (EU) associated with the second-order predictive distribution plays a central role in any case. For regression, essentially, the variances on the different levels of the second-order distribution are used for this purpose, while measures from information theory are applied for classification: Shannon entropy for TU, conditional entropy for AU, and mutual information for EU. Quite recently, Wimmer et al. (2023) criticized the latter for not complying with properties that one could naturally expect of uncertainty measures for second-order distributions. However, the authors do not provide an alternative for reasonable quantification either, which, of course, would be of great importance for practical ML purposes, especially in safety-critical applications. Contributions. In this paper, we suggest an alternative way to obtain uncertainty measures in classification that overcome the drawbacks of the commonly used informationtheory-based approach. To this end, we first propose a set of formal criteria that meaningful uncertainty measures for predictive uncertainty based on second-order distributions should obey. It extends the ones suggested by Wimmer et al. (2023). Moreover, we provide a general framework based on distances on the second-order probability level for developing uncertainty measures to account for these criteria. Using the Wasserstein distance, we instantiate this framework explicitly and prove that all criteria are met. Finally, we elaborate on these quantities when the secondorder distribution is a Dirichlet distribution. All proofs of the theoretical statements are provided in the appendix. 2. Second-Order Uncertainty Quantification In this section, we introduce the formal setting of supervised learning (throughout this paper we will exclusively deal with the case of classification) within which we establish further results. Let (X, σ(X)) and (Y, σ(Y)) be two measurable spaces. We will refer to X as instance (or input) space and to Y as label space, such that | Y | = K N 2. Further, we call the sequence D = {(xi, yi)}n i=1 (X Y)n training data. For i {1, . . . , n}, the pairs (xi, yi) are realizations of random variables (Xi, Yi), which are independent and identically distributed (i.i.d.) according to some probability measure p on (X Y, σ(X Y)). Thus, each instance x X is associated with a conditional distribution p( | x) on (Y, σ(Y)), such that p(y | x) is the probability to observe label y Y given x X. To ease the notation, we will denote by P(Y) the set of all probability measures on the measurable space (Y, σ(Y)). Similarly, we write Q(Y) for the set of all probability measures on (P(Y), σ(P(Y)); we refer to Q Q(Y) as a second-order distribution.1 While usually upper-case letters denote probability measures and lower-case letters their pdf/pmf, in this paper we use capital letters for secondorder and lower-case letters for first-order distributions. The Dirac measure at y Y is denoted by δy P(Y); likewise, δp Q(Y) denotes the Dirac measure at p P(Y), where 1There is no general consensus on terminology, as terms such as level-2 or type-2 distributions are also encountered in the literature. Second-Order Uncertainty Quantification the underlying space of the Dirac measure should be clear from the context. Finally, Unif(Y) denotes the uniform distribution on Y . Given an instance x X, let Q Q(Y) denote the learner s current probabilistic belief2 about p, i.e., Q(p) is the probability (density) of p P(Y). See Figure 1 for an illustration of the different degrees of uncertainty-aware predictions. As already mentioned in the introduction, there are two popular ways of obtaining such a second-order (predictive) distribution: by means of Bayesian inference or via Evidential Deep Learning. Throughout the rest of this paper, we assume such a second-order predictive distribution Q has been provided by a learner (though without being interested in how the prediction has been obtained). We raise the question of how to quantify the total amount of uncertainty (TU), as well as the aleatoric (AU) and epistemic (EU) uncertainties associated with Q. 2.1. Default Measures of Uncertainty We begin by revisiting the arguably most common information-theoretic approach in machine learning for measuring predictive uncertainty in classification tasks. This approach exploits (Shannon) entropy and its link to mutual information and conditional entropy for specifying explicit quantities for the total (TU), aleatoric (AU), and epistemic (EU) uncertainties associated with a predictive second-order distributions Q Q(Y) (Houlsby et al., 2011; Gal, 2016; Depeweg et al., 2018; Mobiny et al., 2021). The (Shannon) entropy (Shannon, 1948) of p P(Y) is defined as y Y p(y) log2 p(y). (1) We can analogously define the entropy of a (discrete) random variable Y : Ω Y by y Y p Y (y) log2 p Y (y), (2) where p Y P(Y) is the corresponding push-forward measure on the measurable space (Y, 2Y). The Shannon entropy has established itself as a standard measure of uncertainty due to its appealing theoretical properties and intuitive interpretation. Specifically, it measures the degree of uniformity of the distribution p Y of a random variable Y , and corresponds to the log-loss of p Y as a prediction of Y . In the following, we assume p R Q, i.e., p R : Ω P(Y) is a random first-order distribution distributed according to a second-order distribution Q and consequently taking values in the (K 1)-dimensional probability simplex. For ω Ω , we denote by p = p R(ω ) the realization of p R, respectively. 2Although it would be more precise to let Q depend on x, for ease of notation we will simply write Q. The core idea for obtaining uncertainty measures for a given second-order distribution Q is to consider the expectation of p R with respect to Q given by p = EQ[p R] = Z P(Y) p d Q(p) , (3) which yields a probability measure p on (Y, σ(Y)), i.e., a first-order distribution. With this, it seems natural to define the measure of total uncertainty as the entropy (1) of p P(Y). More precisely, total uncertainty associated with a second-order distribution Q Q(Y) can be computed as TU(Q) = H (EQ[p R]) . (4) In a similar fashion, one defines aleatoric uncertainty as conditional entropy AU(Q) = EQ[H(Y |p R)] = Z P(Y) H(p) d Q(p). (5) Further, the measure of epistemic uncertainty is in particular motivated by the well-known additive decomposition of entropy into conditional entropy and mutual information (Cover & Thomas, 1999, Equation (2.40)), i.e., H(Y ) = H(Y | p R) + I(Y, p R). (6) By rearranging (6) we get a measure of epistemic uncertainty EU(Q) = I(Y, p R) = EQ[DKL(p R p)], (7) where DKL( ) denotes the Kullback-Leibler (KL) divergence (Kullback & Leibler, 1951). Even though the individual measures, i.e., entropy, conditional entropy, and mutual information, have reasonable interpretations in terms of quantifying the respective uncertainty, which are particularly useful when applied to firstorder predictive distributions, a different picture emerges for the above approach to second-order predictive distributions. Some issues regarding the quantification of the respective uncertainties have recently been intensively discussed by Wimmer et al. (2023), which we will take up and elaborate on in the following section. Essentially, the problem stems from TU in (4) and EU in (7) depending on the second-order predictive distribution Q only through their expectation p in (3). 2.2. Alternatives for the Default Measures Recently, a variant of the above approach was proposed, which attempts to overcome the issues mentioned Second-Order Uncertainty Quantification (Schweighofer et al., 2023). For this purpose, the total uncertainty in (4) is rewritten as TU(Q) = EQ[CE(p R, p)], where CE( , ) is the cross-entropy, i.e., CE(p, q) := X y Y p(y) log2 q(y) for p, q P(Y). Then, the alternative measure for total uncertainty suggested by the authors is TU(Q) = EQ,Q [CE(p R, p R)], (8) where Q is an i.i.d. copy of Q and p R Q . Using again the decomposition in (6) and the resulting components as measures for aleatoric and epistemic uncertainty, one obtains the same aleatoric uncertainty measure as in (5), but the epistemic uncertainty measure changes to EU(Q) = EQ,Q [DKL(p R p R)]. (9) Thus, the proposed measures do not assume that the Bayesian model average predictive distribution is equivalent to the predictive distribution of the true data-generating process. 3. Novel Uncertainty Measures 3.1. Axiomatic Foundations The criticism of the previous approach raised by Wimmer et al. (2023) is grounded in the postulation of criteria that measures of total, aleatoric, and epistemic uncertainties should naturally satisfy when used for quantifying predictive uncertainty associated with second-order distributions. This is similar to the literature on uncertainty quantification for other methods of representing uncertainty, such as belief functions or credal sets (Bronevich & Klir, 2008; Pal et al., 1993; Sale et al., 2023a). In the following, we build on and extend the criteria presented by Wimmer et al. (2023). We begin by recalling some mathematical definitions (see also Wimmer et al. (2023) and Sale et al. (2023b, p.4)). Definition 3.1. Let X Q, X Q be two random vectors, where we have that Q, Q Q(Y). Denote by σ(X) the σ-algebra generated by the random vector X. Then we call Q (i) a mean-preserving spread of Q, iff X d= X + Z, for some random vector Z with E[Z | σ(X)] = 0 almost surely (a.s.) and maxk Var(Zk) > 0. (ii) a spread-preserving location shift of Q, iff X d= X +z, where z = 0 is a constant. (iii) a spread-preserving center-shift of Q, iff it is a spreadpreserving location shift with E[X ] = λE[X] + (1 λ)(1/K, . . . , 1/K) for some λ (0, 1). For (ii) and (iii) it should be ensured that the shifted probability distribution Q remains valid within its support. In the following, we let TU, AU, and EU denote, respectively, measures Q(Y) R 0 of total, aleatoric, and epistemic uncertainties associated with a second-order uncertainty representation Q Q(Y). If Y1 and Y2 are partitions of Y and Q Q(Y), then we denote by Q| Yi the marginalized distribution on Yi. In the same spirit, we define TUYi. A0 TU, AU, and EU are non-negative. A1 AU(δUnif(Y)) AU(δp) AU(δδy) = 0 holds for any y Y and any p P(Y) . A2 EU(Q) EU(δp) = 0 holds for any Q Q(Y), and any p P(Y). Further, for any Q Q(Y) with AU(Q) = 0 we have EU(Q ) EU(Q), where Q is such that Q (δy) = 1 K for all y Y. A3 AU(Q) TU(Q) and EU(Q) TU(Q) holds for any Q Q(Y) . A4 TU(Q) is maximal for Q being the continuous secondorder uniform distribution. A5 If Q is a mean-preserving spread of Q, then EU(Q ) EU(Q) (weak version) or EU(Q ) > EU(Q) (strict version). A6 If Q is a spread-preserving location shift of Q, then EU(Q ) = EU(Q). A7 TUY(Q) TUY1(Q| Y1) + TUY2(Q| Y2). A8 TUY(Q| Y1 Q| Y2) = TUY1(Q| Y1)+TUY2(Q| Y2), where denotes the product measure. Before discussing each criterion3, we first start with a joint and more in-depth discussion of A1 and A2, since they play a central role in the discussion of most of the other criteria. A1 and A2: Since we are interested in second-order distributions Q for the purpose of predictive uncertainty, it is natural to speak of a state of absence of epistemic uncertainty if Q corresponds to a point mass of second order. This is reflected by the lower bound in A2 and is also a viewpoint shared in the literature (Bengs et al., 2022; Wimmer et al., 2023). Moreover, there is agreement in the literature that (i) the uniform distribution of first order, i.e. Unif(Y), represents the case of highest outcome uncertainty, (ii) a degenerated first-order distribution, i.e., a Dirac measure on a point y Y, represents the case of lowest outcome uncertainty, and (iii) first-order distributions between these extreme cases correspond to an outcome uncertainty that lays somewhere in-between . In the absence of epistemic uncertainty in the second-order distribution, this should be reflected by the measure of aleatoric uncertainty ( A1). If the uncertainty is only epistemic in nature, that is, if according to A1 only first-order Dirac measures remain as possible candidates, then the epistemic uncertainty should 3A0 is a trivial property and therefore not discussed. Second-Order Uncertainty Quantification be maximal when the ambiguity around the Diracs is maximal. This happens when the second-order distribution Q is a discrete uniform distribution on the first-order Dirac measures on the elements of Y ( A2). Note that this view differs from that of Wimmer et al. (2023), which demands maximum epistemic and total uncertainties for the continuous second-order uniform distribution. However, our criteria are consistent w.r.t. the maximal total uncertainty ( A4). A3: As discussed in detail by Wimmer et al. (2023, Section 4.4), the aleatoric and epistemic uncertainties of a secondorder predictive distribution are closely intertwined. Since total uncertainty subsumes both types of uncertainty simultaneously, it should be always an upper bound for AU and EU, respectively. A5 and A6: These properties are again inspired by Wimmer et al. (2023). If two second-order distributions have the same expectation but differ in their dispersion or spread, the distribution with higher dispersion should be assigned higher epistemic uncertainty ( A5). Similarly, with equal dispersion, epistemic uncertainty should be the same in all cases. Thus, if Q and Q only differ in their respective means, epistemic uncertainty should be the same in both cases ( A6). A7 and A8: These criteria are inspired by those underlying Shannon entropy. Specifically, these properties aim to ensure that the total uncertainty of a second-order predictive distribution does not exceed the total uncertainties over all its possible marginalizations with respect to the label space Y . Thus, a subadditivity property should also hold here ( A7), with equality achieved when the marginalizations are independent ( A8). As shown by Wimmer et al. (2023), the measures for total, aleatoric, and epistemic uncertainties in (4-7) fail to satisfy A5 and A6 when it comes to second-order distributions. For the alternative version of these measures suggested by Schweighofer et al. (2023) it is not shown whether these properties are fulfilled or not. However, total uncertainty in (8) will not be maximal for Q being the continuous secondorder distribution, but for Q as in A2, so violating A4. In addition, it is apparent from the definition that both TU and EU in (8) and (9) can go to infinity. Thus, the measures are not naturally restricted to an interpretable range. 3.2. Distance-based Measures We now introduce a general framework for deriving suitable measures for total, aleatoric, and epistemic uncertainties associated with a second-order distribution Q Q(Y). The main constituents of the framework are (i) a (suitable) distance d2( , ) on Q(Y) and (ii) specific reference sets of second-order distributions representative for TU, AU or EU, respectively, each lacking one or both types of un- certainties. Roughly speaking, each uncertainty measure (i.e.,TU, AU or EU) of Q is defined as the minimal distance of Q to the corresponding reference set. This approach is inspired by the field of optimal transport (Villani, 2009; 2021) and guided by the following question: How much do we need to move Q to arrive at the nearest second-order distribution of the respective reference set for TU, AU or EU? While the distance function according to which Q moves in the space Q(Y) is intentionally kept flexible in our framework, the reference sets are fixed and should naturally lead to the fulfillment of A0 A8, ideally for a broad class of distances. Total uncertainty. For the total uncertainty we suggest to use all second-order Dirac measures on the set of first-order Dirac measures as the reference set. More specifically, total uncertainty is defined as TU(Q) := miny Y d2(Q, δδy). (10) This choice of the reference set is natural as each element in this reference set represents the case of an absolutely certain prediction/decision, i.e., there is neither aleatoric (firstorder) nor epistemic (second-order) uncertainty present. Thus, the farther Q is from such an element, the farther one is from making a decision without any kind of uncertainty, which is reflected by (10). Aleatoric uncertainty. The reference set for aleatoric uncertainty should be the set of all mixtures of second-order Dirac measures on first-order Dirac measures, i.e., δm = n δm Q(Y) : y Y λy δδy, X y Y λy = 1 o . If we agree on A0 A8, each element in this set has no aleatoric uncertainty, so the assessment of a second-order distribution Q is solely in terms of its amount of aleatoric uncertainty. Accordingly, the measure of aleatoric uncertainty is defined as AU(Q) := minδm δm d2(Q, δm). (11) Epistemic uncertainty. In the same spirit as (11), we want to assess Q solely in terms of its amount of epistemic uncertainty. Again, by agreeing on A0 A8, we naturally obtain as reference set the collection of all second-order Dirac measures on the probability simplex, since these have no epistemic uncertainty. If we denote the latter by δp, we obtain for the measure of epistemic uncertainty EU(Q) := minδp δp d2(Q, δp). (12) It is worth noting that the entropy-based uncertainty measures in Section 2.1 can also be considered from the perspective of our distance-based framework. Indeed, the entropy Second-Order Uncertainty Quantification of a (discrete) distribution is related to the negative KL divergence (or KL distance) between the distribution and the uniform distribution (on the respective domain) (Cover & Thomas, 1999, Equation (2.107)). Thus, we could rewrite (4), (5), and (7) as TU(Q) = log K DKL(EQ[p R] Unif(Y)), AU(Q) = log K EQ[DKL(p R Unif(Y))], EU(Q) = EQ[DKL(p R p)]. (13) With this representation, we see that the EU measure (7) has similarities to ours. More specifically, it is obtained as a special case of (12) with d2( , ) being the expected KL divergence (for which the minimum is obtained by δp = p). Note, however, that the expected KL divergence is not a proper distance on Q(Y), wherefore (7) is not a special case of our framework in a strict sense. Moreover, the interpretation of the measures TU and AU is different from our measures (10) and (11), as both are measuring similarity (through the negated KL divergence) to the case of maximal uncertainty, namely the first-order uniform distribution, instead of dissimilarity to a reference set of least uncertain distributions. The alternative version (8 9) suggested by Schweighofer et al. (2023) does not have such an interpretation, except for the aleatoric uncertainty which remains the same. This is due to the lack of a reference set for TU and EU, so that both measures are more interpretable as a measure of the diversity of the second-order distribution. On a high level, the approach also follows the idea of including the entire characteristics of Q (firstand second-order) in the respective uncertainty assessment, instead of narrowing down to the expected value p in (3) like the default case. 4. Wasserstein Instantiation 4.1. General Case So far, we did not specify the distance d2 : Q(Y) Q(Y) R 0 on Q(Y). In the following, we will motivate one specific choice, namely the Wasserstein distance (or Kantorovich Rubinstein metric). For our discussion, we first recall the concept of coupling, a term that is central to optimal transport theory (Villani, 2009, Chapter 1). Note that the definition used in this paper is an adaptation of the standard one, as our focus is on second-order distributions. Definition 4.1. We call the probability measure γ on (P(Y) P(Y), σ(P(Y) P(Y))) coupling of P, Q Q(Y) iff for all A, B σ(P(Y)) one has γ[A P(Y)] = Q[A], and γ[P(Y) B] = P[B]. Thus, γ admits marginals P and Q. Let (P(Y), d1) be a metric space, where Y is defined as before, and d1 is a suitable metric on the space P(Y) (again, equipped with a suitable σ-algebra). Then, for p [1, ] the (second-order) p-Wasserstein distance between two probability measures P, Q Q(Y) is defined as inf γ Γ(P,Q) P(Y) P(Y) d1(p, q)p dγ(p, q) where Γ(P, Q) denotes the set of all couplings between the probability measures P and Q (see Definition 4.1). The choice of this metric for our purposes is quite natural based on its interpretation: The Wasserstein metric quantifies how much mass has to be moved around and how far in order to convert one distribution into another. This is perfectly in line with our view for the uncertainty measures in Section 3.2. In accordance with the literature, we will be exclusively concerned with the case p = 1 and omit in the following the subscript in Wp( , ). First, we show that W( , ) is indeed a well-defined metric on Q(Y). Lemma 4.2. The second-order Wasserstein distance W : Q(Y) Q(Y) R 0 is a well-defined metric on Q(Y). Since in both (10) and (12) second-order Dirac measures are involved, we show now that the optimal coupling between a second-order distribution Q Q(Y) and a second-order Dirac measure δp, where p P(Y), is trivially given by the respective product measure. This simplifies corresponding computations. Proposition 4.3. For any second-order Dirac measure δp Q(Y), p P(Y), and any second-order distribution Q Q(Y), the optimal coupling between δp and Q is the product measure γ = Q δp. This coupling is also frequently referred to as trivial coupling (Villani, 2009). Let us elaborate on the choice of the metric d1 : P(Y) P(Y) R 0 in (14). We will define this as the Wasserstein metric between two first-order distributions induced by the trivial distance on the label space Y. Note that this is not fixed by design, and without loss of generality other metrics on the label space (depending on the specific problem at hand) can be considered. The trivial distance on Y is given for any y, y Y by d0(y, y ) = 1{y =y }. (15) With the choice of the distance (15), for p, q P(Y) we obtain the following induced4 first-order distance d1: d1(p, q) = infγ Γ(p,q) Y Y d0(y, y ) dγ(y, y ) = infγ Γ(p,q) X y,y Y y =y γ(y, y ) 4Using the first-order Wasserstein distance on P(Y) P(Y). Second-Order Uncertainty Quantification = infγ Γ(p,q) n 1 X y Y γ(y, y) o . (16) Regarding the optimal coupling in (16), we can show the following. Proposition 4.4. The coupling γ Γ(p, q) minimizing the expression (16) is such that γ(y, y) = min{p(y), q(y)}. Proposition 4.4 yields d1(p, q) = 1 X y Y min{p(y), q(y)} y Y max{p(y), q(y)} min{p(y), q(y)} y Y |p(y) q(y)| =: 1 In the context of usual probability measures, Proposition 4.4 is well-known in transportation theory, establishing a connection between the Wasserstein metric and the total variation distance. With this, the proposed uncertainty measures for Q Q(Y) in Section 3.2 simplify as follows. Proposition 4.5. Using d1( , ) as above, the measures of uncertainty in (10), (11), and (12) simplify to TU(Q) = 1 maxy Y EQy[p(y)], (17) AU(Q) = 1 EQ [maxy Y p(y)] , (18) 2 minq P(Y) Ep Q[ p q 1]. (19) Here, Qy denotes the marginal distribution associated with Q Q(Y) for some y Y . The following proposition elaborates on the ranges of the proposed measures of uncertainty. Although the results appear natural, they yield interesting findings from an uncertainty quantification perspective. Proposition 4.6. With the choice of d1( , ) as distance on P(Y), we have that for all Q Q(Y) it holds that i.) TU(Q) K 1 K , where the upper bound is reached for Q Q(Y) such that EQ [p] = Unif(Y). ii.) AU(Q) K 1 K , where the upper bound is reached for Q = δUnif(Y). iii.) EU(Q) K 1 K , where the upper bound is reached for any Q Q(Y) such that Q (δy) = 1 K , for all y Y. The property from Proposition 4.6 is desirable for two reasons. On the one hand, the value range grows with increasing complexity of the classification problem in terms of the number of labels K. This is similar to the entropy, see (13). On the other hand, the value ranges are normalizing themselves with increasing complexity. More precisely, for K , the maximum of the uncertainty measures converges (with respect to the standard Euclidean metric on R) to 1. Needless to say, the upper bounds of the value ranges can also be used to normalize the uncertainty measures a priori by multiplying them with K/(K 1). A direct consequence of Proposition 4.6 is that maximum epistemic uncertainty can be achieved only when there is no aleatoric uncertainty and vice versa. Corollary 4.7. For any Q Q(Y), it holds that EU(Q) = K K 1 if and only if AU(Q) = 0. Finally, we show that the proposed uncertainty measures with the Wasserstein distance instantiations fulfill the criteria specified in Section 3.1. Theorem 4.8. Uncertainty measures (10-12) with the Wasserstein distance instantiation satisfy Axioms A0 A8. 4.2. Dirichlet Distribution Owing to its key role as a conjugate prior for a categorical distribution, the Dirichlet distribution is arguably the most important family of parameterized (second-order) distributions employed in various areas of theoretical and applied research. In Bayesian inference and Evidential Deep Learning, the Dirichlet distribution has become the gold standard. Accordingly, in this section we focus on the computation of the proposed uncertainty measures with the Wasserstein distance initialization for the case of Dirichlet distributions. We start with a brief introduction to the Dirichlet distribution and identify without loss of generality each element in the label space Y with an integer, i.e., Y = {1, 2, . . . , K}. Let π denote a K-dimensional probability vector, and assume it is distributed according to a Dirichlet distribution, that is, π Dir(α). The Dirichlet distribution Dir(α) is supported on the (K 1)-dimensional unit simplex, and it is parameterized by α = (a1, . . . , a K) , a K-dimensional vector whose entries are such that αj > 0, for all j Y. Its probability density function (pdf) is given by 1 B(α) QK j=1 παj 1 j , where B( ) denotes the multivariate Beta function. We can interpret the j-th entry αj of α as pseudo-counts: αj represents the virtual observations that we have for label j. It captures the agent s (i.e., machine learning algorithm s) knowledge around label j that comes e.g. from previous or similar experiments. The expected value of π Dir(α) is given by E(πj) = αj/ P j αj, j Y, and it expresses the belief that j is the true label . This is due to the fact that the marginals πj of the Dirichlet distribution are distributed according to a Beta distribution with parameters αj and α0 αj, with α0 = PK i=1 αi. Dirichlet distributions are second-order distributions since their support is the (K 1)-dimensional simplex, i.e. P(Y). That is, they can be thought of as distributions over the Second-Order Uncertainty Quantification actual probability measures that generated the data. In the following, we assume that our current probabilistic knowledge is given by Q Dir(α), so that the marginal distributions are Beta distributions, i.e., Qi Beta(αi, α0 αi) with α0 = PK j=1 αj for each i Y. Using the closedform for the expectation of the marginals, we obtain TU(Q) = 1 maxy Y αy α0 (20) for the total uncertainty in (17). Unfortunately, it is difficult to derive a closed form for the expression in (18). However, the expected value is easily approachable through Monte Carlo simulations. Finally, for EU in (19), we are dealing with a constrained optimization problem, which, however, has appealing properties. Indeed, given a Dirichlet distribution, we seek to solve the following constraint optimization problem for (19): minimize q (0,1)K h(q) := 1 i=1 Epi Qi[ |pi qi| ] (21) subject to c(q) := XK i=1 qi 1 = 0 (22) By further evaluation of the sum of expectations involved in (21) and using the method of Lagrange multipliers, we obtain the following result. Proposition 4.9. The convex constrained optimization problem (21 22) has a unique solution. Accordingly, EU for a given Dirichlet distribution can be computed quickly using a common optimization method. Figure 2 displays, for | Y | = 3, some exemplary Dirichlet distributions with different α parameters over a 2-simplex, along with their corresponding normalized5 values for TU, AU, and EU in (10-12). We observe that the desired properties are captured as follows: First, AU and respectively EU is always smaller or equal to TU. Moreover, TU is maximal for the uniform distribution, as shown in Fig. 2a. TU also attains its maximum under other parameter conditions, but with varying aleatoric and epistemic contributions: This can occur with a high AU value, stemming from a high concentration around the first-order uniform distribution (see Fig. 2b, c). Alternatively, a high EU value can drive this, due to a strong similarity to the discrete uniform distribution on the first-order Dirac measures, namely the vertices (see Fig. 2f). Additionally, we observe that EU strictly increases for mean-preserving spreads (Fig. 2b, c). Fig. 2e depicts a Dirichlet distribution which is quite confident about one of the actual outcomes. This is reflected accordingly in low values of the uncertainty measures. These observations based on Dirichlet distributions align with our theoretical analysis of the proposed distance-based uncertainty measures. 5The values are normalized by multiplying them with K/(K 1), see discussion after Proposition 4.6. Figure 2: Dirichlet distributions with different choices of α with normalized values for TU, AU, and EU. Finally, we also consider the same exemplary second-order distributions Q used by Wimmer et al. (2023) to illustrate the issues of the entropy-based uncertainty measure (Figure 3 in Appendix B). In line with our theoretical results, our Wasserstein metric-induced measures behave as desired with respect to the axioms. 5. Conclusion Recent criticisms have pointed to limitations in widely accepted uncertainty measures for second-order distributions, primarily due to certain unfavorable theoretical properties. Responding to this criticism, we presented a set of formal criteria that any uncertainty measure should fulfill. Additionally, we introduced a distance-based approach to obtain uncertainty measures for total, aleatoric, and epistemic uncertainties tailored towards obeying the criteria. On the basis of the Wasserstein metric, we demonstrated that this approach is fruitful and practical, especially for the oftenused Dirichlet distributions. The motivation for adopting a distance-based method for uncertainty quantification stems from the intuitive and geometric interpretation of uncertainty in probability spaces. Traditionally, uncertainty measures such as entropy-based Second-Order Uncertainty Quantification ones provide insight into the spread or unpredictability of a distribution. However, they do not always capture subtleties of second-order distributions effectively. We address this gap by leveraging a method that quantifies the distance between probabilistic beliefs (represented by second-order distributions). Our approach also closely aligns with a statistical viewpoint: In statistics it is quite natural to assess the discrepancy between probability distributions using distances. Such distances are well-established for (first-order) probability distributions, with prominent examples including the Wasserstein distance and the Kullback-Leibler divergence, among others. Our results open several venues for future work. First, it would be interesting to instantiate the proposed uncertainty measures with metrics on probability measures other than the Wasserstein metric and verify that the proposed criteria are met. In that respect, it would be interesting to work out general properties that a metric must satisfy in order for the criteria to be met. Although the focus of our work is on the theoretical aspects of uncertainty measures, a systematic experimental comparison in the context of evidential deep learning would be intriguing. Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. Acknowledgements Yusuf Sale is supported by the DAAD program Konrad Zuse Schools of Excellence in Artificial Intelligence, sponsored by the Federal Ministry of Education and Research. Michele Caprio would like to acknowledge partial funding by the Army Research Office (ARO MURI W911NF2010080). Amini, A., Schwarting, W., Soleimany, A., and Rus, D. Deep evidential regression. In Proc. Neur IPS, 33rd Advances in Neural Information Processing Systems, volume 33, pp. 14927 14937, 2020. Augustin, T., Coolen, F. P., De Cooman, G., and Troffaes, M. C. Introduction to Imprecise Probabilities. John Wiley & Sons, 2014. Bengs, V., H ullermeier, E., and Waegeman, W. Pitfalls of epistemic uncertainty quantification through loss minimisation. In Proc. Neur IPS, 35th Advances in Neural Information Processing Systems, volume 35, pp. 29205 29216, 2022. Bengs, V., H ullermeier, E., and Waegeman, W. On secondorder scoring rules for epistemic uncertainty quantification. In Proc. ICML, 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 2078 2091. PMLR, 2023. Bronevich, A. and Klir, G. J. Axioms for uncertainty measures on belief functions and credal sets. In Annual Meeting of the North American Fuzzy Information Processing Society (NAFIPS), pp. 1 6. IEEE, 2008. Corani, G., Antonucci, A., and Zaffalon, M. Bayesian networks with imprecise probabilities: Theory and application to classification. Data Mining: Foundations and Intelligent Paradigms: Volume 1: Clustering, Association and Classification, pp. 49 93, 2012. Cover, T. and Thomas, J. A. Elements of Information Theory. John Wiley & Sons, 1999. Depeweg, S., Hernandez-Lobato, J.-M., Doshi-Velez, F., and Udluft, S. Decomposition of uncertainty in Bayesian deep learning for efficient and risk-sensitive learning. In Proc. ICML, 35th International Conference on Machine Learning, pp. 1184 1193. PMLR, 2018. Gal, Y. Uncertainty in Deep Learning. Ph D thesis, University of Cambridge, 2016. Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. Bayesian Data Analysis. CRC Press, 2013. Gruber, C., Schenk, P. O., Schierholz, M., Kreuter, F., and Kauermann, G. Sources of uncertainty in machine learning a statisticians view. ar Xiv preprint ar Xiv:2305.16703, 2023. Houlsby, N., Husz ar, F., Ghahramani, Z., and Lengyel, M. Bayesian active learning for classification and preference learning. ar Xiv preprint ar Xiv:1112.5745, 2011. H ullermeier, E. and Waegeman, W. Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods. Machine Learning, 110(3):457 506, 2021. Kapoor, S., Maddox, W. J., Izmailov, P., and Wilson, A. G. On uncertainty, tempering, and data augmentation in Bayesian classification. In Proc. Neur IPS, 35th Advances in Neural Information Processing Systems, volume 35, pp. 18211 18225, 2022. Kendall, A. and Gal, Y. What uncertainties do we need in Bayesian deep learning for computer vision? In Proc. Neur IPS, 30th Advances in Neural Information Processing Systems, volume 30, pp. 5574 5584, 2017. Second-Order Uncertainty Quantification Kullback, S. and Leibler, R. A. On information and sufficiency. The Annals of Mathematical Statistics, 22(1): 79 86, 1951. Lambrou, A., Papadopoulos, H., and Gammerman, A. Reliable confidence measures for medical diagnosis with evolutionary algorithms. IEEE Transactions on Information Technology in Biomedicine, 15(1):93 99, 2010. Malinin, A., Chervontsev, S., Provilkov, I., and Gales, M. Regression prior networks. ar Xiv preprint ar Xiv:2006.11590, 2020. Meinert, N. and Lavin, A. Multivariate deep evidential regression. ar Xiv preprint ar Xiv:2104.06135, 2021. Meinert, N., Gawlikowski, J., and Lavin, A. The unreasonable effectiveness of deep evidential regression. In Proc. AAAI, Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 9134 9142, 2023. Mobiny, A., Yuan, P., Moulik, S. K., Garg, N., Wu, C. C., and Van Nguyen, H. Drop Connect is effective in modeling uncertainty of Bayesian deep networks. Scientific Reports, 11:5458, 2021. Pal, N. R., Bezdek, J. C., and Hemasinha, R. Uncertainty measures for evidential reasoning II: A new measure of total uncertainty. International Journal of Approximate Reasoning, 8(1):1 16, 1993. Sale, Y., Caprio, M., and H ullermeier, E. Is the volume of a credal set a good measure for epistemic uncertainty? In Proc. UAI, 39th Conference on Uncertainty in Artificial Intelligence, pp. 1795 1804. PMLR, 2023a. Sale, Y., Hofman, P., Wimmer, L., H ullermeier, E., and Nagler, T. Second-order uncertainty quantification: Variance-based measures. ar Xiv preprint ar Xiv:2401.00276, 2023b. Schweighofer, K., Aichberger, L., Ielanskyi, M., and Hochreiter, S. Introducing an improved informationtheoretic measure of predictive uncertainty. In Neur IPS 2023 Workshop on Mathematics of Modern Machine Learning, 2023. Senge, R., B osner, S., Dembczy nski, K., Haasenritter, J., Hirsch, O., Donner-Banzhoff, N., and H ullermeier, E. Reliable classification: Learning classifiers that distinguish aleatoric and epistemic uncertainty. Information Sciences, 255:16 29, 2014. Shannon, C. E. A mathematical theory of communication. The Bell System Technical Journal, 27(3):379 423, 1948. Smith, L. and Gal, Y. Understanding measures of uncertainty for adversarial example detection. In Proc. UAI, 34th Conference on Uncertainty in Artificial Intelligence, pp. 560 570, 2018. Ulmer, D., Hardmeier, C., and Frellsen, J. Prior and posterior networks: A survey on evidential deep learning methods for uncertainty estimation. Transaction of Machine Learning Research, 2023. Varshney, K. R. and Alemzadeh, H. On the safety of machine learning: Cyber-physical systems, decision sciences, and data products. Big Data, 5(3):246 255, 2017. Villani, C. Optimal Transport: Old and New, volume 338. Springer Science & Business Media, 2009. Villani, C. Topics in Optimal Transportation, volume 58. American Mathematical Society, 2021. Walley, P. Statistical Reasoning with Imprecise Probabilities. Chapman & Hall, 1991. Wimmer, L., Sale, Y., Hofman, P., Bischl, B., and H ullermeier, E. Quantifying aleatoric and epistemic uncertainty in machine learning: Are conditional entropy and mutual information appropriate measures? In Proc. UAI, 39th Conference on Uncertainty in Artificial Intelligence, pp. 2282 2292. PMLR, 2023. Yang, F., Wang, H.-Z., Mi, H., Lin, C.-D., and Cai, W.- W. Using random forest for reliable classification and cost-sensitive learning for medical diagnosis. BMC Bioinformatics, 10(1):1 14, 2009. Second-Order Uncertainty Quantification Proof of Lemma 4.2 Since Y = {y1, . . . , y K} is finite, this means that a probability measure p P(Y) can be seen as a K-dimensional probability vector. In symbols, p (p(y1), . . . , p(y K)) . The latter is a vector belonging to the (K 1)-unit simplex K 1. As a consequence, a second-order distribution on P(Y) can be seen as a first-order distribution on the (K 1)-unit simplex K 1. In symbols, Q(Y) P( K 1). This, together with the first-order Wasserstein distance being a welldefined metric on P(Y) (Villani, 2009; 2021), allows us to conclude that the second-order Wasserstein distance is itself a well-defined metric. Proof of Proposition 4.3 Let δp Q(Y) be a second-order Dirac measure, where p P(Y), and Q Q(Y) a second-order distribution. We show that any coupling γ (P(Y) P(Y), σ(P(Y) P(Y))) has to be necessarily given by γ = Q δp. Thus, for any A, B σ(P(Y)) we show that ( Q(A), if p B, 0, else. Let p B, then we have γ(A Bc) = 0. Hence, this implies γ(A B) = γ(A P(Y)) γ(A Bc) = Q(A). This shows the first case. Assume p / B, then γ(A B) γ(P(Y) B) = δp(B) = 0, showing the second case. Proof of Proposition 4.4 Let p, q P(Y). Note that γ(y, y) = min{p(y), q(y)} is trivially a coupling, hence γ Γ(p, q). For the corresponding marginals we have p(y) = P y Y γ(y, y ) and q(y) = P y Y γ(y , y), thus γ(y, y) min{p(y), q(y)}. This implies directly that min{p(y), q(y)} maximizes P y Y γ(y, y), and therefore minimizes the distance d1(p, q) = infγ Γ(p,q) n 1 P y Y γ(y, y) o . Proof of Proposition 4.5 Let the distance on P(Y) be given by d1(p, q) = 1 2 p q 1, where p, q P(Y). Now, for any Q Q(Y) the proposed uncertainty measures simplify as follows: TU(Q) = min y Y W(Q, δδy) = min y Y Ep Q = min y Y Ep Q [1 p(y)] (24) = 1 max y Y Ep Qy[p(y)]. (25) Note that for (24) we used the fact that 1 2 p δy 1 = 1 2(1 p(y) + P y =y p(y )) = 1 p(y). Further, we have AU(Q) = min δm δm W(Q, δm) (26) = min δm δm inf γ Γ(Q,δm) P(Y) P(Y) d1(p, q) dγ(p, q) (27) = min δm δm inf γ Γ(Q,δm) P(Y) δy 1 p(y) dγ(p, q) (28) P(Y) 1 max y Y p(y) d Q(p) (29) Second-Order Uncertainty Quantification = Ep Q[1 max y p(y)], (30) where we used y = argmaxy Y q(y ). Equality in (29) is reached for the Dirac mixture with δ m δm with δ m(δy) = Q(y = argmaxy Y p(y )). Thus, we have the following: AU(Q) = inf γ Γ(Q,δ m) P(Y) δy 1 X y Y p(y)q(y) dγ(p, q) = inf γ Γ(Q,δ m) y Y p(y)q(y)}γ(q|p) d Q(p) = inf γ Γ(Q,δ m) y Y {1 p(y)}γ(δy|p) d Q(p) P(Y) 1 max y Y p(y) d Q(p). The conditional probability measure γ(δy|p) = 1{y=argmaxy Y p(y )} is valid, since δ m(δy) = Z P(Y) γ(δy|p) d Q(p) = Q(y = argmax y Y p(y )). Finally, we also have the following: EU(Q) = min δp δp W(Q, δp) = min δp δp P(Y) d1(p, q) dδp(p) d Q(q) 2 min p P(Y) P(Y) q p 1 d Q(q) 2 min p P(Y) Eq Q[ q p 1]. This concludes the proof. Proof of Proposition 4.6 Let Q Q(Y), then we have: i.) TU(Q) = 1 maxy Y EQ[p(y)] 1 1 K , where the inequality is a direct consequence of P y Y p(y) = 1 for any p P(Y) which implies that maxy Y p(y) 1/K. It is clear that this upper bound is reached for Q Q(Y) such that EQ [p] = Unif(Y). ii.) AU(Q) = 1 EQ[maxy Y p(y)] 1 1 K . Clearly the upper bound is reached for Q Q(Y) such that Q = δUnif(Y). iii.) For EU(Q) we obtain 2 min p P(Y) Eq Q[ q p 1] (31) 2 Eq Q[ q EQ[p] 1] (32) 2 Eq δm[ q EQ[p] 1] (33) Second-Order Uncertainty Quantification y Y EQ[p(y)](1 EQ[p(y)]) (34) y Y EQ[p(y)]2 (35) where (33) follows from the Dirac mixture δm δm being a mean-preserving spread of Q. Inequality (36) is a consequence of the Cauchy-Schwarz inequality and the linearity of expectation. The upper bound is reached for Q , which is such that Q (δy) = 1 K for all y Y. This concludes the proof. Proof of Corollary 4.7 Corollary 4.7 is an immediate consequence of Proposition 4.6. Proof of Theorem 4.8 We show that the Wasserstein distance instantiated measures (10) - (12) satisfy Axioms A0 - A8 discussed in Section 3.1. A0: Since the proposed measures are distance-based this property holds trivially true. A1: Let p P(Y) and y Y, then we have AU(δUnif(Y)) = 1 max y Y Unif(Y)(y) K 1 max y Y p(y) = 1 max y Y δy(y ) The first inequality is a direct consequence of Proposition 4.6. A2: For p P(Y) and Q Q(Y) we have immediately by definition EU(Q) EU(δp) = 0. The other inequality in this axiom follows directly from Proposition 4.6 iii.). A3: Since S y Y{δδy} δp, it follows EU(Q) = minδp δp W(Q, δp) miny Y W(Q, δδy) = TU(Q), for any Q Q(Y). Similarly, since S y Y{δδy} δm we obtain AU(Q) = minδp δm W(Q, δm) miny Y W(Q, δδy) = TU(Q), for any Q Q(Y). A4: This follows from Proposition 4.6, since we have EQ[p] = Unif(Y) for Q being the continuous second-order uniform distribution. A5: Further, let Q Q(Y) be a mean-preserving spread of Q Q(Y), i.e., let X Q, X Q be two random variables such that X d= X + Z, for some random variable Z with E[Z|X = x] = 0, for all x in the support of X. Then, we have 2 min p P(Y) E[ (X + Z) p 1] 2 min p P(Y) i=1 E(|Xi + Zi pi|), Second-Order Uncertainty Quantification where X1, . . . , XK are the marginals of X and Z1, . . . , ZK the marginals of Z, respectively. From this, we further infer that for any p = (p1, . . . , pk) P(Y) and any x in the support of X that 2 min p P(Y) i=1 E(|Xi pi| + Zi(1Xi>pi 1Xipi 1Xipi 1Xi