# rethinking_aleatoric_and_epistemic_uncertainty__9c5815de.pdf Rethinking Aleatoric and Epistemic Uncertainty Freddie Bickford Smith 1 Jannik Kossen 1 Eleanor Trollope 1 Mark van der Wilk 1 Adam Foster 1 Tom Rainforth 1 The ideas of aleatoric and epistemic uncertainty are widely used to reason about the probabilistic predictions of machine-learning models. We identify incoherence in existing discussions of these ideas and suggest this stems from the aleatoricepistemic view being insufficiently expressive to capture all the distinct quantities that researchers are interested in. To address this we present a decision-theoretic perspective that relates rigorous notions of uncertainty, predictive performance and statistical dispersion in data. This serves to support clearer thinking as the field moves forward. Additionally we provide insights into popular information-theoretic quantities, showing they can be poor estimators of what they are often purported to measure, while also explaining how they can still be useful in guiding data acquisition. 1 Introduction When making decisions under uncertainty, it can be useful to reason about where that uncertainty comes from (Osband et al, 2023; Wen et al, 2022). Researchers often aim to do this by referring to aleatoric (literal meaning: relating to chance ) and epistemic ( relating to knowledge ) uncertainty, ideas with a long history in the study of probability (Hacking, 1975). Aleatoric uncertainty is typically associated with statistical dispersion in data (sometimes thought of as noise), while epistemic is associated with a model s internal information state (H ullermeier & Waegeman, 2021). Concerningly given their scale of use, these ideas are not being discussed coherently in the literature. The line between model-based predictions and data-generating processes is repeatedly blurred (Amini et al, 2020; Ayhan & Berens, 2018; Immer et al, 2021; Kapoor et al, 2022; Smith & Gal, 2018; van Amersfoort et al, 2020). On top of this, tenuous as- 1University of Oxford. Correspondence to Freddie Bickford Smith . Proceedings of the 42nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). sumptions are made about how uncertainty will decompose on unseen data (Seeb ock et al, 2019; Wang & Aitchison, 2021), and misleading connections are drawn between uncertainty and predictive accuracy (Orlando et al, 2019; Wang et al, 2019). Meanwhile distinct mathematical quantities are used to refer to notionally the same concepts: epistemic uncertainty, for example, has been variously defined using density-based (Mukhoti et al, 2023; Postels et al, 2020), information-based (Gal et al, 2017) and variance-based (Gal, 2016; Kendall & Gal, 2017; Mc Allister, 2016) quantities. We suggest this incoherence arises from the aleatoricepistemic view being too simplistic in the context of machine learning. Researchers are looking for concrete notions of a model s predictive uncertainty and how that uncertainty might or might not change with more data (associated with a decomposition into irreducible and reducible components), but also related notions of predictive performance and data dispersion. The aleatoric-epistemic view cannot satisfy all these needs: many concepts stand to be defined, while the view fundamentally only has capacity for two concepts. Yet the current state of play is to nevertheless appeal to the aleatoric-epistemic view, with different researchers using it in different ways. A result of this conceptual overloading is to conflate quantities that ought to be recognised as distinct. Far from just a matter of semantics, this is having a meaningful effect on the field s progress: methods are being designed and evaluated based on shaky foundations. To establish a clearer perspective, we draw on powerful yet underappreciated ideas from decision theory (Dawid, 1998; De Groot, 1962; Neiswanger et al, 2022). Our starting point is a final decision of interest with an associated loss function. Given this, uncertainty in predictive beliefs can be formalised as the subjective expected loss of acting Bayesoptimally under those beliefs; this generalises quantities like variance and Shannon entropy. From there we show how reasoning about new data gives rise to a notion of expected uncertainty reduction, which we can use to identify a decomposition of uncertainty into irreducible and reducible components. Then we clarify the connection between uncertainty, predictive performance and data dispersion, linking to classic decompositions from statistics and information theory. Overall this provides a coherent synthesis of key quantities that researchers are interested in (Figure 1). Rethinking Aleatoric and Epistemic Uncertainty Decision problem Training data y1:n ptrain(y1:n|π) Predictive model pn(z) = p(z; y1:n) Bayes-optimal action a n = arg mina A Epn(z)[ℓ(a, z)] Predictive uncertainty h[pn(z)] = Epn(z)[ℓ(a n, z)] Expected uncertainty reduction EURtrue z (π, m) = h[pn(z)] Eptrain(y+ 1:m|π)[h[pn+m(z)]] h[pn(z)] Epn(y+ 1:m|π )[h[qn+m(z)]] Uncertainty quantification Proper scoring rule s(pn, z) = ℓ(a n, z) Discrepancy function d(pn, peval) = Epeval(z)[s(pn, z)] h[peval(z)] Model evaluation Learning Reasoning Figure 1 Our decision-theoretic view coherently relates machine-learning concepts that have been conflated under the aleatoric-epistemic view. We consider taking an action, a A, in light of imperfect knowledge of z Z, with an action s consequences measured by a loss function, ℓ(a, z). Since z is unknown, we use any available training data, y1:n, to build a predictive model, pn(z), with which we can reason over possible values of z and thus choose an action. Additionally we can use the model to quantify uncertainty and its reducibility with respect to new data, and we can evaluate the model using a ground-truth realisation of z or a reference distribution, peval(z). Bridging this generalised perspective back to how aleatoric and epistemic uncertainty have been discussed in past work, we provide new insights on BALD, a popular informationtheoretic objective for data acquisition (Gal et al, 2017; Houlsby et al, 2011; Lindley, 1956). In particular we highlight that it should be seen not as a direct measure of longrun reducible predictive uncertainty, as has been suggested in the past, but instead as an estimator that can be highly inaccurate. Reconciling this with BALD s practical utility, we suggest it is often better understood as approximately measuring short-run reductions in parameter uncertainty. It can therefore be useful, albeit still suboptimal in predictionoriented settings (Bickford Smith et al, 2023; 2024). Our work thus serves to inform future work in two key ways. On the one hand it sheds light on the contradictions of the aleatoric-epistemic view and presents a coherent alternative perspective that allows clearer thinking about uncertainty in machine learning. On the other hand it provides more direct practical insights. It clarifies that what might have seemed like arbitrary choices for a decision-maker can instead be made by following well-defined logic: given some basic components and principles, it becomes clear how we should measure predictive uncertainty, predictive performance and data dispersion, and how we can identify good future training data. It also highlights approximations that often have to be made in practice, revealing scope for suboptimal performance and therefore informing future methods research. 2 Background The broad motivation of the aleatoric-epistemic view is to distinguish between different sources of uncertainty. If a model s prediction is uncertain, we might want to know whether that prediction is fundamentally uncertain for the given model class or instead due to a lack of data. This breakdown has clear utility in the context of seeking new data that will reduce predictive uncertainty (Bickford Smith et al, 2023; 2024; Mac Kay, 1992a;b). But it is also relevant elsewhere: in model selection, for example, we might want to quantify a model s scope for improvement by forecasting how its predictions will change given more data (Barbieri & Berger, 2004; Fong & Holmes, 2020; Geisser & Eddy, 1979; Kadane & Lazar, 2004; Laud & Ibrahim, 1995). Uncertainty that resolves in light of new data can be thought of as epistemic in the sense that data conveys knowledge. Intuitively the corresponding irreducible uncertainty seems to be determined by not only the model class but also, among other things, an inherent level of uncertainty associated with the data source at hand, which is often thought of in terms of randomness or chance, hence the word aleatoric . While the concepts of aleatoric and epistemic uncertainty had previously been used in machine learning, for example by Lawrence (2012) and Senge et al (2014), their popularity grew following work by Gal (2016), Gal et al (2017) and Kendall & Gal (2017). The most widely used mathematical definitions of these ideas, which we will discuss in Section 4, are the information-theoretic quantities used by Gal et al (2017), building on earlier work on Bayesian experimental design (Lindley, 1956) and Bayesian active learning (Houlsby et al, 2011; Mac Kay, 1992a;b). A range of perspectives on aleatoric and epistemic uncertainty in machine learning have been put forward in recent years. These include a discussion of sources of uncertainty in machine learning (Gruber et al, 2023); a case against Shannon entropy as a measure of predictive uncertainty Rethinking Aleatoric and Epistemic Uncertainty (Wimmer et al, 2023); proposals for alternative information quantities (Schweighofer et al, 2023a;b; 2025); and various other suggestions for how to define uncertainty, such as in terms of class-wise variance (Sale et al, 2023b; 2024b), credal sets (Hofman et al, 2024a; Sale et al, 2023a), distances between probability distributions (Sale et al, 2024a), frequentist risk (Kotelevskii et al, 2022; 2025; Lahlou et al, 2023) and proper scoring rules (Hofman et al, 2024b). As we will show, our replacement for the aleatoric-epistemic view unifies and explains many of these ideas. 3 Key concepts Our aim in this work is to formalise and link together quantities that have been associated with the ideas of aleatoric and epistemic uncertainty in past work. In particular we look to identify a rigorous notion of predictive uncertainty and the extent to which it reduces as more data is observed, and also measures of predictive performance and statistical dispersion in data. We start by highlighting some foundational concepts that will be used throughout our discussion. 3.1 Reasoning should start with the decision of interest We consider taking an action, a A, under imperfect knowledge of a ground-truth variable, z Z. Here z could for example be an output relating to a given input (if so, the input is left implicit in our notation) or a parameter in a model, and a could be a direct prediction of z, with A = Z for point prediction, or A = P(Z) for probabilistic prediction. We emphasise our choice to focus on this decision, in deliberate contrast with the more common starting point of learning a model from fixed data. We want a notion of predictive uncertainty that is grounded in actions and their consequences, and we need to reason about different possible datasets to rigorously think about reductions in uncertainty. 3.2 Actions induce losses that reflect preferences We assume we can measure the consequences of taking action a in light of a realisation of z using a loss (or negative utility) function, ℓ: A Z R. In principle the specification of ℓfollows directly from having preferences that satisfy basic axioms of rationality (von Neumann & Morgenstern, 1947). In practice it can be hard to know what ℓshould be; options for dealing with this include using an intrinsic loss or a random loss (Robert, 1996; 2007). 3.3 Subjective expected loss enables decision-making Since ℓis a function of the unknown z, it cannot be used directly as an objective for selecting an action, a. A principled solution that we focus on here is to form subjective beliefs over possible values of z (conventionally this belief state would be a Bayesian prior or posterior), average over these to form an expected loss, then choose an action that minimises this subjective expected loss (Ramsey, 1926; Savage, 1951). Alternative decision-making approaches include minimax, which involves minimising the worst-case frequentist risk (von Neumann, 1928; Wald, 1939; 1945). 3.4 Machine learning allows data-driven prediction Minimising subjective expected loss requires beliefs over z, and those beliefs can often be informed by some training data, y1:n ptrain(y1:n|π), where π Π is a policy that controls aspects of data generation. We want notions of uncertainty that reflect how we will actually learn from data, rather than assuming idealised updating that we cannot perform in practice. We therefore define our predictive model, pn(z) = p(z; y1:n), to be the output of a generic machinelearning method applied to the training data (and the input of interest if there is one) for any given n, which lets us reason about actual changes in uncertainty as n varies. Conventional Bayesian inference taking a generative model over possible data and conditioning on the observed data, giving pn(z) = p(z|y1:n) is one possible updating method. Others include deep learning (Le Cun et al, 2015), in-context learning (Brown et al, 2020) and non-Bayesian ensemble methods (Breiman, 2001). In some cases the predictive distribution is defined as pn(z) = Epn(θ)[pn(z|θ)] where θ pn(θ) = p(θ; y1:n) represents a set of stochastic model parameters that we average over at prediction time. Regardless of whether there are stochastic parameters, the updating scheme could be stochastic. To handle this we take the convention that updating stochasticity is implicitly absorbed into y1:n: we can consider our machine-learning method to be a deterministic mapping that takes a randomnumber seed as an auxiliary input along with data. Thus, while stochastic updating can be source of variability in how uncertainty reduces, this can be dealt with as part of the variability already present in what data we observe. 3.5 Bayes optimality is a subjective notion An action taken by minimising subjective expected loss under pn(z) is referred to as Bayes optimal (Murphy, 2022); if the action is an estimator of some quantity of interest then it is known as a Bayes estimator. The notion of Bayes optimality assumes our beliefs, pn(z), represent our best knowledge of z, and ℓreflects our preferences. It says nothing at all about how well our beliefs match reality, or about the actions we would take if we had different beliefs. Bayes-optimal actions can therefore be suboptimal as judged using realisations of z from somewhere other than pn(z), such as a system serving as a source of ground truth. 3.6 Predictions often do not match data generation The correspondence between our predictions, pn(z), and the data-generating process, ptrain(yi|π(y