# evidential_turing_processes___d47fa69b.pdf

Published as a conference paper at ICLR 2022

EVIDENTIAL TURING PROCESSES

Melih Kandemir Dept of Math and Computer Science University of Southern Denmark Odense, Denmark kandemir@imada.sdu.dk

Abdullah Akgül Department of Computer Engineering Istanbul Technical University Istanbul, Turkey akgula15@itu.edu.tr

Manuel Haussmann Department of Computer Science Aalto University Espoo, Finland manuel.haussmann@aalto.fi

Gozde Unal Department of Computer Engineering Istanbul Technical University Istanbul, Turkey gozde.unal@itu.edu.tr

A probabilistic classiﬁer with reliable predictive uncertainties i) ﬁts successfully to the target domain data, ii) provides calibrated class probabilities in difﬁcult regions of the target domain (e.g. class overlap), and iii) accurately identiﬁes queries coming out of the target domain and rejects them. We introduce an original combination of Evidential Deep Learning, Neural Processes, and Neural Turing Machines capable of providing all three essential properties mentioned above for total uncertainty quantiﬁcation. We observe our method on ﬁve classiﬁcation tasks to be the only one that can excel all three aspects of total calibration with a single standalone predictor. Our uniﬁed solution delivers an implementation-friendly and compute efﬁcient recipe for safety clearance and provides intellectual economy to an investigation of algorithmic roots of epistemic awareness in deep neural nets.

1 INTRODUCTION

The applicability of deep neural nets to safety-critical use cases such as autonomous driving or medical diagnostics is an active matter of fundamental research (Schwalbe & Schels, 2020). Key challenges in the development of total safety in deep learning are at least three-fold: i) evaluation of model ﬁt, ii) risk assessment in difﬁcult regions of the target domain (e.g. class overlap) based on which safety-preserving fallback functions can be deployed, and iii) rejection of inputs that do not belong to the target domain, such as an image of a cat presented to a character recognition system. The attention of the machine learning community to each of these critical safety elements is in steady increase (Naeini et al., 2015; Guo et al., 2017; Kuleshov et al., 2018). However, the developments often follow isolated directions and the proposed solutions are largely fragmented. Calibration of neural net probabilities focuses primarily on post-hoc adjustment algorithms trained on validation sets from in-domain data (Guo et al., 2017), ignoring the out-of-domain detection and model ﬁt evaluation aspects. On the other hand, recent advances in out-of-domain detection build on strong penalization of divergence from the probability mass observed in the target domain (Sensoy et al., 2018; Malinin & Gales, 2018), distorting the quality of class probability scores. Such fragmentation of best practices not only hinders their accessibility by real-world applications but also complicates the scientiﬁc inquiry of the underlying reasons behind the sub-optimality of neural net uncertainties.

We aim to identify the guiding principles for Bayesian modeling that could deliver the most complete set of uncertainty quantiﬁcation capabilities in one single model. We ﬁrst characterize Bayesian models with respect to the relationship of their global and local variables: (a) Parametric Bayesian Models comprise a likelihood function that maps inputs to outputs via probabilistic global parameters, (b) Evidential Bayesian Models apply an uninformative prior distribution on the parameters of the likelihood function and infer the prior hyperparameters by amortizing on an input observation. Investigating the decomposition of the predictive distribution variance, we ﬁnd out that the

Work done while at Heidelberg University, Germany.

Published as a conference paper at ICLR 2022

(a) Parametric Bayesian Model

(b) Evidential Bayesian Model

(c) Complete Bayesian Model

Figure 1: Plate diagrams of three main Bayesian modeling approaches. Shaded nodes are observed and unshaded ones are latent. The solid direct arrows denote conditional dependencies in the true model, while the bent dashed arrows denote amortization in variational inference.

uncertainty characteristics of these two approaches exhibit complementary strengths. We introduce a third approach that employs a prior on the likelihood parameters that both conditions on the input observations on the true model and amortizes on them during inference. The mapping from the input observations to the prior hyperparameters is governed by a random set of global parameters. We refer to the resulting approach as a (c) Complete Bayesian Model, since its predictive distribution variance combines the favorable properties of Parametric and Evidential Bayesian Models. Figure 1 gives an overview of the relationship between these three Bayesian modeling approaches.

The advantages of the Complete Bayesian Models come with the challenge of choosing an inputdependent prior. We introduce a new stochastic process construction that addresses this problem by accumulating observations from a context set during training time into the parameters of a global hyperprior variable by an explicitly designed update rule. We design the input-speciﬁc prior on the likelihood parameters by conditioning it on both the hyperprior and the input observation. As conditioning via explicit parameter updates amounts to maintaining an external memory that can be queried by the prior distribution, we refer to the eventual stochastic process as a Turing Process with inspiration from the earlier work on Neural Turing Machines (Graves et al., 2014) and Neural Processes (Garnelo et al., 2018b). We arrive at our target Bayesian design that is equipped with the complete set of uncertainty quantiﬁcation capabilities by incorporating the Turing Process design into a Complete Bayesian Model. We refer to the resulting approach as an Evidential Turing Process. We observe on ﬁve real-world classiﬁcation tasks that the Evidential Turing Process is the only model that excels simultaneously at model ﬁt, class overlap quantiﬁcation, and out-of-domain detection.

2 PROBLEM STATEMENT: TOTAL CALIBRATION

Consider the following data generating process with K modes

y|π Cat(y|π), x|y

k=1 Iy=kp(x|y = k), (1)

where Cat(y| ) is a categorical distribution, Iu is the indicator function for predicate u, and p(x|y) is the class-conditional density function on input observation x. For any prior class distribution π, the classiﬁcation problem can be cast as identifying the class probabilities for observed patterns Pr[y|x, π] = Cat(y|f π true(x)) where f π true(x) is a K dimensional vector with the k-th element

f π,k true(x) = πkp(x|y = k) . K X

κ=1 πκp(x|y = κ). (2)

In many real-world cases f π true is not known and should be identiﬁed from a hypothesis space hπ Hπ via inference given a set of samples D = {(xn, yn)|n = 1, . . . , N} obtained from the true distribution. The risk of the Bayes classiﬁer arg maxy Cat(y|hπ(x)) for a given input x is

κ=1 Iκ =arg max hπ(x)f π,κ true(x). (3)

Published as a conference paper at ICLR 2022

For the whole data distribution, we have R(hπ) = Ex|π[R(hπ(x))]. The optimal case is when hπ(x) = f π true(x), which brings about the irreducible Bayes risk resulting from class overlap:

R (f π true(x)) = min n 1 max f π true(x), max f π true(x) o . (4)

Total Calibration. Given a set of test samples Dts coming from the true data generating process in Eq. 1, we deﬁne Total Calibration as the capability of a discriminative predictor hπ(x) to quantify the following three types of uncertainty simultaneously:

i) Model misﬁt evaluated by the similarity of hπ(x) and f π true(x) measurable directly via

KL Cat(y|f π true(x))p(x) || Cat(y|hπ(x))p(x) =

Ep(x) [log Cat(y|f π true(x)] + Ep(x) [ log Cat(y|hπ(x))]

where KL( || ) denotes the Kullback-Leibler divergence. Since Ep(x) [log Cat(y|f π true(x))] is a constant with respect to hπ(x), an unbiased estimator of the relative goodness of an inferred model is the negative test log-likelihood NLL(hπ(x)) = 1/|Dts| P

(x,y) Dts log Cat(y|hπ(x)).

ii) Class overlap evaluated by R (f π true(x)). As there is no way to measure this quantity from Dts, it is approximated indirectly via Expected Calibration Error (ECE)

acc(Bm) conf(Bm)

where Bm = {(n|hπ(xn) [(m 1)/M, m/M]} are M bins that partition the test set Dts into Dts = B1 BM and are characterized by their accuracy and conﬁdence scores

acc(Bm) = 1 |Bm|

n Bm Iyn=arg max hπ(xn), conf(Bm) = 1 |Bm|

n Bm hπ(xn). (5)

iii) Domain mismatch deﬁnes the rarity of an input pattern x for the target task, that is Pr[p(x = x ) < ϵ] for small ϵ. This quantity cannot be measured since p(x) is unknown. It is approximated indirectly as the success of discriminating between samples x coming from a different data distribution and those in Dts by calculating the Area Under ROC (AUROC) curve w.r.t. hπ(x).

3 UNCERTAINTY QUANTIFICATION WITH PARAMETRIC AND EVIDENTIAL BAYESIAN MODELS

Parametric Bayesian Models. A commonplace design choice is to build the hypothesis space by equipping the hypothesis function with a global parameter θ that takes values from a feasible set Θ, that is Hπ = {hθ π(x)|θ Θ}. Parametric Bayesian Models (PBMs) employ a prior belief θ p(θ), which is updated via Bayesian inference on a training set Dtr as p(θ|Dtr) = Q

(x,y) Dtr Cat(y|hθ π(x))p(θ)/p(Dtr). Consider the update on the posterior belief

p(θ|Dtr (x , y )) = Cat(y |hθ π(x ))p(θ|Dtr)/Z, (6)

after a new observation (x , y ) where the normalizer can be expressed as

Z = 1 p(Dtr)

Z Cat(y |hθ π(x ))L(θ)p(θ)dθ. (7)

with L(θ) = Q

(x,y) Dtr Cat(y|hθ π(x)). This can be seen as an inner product between the functions Cat(y |hθ π(x )) and L(θ) on an embedding space modulated by p(θ). As the sample size grows, the high-density regions of Cat(y |hθ π(x )) and L(θ) will superpose with higher probability, causing Z to increase, making p(θ|Dtr (x , y )) increasingly more peaked, and consequently we will have

lim |Dtr| + V ar[θ|Dtr] = 0. (8)

Published as a conference paper at ICLR 2022

This conjecture depends on the assumption that a single random variable θ modulates all samples. The variance of a prediction in favor of a class k decomposes according to the law of total variance

V ar[y = k|x , D] = V arθ p(θ|Dtr)[hθ π,k(x )] | {z } Reducible model uncertainty

+ Eθ p(θ|Dtr) h (1 hθ π,k(x ))hθ π,k(x ) i

| {z } Data uncertainty

Due to the Eq. 8, the ﬁrst term on the r.h.s. vanishes in the large sample regime and the second one coincides with the asymptotic solution of the maximum likelihood estimator θMLE. In cases when Hπ contains f π true(x), an ideal inference scheme has the potential to recover it and we get

lim N + Eθ p(θ|Dtr) h (1 hθ π,k(x ))hθ π,k(x ) i = (1 hθMLE π,k (x ))hθMLE π,k (x )

= (1 f π true(x ))f π true(x ).

The simple proposition below sheds light on the meaning of this result. It points to the fact that the Bayes risk of a classiﬁer (Eq. 4) and the second component of its predictive variance (Eq 9) are proportional. This gives a mechanistic explanation for why the second term on the r.h.s. of Eq. 9 quantiﬁes class overlap. Despite the commonplace allocation of the wording data uncertainty for this term, we are not aware of prior work that derives it formally from basic learning-theoretic concepts.

Proposition. The following inequality holds for any π, π [0.5, 1] pair

min(1 π, π) min(1 π , π ) (1 π)π (1 π )π .

Proof. Deﬁne π = 0.5+ϵ and π = 0.5+ϵ for ϵ, ϵ > 0. As min(1 π, π) min(1 π , π ) ϵ ϵ , we get (1 π)π = (1 0.5 ϵ)(0.5+ϵ) = 0.25 0.25ϵ ϵ2 0.25 0.25ϵ ϵ 2 = (1 π )π

The conclusion is that (1 f π true(x))f π true(x) R (f π true(x)). Hence, the second term on the r.h.s. of Eq. 9 is caused by the class overlap and it can be used to approximate the Bayes risk. Put together, the ﬁrst term of the decomposition reﬂects the missing knowledge on the optimal value of θ that can be reduced by increasing the training set size, while the second reﬂects the uncertainty stemming from the properties of the true data distribution. Hence we refer to the ﬁrst term as the reducible model uncertainty and the second the data uncertainty.

Evidential Bayesian Models. In all the analysis above, we assumed a ﬁxed and unknown π that modulates the whole process and cannot be identiﬁed by the learning scheme. When we characterize the uncertainty on the class distribution by a prior belief π p(π), we attain the joint p(y, π|x) = Pr[y|x, π]p(π|x). The variance of the Bayes optimal classiﬁer that averages over this uncertainty Pr[y|x] = R Pr[y|x, π]p(π|x)dπ decomposes as

V ar[y = k|x] = V arπ p(π|x)[f π,k true(x)] | {z } Irreducible model uncertainty

+ Eπ p(π|x) h (1 f π,k true(x))f π,k true(x) i

| {z } Data uncertainty

The main source of uncertainty p(π|x) is not a posterior this time but an empirical prior on a single observation x, hence its variance will not shrink as |Dtr| + . Therefore, the ﬁrst term on the r.h.s. reﬂects the irreducible model uncertainty caused by the missing knowledge about π. The second term indicates data uncertainty as in PBMs, but accounts for the local uncertainty at x caused by π. Evidential Bayesian Models (EBMs) (Sensoy et al., 2018; Malinin & Gales, 2018) suggest

p(y, π|x) = Pr[y|x, π]p(π|x) Pr[y|π]qψ(π|x)

where qψ(π|x) is a density function parameterized by ψ and the dependency of the class-conditional on the input is dropped. Note that the resultant model employs uncertainty on individual data points via π, but it does not have any global random variables. The standard evidential model training is performed by maximizing the expected likelihood subject to a regularizer that penalizes the divergence of qψ(π|x) from the prior belief

arg min ψ Z log Pr[y|π]qψ(π|x)dπ + βKL(qψ(π|x)||p(π)).

Although presented in the original work as as-hoc design choices, such a training objective can alternatively be viewed as β-regularized (Higgins et al., 2017) variational inference of Pr[y|π]p(π|x) with the approximate posterior qψ(π|x) (Chen et al., 2019).

Published as a conference paper at ICLR 2022

4 COMPLETE BAYESIAN MODELS: BEST OF BOTH WORLDS

The uncertainty decomposition characteristics of PBMs and EBMs provide complementary strengths towards quantiﬁcation of the three uncertainty types that constitute total calibration.

i) Model misﬁt: The PBM is favorable since it provides shrinking posterior variance with growing training set size (Eq. 8) and the recovery of θMLE which inherits the consistency guarantees the frequentist approach. Since the uncertainty on the local variable π does not shrink for large samples, NLL would measure how well ψ can capture the regularities of heteroscedastic noise across individual samples, which is a more difﬁcult task than quantifying the ﬁt of a parametric model.

ii) Class overlap: The EBM is favorable since its data uncertainty term Eπ p(π|x )[(1 hθ π,k(x ))hθ π,k(x )] is likely to have less estimator variance than Eθ p(θ|Dtr)[(1 hθ π,k(x ))hθ π,k(x )] and calculating p(π|x ) does not require approximate inference as for p(θ|Dtr).

iii) Domain mismatch: The PBM is favorable since the effect of the modulation of a global θ on the variance of Z applies also to the posterior predictive distribution with the only difference that x is a test sample. When the test sample is away from the training set, i.e. minx Dtr ||x x|| is large, then p(y |x , θ) is less likely to superpose with those regions of θ where L(θ) is pronounced, hence will be ﬂattened out leading to a higher variance posterior predictive for samples coming from another domain. Since EBM has only local random variables, the variance of its posterior is less likely to build a similar discriminative property for domain detection.

Main Hypothesis. A model that inherits the model uncertainty of PBM and data uncertainty of EBM can simultaneously quantify: i) model misﬁt, ii) class overlap, and iii) domain mismatch.

Guided by our main hypothesis, we combine the PBM and EBM designs within a uniﬁed framework that inherits the desirable uncertainty quantiﬁcation properties of each individual approach

Pr[y|π]p(π|θ, x)p(θ), (10)

which we refer to as a Complete Bayesian Model (CBM) hinting to its capability to solve the total calibration problem. The model simply introduces a global random variable θ into the empirical prior p(π|θ, x) of the EBM. The variance of the posterior predictive of the resultant model decomposes as

V ar[y|x] = V arp(θ|D) h Ep(π|x,θ)[E[y|π, x]] i

| {z } Reducible Model Uncertainty

+ Ep(θ|D) h V arp(π|θ,x)[E[y|π]] i

| {z } Irreducible Model Uncertainty

+ Ep(θ|D) h Ep(π|x,θ)[V ar[y|π]] i

| {z } Data Uncertainty

where the r.h.s. of the decomposition has the following desirable properties: i) The ﬁrst term recovers the form of the reducible model uncertainty term of PBM when π is marginalized V arp(θ|D)[Ep(π|x,θ)[E[y|π, x]]] = V arp(θ|D)[E[y|θ, x]]; ii) The second term is the irreducible model uncertainty term of EBM averaged over the uncertainty on the newly introduced global variable θ. It quantiﬁes the portion of uncertainty stemming from heteroscedastic noise, which is not explicit in the decomposition of PBM; iii) The third term recovers the data uncertainty term of EBM when θ is marginalized: Ep(θ|D)[Ep(π|x,θ)[V ar[y|π]] = Ep(π|x)[V ar[y|π]].

5 THE EVIDENTIAL TURING PROCESS: AN EFFECTIVE CBM REALIZATION

Equipping the empirical prior p(π|θ, x) of CBM with the capability of learning to generate accurate prior beliefs on individual samples is the key for total calibration. We adopt the following two guiding principles to obtain highly expressive empirical priors: i) The existence of a global random variable θ can be exploited to express complex reducible model uncertainties using the Neural Processes (Garnelo et al., 2018b) framework, ii) The conditioning of p(π|θ, x) on a global variable θ and an individual observation x can be exploited with an attention mechanism where θ is a memory and x is a query. Dependency on a context set during test time can be lifted using the Neural Turing Machine

Published as a conference paper at ICLR 2022

(Graves et al., 2014) design, which maintains an external memory that is updated by a rule detached from the optimization scheme of the main loss function. We extend the stochastic process derivation of the Neural Processes by marginalizing out a local variable from a PBM with an external memory that follows the Neural Turing Machine design and arrive at a new family of stochastic processes called a Turing Process, formally deﬁned as below.

Deﬁnition 1. A Turing Process is a collection of random variables Y = {yi|i I} for an index set I = {1, . . . , K} with arbitrary K N+ that satisfy the two properties

(i) p M(Y ) = Z Y

yi Y p(yi|θ)p M(θ)dθ, (ii) p M (Y |Y ) = Z Y

yi Y p(yi|θ)p M (θ|Y )dθ,

for another random variable collection Y of arbitrary size that lives in the same probability space as Y , a probability measure p M(θ) with free parameters M, and some function M = r(M, Y ).

The Turing Process can express all stochastic processes since property (i) is sufﬁcient to satisfy Kolmogorov s Extension Theorem (Øksendal, 1992) and choosing r simply to be an identity mapping would revert us back to the generic deﬁnition of a stochastic process. The Turing Process goes further by permitting to explicitly specify how information accumulates within the prior. This is a different approach from the Neural Process that performs conditioning p(Y |Y ) by applying an amortized posterior q(θ|Y ) on a plain stochastic process that satisﬁes only Property (i), hence maintains a data-agnostic prior p(θ) and depends on a context set also at the prediction time.

Given a disjoint partitioning of the data D = DC DT into a context set (x , y ) DC and a target set (x, y) DT , we propose the data generating process on the target set below

p(y, DT , π, θ) = p(w) p M(Z) | {z } External memory

h p(y|π) p(π|Z, w, x) | {z } Input-speciﬁc prior

We decouple the global parameters θ = {w, Z} into a w that parameterizes a map from the input to the hyperparameter space and an external memory Z = {z1, . . . , z R} governed by the distribution p M(Z) parameterized by M that embeds the context data during training. The memory variable Z updates its belief conditioned on the context set DC by updating its parameters with an explicit rule p M(Z|DC) p EZ p M (Z)[r(Z,DC)](Z). The hyperpriors of p(π|Z, w, x) are then determined by querying the memory Z for each input observation x using an attention mechanism, for instance a transformer network (Vaswani et al., 2017). Choosing the variational distribution as qλ(θ, π|x) = qλ(w)p M(Z)q(π|Z, w, x), we attain the variational free energy formula for our target model FET P (λ) = (13)

(x,y) D Eqλ(π|Z,w,x)

log p(y|π) + log qλ(π|Z, w, x)

p(π|Z, w, x)

+ log qλ(w)

The learning procedure will then alternate between updating the variational parameters λ via gradientdescent on FET P (λ) and updating the parameters of the memory variable Z via an external update rule as summarized in Figure 2b. ETP can predict a test sample (x , y ) by only querying the input x on the memory Z and evaluating the prior p(π |Z, w, x ) without need for any context set as

p(y |Dtr, x ) ZZ p(y |π )q(π |Z, w, x )p M(Z)qλ(w)d Zdw. (14)

6 MOST IMPORTANT SPECIAL CASES OF THE EVIDENTIAL TURING PROCESS

As we build the design of the Evidential Turing Process (ETP) on the CBM framework that overarches the prevalent parametric and evidential approaches, its ablation amounts to recovering many state-of-the-art modeling approaches as listed in Table 1 and detailed below.

Bayesian Neural Net (BNN) (Neal, 1995; Mac Kay, 1992; 1995) is the most characteristic PBM example that parameterize a likelihood p(y|θ, x) with a neural network fθ( ) whose weights follow a distribution θ p(θ). In the experiments we assume as a prior θ N(θ|0, β 1I) and infer its posterior by mean-ﬁeld variational posterior qλ(θ) applying the reparameterization trick (Kingma et al., 2015; Molchanov et al., 2017).

Published as a conference paper at ICLR 2022

(x, y) DT (x , y ) DC

(a) ETP Plate diagram

while Model not converged do

for Dbatch D do

Choose DC Dbatch M EZ p M(Z) [r(Z, DC)] λ λ λFET P (λ) end end

(b) ETP Training

Figure 2: The Evidential Turing Process: (left) The essential design choice is how p(π|Z, w, x) is constructed thanks to an external memory M that can be queried with respect to any input x without needing to store context data at test time. The dashed arrow indicates that Z can condition on a context DC by updates its parameters M with an explicit function r. (right) The ETP training routine.

Table 1: Ablation Table. Deactivating the components of ETP one at a time equates it to state-of-theart methods. We curate ENP as a surrogate for the Attentive NP (ANP) of (Kim et al., 2019), which requires test-time context, and an improved NP variant with reduced estimator variance thanks to π.

Model π Z M r D(test) c Compare

ETP (Target) ANP (Kim et al., 2019) No ENP (Surrogate) Yes NP (Garnelo et al., 2018b) No EDL (Sensoy et al., 2018) Yes BNN (Molchanov et al., 2017) Yes : Component active : Component inactive D(test) c : Demand for context data at test time

Evidential Deep Learning (EDL) (Sensoy et al., 2018) introduces the characteristic example of EBMs. EDL employs an uninformative Dirichlet prior on the class probabilities, which then set the mean of a normal distribution on the one-hot-coded class labels y, π Dir(π|1, . . . , 1)N(y|π, 0.5IK). EDL links the input observations to the distribution of class probabilities by performing amortized variational inference with the approximate distribution qλ(π; x) = Dir(π|αλ(x)).

Neural Processes (NP) (Garnelo et al., 2018a;b) is a PBM used to generate stochastic processes from neural networks by integrating out the global parameters θ as in Property (i) of Deﬁnition 1. NPs differ from PBMs by amortizing an arbitrary sized context set DC on the global parameters q(θ|DC) via an aggregator network during inference. NPs can be viewed as Turing Processes with an identity map r. Follow-up work equipped NPs with attention (Kim et al., 2019), translation equivariance (Gordon et al., 2020; Foong et al., 2020), and sequential prediction (Singh et al., 2019; Yoon et al., 2020). All of these NP variants assume prediction-time access to context, making them suitable for interpolation tasks such as image in-painting but not for generic prediction problems.

Evidential Neural Process (ENP) is the EBM variant of a neural process we introduce to bring together best practices of EDL and NPs and make the strongest possible baseline for ETP, deﬁned as

p(y, π, Z|DC) = Cat(y|π)Dir π|αθ(Z) N Z|(µ, σ2) = r({hθ(xj, yj) | (xj, yj) DC}) , (15)

with an aggregation rule r( ), an encoder net hθ( , ), and a neural net αθ( ) mapping the global variable Z to the class probability simplex. We further improve upon ENP with an attentive NP approach (Kim et al., 2019) by replacing the ﬁxed aggregation rule r with an attention mechanism, thereby allowing the model to switch focus depending on the target. At the prediction time, we use Z N(Z|1, κ2I) for small κ to induce an uninformative prior on π in the absence of a context set.

Published as a conference paper at ICLR 2022

7 CASE STUDY: CLASSIFICATION WITH EVIDENTIAL TURING PROCESSES

We demonstrate a realization of an Evidential Turing Process to a classiﬁcation problem below

y, π, w, Z|M Cat(y|π)Dir π| exp(a(vw(x); Z)) N(w|0, β 1I)

r=1 N(zr|mr, κ2I). (16)

The global variable Z is parameterized by the memory M = (m1, . . . , m R) consisting of R cells mr RK.1 The input data x is mapped to the same space via an encoding neural net vw( ) parameterizing a Dirichlet distribution over the class probabilities π via an attention mechanism a(vw, z) = P

z Z φ(vw(x), z )z, where φ(vw(x), z) = softmax({kψ(z) vw(x)/

K|z Z}). The function kψ( ) is the key generator network operating on the embedding space. We update the memory cells as a weighted average between remembering and updating

m EZ p(Z|M)

tanh γm + (1 γ) X

(x,y) DC φ(vw(x), z)[onehot(y) + soft(vw(x))]

where γ (0, 1) is a ﬁxed scaling factor controlling the relative importance. The second term adds new information to the cell as a weighted sum over all pairs (x, y), taking both the true label (via onehot(y)) as well as the uncertainty in the prediction (via soft(vw(x))) into account. The ﬁnal tanh( ) transformation ensures that the memory content remains in a ﬁxed range across updates, as practised in LSTMs (Hochreiter & Schmidhuber, 1997).

8 EXPERIMENTS

We benchmark ETP against the state of the art as addressed in the ablation plan in Table 1 according to the total calibration criteria developed in Sec. 2 on ﬁve real-world data sets. See the Appendix B for full details on the experimental setup in each case and for the hyperparameters used throughout. We provide a reference implementation of the proposed model and the experimental pipeline.2

60 CIFAR10-C

1 2 3 4 5 Severity

1 2 3 4 5 Severity

1 2 3 4 5 Severity

BNN EDL ENP ETP

Figure 3: Results averaged across 19 types of corruption (e.g. motion blur, fog, pixelation) applied on the test splits of FMNIST, CIFAR10, and SVHN at ﬁve severity levels.

Total calibration. Table 2 reports the prediction error and the three performance scores introduced in Sec. 2 that constitute total calibration. We perform the out-of-domain detection task as in Malinin & Gales (2018) and classify the test split of the target domain and a data set from another domain based on the predictive entropy. ETP is the only method that consistently ranks among top performers in all data sets with respect to all performance scores, which supports our main hypothesis that CBM should outperform PBM and EBM in total calibration.

Response to gradual domain shift. In order to assess how well models can cope with a gradual transition from their native domain, we evaluate their ECE performance on data perturbed by 19 types of corruption (Hendrycks & Dietterich, 2019) at ﬁve different severity levels. Figure 3 depicts the performance of models averaged across all corruptions. ETP gives the best performance nearly in all data sets and distortion severity levels (see Appendix B.3 Table 4 for a tabular version of these results).

Computational cost. We measured the average wall-clock time per epoch in CIFAR10 to be 10.8 0.1 seconds for ETP, 8.2 0.1 seconds for BNN, 8.6 0.1 seconds for EDL, and 9.5 0.2 seconds for ENP. The relative standings of the models remain unchanged in all the other data sets.

1In the experiments we ﬁx κ2 = 0.1. Preliminary results showed stability as well for larger/smaller values. 2https://github.com/ituvisionlab/Evidential Turing Process

Published as a conference paper at ICLR 2022

Table 2: Quantitative results on ﬁve data sets showing mean standard deviation across 10 repetitions. Best performing models that overlap within three standard deviations are highlighted in bold.

Domain Data IMDB Fashion SVHN CIFAR10 CIFAR100 (Architecture) (LSTM) (Le Net5) (Le Net5) (Le Net5) (Res Net18) Prediction accuracy as % test error BNN 16.4 0.6 7.9 0.1 7.9 0.1 15.3 0.3 30.2 0.3 EDL 38.3 13.3 8.6 0.1 7.3 0.1 18.5 0.2 45.2 0.4 ENP 50.0 0.0 7.9 0.2 6.7 0.1 14.8 0.2 39.0 0.3 ETP (Target) 15.8 1.3 7.9 0.2 6.9 0.1 15.3 0.2 29.2 0.3 In-domain calibration as % Expected Calibration Error (ECE) BNN 14.4 0.4 6.7 0. 6.5 0. 5.5 0.3 15.2 0.0 EDL 41.1 2.6 3.7 0.2 4.0 0.1 9.0 0.2 5.3 0.4 ENP 0.8 1.6 6.0 0.2 10.7 0.2 7.2 0.3 39.7 0.4 ETP (Target) 3.1 0.4 2.6 0.2 2.6 0.1 2.7 0.1 6.6 0.1 Model ﬁt as negative test log-likelihood BNN 0.47 0.0 0.65 0.0 0.71 0.0 0.50 0.0 1.78 0.0 EDL 0.66 0.1 0.37 0.0 0.34 0.0 0.72 0.0 2.24 0.0 ENP 0.69 0.0 0.34 0.0 0.33 0.0 0.50 0.0 2.52 0.0 ETP (Target) 0.37 0.0 0.29 0.0 0.26 0.0 0.46 0.0 1.36 0.0 Out-of-domain detection as % Area Under ROC Curve OOD Data Random MNIST CIFAR100 SVHN Tiny Image Net BNN 60.9 4.2 75.9 2.3 86.2 0.5 84.1 1.3 97.2 0.5 EDL 55.1 5.1 77.5 2.0 90.9 0.3 79.2 0.7 89.6 0.3 ENP 53.7 5.7 88.9 1.0 92.4 0.4 81.4 0.8 100.0 0.1 ETP (Target) 59.1 5.1 90.0 0.9 90.0 0.4 82.1 0.6 99.6 0.1

9 CONCLUSION

Summary. We initially develop the ﬁrst formal deﬁnition of total calibration. Then we analyze how the design of two mainstream Bayesian modeling approaches affect their uncertainty characteristics. Next we introduce Complete Bayesian Models as a unifying framework that inherits the complementary strengths existing ones. We develop the Evidential Turing Process as an optimal realization of Complete Bayesian Models. We derive an experiment setup from our formal deﬁnition of total calibration and a systematic ablation of our target model into the strongest representatives of the state of the art. We observe in ﬁve real-world tasks that the Evidential Turing Process is the only model that can excel all three aspects of total calibration simulatenously.

Broad impact. Our work delivers evidence for the claim that epistemic awareness of a Bayesian model is indeed a capability learnable only from in-domain data, as appears in biological intelligence via closed-loop interactions of neuronal systems with stimuli. The implications of our work could follow up in interdisciplinary venues, with focus on the relation of associative memory and attention to the neuroscientiﬁc roots of epistemic awareness (Lorenz, 1978).

Limitations and ethical concerns. While we observed ETP to outperform BNN variants in our experiments, whether BNNs could bridge the gap after further improvements in the approximate inference remains an open question. The attention mechanism of ETP could also be improved, for instance to transformer nets (Vaswani et al., 2017). The explainability of the memory content of ETP deserves further investigation. The generalizeability of our results to very deep architectures, such as Res Net 152 or Dense Net 161 could be a topic of a separate large-scale empirical study. ETP does not improve the explainability and fairness of the decisions made by the underlying design choices, such as the architecture of the used neural nets in the pipeline. Potential negative societal impacts of deep neural net classiﬁers stemming from these two factors need to be circumvented separately before the real-world deployment of our work.

Published as a conference paper at ICLR 2022

M. Abramowitz and I.A. Stegun. Handbook of Mathematical Functions. Dover, 1965.

W. Chen, Y. Shen, H. Jin, and W. Wang. A Variational Dirichlet Framework for Out-of-Distribution Detection. abs/1811.07308, 2019.

A. Foong, W. Bruinsma, J. Gordon, Y. Dubois, J. Requeima, and R. Turner. Meta-Learning Stationary Stochastic Process Prediction with Convolutional Neural Processes. In Neur IPS, 2020.

M. Garnelo, D. Rosenbaum, C. Maddison, T. Ramalho, D. Saxton, M. Shanahan, Y.W. Teh, D. Rezende, and S.M.A. Eslami. Conditional Neural Processes. In ICML, 2018a.

M. Garnelo, J. Schwarz, D. Rosenbaum, F. Viola, D.J. Rezende, S.M. Eslami, and Y.W. Teh. Neural Processes. In ICML WS on Theoretical Foundations and Applications of Deep Generative Models, 2018b.

J. Gordon, W.P. Bruinsma, A.Y.K. Foong, J. Requeima, Y. Dubois, and R.E. Turner. Convolutional Conditional Neural Processes. In ICLR, 2020.

A. Graves, G. Wayne, and I. Danihelka. Neural Turing Machines. ar Xiv preprint ar Xiv:1410.5401, 2014.

C. Guo, G. Pleiss, Y. Sun, and K.Q. Weinberger. On Calibration of Modern Neural Networks. In ICML, 2017.

D. Hendrycks and T. Dietterich. Benchmarking Neural Network Robustness to Common Corruptions and Perturbations. In ICML, 2019.

I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner. β-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. In ICLR, 2017.

S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735 1780, 1997.

H. Kim, A. Mnih, J. Schwarz, M. Garnelo, A. Eslami, D. Rosenbaum, O. Vinyals, and Y.W. Teh. Attentive Neural Processes. In ICLR, 2019.

D.P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. In ICLR, 2015.

D.P. Kingma, T. Salimans, and M. Welling. Variational Dropout and the Local Reparameterization Trick. Neur IPS, 2015.

V. Kuleshov, N. Fenner, and S. Ermon. Accurate Uncertainties for Deep Learning using Calibrated Regression. In ICML, 2018.

Y. Le Cun, C. Cortes, and C.J. Burges. MNIST Handwritten Digit Database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010.

Konrad Lorenz. Die Rückseite des Spiegels: Versuch einer Naturgeschichte menschlichen Erkennens. Piper, 1978.

D.J.C. Mac Kay. A Practical Bayesian Framework for Backpropagation Networks. Neural Computation, 1992.

D.J.C. Mac Kay. Probable Networks and Plausible Predictions A Review of Practical Bayesian Methods for Supervised Neural Networks. Network: Computation in Neural Systems, 1995.

A. Malinin and M. Gales. Predictive Uncertainty Estimation via Prior Networks. In Neur IPS, 2018.

D. Molchanov, A. Ashukha, and D. Vetrov. Variational Dropout Sparsiﬁes Deep Neural Networks. In ICML, 2017.

M.P. Naeini, G. Cooper, and M. Hauskrecht. Obtaining Well Calibrated Probabilities using Bayesian Binning. In AAAI, 2015.

Published as a conference paper at ICLR 2022

R. Neal. Bayesian Learning for Neural Networks. Ph D thesis, 1995.

Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A.Y. Ng. Reading Digits in Natural Images with Unsupervised Feature Learning. In Neur IPS WS on Deep Learning and Unsupervised Feature Learning, 2011.

B. Øksendal. Stochastic Differential Equations: An Introduction with Applications. Springer-Verlag, 1992.

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T..evor Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. De Vito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. Py Torch: An Imperative Style, High-Performance Deep Learning Library. In Neur IPS. 2019.

H. Robbins and S. Monro. A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3):400 407, 1951.

G. Schwalbe and M. Schels. A Survey on Methods for the Safety Assurance of Machine Learning Based Systems. In European Congress on Embedded Real Time Software and Systems, 2020.

M. Sensoy, L. Kaplan, and M. Kandemir. Evidential Deep Learning to Quantify Classiﬁcation Uncertainty. In Neur IPS, 2018.

G. Singh, J. Yoon, Y. Son, and S. Ahn. Sequential Neural Processes. In Neur IPS, 2019.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, and I. Polosukhin. Attention is All You Need. In Neur IPS, 2017.

J. Yoon, G. Singh, and S. Ahn. Robustifying Sequential Neural Processes. In ICML, 2020.

Published as a conference paper at ICLR 2022

A FURTHER ANALYTICAL RESULTS AND DERIVATIONS

A.1 KULLBACK LEIBLER BETWEEN TWO DIRICHLET DISTRIBUTIONS

As both our ETP and and the EDL baseline use the Kullback-Leibler divergence between two Dirichlet distributions, we give the full details of its calculation here. For two distributions over a K-dimensional probability π, Dir(π|α) and Dir(π|β), parameterized by α, β RK + , the following identity holds

KL Dir(π|α) Dir(π|β) = log Γ(P

k Γ(βk) Γ(P

k (αk βk) ψ(αk) ψ(P k αk) ,

where Γ( ) and ψ(a) := d da log Γ(a) are the gamma and digamma functions (Abramowitz & Stegun, 1965).

A.2 THE EVIDENTIAL DEEP LEARNING OBJECTIVE

The analytical expression of the Evidential Deep Learning (Sensoy et al., 2018) loss as deﬁned in the main paper can be computed as follows. For a single K-dimensional observation y, dropping the n from the notation and instead using the index for the dimensionality throughout the following equations, we have

Ep(π|x) ||y π||2 = Ep(π|x) (y π) (y π)

k=1 Ep(π|x) (yk πk)2

k=1 Ep(π|x) (yk Ep(π|x) [πk] + Ep(π|x) [πk] πk)2

k=1 (yk Ep(π|x) [πk])2 + varp(π|x) [πk] ,

with the tractable expectation and variance of a Dirichlet distributed π. The Kullback-Leibler term is between two Dirichlet distributions and given (see the general form in A.1) as

KL (Dir(π|αθ(x)) Dir(π|1, . . . , 1))

k Γ(αθ(x)k)

k=1 (αθ(x)k 1) ψ(αθ(x)k) ψ P

A.3 EVIDENTIAL DEEP LEARNING AS A LATENT VARIABLE MODEL

As discussed in the main paper for a set of N observations D = {(x1, y1), . . . , (x N, y N)}, the EDL objective minimizes is given as

n=1 Ep(πn|xn) ||yn πn||2 2 + λKL p(πn|xn) Dir(πn|1, . . . , 1) ,

where p(πn|xn) = Dir(π|αθ(xn)). We can instead assume the following generative model

πn Dir(πn|1, . . . , 1) n yn N(yn|πn, 0.5IK) n,

i.e. latent variables πn and K-dimensional observations yn following a multivariate normal prior. Approximating the intractable posterior p(π|y), where π = (π1, . . . , πn), y = (y1, ..., yn), with an

Published as a conference paper at ICLR 2022

amortized variational posterior q(πn; xn) = Dir(πn|αθ(xn)), where αθ( ) is the same architecture as in the EDL model, we have as the evidence lower bound (ELBO) to be maximized

n=1 Eq(πn;xn) [log p(yn|πn)] KL q(πn; xn) p(πn)

n=1 Eq(πn;xn)

2 log(2π) + K

2 log(2) (yn πn) (yn πn) KL q(πn; xn) p(πn)

n=1 Eq(πn;xn) (yn πn) (yn πn) KL q(πn; xn) p(πn)

= const LEDL,

that is the negative ELBO is equal to the EDL loss up to an additive constant.

A.4 POSTERIOR PREDICTIVE

For a new input observation x , the corresponding posterior predictive distribution on the output y can be computed as

p(y = k|x , D) Ep(Z|M) Eqλ(π|x ,Z) [p(y = k|π)] = Ep(Z|M) h hk θ P

where hk θ = hθ (vθ(x ), a(vθ(x ); Z))k. That is, we can compute it analytically up to a samplingbased evaluation of the expectation over p(Z|M).

B EXPERIMENTAL DETAILS

B.1 SYNTHETIC 1D TWO-CLASS CLASSIFICATION

2 1 0 1 2 0.0

2 1 0 1 2 Input space

Memory Evidence

Figure 4: 1D Classiﬁcation Task. The upper plot shows the underlying distributions of each of the two classes, as well as the observed data. The lower shows the regularizing evidence the generative model places on each of the two classes depending on location in space, that is, mean one standard deviation over ten samples from p(Z|M).

We illustrate the qualitative behaviour of ETP on a synthetic 1d two-class classiﬁcation task. The data consist of observations (20 per class) from two highly overlapping Gaussian distributions. The neural net, a simple one hidden layer architecture, can learn the task with high accuracy. Figure 4 shows the raw data and underlying distributions as well as the learned memory evidence. The model learns to place more evidence on the correct class further away from the decision boundary while also becoming more varied with increasing distance. Within the high-density region, the asymmetry in how the speciﬁc observations are spread out leads to an asymmetry in the memory evidence. Below zero, the blue points are clearly separated, leading to a quick emphasis on that class as we move towards the left. Above zero, however, some blue observations in the orange region prevent the model from rapidly developing over-conﬁdence.

This synthetic data set consisting of two classes with 20 observations each sampled from standard normal distributions centered at 1 respectively. The encoding function vφ( ) consists of a multi-layer perceptron with a single hidden layer consisting of 32 neurons and a Re LU activation function. It is optimized for 400 epochs with the Adam optimizer (Kingma & Ba, 2015) using the Py Torch (Paszke et al., 2019) default parameters and a learning rate of 0.001. In order to keep the model as simple

Published as a conference paper at ICLR 2022

4 3 2 1 0 1 2 3 4 Input x1 (for x2 = 0)

Memory Evidence

Figure 5: 2D Classiﬁcation Task. The three upper plots visualize the regularizing evidence that the model places on the location in space as the average over ten samples from p(Z|M). The color scales are different across the three plots and vary in overall intensity. The lower plot visualizes a horizontal cut through the middle of these plots, putting them on a common scale. The solid lines in each color indicate the mean memory evidence, and the corresponding shaded areas indicate one standard deviation. Around the separable blue class, the memory strongly emphasizes the correct class with large conﬁdence and suppresses the other two classes depicted in orange and green. While always preferring the correct class, the memory regularizes against overconﬁdence around the overlapping orange and green classes.

as possible, the key generator function kλ( ) is constrained to being the identity function, while hξ v(x), a(v(x); Z) = v(x) + tanh(a(v(x); Z). The memory update is simpliﬁed to dropping the tanh( ) rescaling.

B.2 IRIS 2D THREE-CLASS CLASSIFICATION

The data set used in this experiment is the classical Iris data set3 consisting of 150 samples of three iris species (50 per class), with four features measured per sample. We ﬁrst map the data to the ﬁrst two principal components for visualization and also use this modiﬁed version as the training data. This gives an interpretable toy learning setup where one class is clearly separated from the other two overlapping classes. The encoding function vφ( ) consists of an MLP with two hidden layers of 32 neurons each and Re LU activation functions. We train it for 400 epochs with the Adam optimizer using the Py Torch default parameters and a learning rate of 0.001. In order to keep the model as simple as possible, we constrain the key generator function kλ( ) to the identity function while setting hξ v(x), a(v(x); Z) = v(x) + tanh(a(v(x); Z). We simplify the memory update by dropping the tanh( ) rescaling. Figure 5 shows the varying importance the model assigns to different parts of the input space depending on the class under consideration. For the separate blue class, the model is signiﬁcantly more conﬁdent in relative terms and absolute values.

B.3 EXPERIMENTS ON REAL DATA

All experiments are implemented in Py Torch (Paszke et al., 2019) version 1.7.1 and trained on a TITAN RTX. They have been replicated ten times over random initial seeds.

Fashion MNIST (FMNIST). We use a Le Net5-sized architecture (see Table 3). The out-ofdistribution data is the MNIST (Le Cun et al., 2010) data set. We z-score normalize all data sets with the in-domain mean and standard deviations. We train each model for 50 epochs with the Adam optimizer (Kingma & Ba, 2015) using a learning rate of 0.001.

3Originally due to Fisher (1936) and Anderson (1935). We rely on the version provided by the scikit learn library, see https://scikit-learn.org/stable/modules/classes.html# module-sklearn.datasets.

Published as a conference paper at ICLR 2022

Table 3: The neural network architectures used for the FMNIST, CIFAR10, and SVHN data set with Re LU activations between the layers.

FMNIST CIFAR10/SVHN

Convolution (5 5) with 20 channels Convolution (5 5) with 192 channels Max Pooling (2 2) with stride 2 Convolution (5 5) with 50 channels Convolution (5 5) with 192 channels Max Pooling (2 2) with stride 2 Linear with 500 neurons Linear with 1000 neurons Linear with 10 neurons

CIFAR100 Layer Name Res Net-18

Conv1 7 7, 64, stride 2

Conv2_x 3 3 max pool, stride 2 3 3, 64 3 3, 64

Conv3_x 3 3, 128 3 3, 128

Conv4_x 3 3, 256 3 3, 256

Conv5_x 3 3, 512 3 3, 512

Average Pool 7 7 average pool Linear 100 neurons

IMDB Layer Name LSTM

Embedding 1001 64 LSTM 2 layers with 256 hidden size Linear 2 neurons

CIFAR10 (C10). We use a Le Net5-sized architecture (see Table 3). The out-of-distribution data is the SVHN (Netzer et al., 2011). We z-score normalize all data sets with the in-domain mean and standard deviations. We further rely on data augmentation for the in domain data using random crops with 32 pixels and four pixel padding as well as horizontal ﬂips. We train each model for 100 epochs using the Adam optimizer (Kingma & Ba, 2015) with a learning rate of 0.0001.

SVHN. We use a Le Net5-sized architecture (see Table 3). The out-of-distribution data is the CIFAR10 data set. We z-score normalize all data sets with the in-domain mean and standard deviations. We train each model for 100 epochs using the Adam optimizer (Kingma & Ba, 2015) with a learning rate of 0.0001.

IMDB Sentiment Classiﬁcation. We use a LSTM architecture with 64 embedding dimensions and 2 layers with 256 hidden dimensions. We generate the out-of-distribution data by sampling values from the [1, 1000]interval uniformly at random. We train each model for 20 epochs using the SGD optimizer (Robbins & Monro, 1951) with a learning rate of 0.05 and 0.9 momentum. For data preparation, we follow the source code of the original work 4.

Shared Details and Observations. In all experiments, we train the BNN models with a crossentropy loss using softmax as the squashing function to calculate class probabilities. The EDL model is trained with the original loss (Sensoy et al., 2018). We make our predictions with the mean of the Dirichlet distribution as in the original work. For all models, we use entropy as the criterion for detecting out-of-distribution instances.

4https://www.kaggle.com/arunmohan003/sentiment-analysis-using-lstm-pytorch

Published as a conference paper at ICLR 2022

Table 4: Quantitative results of models on the FMNIST-C, CIFAR10-C and SVHN-C test dataset using the Le Net5. The table below reports mean three standard deviations.

FMNIST-C CIFAR10-C SVHN-C

Severity Method Err (%) ( ) ECE (%) ( ) ( ) ECE (%) ( ) ( ) ECE (%) ( )

BNN 9.4 2.8 8.0 2.4 21.9 4.5 7.6 2.9 8.8 1.2 7.3 1.0 EDL 9.9 2.6 4.3 1.0 25.1 4.6 10.4 1.3 8.1 1.2 4.6 0.6 ENP 9.3 2.7 5.9 0.5 35.9 6.0 4.3 1.7 8.1 1.3 11.4 1.0 ETP 9.5 2.8 3.0 1.0 21.9 4.6 3.6 2.2 8.0 1.2 3.0 0.4

BNN 9.9 2.3 8.3 1.9 26.1 5.3 10.0 3.7 8.9 1.2 7.3 0.9 EDL 10.5 2.1 4.5 0.8 29.1 5.1 10.1 1.6 8.2 1.1 4.6 0.6 ENP 9.9 2.3 6.1 1.0 41.1 6.6 5.2 2.1 7.8 1.2 11.7 1.1 ETP 10.1 2.4 3.1 0.8 26.1 5.3 5.2 2.8 7.7 1.1 3.0 0.5

BNN 10.7 2.3 9.0 1.9 30.0 7.6 12.3 5.3 9.2 1.6 7.5 1.2 EDL 11.4 2.1 4.9 0.9 32.6 7.0 9.8 1.5 8.5 1.5 4.7 0.7 ENP 10.8 2.3 6.4 1.6 44.8 9.0 7.0 3.4 8.5 1.7 11.8 1.5 ETP 11.1 2.4 3.5 0.9 29.8 7.2 6.8 3.8 8.2 1.5 3.2 0.6

BNN 12.2 4.2 10.2 3.6 35.1 10.9 15.9 8.3 9.9 2.5 8.0 1.9 EDL 12.9 4.2 5.6 1.8 37.3 10.0 9.6 1.4 9.1 2.4 5.1 1.0 ENP 12.4 4.3 6.5 2.2 48.9 11.2 9.2 5.4 9.4 2.8 12.3 2.2 ETP 12.8 4.6 4.1 1.8 34.8 10.3 9.7 6.4 9.1 2.4 3.2 1.0

BNN 14.2 5.4 11.7 4.6 42.4 13.7 21.0 10.4 10.2 2.5 8.2 1.9 EDL 15.1 5.8 6.7 3.1 43.9 12.5 9.5 2.4 9.4 2.4 5.1 1.1 ENP 14.8 6.5 6.9 3.1 54.6 12.7 12.5 7.0 9.5 2.8 12.4 2.3 ETP 15.3 7.1 5.2 3.4 42.1 13.1 14.1 8.6 9.0 2.5 3.3 1.0

Robustness against perturbations. We use the CIFAR10-C dataset in the same setup as in original work t Hendrycks & Dietterich (2019). As FMNIST-C and SVHN-C are not available, we create them following the same procedure as in the original work. For FMNIST we adapt the procedure to gray-scale images by replicating the image three times for each channel. After applying the distortions, we map it back to the gray scale. Table 4 gives the numerical results used to create Figure 3 in the main paper.