# depth_uncertainty_in_neural_networks__8c2d3946.pdf

Depth Uncertainty in Neural Networks

Javier Antoránú University of Cambridge

ja666@cam.ac.uk

James Urquhart Allinghamú

University of Cambridge

jua23@cam.ac.uk

José Miguel Hernández-Lobato

University of Cambridge

Microsoft Research The Alan Turing Institute

jmh233@cam.ac.uk

Existing methods for estimating uncertainty in deep learning tend to require multiple forward passes, making them unsuitable for applications where computational resources are limited. To solve this, we perform probabilistic reasoning over the depth of neural networks. Diﬀerent depths correspond to subnetworks which share weights and whose predictions are combined via marginalisation, yielding model uncertainty. By exploiting the sequential structure of feed-forward networks, we are able to both evaluate our training objective and make predictions with a single forward pass. We validate our approach on real-world regression and image classiﬁcation tasks. Our approach provides uncertainty calibration, robustness to dataset shift, and accuracies competitive with more computationally expensive baselines.

1 Introduction

Despite the widespread adoption of deep learning, building models that provide robust uncertainty estimates remains a challenge. This is especially important for real-world applications, where we cannot expect the distribution of observations to be the same as that of the training data. Deep models tend to be pathologically overconﬁdent, even when their predictions are incorrect (Nguyen et al., 2015; Amodei et al., 2016). If AI systems would reliably identify cases in which they expect to underperform, and request human intervention, they could more safely be deployed in medical scenarios (Filos et al., 2019) or self-driving vehicles (Fridman et al., 2019), for example.

In response, a rapidly growing subﬁeld has emerged seeking to build uncertainty aware neural networks (Hernández-Lobato and Adams, 2015; Gal and Ghahramani, 2016; Lakshminarayanan et al., 2017). Regrettably, these methods rarely make the leap from research to production due to a series of shortcomings. 1) Implementation Complexity: they can be technically complicated and sensitive to hyperparameter choice. 2) Computational cost: they can take orders of magnitude longer to converge than regular networks or require training multiple networks. At test time, averaging the predictions from multiple models is often required. 3) Weak performance: they rely on crude approximations to achieve scalability, often resulting in limited or unreliable uncertainty estimates (Foong et al., 2019a).

In this work, we introduce Depth Uncertainty Networks (DUNs), a probabilistic model that treats the depth of a Neural Network (NN) as a random variable over which to perform

úequal contribution

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

f0 f1 x ˆy1

f0 f1 f2 x ˆy2

f0 f1 f2 f3 x ˆy3

f0 f1 f2 f3 f4 x ˆy4

Figure 1: A DUN is composed of subnetworks of increasing depth (left, colors denote layers with shared parameters). These correspond to increasingly complex functions (centre, colors denote depth at which predictions are made). Marginalising over depth yields model uncertainty through disagreement of these functions (right, error bars denote 1 std. dev.).

inference. In contrast to more typical weight-space approaches for Bayesian inference in NNs, ours reﬂects a lack of knowledge about how deep our network should be. We treat network weights as learnable hyperparameters. In DUNs, marginalising over depth is equivalent to performing Bayesian Model Averaging (BMA) over an ensemble of progressively deeper NNs. As shown in Figure 1, DUNs exploit the overparametrisation of a single deep network to generate diverse explanations of the data. The key advantages of DUNs are:

1. Implementation simplicity: requiring only minor additions to vanilla deep learning

code, and no changes to the hyperparameters or training regime.

2. Cheap deployment: computing exact predictive posteriors with a single forward pass.

3. Calibrated uncertainty: our experiments show that DUNs are competitive with strong

baselines in terms of predictive performance, Out-of-distribution (OOD) detection and robustness to corruptions.

2 Related Work

Traditionally, Bayesians tackle overconﬁdence in deep networks by treating their weights as random variables. Through marginalisation, uncertainty in weight-space is translated to predictions. Alas, the weight posterior in Bayesian Neural Networks (BNNs) is intractable. Hamiltonian Monte Carlo (Neal, 1995) remains the gold standard for inference in BNNs but is limited in scalability. The Laplace approximation (Mac Kay, 1992; Ritter et al., 2018), Variational Inference (VI) (Hinton and van Camp, 1993; Graves, 2011; Blundell et al., 2015) and expectation propagation (Hernández-Lobato and Adams, 2015) have all been proposed as alternatives. More recent methods are scalable to large models (Khan et al., 2018; Osawa et al., 2019; Dusenberry et al., 2020). Gal and Ghahramani (2016) re-interpret dropout as VI, dubbing it MC Dropout. Other stochastic regularisation techniques can also be viewed in this light (Kingma et al., 2015; Gal, 2016; Teye et al., 2018). These can be seamlessly applied to vanilla networks. Regrettably, most of the above approaches rely on factorised, often Gaussian, approximations resulting in pathological overconﬁdence (Foong et al., 2019a).

It is not clear how to place reasonable priors over network weights (Wenzel et al., 2020a). DUNs avoid this issue by targeting depth. BNN inference can also be performed directly in function space (Hafner et al., 2018; Sun et al., 2019; Ma et al., 2019; Wang et al., 2019). However, this requires crude approximations to the KL divergence between stochastic processes. The equivalence between inﬁnitely wide NNs and Gaussian processes (GPs) (Neal, 1995; de G. Matthews et al., 2018; Garriga-Alonso et al., 2019) can be used to perform exact inference in BNNs. Unfortunately, exact GP inference scales poorly in dataset size.

Deep ensembles is a non-Bayesian method for uncertainty estimation in NNs that trains multiple independent networks and aggregates their predictions (Lakshminarayanan et al., 2017). Ensembling provides very strong results but is limited by its computational cost. Huang et al. (2017), Garipov et al. (2018), and Maddox et al. (2019) reduce the cost of training an ensemble by leveraging diﬀerent weight conﬁgurations found in a single SGD trajectory. However, this comes at the cost of reduced predictive performance (Ashukha et al., 2020). Similarly to deep ensembles, DUNs combine the predictions from a set of deep models. However, this set stems from treating depth as a random variable. Unlike ensembles,

Figure 2: Left: graphical model under consideration. Right: computational model. Each layer s activations are passed through the output block, producing per-depth predictions.

BMA assumes the existence of a single correct model (Minka, 2000). In DUNs, uncertainty arises due to a lack of knowledge about how deep the correct model is. It is worth noting that deep ensembles can also be interpreted as approximate BMA (Wilson, 2020).

All of the above methods, except DUNs, require multiple forward passes to produce uncertainty estimates. This is problematic in low-latency settings or those in which computational resources are limited. Note that certain methods, such as MC Dropout, can be parallelised via batching. This allows for some computation time / memory usage trade-oﬀ. Alternatively, Postels et al. (2019) use error propagation to approximate the dropout predictive posterior with a single forward pass. Although eﬃcient, this approach shares pathologies with MC Dropout. van Amersfoort et al. (2020) combine deep RBF networks with a Jacobian regularisation term to deterministically detect OOD points. Nalisnick et al. (2019c) and Meinke and Hein (2020) use generative models to detect OOD data without multiple predictor evaluations. Unfortunately, deep generative models can be unreliable for OOD detection (Nalisnick et al., 2019b) and simpler alternatives might struggle to scale.

There is a rich literature on probabilistic inference for NN structure selection, starting with the Automatic Relevance Detection prior (Mac Kay et al., 1994). Since then, a number of approaches have been introduced (Lawrence, 2001; Ghosh et al., 2019). Perhaps the closest to our work is that of Nalisnick et al. (2019a), which interprets dropout as a structured shrinkage prior that reduces the inﬂuence of residual blocks in Res Nets. Conversely, a DUN

can be constructed for any feed-forward neural network and marginalizes predictions at diﬀerent depths. Similar work from Dikov and Bayer (2019) uses VI to learn both the width and depth of a NN by leveraging continuous relaxations of discrete probability distributions. For depth, they use a Bernoulli distribution to model the probability that any layer is used for a prediction. In contrast to their approach, DUNs use a Categorical distribution to model depth, do not require sampling for evaluation of the training objective or making predictions, and can be applied to a wider range of NN architectures, such as CNNs. Huang et al. (2016) stochastically drop layers as a Res Net training regularisation approach. On the other hand, DUNs perform exact marginalisation over architectures at train and test time, translating depth uncertainty into uncertainty over a broad range of functional complexities.

DUNs rely on two insights that have recently been demonstrated elsewhere in the literature. The ﬁrst is that a single over-parameterised NN is capable of learning multiple, diverse, representations of a dataset. This is also a key insight for a subsequent work MIMO (Havasi et al., 2020). The second is that ensembling NNs with varying hyperparameters, in our case depth, leads to improved prediction robustness. Concurrently to our work, hyper-deep ensembles (Wenzel et al., 2020b) demonstrate this for a large range of hyperparameters.

3 Depth Uncertainty Networks

Consider a dataset D = {x(n), y(n)}N

n=1 and a neural network composed of an input block f0( ), D intermediate blocks {fi( )}D

i=1, and an output block f D+1( ). Each block is a group of one or more stacked linear and non-linear operations. The activations at depth i œ [0, D], ai, are obtained recursively as ai = fi(ai 1), a0 = f0(x).

A forward pass through the network is an iterative process, where each successive block fi( ) reﬁnes the previous block s activation. Predictions can be made at each step of this procedure by applying the output block to each intermediate block s activations: ˆyi = f D+1(ai). This

computational model is displayed in Figure 2. It can be implemented by changing 8 lines in a vanilla Py Torch NN, as shown in Appendix H. Recall, from Figure 1, that we can leverage the disagreement among intermediate blocks predictions to quantify model uncertainty.

3.1 Probabilistic Model: Depth as a Random Variable

We place a categorical prior over network depth p (d) = Cat(d|{ i}D

i=0). Referring to network weights as , we parametrise the likelihood for each depth using the corresponding subnetwork s output: p(y|x, d=i; ) = p(y|f D+1(ai; )). A graphical model is shown in Figure 2. For a given weight conﬁguration, the likelihood for every depth, and thus our model s Marginal Log Likelihood (MLL):

log p(D; ) = log

p(y(n)|x(n), d=i; )

can be obtained with a single forward pass over the training set by exploiting the sequential nature of feed-forward NNs. The posterior over depth, p(d|D; )=p(D|d; )p (d)/p(D; ) is a categorical distribution that tells us about how well each subnetwork explains the data.

A key advantage of deep neural networks lies in their capacity for automatic feature extraction and representation learning. For instance, Zeiler and Fergus (2014) demonstrate that CNNs detect successively more abstract features in deeper layers. Similarly, Frosst et al. (2019) ﬁnd that maximising the entanglement of diﬀerent class representations in intermediate layers yields better generalisation. Given these results, using all of our network s intermediate blocks for prediction might be suboptimal. Instead, we infer whether each block should be used to learn representations or perform predictions, which we can leverage for ensembling, by treating network depth as a random variable. As shown in Figure 3, subnetworks too shallow to explain the data are assigned low posterior probability; they perform feature extraction.

3.2 Inference in DUNs

We consider learning network weights by directly maximising (1) with respect to , using

backpropagation and the log-sum-exp trick. In Appendix B, we show that the gradients of (1) reaching each subnetwork are weighted by the corresponding depth s posterior mass. This

leads to local optima where all but one subnetworks gradients vanish. The posterior collapses to a delta function over an arbitrary depth, leaving us with a deterministic NN. When working with large datasets, one might indeed expect the true posterior over depth to be a delta. However, because modern NNs are underspeciﬁed even for large datasets, multiple depths should be able to explain the data simultaneously (shown in Figure 3 and Appendix B).

We can avoid the above pathology by decoupling the optimisation of network weights from

the posterior distribution. In latent variable models, the Expectation Maximisation (EM) algorithm (Bishop, 2007) allows us to optimise the MLL by iteratively computing p(d|D; ) and then updating . We propose to use stochastic gradient variational inference as an alternative more amenable to NN optimisation. We introduce a surrogate categorical distribution over depth q (d) = Cat(d|{ i}D

i=0). In Appendix A, we derive the following lower bound on (1):

log p(D; ) Ø L( , ) =

log p(y(n)|x(n), d; )

KL(q (d) Î p (d)). (2)

This Evidence Lower BOund (ELBO) allows us to optimise the variational parameters and network weights simultaneously using gradients. Because both our variational and true posteriors are categorical, (2) is convex with respect to . At the optima, q (d) = p(d|D; ) and the bound is tight. Thus, we perform exact rather than approximate inference.

Eq (d)[log p(y|x, d; )] can be computed from the activations at every depth. Consequently, both terms in (2) can be evaluated exactly, with only a single forward pass. This removes the need for high variance Monte Carlo gradient estimators, often required by VI methods for NNs. When using mini-batches of size B, we stochastically estimate the ELBO in (2) as

log p(y(n)|x(n), d=i; ) i

MLL Objective VI Objective

0 1000 2000 3000 4000 epochs

probabilities

0 1000 2000 3000 4000 epochs

Figure 3: Top row: progression of MLL and ELBO during training. Bottom: progression of all six depth posterior probabilities. The left column corresponds to optimising the MLL directly and the right to VI. For the latter, variational posterior probabilities q(d) are shown.

Predictions for new data xú are made by marginalising depth with the variational posterior:

p(yú|xú, D; ) =

p(yú|xú, d=i; )q (d=i). (4)

4 Experiments

First, we compare the MLL and VI training approaches for DUNs. We then evaluate DUNs on toy-regression, real-world regression, and image classiﬁcation tasks. As baselines, we provide results for vanilla NNs (denoted as SGD ), MC Dropout (Gal and Ghahramani, 2016), and deep ensembles (Lakshminarayanan et al., 2017), arguably the strongest approach for uncertainty estimation in deep learning (Snoek et al., 2019; Ashukha et al., 2020). For regression tasks, we also include Gaussian Mean Field VI (MFVI) (Blundell et al., 2015) with the local reparametrisation trick (Kingma et al., 2015). For the image classiﬁcation tasks, we include stochastic depth Res Nets (S-Res Nets) (Huang et al., 2016), which can be viewed as MC Dropout applied to whole residual blocks, and deep ensembles of diﬀerent depth networks (depth-ensembles). We include the former as an alternate method for converting uncertainty over depth into predictive uncertainty. We include the latter to investigate the hypothesis that the diﬀerent classes of functions produced at diﬀerent depths help provide improved disagreement and, in turn, predictive uncertainty. We study all methods in terms of accuracy, uncertainty quantiﬁcation, and robustness to corrupted or OOD data. We place a uniform prior over DUN depth. See Appendix C, Appendix D, and Appendix E for detailed descriptions of the techniques we use to compute, and evaluate uncertainty estimates, and our experimental setup, respectively. Additionally, in Appendix G, we explore the eﬀects of architecture hyperparameters on the depth posterior and the use of DUNs for architecture search. Code is available at https://github.com/cambridge-mlg/DUN.

4.1 Comparing MLL and VI training

Figure 3 compares the optimisation of a 5 hidden layer fully connected DUN on the concrete dataset using estimates of the MLL (1) and ELBO (3). The former approach converges to a local optima where all but one depth s probabilities go to 0. With VI, the surrogate posterior converges slower than the network weights. This allows to reach a conﬁguration where multiple depths can be used for prediction. Towards the end of training, the variational gap vanishes. The surrogate distribution approaches the true posterior without collapsing to a delta. The MLL values obtained with VI are larger than those obtained with (1), i.e. our proposed approach ﬁnds better explanations for the data. In Appendix B, we optimise (1) after reaching a local optima with VI (3). This does not cause posterior collapse, showing that MLL optimisation s poor performance is due to a propensity for poor local optima.

Figure 4: Top row: toy dataset from Izmailov et al. (2019). Bottom: Wiggle dataset. Black dots denote data points. Error bars represent standard deviation among mean predictions.

4.2 Toy Datasets

We consider two synthetic 1D datasets, shown in Figure 4. We use 3 hidden layer, 100

hidden unit, fully connected networks with residual connections for our baselines. DUNs use the same architecture but with 15 hidden layers. GPs use the RBF kernel. We found these conﬁgurations to work well empirically. In Appendix F.1, we perform experiments with diﬀerent toy datasets, architectures and hyperparameters. DUNs performance increases with depth but often 5 layers are suﬃcient to produce reasonable uncertainty estimates.

The ﬁrst dataset, which is taken from Izmailov et al. (2019), contains three disjoint clusters of data. Both MFVI and Dropout present error bars that are similar in the data dense and in-between regions. MFVI underﬁts slightly, not capturing smoothness in the data. DUNs perform most similarly to Ensembles. They are both able to ﬁt the data well and express inbetween uncertainty. Their error bars become large very quickly in the extrapolation regime due to diﬀerent ensemble elements and depths predictions diverging in diﬀerent directions.

Our second dataset consists of 300 samples from y= sin(ﬁx)+0.2 cos(4ﬁx) 0.3x+ , where N(0, 0.25) and x N(5, 2.5). We dub it Wiggle . Dropout struggles to ﬁt this faster varying function outside of the data-dense regions. MFVI fails completely. DUNs and Ensembles both ﬁt the data well and provide error bars that grow as the data becomes sparse.

4.3 Tabular Regression

We evaluate all methods on UCI regression datasets using standard (Hernández-Lobato and Adams, 2015) and gap splits (Foong et al., 2019b). We also use the large-scale non-stationary ﬂight delay dataset, preprocessed by Hensman et al. (2013). Following Deisenroth and Ng (2015), we train on the ﬁrst 2M data points and test on the subsequent 100k. We select all

hyperparameters, including NN depth, using Bayesian optimisation with Hyper Band (Falkner et al., 2018). See Appendix E.2 for details. We evaluate methods with Root Mean Squared Error (RMSE), Log Likelihood (LL) and Tail Calibration Error (TCE). The latter measures the calibration of the 10% and 90% conﬁdence intervals, and is described in Appendix D.

UCI standard split results are found in Figure 5. For each dataset and metric, we rank methods from 1 to 5 based on mean performance. We report mean ranks and standard deviations. Dropout obtains the best mean rank in terms of RMSE, followed closely by Ensembles. DUNs are third, signiﬁcantly ahead of MFVI and SGD. Even so, DUNs outperform Dropout and Ensembles in terms of TCE, i.e. DUNs more reliably assign large error bars to points on which they make incorrect predictions. Consequently, in terms of LL, a metric which considers both uncertainty and accuracy, DUNs perform competitively (the LL rank distributions for all three methods overlap almost completely). MFVI provides

the best calibrated uncertainty estimates. Despite this, its mean predictions are inaccurate, as evidenced by it being last in terms of RMSE. This leads to MFVI s LL rank only being better than SGD s. Results for gap splits, designed to evaluate methods capacity to express in-between uncertainty, are given in Appendix F.2. Here, DUNs outperform Dropout in terms of LL rank. However, they are both outperformed by MFVI and ensembles.

DUN Dropout Ensemble MFVI SGD

Figure 5: Quartiles for results on UCI regression datasets across standard splits. Average ranks are computed across datasets. For LL, higher is better. Otherwise, lower is better. Table 1: Results obtained on the ﬂights dataset (2M). Mean and standard deviation values are computed across 5 independent training runs.

Metric DUN Dropout Ensemble MFVI SGD

LL 4.95 0.01 4.95 0.02 4.95 0.01 5.02 0.05 4.97 0.01 RMSE 34.69 0.28 34.28 0.11 34.32 0.13 36.72 1.84 34.61 0.19 TCE .087 .009 .096 .017 .090 .008 .068 .014 .084 .010 Time .026 .001 .016 .001 .031 .001 .547 .003 .002 .000

The ﬂights dataset is known for strong covariate shift between its train and test sets, which are sampled from contiguous time periods. LL values are strongly dependent on calibrated uncertainty. As shown in Table 1, DUNs RMSE is similar to that of SGD, with Dropout and Ensembles performing best. Again, DUNs present superior uncertainty calibration. This allows them to achieve the best LL, tied with Ensembles and Dropout. We speculate that DUNs calibration stems from being able to perform exact inference, albeit in depth space.

In terms of prediction time, DUNs clearly outrank Dropout, Ensembles, and MFVI on UCI. Due to depth, or maximum depth D for DUNs, being chosen with Bayesian optimisation, methods batch times vary across datasets. DUNs are often deeper because the quality of their uncertainty estimates improves with additional explanations of the data. As a result, SGD clearly outranks DUNs. On ﬂights, increased depth causes DUNs prediction time to lie in between Dropout s and Ensembles .

4.4 Image Classiﬁcation

We train Res Net-50 (He et al., 2016a) using all methods under consideration. This model

is composed of an input convolutional block, 16 residual blocks and a linear layer. For DUNs, our prior over depth is uniform over the ﬁrst 13 residual blocks. The last 3 residual blocks and linear layer form the output block, providing the ﬂexibility to make predictions

Rotated MNIST

Corrupted CIFAR10

0 30 60 90 120 150 180 rotation ( )

0 1 2 3 4 5 corruption

0 25 50 75 100 % rejected

OOD Rejection

0.0 0.5 1.0 1.5 2.0 2.5 3.0 time (s)

23 5 7 10 15 20

Compute Time

DUN Ensemble Dropout SGD DUN (exact) Depth-Ens (5) Depth-Ens (13) S-Res Net

Figure 6: Top left: error and LL for MNIST at varying degrees of rotation. Top right: error and LL for CIFAR10 at varying corruption severities. Bottom left: CIFAR10-SVHN rejection-classiﬁcation plot. The black line denotes the theoretical maximum performance; all in-distribution samples are correctly classiﬁed and OOD samples are rejected ﬁrst. Bottom right: Pareto frontiers showing LL for corrupted CIFAR10 (severity 5) vs batch prediction time. Batch size is 256, split over 2 Nvidia P100 GPUs. Annotations show ensemble elements and Dropout samples. Note that a single element ensemble is equivalent to SGD.

from activations at multiple resolutions. We use 1 1 convolutions to adapt the number of channels between earlier blocks and the output block. We use default Py Torch training hyperparameters2 for all methods. We set per-dataset LR schedules. We use 5 element (standard) deep ensembles, as suggested by Snoek et al. (2019), and 10 dropout samples. We

use two variants of depth-ensembles. The ﬁrst is composed of ﬁve elements corresponding to the ﬁve most shallow DUN sub-networks. The second depth-ensemble is composed of 13 elements, one for each depth used by DUNs. Similarly, S-Res Nets are uncertain over only the ﬁrst 13 layers. Figure 6 contains results for all experiments described below. Mean values and standard deviations are computed across 5 independent training runs. Full details are given in Appendix E.3.

Rotated MNIST Following Snoek et al. (2019), we train all methods on MNIST and evaluate their predictive distributions on increasingly rotated digits. Although all methods perform well on the original test-set, their accuracy degrades quickly for rotations larger than 30 . Here, DUNs and S-Res Nets diﬀerentiate themselves by being the least overconﬁdent. Additionally, depth-ensembles improve over standard ensembles. We hypothesize that predictions based on features at diverse resolutions allow for increased disagreement.

2https://github.com/pytorch/examples/blob/master/imagenet/main.py

Corrupted CIFAR Again following Snoek et al. (2019), we train models on CIFAR10 and evaluate them on data subject to 16 diﬀerent corruptions with 5 levels of intensity each (Hendrycks and Dietterich, 2019). Here, Ensembles signiﬁcantly outperform all single network methods in terms of error and LL at all corruption levels, with depth-ensembles being notably better than standard ensembles. This is true even for the 5 element depth-ensemble, which has relatively shallow networks with far fewer parameters. The strong performance of depth-ensembles further validates our hypothesis that networks of diﬀerent depths provide a useful diversity in predictions. With a single network, DUNs perform similarly to SGD and Dropout on the uncorrupted data. However, leveraging a distribution over depth allows DUNs to be the most robust non-ensemble method.

OOD Rejection We simulate a realistic OOD rejection scenario (Filos et al., 2019) by jointly evaluating our models on an in-distribution and an OOD test set. We allow our methods to reject increasing proportions of the data based on predictive entropy before classifying the rest. All predictions on OOD samples are treated as incorrect. Following Nalisnick et al. (2019b), we use CIFAR10 and SVHN as in and out of distribution datasets. Ensembles perform best. In their standard conﬁguration, DUNs show underconﬁdence. They are incapable of separating very uncertain in-distribution inputs from OOD points. We re-run DUNs using the exact posterior over depth p(d|D; ) in (4), instead of q (d). The exact posterior is computed while setting batch-norm to test mode. See Appendix F.3 for additional discussion. This resolves underconﬁdence, outperforming dropout and coming second, within error, of ensembles. We don t ﬁnd exact posteriors to improve performance in any other experiments. Hence we abstain from using them, as they require an additional evaluation of the train set.

Compute Time We compare methods performance on corrupted CIFAR10 (severity 5) as a function of computational budget. The LL obtained by a DUN matches that of a 1.8 element ensemble. A single DUN forward pass is 1.02 times slower than a vanilla network s. On average, DUNs computational budget matches that of 0.47 ensemble elements or 0.94 dropout samples. These values are smaller than one due to overhead such as ensemble element loading. Thus, making predictions with DUNs is 10 faster than with ﬁve element ensembles. Note that we include loading times for ensembles to reﬂect that it is often impractical to store multiple ensemble elements in memory. Without loading times, ensemble timing would match Dropout. For single-element ensembles (SGD) we report only the prediction time.

5 Discussion and Future Work

We have re-cast NN depth as a random variable, rather than a ﬁxed parameter. This

treatment allows us to optimise weights as model hyperparameters, preserving much of the simplicity of non-Bayesian NNs. Critically, both the model evidence and predictive posterior for DUNs can be evaluated with a single forward pass. Our experiments show that networks of diﬀerent depths obtain diverse ﬁts. As a result, DUNs produce well calibrated uncertainty estimates, performing well relative to their computational budget on uncertainty-aware tasks. They scale to modern architectures and large datasets.

In DUNs, network weights have dual roles: ﬁtting the data well and expressing diverse predictive functions at each depth. In future work, we would like to develop optimisation schemes that better ensure both roles are fulﬁlled, and investigate the relationship between excess model capacity and DUN performance. We would also like to investigate the eﬀects of DUN depth on uncertainty estimation, allowing for more principled model selection. Additionally, because depth uncertainty is orthogonal to weight uncertainty, both could potentially be combined to expand the space of hypothesis over which we perform inference. Furthermore, it would be interesting to investigate the application of DUNs to a wider range of NN architectures, for example stacked RNNs or Transformers.

Broader Impact

We have introduced a general method for training neural networks to capture model uncer-

tainty. These models are fairly ﬂexible and can be applied to a large number of applications, including potentially malicious ones. Perhaps, our method could have the largest impact on critical decision making applications, where reliable uncertainty estimates are as important as the predictions themselves. Financial default prediction and medical diagnosis would be examples of these.

We hope that this work will contribute to increased usage of uncertainty aware deep learning

methods in production. DUNs are trained with default hyperparameters and easy to make converge to reasonable solutions. The computational cost of inference in DUNs is similar to that of vanilla NNs. This makes DUNs especially well suited for applications with real-time requirements or low computational resources, such as self driving cars or sensor fusion on embedded devices. More generally, DUNs make leveraging uncertainty estimates in deep learning more accessible for researchers or practitioners who lack extravagant computational resources.

Despite the above, a hypothetical failure of our method, e.g. providing miscalibrated uncertainty estimates, could have large negative consequences. This is particularly the case for critical decision making applications, such as medical diagnosis.

Acknowledgments and Disclosure of Funding

We would like to thank Eric Nalisnick and John Bronskill for helpful discussions. We

also thank Pablo Morales-Álvarez, Stephan Gouws, Ulrich Paquet, Devin Taylor, Shakir Mohamed, Avishkar Bhoopchand and Taliesin Beynon for giving us feedback on this work. Finally, we thank Marc Deisenroth and Balaji Lakshminarayanan for helping us acquire the ﬂights dataset and Andrew Foong for providing us with the UCI gap datasets.

JA acknowledges support from Microsoft Research, through its Ph D Scholarship Programme, and from the EPSRC. JUA acknowledges funding from the EPSRC and the Michael E. Fisher Studentship in Machine Learning. This work has been performed using resources provided by the Cambridge Tier-2 system operated by the University of Cambridge Research Computing Service (http://www.hpc.cam.ac.uk) funded by EPSRC Tier-2 capital grant EP/P020259/1.

Dario Amodei, Chris Olah, Jacob Steinhardt, Paul F. Christiano, John Schulman, and Dan Mané.

Concrete problems in AI safety. Co RR, abs/1606.06565, 2016.

Javier Antorán, James Urquhart Allingham, and José Miguel Hernández-Lobato. Variational depth

search in Res Nets. Co RR, abs/2002.02797, 2020.

Arsenii Ashukha, Alexander Lyzhov, Dmitry Molchanov, and Dmitry P. Vetrov. Pitfalls of in-domain

uncertainty estimation and ensembling in deep learning. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open Review.net, 2020.

Christopher M. Bishop. Pattern Recognition and Machine Learning, 5th Edition. Information science

and statistics. Springer, 2007.

Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty

in neural networks. In Francis R. Bach and David M. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, volume 37 of JMLR Workshop and Conference Proceedings, page 1613 1622. JMLR.org, 2015.

Tarin Clanuwat, Mikel Bober-Irizar, Asanobu Kitamoto, Alex Lamb, Kazuaki Yamamoto, and David

Ha. Deep learning for classical Japanese literature. Co RR, abs/1812.01718, 2018.

Alexander G. de G. Matthews, Jiri Hron, Mark Rowland, Richard E. Turner, and Zoubin Ghahramani.

Gaussian process behaviour in wide deep neural networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. Open Review.net, 2018.

Marc Peter Deisenroth and Jun Wei Ng. Distributed gaussian processes. In Francis R. Bach and

David M. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, volume 37 of JMLR Workshop and Conference Proceedings, pages 1481 1490. JMLR.org, 2015.

Georgi Dikov and Justin Bayer. Bayesian learning of neural network architectures. In Kamalika

Chaudhuri and Masashi Sugiyama, editors, The 22nd International Conference on Artiﬁcial Intelligence and Statistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan, volume 89 of Proceedings of Machine Learning Research, pages 730 738. PMLR, 2019.

Kevin Dowd. Backtesting Market Risk Models, chapter 15, pages 321 349. John Wiley & Sons, Ltd,

Michael W. Dusenberry, Ghassen Jerfel, Yeming Wen, Yi-An Ma, Jasper Snoek, Katherine A. Heller,

Balaji Lakshminarayanan, and Dustin Tran. Eﬃcient and scalable Bayesian neural nets with rank-1 factors. Co RR, abs/2005.07186, 2020.

Stefan Falkner, Aaron Klein, and Frank Hutter. BOHB: robust and eﬃcient hyperparameter optimization at scale. In Jennifer G. Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 1436 1445. PMLR, 2018.

Angelos Filos, Sebastian Farquhar, Aidan N. Gomez, Tim G. J. Rudner, Zachary Kenton, Lewis

Smith, Milad Alizadeh, Arnoud de Kroon, and Yarin Gal. A systematic comparison of Bayesian deep learning robustness in diabetic retinopathy tasks. Co RR, abs/1912.10481, 2019.

Andrew Y. K. Foong, David R. Burt, Yingzhen Li, and Richard E. Turner. Pathologies of factorised

gaussian and MC dropout posteriors in Bayesian neural networks. Co RR, abs/1909.00719, 2019a.

Andrew Y. K. Foong, Yingzhen Li, José Miguel Hernández-Lobato, and Richard E. Turner. In-

between uncertainty in Bayesian neural networks. Co RR, abs/1906.11537, 2019b.

Lex Fridman, Li Ding, Benedikt Jenik, and Bryan Reimer. Arguing machines: Human supervision of

black box AI systems that make life-critical decisions. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2019, Long Beach, CA, USA, June 16-20, 2019, pages 1335 1343. Computer Vision Foundation / IEEE, 2019.

Nicholas Frosst, Nicolas Papernot, and Geoﬀrey E. Hinton. Analyzing and improving representations

with the soft nearest neighbor loss. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research,

pages 2012 2020. PMLR, 2019.

Yarin Gal. Uncertainty in Deep Learning. Ph D thesis, University of Cambridge, 2016.

Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing model

uncertainty in deep learning. In Maria-Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings,

pages 1050 1059. JMLR.org, 2016.

Jacob R. Gardner, GeoﬀPleiss, Kilian Q. Weinberger, David Bindel, and Andrew Gordon Wilson.

GPy Torch: Blackbox matrix-matrix gaussian process inference with GPU acceleration. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, 3-8 December 2018, Montréal, Canada, pages 7587 7597. 2018.

Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P. Vetrov, and Andrew Gordon

Wilson. Loss surfaces, mode connectivity, and fast ensembling of dnns. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, 3-8 December 2018, Montréal, Canada, pages 8803 8812, 2018.

Adrià Garriga-Alonso, Carl Edward Rasmussen, and Laurence Aitchison. Deep convolutional net-

works as shallow gaussian processes. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net, 2019.

Soumya Ghosh, Jiayu Yao, and Finale Doshi-Velez. Model selection in Bayesian neural networks via

horseshoe priors. J. Mach. Learn. Res., 20:182:1 182:46, 2019.

Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation.

Journal of the American statistical Association, 102(477):359 378, 2007.

Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola,

Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training Image Net in 1 hour. Co RR, abs/1706.02677, 2017.

Alex Graves. Practical variational inference for neural networks. In John Shawe-Taylor, Richard S.

Zemel, Peter L. Bartlett, Fernando C. N. Pereira, and Kilian Q. Weinberger, editors, Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011. Proceedings of a meeting held 12-14 December 2011, Granada, Spain,

pages 2348 2356. 2011.

Danijar Hafner, Dustin Tran, Alex Irpan, Timothy P. Lillicrap, and James Davidson. Reliable un-

certainty estimates in deep neural networks using noise contrastive priors. Co RR, abs/1807.09289, 2018.

Marton Havasi, Rodolphe Jenatton, Stanislav Fort, Jeremiah Zhe Liu, Jasper Snoek, Balaji Laksh-

minarayanan, Andrew M. Dai, and Dustin Tran. Training independent subnetworks for robust prediction. Co RR, abs/2010.06610, 2020. URL https://arxiv.org/abs/2010.06610.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectiﬁers: Surpassing

human-level performance on Image Net classiﬁcation. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pages 1026 1034. IEEE

Computer Society, 2015.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image

recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 770 778. IEEE Computer Society, 2016a.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual

networks. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016,

Proceedings, Part IV, volume 9908 of Lecture Notes in Computer Science, pages 630 645. Springer, 2016b.

Dan Hendrycks and Thomas G. Dietterich. Benchmarking neural network robustness to common

corruptions and perturbations. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net, 2019.

James Hensman, Nicoló Fusi, and Neil D. Lawrence. Gaussian processes for big data. In Ann Nicholson

and Padhraic Smyth, editors, Proceedings of the Twenty-Ninth Conference on Uncertainty in Artiﬁcial Intelligence, UAI 2013, Bellevue, WA, USA, August 11-15, 2013. AUAI Press, 2013.

José Miguel Hernández-Lobato and Ryan P. Adams. Probabilistic backpropagation for scalable

learning of Bayesian neural networks. In Francis R. Bach and David M. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, volume 37 of JMLR Workshop and Conference Proceedings, pages 1861 1869. JMLR.org, 2015.

Geoﬀrey E. Hinton and Drew van Camp. Keeping the neural networks simple by minimizing the

description length of the weights. In Lenny Pitt, editor, Proceedings of the Sixth Annual ACM Conference on Computational Learning Theory, COLT 1993, Santa Cruz, CA, USA, July 26-28, 1993, pages 5 13. ACM, 1993.

Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q. Weinberger. Deep networks with

stochastic depth. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14,

2016, Proceedings, Part IV, volume 9908 of Lecture Notes in Computer Science, pages 646 661.

Springer, 2016.

Gao Huang, Yixuan Li, GeoﬀPleiss, Zhuang Liu, John E. Hopcroft, and Kilian Q. Weinberger.

Snapshot ensembles: Train 1, get M for free. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings.

Open Review.net, 2017.

Sergey Ioﬀe and Christian Szegedy. Batch normalization: Accelerating deep network training by

reducing internal covariate shift. In Francis R. Bach and David M. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, volume 37 of JMLR Workshop and Conference Proceedings, pages 448 456. JMLR.org, 2015.

Pavel Izmailov, Wesley Maddox, Polina Kirichenko, Timur Garipov, Dmitry P. Vetrov, and An-

drew Gordon Wilson. Subspace inference for Bayesian deep learning. In Amir Globerson and Ricardo Silva, editors, Proceedings of the Thirty-Fifth Conference on Uncertainty in Artiﬁcial Intelligence, UAI 2019, Tel Aviv, Israel, July 22-25, 2019, page 435. AUAI Press, 2019.

Valen E. Johnson and David Rossell. Bayesian model selection in high-dimensional settings. Journal

of the American Statistical Association, 107(498):649 660, 2012.

Mohammad Emtiyaz Khan, Didrik Nielsen, Voot Tangkaratt, Wu Lin, Yarin Gal, and Akash

Srivastava. Fast and scalable Bayesian deep learning by weight-perturbation in Adam. In Jennifer G. Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 2616 2625. PMLR, 2018.

Diederik P. Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparame-

terization trick. Co RR, abs/1506.02557, 2015.

Alex Krizhevsky et al. Learning multiple layers of features from tiny images. 2009.

Paul H. Kupiec. Techniques for verifying the accuracy of risk measurement models. The Journal of

Derivatives, 3(2):73 84, 1995.

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive

uncertainty estimation using deep ensembles. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 6402 6413, 2017.

Neil D. Lawrence. Note relevance determination. In Roberto Tagliaferri and Maria Marinaro, editors,

Proceedings of the 12th Italian Workshop on Neural Nets, WIRN VIETRI 2001, Vietri sul Mare, Salerno, Italy, May 17-19, 2001, Perspectives in Neural Computing, pages 128 133. Springer, 2001.

Yann Le Cun, Léon Bottou, Yoshua Bengio, and Patrick Haﬀner. Gradient-based learning applied to

document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Lisha Li, Kevin G. Jamieson, Giulia De Salvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hy-

perband: a novel bandit-based approach to hyperparameter optimization. Journal of Machine Learning Research, 18:185:1 185:52, 2017.

Chao Ma, Yingzhen Li, and José Miguel Hernández-Lobato. Variational implicit processes. In

Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA,

volume 97 of Proceedings of Machine Learning Research, pages 4222 4233. PMLR, 2019.

David J. C. Mac Kay. A practical Bayesian framework for backpropagation networks. Neural Comput.,

4(3):448 472, 1992.

David J.C. Mac Kay et al. Bayesian nonlinear modeling for the prediction competition. ASHRAE

transactions, 100(2):1053 1062, 1994.

Wesley J. Maddox, Pavel Izmailov, Timur Garipov, Dmitry P. Vetrov, and Andrew Gordon Wilson. A

simple baseline for Bayesian uncertainty in deep learning. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pages 13132 13143, 2019.

Alexander Meinke and Matthias Hein. Towards neural networks that provably know when they

don t know. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open Review.net, 2020.

Tom Minka. Bayesian model averaging is not model combination. July 2000. URL https: //tminka.github.io/papers/minka-bma-isnt-mc.pdf.

Eric T. Nalisnick, José Miguel Hernández-Lobato, and Padhraic Smyth. Dropout as a structured

shrinkage prior. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 4712 4722.

PMLR, 2019a.

Eric T. Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Görür, and Balaji Lakshminarayanan.

Do deep generative models know what they don t know? In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net, 2019b.

Eric T. Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Görür, and Balaji Lakshminarayanan.

Hybrid models with deep and invertible features. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 4723 4732. PMLR, 2019c.

Radford M Neal. Bayesian Learning for Neural Networks. Ph D thesis, University of Toronto, 1995.

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading

digits in natural images with unsupervised feature learning. 2011.

Anh Mai Nguyen, Jason Yosinski, and JeﬀClune. Deep neural networks are easily fooled: High

conﬁdence predictions for unrecognizable images. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 427 436. IEEE

Computer Society, 2015.

Jeremy Nixon, Michael W. Dusenberry, Linchuan Zhang, Ghassen Jerfel, and Dustin Tran. Measuring

calibration in deep learning. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2019, Long Beach, CA, USA, June 16-20, 2019, pages 38 41.

Computer Vision Foundation / IEEE, 2019.

Kazuki Osawa, Siddharth Swaroop, Mohammad Emtiyaz Khan, Anirudh Jain, Runa Eschenhagen,

Richard E. Turner, and Rio Yokota. Practical deep learning with Bayesian principles. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pages 4289 4301, 2019.

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor

Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pages 8024 8035. 2019.

Janis Postels, Francesco Ferroni, Huseyin Coskun, Nassir Navab, and Federico Tombari. Sampling-free

epistemic uncertainty estimation using approximated variance propagation. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pages 2931 2940. IEEE, 2019.

Hippolyt Ritter, Aleksandar Botev, and David Barber. A scalable laplace approximation for neural

networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. Open Review.net, 2018.

David Rossell, Donatello Telesca, and Valen E. Johnson. High-dimensional Bayesian classiﬁers using

non-local priors. In Paolo Giudici, Salvatore Ingrassia, and Maurizio Vichi, editors, Statistical Models for Data Analysis, Studies in Classiﬁcation, Data Analysis, and Knowledge Organization,

pages 305 313. Springer, 2013.

Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. Practical Bayesian optimization of machine

learning algorithms. In Peter L. Bartlett, Fernando C. N. Pereira, Christopher J. C. Burges, Léon Bottou, and Kilian Q. Weinberger, editors, Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States, pages 2960 2968, 2012.

Jasper Snoek, Yaniv Ovadia, Emily Fertig, Balaji Lakshminarayanan, Sebastian Nowozin, D Sculley,

Joshua Dillon, Jie Ren, and Zachary Nado. Can you trust your model s uncertainty? evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems, pages 13969 13980, 2019.

Shengyang Sun, Guodong Zhang, Jiaxin Shi, and Roger B. Grosse. Functional variational Bayesian

neural networks. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net, 2019.

Mattias Teye, Hossein Azizpour, and Kevin Smith. Bayesian uncertainty estimation for batch

normalized deep networks. In Jennifer G. Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 4914 4923. PMLR, 2018.

Brian Trippe and Richard Turner. Overpruning in variational Bayesian neural networks. Co RR,

abs/1801.06230, 2018.

Joost van Amersfoort, Lewis Smith, Yee Whye Teh, and Yarin Gal. Simple and scalable epistemic

uncertainty estimation using a single deep deterministic neural network. Co RR, abs/2003.02037, 2020.

Ziyu Wang, Tongzheng Ren, Jun Zhu, and Bo Zhang. Function space particle optimization for

Bayesian neural networks. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net, 2019.

Florian Wenzel, Kevin Roth, Bastiaan S. Veeling, Jakub Swiatkowski, Linh Tran, Stephan Mandt,

Jasper Snoek, Tim Salimans, Rodolphe Jenatton, and Sebastian Nowozin. How good is the bayes posterior in deep neural networks really? Co RR, abs/2002.02405, 2020a.

Florian Wenzel, Jasper Snoek, Dustin Tran, and Rodolphe Jenatton. Hyperparameter ensembles

for robustness and uncertainty quantiﬁcation. In Hugo Larochelle, Marc Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020b. URL https://proceedings.neurips.cc/

paper/2020/hash/481fbfa59da2581098e841b7afc122f1-Abstract.html.

Andrew Gordon Wilson. The case for Bayesian deep learning. Co RR, abs/2001.10995, 2020.

Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a novel image dataset for bench-

marking machine learning algorithms. 2017.

Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In Richard C. Wilson, Edwin R.

Hancock, and William A. P. Smith, editors, Proceedings of the British Machine Vision Conference 2016, BMVC 2016, York, UK, September 19-22, 2016. BMVA Press, 2016.

Matthew D. Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In

David J. Fleet, Tomás Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I, volume 8689 of Lecture Notes in Computer Science, pages 818 833. Springer, 2014.