# nomu_neural_optimizationbased_model_uncertainty__61c77270.pdf

NOMU: Neural Optimization-based Model Uncertainty

Jakob Heiss * 1 2 Jakob Weissteiner * 2 3 Hanna Wutte * 1 2 Sven Seuken 2 3 Josef Teichmann 1 2

We study methods for estimating model uncertainty for neural networks (NNs) in regression. To isolate the effect of model uncertainty, we focus on a noiseless setting with scarce training data. We introduce ﬁve important desiderata regarding model uncertainty that any method should satisfy. However, we ﬁnd that established benchmarks often fail to reliably capture some of these desiderata, even those that are required by Bayesian theory. To address this, we introduce a new approach for capturing model uncertainty for NNs, which we call Neural Optimization-based Model Uncertainty (NOMU). The main idea of NOMU is to design a network architecture consisting of two connected sub-NNs, one for model prediction and one for model uncertainty, and to train it using a carefully-designed loss function. Importantly, our design enforces that NOMU satisﬁes our ﬁve desiderata. Due to its modular architecture, NOMU can provide model uncertainty for any given (previously trained) NN if given access to its training data. We evaluate NOMU in various regressions tasks and noiseless Bayesian optimization (BO) with costly evaluations. In regression, NOMU performs at least as well as state-of-theart methods. In BO, NOMU even outperforms all considered benchmarks.

1. Introduction

Neural networks (NNs) are becoming increasingly important in machine learning applications (Le Cun et al., 2015). In many domains, it is essential to be able to quantify the model uncertainty (epistemic uncertainty) of NNs (Neal, 2012; Ghahramani, 2015). Good estimates of model uncertainty are indispensable in Bayesian optimization (BO) and active learning, where exploration is steered by (functions

*Equal contribution 1ETH Zurich 2ETH AI Center 3University of Zurich. Correspondence to: Jakob Weissteiner <weissteiner@iﬁ.uzh.ch>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

1.0 0.5 0.0 0.5 1.0

NOMU: f 2 σf

MCDO: f 4 σf

Figure 1. Visualization of estimated model uncertainty ˆσf. The unknown true function is depicted as a black solid line with training points as black dots. NOMU s model prediction ˆf is shown as a solid blue line and its uncertainty bounds are shown as a blue shaded area. As a benchmark, MC Dropout is shown in green.

of) these uncertainty estimates. In recent years, BO has been successfully applied in practice to a wide range of problems, including robotics (Martinez-Cantin et al., 2009), sensor networks (Srinivas et al., 2012), and drug development (G omez-Bombarelli et al., 2018). Better model uncertainty estimates for BO directly translate to improvements in these applications.

However, estimating model uncertainty well for NNs is still an open research problem. For settings with scarce training data and negligible data noise, where model uncertainty is the main source of uncertainty, we uncover deﬁciencies of widely used state-of-the-art methods for estimating model uncertainty for NNs. Prior work often only measures the performance in data noise dominant settings, and thus does not adequately isolate the pure model uncertainty, thereby overlooking the algorithms deﬁciencies. However, in tasks such as BO with costly evaluations, where accurate estimates of model uncertainty are of utmost importance, these deﬁciencies can drastically decrease performance.

In this paper, we study the problem of estimating model uncertainty for NNs to obtain uncertainty bounds (UBs) that estimate Bayesian credible bounds in a setting with negligible data noise and scarce training data. For this, we propose a novel algorithm (NOMU) that is specialized to such a setting. Figure 1 shows UBs for NOMU and the benchmark method MC Dropout.

NOMU: Neural Optimization-based Model Uncertainty

1.1. Prior Work on Model Uncertainty for NNs

Over the last decade, researchers have developed various methods to quantify model uncertainty for NNs. One strand of research considers Bayesian Neural Networks (BNNs), where distributions are placed over the NN s parameters (Graves, 2011; Blundell et al., 2015; Hern andez-Lobato & Adams, 2015). However, variational methods approximating BNNs are usually computationally prohibitive and require careful hyperparameter tuning. Thus, BNNs are rarely used in practice (Wenzel et al., 2020a).

In practice, ensemble methods are more established:

Gal & Ghahramani (2016) proposed Monte Carlo dropout (MCDO) to estimate model uncertainty via stochastic forward passes. Interestingly, they could show that training a NN with dropout can also be interpreted as variational inference approximating a BNN.

Lakshminarayanan et al. (2017) experimentally evaluated ensembles of NNs and showed that they perform as good or better than BNNs. They proposed using deep ensembles (DE), which use NNs with two outputs for model prediction and data noise, and they estimate model uncertainty via the empirical standard deviation of the ensemble. DE is the most established state-of-the art ensemble method and has been shown to consistently outperform other ensemble methods (Ovadia et al., 2019; Fort et al., 2019; Gustafsson et al., 2020; Ashukha et al., 2020).

Recently, Wenzel et al. (2020b) proposed hyper deep ensembles (HDE), an extension of DE where additional diversity is created via different hyperparameters, and they showed that HDE outperforms DE.

Despite the popularity of MCDO, DE and HDE, our empirical results suggest that none of them reliably capture all essential features of model uncertainty: MCDO yields tubular bounds that do not narrow at observed data points (which can already be observed in Figure 1); DE and HDE can produce UBs that are sometimes unreasonably narrow in regions far from observed data or unreasonably wide at training points (as we will show in Section 4.1).

1.2. Overview of our Contribution

We present a new approach for estimating model uncertainty for NNs, which we call neural optimization-based model uncertainty (NOMU). In contrast to a fully Bayesian approach (e.g. BNNs), where approximating the posterior for a realistic prior is in general very challenging, we estimate posterior credible bounds by directly enforcing essential properties of model uncertainty. Speciﬁcally, we make the following contributions:

1. We ﬁrst introduce ﬁve desiderata that we argue model UBs should satisfy (Section 3.1).

2. We then introduce NOMU, which consists of a network architecture (Section 3.2) and a carefully-designed loss function (Section 3.3), such that the estimated UBs fulﬁll these ﬁve desiderata. NOMU is easy to implement, scales well to large NNs, and can be represented as a single NN without the need for further ensemble distillation (in contrast to MCDO, DE and HDE). Because of its modular architecture, it can easily be used to obtain UBs for already trained NNs.

3. We experimentally evaluate NOMU in various regression settings: in scarce and noiseless settings to isolate model uncertainty (Sections 4.1.1 and 4.1.2) and on realword data-sets (Sections 4.1.3 and 4.1.4). We show that NOMU performs well across all these settings while state-of-the-art methods (MCDO, DE, and HDE) exhibit several deﬁciencies.1

4. Finally, we evaluate the performance of NOMU in highdimensional Bayesian optimization (BO) and show that NOMU performs as well or better than all considered benchmarks (Section 4.2).

Our source code is available on Git Hub: https://github.com /marketdesignresearch/NOMU.

1.3. Further Related Work

Nix & Weigend (1994) were among the ﬁrst to introduce NNs with two outputs: one for model prediction and one for data noise (aleatoric uncertainty), using the Gaussian negative log-likelihood as loss function. However, such a data noise output cannot be used as an estimator for model uncertainty (epistemic uncertainty); see Appendix G for details. To additionally capture model uncertainty, Kendall & Gal (2017) combined the idea of Nix & Weigend (1994) with MCDO.

Similarly, NNs with two outputs for lower and upper UBs, trained on speciﬁcally-designed loss functions, were previously considered by Khosravi et al. (2010) and Pearce et al. (2018). However, the method by Khosravi et al. (2010) again only accounts for data noise and does not consider model uncertainty. The method by Pearce et al. (2018) also does not take model uncertainty into account in the design of their loss function and only incorporates it via ensembles (as in DE).

Besides the state-of-the art ensemble methods HDE and DE, there exist several other papers on ensemble methods that,

1We also conducted experiments with (Blundell et al., 2015). However, we found that this method did not perform as well as the other considered benchmarks. Moreover, it was shown in (Gal & Ghahramani, 2016; Lakshminarayanan et al., 2017) that deep ensembles and MC dropout outperform the methods by (Hern andez-Lobato & Adams, 2015) and (Graves, 2011), respectively. Therefore, we do not include (Graves, 2011; Blundell et al., 2015; Hern andez-Lobato & Adams, 2015) in our experiments.

NOMU: Neural Optimization-based Model Uncertainty

for example, promote their diversity on the function space (Wang et al., 2019) or reduce their computational cost (Wen et al., 2020; Havasi et al., 2021).

For classiﬁcation, Malinin & Gales (2018) introduced prior networks, which explicitly model in-sample and out-ofdistribution uncertainty, where the latter is realized by minimizing the reverse KL-distance to a selected ﬂat point-wise deﬁned prior. In a recent working paper (and concurrent to our work), Malinin et al. (2020a) report on progress extending their idea to regression. While the idea of introducing a separate loss for learning model uncertainty is related to NOMU, there are several important differences (loss, architecture, behavior of the model prediction, theoretical motivation; see Appendix E for details). Furthermore, their experiments suggest that DE still performs as good or better than their proposed method.

In contrast to BNNs, which perform approximate inference over the entire set of weights, Neural Linear Models (NLMs) perform exact inference on only the last layer. NLMs have been extensively benchmarked in (Ober & Rasmussen, 2019) against MCDO and the method from (Blundell et al., 2015). Their results suggest that MCDO and (Blundell et al., 2015) perform competitive, even to carefully-tuned NLMs.

Neural processes, introduced by Garnelo et al. (2018a;b), have been used to express model uncertainty for image completion tasks, where one has access to 1000s of different images interpreted as functions fi instead of input points xi. See Appendix F for a detailed comparison of their setting to the setting we consider in this paper.

2. Preliminaries

In this section, we brieﬂy review the classical Bayesian uncertainty framework for regression.

Let X Rd, Y R and let f : X Y denote the unknown ground truth function. Let Dtrain := {(xtrain i , ytrain i ) X Y, i {1, . . . , ntrain}}, with ntrain N be i.i.d samples from the data generating process y = f(x) + ε, where ε|x N(0, σ2 n(x)). We use σn to refer to the data noise (aleatoric uncertainty). We refer to (xtrain i , ytrain i ) as a training point and to xtrain i as an input training point.

In the remainder of this paper, we follow the classic Bayesian uncertainty framework by modelling the unknown ground truth function f as a random variable. Hence, with a slight abuse of notation, we use the symbol f to denote both the unknown ground truth function as well as the corresponding random variable. In Appendix I, we provide a mathematically rigorous formulation of the considered Bayesian uncertainty framework.

Given a prior distribution for f and known data noise σn, the posterior of f and y are well deﬁned. The model uncertainty

(epistemic uncertainty) σf(x) is the posterior standard deviation of f(x), i.e.,

V[f(x)|Dtrain, x], x X. (1)

Assuming independence between f and ε, the variance of the predictive distribution of y can be decomposed as V[y|Dtrain, x] = σ2 f(x) + σ2 n(x). We present our algorithm for estimating model uncertainty ˆσf for the case of zero or negligible data noise, i.e., σn 0 (see Appendix C for an extension to σn 0). For a given model prediction ˆf, the induced uncertainty bounds (UBs) are then given by UBc(x), UBc(x) := ˆf(x) c ˆσf(x) , for x X and

a calibration parameter c 0.2

3. The NOMU Algorithm

We now present NOMU. We design NOMU to yield a model prediction ˆf and a model uncertainty prediction ˆσf, such that the resulting UBs (UBc, UBc) fulﬁll ﬁve desiderata.

3.1. Desiderata

D1 (Non-Negativity) The upper/lower UB between two training points lies above/below the model prediction ˆf, i.e., UBc(x) ˆf(x) UBc(x) for all x X and for c 0. Thus, ˆσf 0.

By deﬁnition, for any given prior, the exact posterior model uncertainty σf is positive, and therefore Desideratum D1 should also hold for any estimate ˆσf.

D2 (In-Sample) In the noiseless case (σn 0), there is zero model uncertainty at each input training point xtrain, i.e., ˆσf(xtrain) = 0. Thus, UBc(xtrain) = UBc(xtrain) = ˆf(xtrain) for c 0.

In Appendix D.2, we prove that, for any prior that does not contradict the training data, the exact σf satisﬁes D2. Thus, D2 should also hold for any estimate ˆσf, and we argue that even in the case of non-zero small data noise, model uncertainty should be small at input training data points.

D3 (Out-of-Sample) The larger the distance of a point x X to the input training points in Dtrain, the wider the UBs at x, i.e., model uncertainty ˆσf increases out-ofsample.3

For D3 it is often not obvious which metric to choose to measure distances. Some aspects of this metric can be speci-

2Note that our UBs estimate credible bounds (CBs) CB and CB, which, for α [0, 1], fulﬁll that P[f(x) [CB, CB]|Dtrain, x] = α. For σn 0, CBs are equal to predictive bounds PB, PB with P[y [PB, PB]|Dtrain, x] = α. See Appendix A for an explanation. 3Importantly, D3 also promotes model uncertainty in gaps between input training points.

NOMU: Neural Optimization-based Model Uncertainty

ˆrf-network

Figure 2. NOMU s network architecture

ﬁed via the architecture (e.g., shift invariance for CNNs). In many applications, it is best to learn further aspects of this metric from training data, motivating our next desideratum.

D4 (Metric Learning) Changes in those features of x that have high predictive power on the training set have a large effect on the distance metric used in D3.4

D4 is not required for any application. However, speciﬁcally in deep learning applications, where it is a priori not clear which features are important, D4 is particularly desirable.

D5 (Vanishing) As the number of training points ntrain tends to inﬁnity, model uncertainty vanishes for each x in the support of the input data distribution, i.e., limntrain ˆσf (x)= 0 for a ﬁxed c 0. Thus, for a ﬁxed c, limntrain |UBc(x) UBc(x)| = 0.

In Appendix D, we discuss all desiderata in more detail (see Appendix D.4 for a visualization of D4).

3.2. The Network Architecture

For NOMU, we construct a network NN θ with two outputs: the model prediction ˆf (e.g. mean prediction) and a raw model uncertainty prediction ˆrf. Formally: NN θ : X Y R 0, with x 7 NN θ(x) := ( ˆf(x), ˆrf(x)). NOMU s architecture consists of two almost separate sub-networks: the ˆf-network and the ˆrf-network (see Figure 2). For each sub-network, any network architecture can be used (e.g., feed-forward NNs, CNNs). This makes NOMU highly modular and we can plug in any previously trained NN for ˆf,

4 Consider the task of learning facial expressions from images. For this, eyes and mouth are important features, while background color is not. A CNN automatically learns which features are important for model prediction. The same features are also important for model uncertainty: Consider an image with pixel values similar to those of an image of the training data, but where mouth and eyes are very different. We should be substantially more uncertain about the model prediction for such an image than for one which is almost identical to a training image except that it has a different background color, even if this change of background color results in a huge Euclidean distance of the pixel vectors. D4 requires that a more useful metric is learned instead.

or we can train ˆf simultaneously with the ˆrf-network. The ˆrf-network learns the raw model uncertainty and is connected with the ˆf-network through the last hidden layer (dashed lines in Figure 2). This connection enables ˆrf to re-use features that are important for the model prediction ˆf, implementing Desideratum D4 (Metric Learning).5

Remark 3.1 NOMU s network architecture can be modiﬁed to realize D4 (Metric Learning) in many different ways. For example, if low-level features were important for predicting the model uncertainty, one could additionally add connections from earlier hidden layers of the ˆf-network to layers of the ˆrf-network. Furthermore, one can strengthen D4 (Metric Learning) by increasing the regularization of the ˆrf-network (see Appendix D.4).

After training NN θ, we apply the readout map ϕ(z) = ℓmax(1 exp( max(0,z)+ℓmin

ℓmax )), ℓmin 0, ℓmax > 0 to the raw model uncertainty output ˆrf to obtain NOMU s model uncertainty prediction

ˆσf(x) := ϕ(ˆrf(x)), x X. (2)

The readout map ϕ monotonically interpolates between a minimal ℓmin and a maximal ℓmax model uncertainty. Here, ℓmin is used for numerical stability, and ℓmax deﬁnes the maximal model uncertainty far away from input training points (similarly to the prior variance for GPs). With NOMU s model prediction ˆf, its model uncertainty prediction ˆσf deﬁned in (2), and given a calibration parameter6 c R 0, we can now deﬁne for each x X NOMU s UBs as UBc(x), UBc(x) := ˆf(x) c ˆσf(x) . (3)

It is straightforward to construct a single NN that directly outputs the upper/lower UB, by extending the architecture shown in Figure 2: we monotonically transform and scale the output ˆrf(x) and then add/subtract this to/from the other output ˆf(x). It is also straightforward to compute NOMU s UBs for any given, previously trained NN, by attaching the ˆrf-network to the trained NN, and only training the ˆrf-network on the same training points as the original NN.

Remark 3.2 The readout map ϕ can be modiﬁed depending on the subsequent use of the estimated UBs. For example, for BO over discrete domains (e.g. X = {0, 1}d) (Baptista & Poloczek, 2018), we propose the linearized readout map ϕ(z) = ℓmin + max(0, z ℓmin) max(0, z ℓmax). With this ϕ and Re LU activations, one can encode NOMU s UBs as a mixed integer program (MIP) (Weissteiner & Seuken, 2020). This enables optimizing the upper UB as acquisition

5To prevent that ˆrf impacts ˆf, the dashed lines should only make forward passes when trained. 6Like all other methods, NOMU outputs relative UBs that should be calibrated, e.g., via a parameter c 0. See also Kuleshov et al. (2018) for a non-linear calibration method.

NOMU: Neural Optimization-based Model Uncertainty

function without the need for further approximation via ensemble distillations(Malinin et al., 2020b).

3.3. The Loss Function

We now introduce the loss function Lπ we use for training NOMU s architecture. Let X be such that 0 < λd(X) < , where λd denotes the d-dimensional Lebesgue measure. We train NN θ with loss Lπ and L2-regularization parameter λ > 0, i.e., minimizing Lπ(NN θ) + λ θ 2 2 via a gradient descent-based algorithm.

Deﬁnition 3.3 (NOMU LOSS) Let π := (πsqr, πexp, cexp) R3 0 denote a tuple of hyperparameters. Given a training set Dtrain, the loss function Lπ is deﬁned as

Lπ(NN θ) :=

i=1 ( ˆf(x train i ) y train i )2

i=1 (ˆrf(x train i ))2

+ πexp 1 λd(X)

X e cexp ˆrf (x) dx | {z } (c)

In the following, we explain how the three terms of Lπ

promote the desiderata we introduced in Section 3.1. Note that the behaviour of ˆrf directly translates to that of ˆσf.

Term (a) solves the regression task, i.e., learning a smooth function ˆf given Dtrain. If ˆf is given as a pre-trained NN, then this term can be omitted.

Term (b) implements D2 (In-Sample) and D5 (Vanishing)

(i.e., ˆrf(xtrain i ) 0). The hyperparameter πsqr controls the amount of uncertainty at the training points.7 The larger πsqr, the narrower the UBs at training points.

Term (c) has two purposes. First, it implements D1 (Non Negativity) (i.e., ˆrf 0). Second, it pushes ˆrf towards inﬁnity across the whole input space X. However, due to the counteracting force of (b) as well as regularization, ˆrf increases continuously as you move away from the training data. The interplay of (b), (c), and regularization thus promotes D3 (Out-of-Sample). The hyperparameters πexp and cexp control the size and shape of the UBs. Concretely, the larger πexp, the wider the UBs; the larger cexp, the narrower the UBs at points x with large ˆσf(x) and the wider the UBs at points x with small ˆσf(x).

In Appendix H.1, we provide detailed visualizations on how the loss hyperparameters πsqr, πexp, and cexp shape NOMU s model uncertainty estimate. In the implementation of Lπ, we approximate (c) via MC-integration using additional,

7In the noiseless case, in theory πsqr

λ = ; we set πsqr

λ = 107. For small non-zero data noise, setting πsqr

λ captures data noise induced model uncertainty σf(xtrain) > 0 (Appendix D.2).

artiﬁcial input points Dart := {xi}l i=1 i.i.d Unif(X) by 1

x Darte cexp ˆσf (x).

Remark 3.4 In (c), instead of the Lebesguemeasure, one can also use a different measure ν, i.e., 1 ν(X) R

X e cexp ˆrf (x) dν(x). This can be relevant in high dimensions, where meaningful data points often lie close to a lower-dimensional manifold (Cayton, 2005); ν can then be chosen to concentrate on that region. In practice, this can be implemented by sampling from a large unlabeled data set Dart representing this region or learning the measure ν using GANs (Goodfellow et al., 2014).

Theory In Appendix D, we prove that NOMU fulﬁlls D1, D2 and D5 (Propositions D.1.a, D.2.c and D.5.a), and discuss how NOMU fulﬁlls D3 and D4. In Appendix A.1, we show that, under certain assumptions, NOMU s UBs can be interpreted as pointwise worst-case UBs UBpw(x) := supf HDtrain f(x) within a hypothesis class HDtrain of dataexplaining functions. In Appendix A.2, we explain how UBpw(x) and UBpw(x) estimate posterior CBs of BNNs (with a Gaussian prior on the weights), without performing challenging variational inference. However, while exact posterior CBs of BNNs lose D4 as their width goes to inﬁnity, NOMU s UBs are capable of retaining D4 in this limit.

4. Experimental Evaluation

In this section, we experimentally evaluate NOMU s model uncertainty estimate in multiple synthetic and realworld regression settings (Section 4.1) as well as in highdimensional Bayesian optimization (Section 4.2).

Benchmarks We compare NOMU against four benchmarks, each of which gives a model prediction ˆf and a model uncertainty prediction ˆσf (see Appendix B.1 for formulas). We calculate model-speciﬁc UBs at x X as ( ˆf(x) c ˆσf(x)) with calibration parameter c R 0 and use them to evaluate all methods. We consider three algorithms that are specialized to model uncertainty for NNs: (i) MC dropout (MCDO) (ii) deep ensembles (DE) and (iii) hyper deep ensembles (HDE) and a non-NN-based benchmark: (iv) Gaussian process (GP) with RBF kernel.

4.1. Regression

To develop intuition, we ﬁrst study the model UBs of all methods on synthetic test functions with 1D 2D scarce input training points without data noise (Section 4.1.1). We then propose a novel generative test-bed and evaluate NOMU within this setting (Section 4.1.2). Next, we analyze NOMU on a real-world time series (Section 4.1.3). Finally, we evaluate NOMU on the real-world UCI data sets (Section 4.1.4).

NOMU: Neural Optimization-based Model Uncertainty

Figure 3. 1D synthetic test functions

4.1.1. TOY REGRESSION

Setting We consider ten different 1D functions whose graphs are shown in Figure 3. Those include the popular Levy and Forrester function with multiple local optima.8

All functions are transformed to X := [ 1, 1] =: f(X). For each function, we conduct 500 runs. In each run, we randomly sample eight noiseless input training points from X, such that the only source of uncertainty is model uncertainty. For each run, we also generate 100 test points in the same fashion to assess the quality of the UBs.

Metrics We report the average negative log (Gaussian) likelihood (NLL), minimized over the calibration parameter c, which we denote as NLLmin. Following prior work (Khosravi et al., 2010; Kuleshov et al., 2018; Pearce et al., 2018), we further measure the quality of UBs by contrasting their mean width (MW) with their coverage probability (CP). Ideally, MW should be as small as possible, while CP should be close to 1. Since CP is counter-acting MW, we consider ROC-like curves, plotting MW against CP for a range of calibration parameters c, and report the area under the curve (AUC) (see Appendix B.2.1 for details).

Algorithm Setup For each of the two NOMU subnetworks, we use a feed-forward NN with three fullyconnected hidden layers a 210 nodes, Re LUs, and hyperparameters πexp = 0.01, πsqr = 0.1, cexp = 30. In practice, the values for πexp, πsqr, and cexp can be tuned on a validation set. However, for all synthetic experiments (Section 4.1.1 and Section 4.1.2), we use the same values, which lead to good results across all functions. Moreover, we set λ = 10 8

accounting for zero data-noise, ℓmin = 0.001 and ℓmax = 2. In Appendix H.2, we provide an extensive sensitivity analysis for the hyperparameters πexp, πsqr, cexp, ℓmin and ℓmax. This analysis demonstrates NOMU s robustness within a certain range of hyperparameter values.

Furthermore, to enable a fair comparison of all methods, we use generic representatives and do not optimize any of

8sfu.ca/ ssurjano/optimization.html

1.0 0.5 0.0 0.5 1.0

2 NOMU: 1 f

NOMU: f 1 f

MCDO: f 4 f

1.0 0.5 0.0 0.5 1.0

HDE: f 10 f

Figure 4. UBs resulting from NOMU, GP, MCDO, and DE, HDE for the Levy function (solid black line). For NOMU, we also show ˆσf as a dotted blue line. Training points are shown as black dots.

them for any test function. We also choose the architectures of all NN-based methods, such that the overall number of parameters is comparable. We provide all methods with the same prior information (speciﬁcally, this entails knowledge of zero data noise), and set the corresponding parameters accordingly. Finally, we set all remaining hyperparameters of the benchmarks to the values proposed in the literature. Details on all conﬁgurations are provided in Appendix B.2.2.

Results Figure 4 exempliﬁes our ﬁndings, showing typical UBs for the Levy function as obtained in one run. In Appendix B.2.4 we provide further visualisations. We ﬁnd that MCDO consistently yields tube-like UBs; in particular, its UBs do not narrow at training points, i.e., failing in D2 (In-Sample) even though MCDO s Bayesian interpretation requires D2 to hold (see Appendix D.2). Moreover, it only fulﬁlls D3 (Out-of-Sample) to a limited degree. We frequently observe that DE leads to UBs of somewhat arbitrary shapes. This can be seen most prominently in Figure 4 around x 0.75 and at the edges of its input range, where DE s UBs are very different in width with no clear justiﬁcation. Thus, also DE is limited in D3 (Out-of-Sample). In addition, we sometimes see that also DE s UBs do not narrow sufﬁciently at training points, i.e., not fulﬁlling D2 (In-Sample). HDE s UBs are even more random, i.e., predicting large model uncertainty at training points and some-

NOMU: Neural Optimization-based Model Uncertainty

Table 1. Ranks (1=best to 5=worst) for AUC and NLLmin.

NOMU GP MCDO DE HDE FUNCTION AUC NLLMIN AUC NLLMIN AUC NLLMIN AUC NLLMIN AUC NLLMIN

ABS 1 1 3 1 3 4 1 1 5 5 STEP 2 2 4 3 2 3 1 1 5 5 KINK 1 1 3 1 4 4 1 1 5 5 SQUARE 2 2 2 1 4 4 2 3 5 5 CUBIC 2 2 1 1 3 4 3 3 5 5 SINE 1 2 1 2 1 1 1 2 1 5 5 SINE 2 2 2 3 1 1 2 3 3 5 5 SINE 3 1 1 4 1 3 4 1 1 5 5 FORRESTER 1 2 1 1 3 4 3 3 5 5 LEVY 1 1 4 3 1 4 3 1 5 5

times zero model uncertainty in gaps between them (e.g. x 0.75). In contrast, NOMU displays the behaviour it is designed to show. Its UBs nicely tighten at training points and expand in-between (D1 D3, for D4 (Metric Learning) see Appendix D.4). Like NOMU, the GP fulﬁlls D2 (In Sample) and D3 (Out-of-Sample) well, but cannot account for D4 (Metric Learning) (a ﬁxed kernel does not depend on the model prediction). Table 1 provides the ranks achieved by each algorithm (see Appendix B.2.3 for corresponding metrics). We calculate the ranks based on the medians and a 95% bootstrap CI of AUC and NLLmin. An algorithm loses one rank to each other algorithm that signiﬁcantly dominates it. Winners are marked in grey. We observe that NOMU is the only algorithm that never comes in third place or worse. Thus, while some algorithms do particularly well in capturing uncertainties of functions with certain characteristics (e.g. RBF-GPs for polynomials), NOMU is the only algorithm that consistently performs well. HDE s bad performance can be explained by its randomness and the fact that it sometimes predicts zero model uncertainty outof-sample. For 2D, we provide results and visualizations in Appendix B.2.3 and B.2.4 highlighting similar characteristics of all the algorithms as in 1D.

4.1.2. GENERATIVE TEST-BED

Setting Instead of only relying on a limited number of data-sets, we also evaluate all algorithms on a generative testbed, which provides an unlimited number of test-functions. The importance of using a test-bed (to avoid over-ﬁtting on speciﬁc data-sets) has also been highlighted in a recent work of Osband et al. (2021). For our test-bed, we generate 200 different data-sets by randomly sampling 200 different test-functions from a BNN with i.i.d centered Gaussian weights and three fully-connected hidden layers with nodes [210, 211, 210] and Re LU activations. From each of these test-functions we uniformly at random sample ntrain = 8 d input training points and 100 d test data points, where d refers to the input dimension. We train all algorithms on these training sets and determine the NLL on the corresponding test sets averaged over the 200 test-functions. This metric converges to the Kullback-Leibler divergence to the

Table 2. Average NLL (without const. ln(2π)/2) and a 95% CI over 200 BNN samples. Winners are marked in grey.

FUNCTION NOMU GP MCDO DE HDE

BNN1D -1.65 0.10 -1.08 0.22 -0.34 0.23 -0.38 0.36 8.47 1.00 BNN2D -1.16 0.05 -0.52 0.11 -0.33 0.13 -0.77 0.07 9.11 0.39 BNN5D -0.37 0.02 -0.33 0.02 -0.05 0.04 -0.13 0.03 8.41 1.00

exact BNN-posterior (see Theorem B.7). We calibrate by choosing per dimension an optimal value of c in terms of average NLL, which does not depend on the test-function.

Results In Table 2, we provide the results for input dimensions 1, 2 and 5. We see that NOMU outperforms all other algorithms including MCDO, which is a variational inference method to approximate this posterior (Gal & Ghahramani, 2016). See Appendix B.2.5 for a detailed explanation of the experiment setting and further results in modiﬁed settings including a discussion of higher dimensional settings.

4.1.3. SOLAR IRRADIANCE TIME SERIES

Setting Although the current version of NOMU is specifically designed for scarce settings without data noise, we are also interested to see how well it performs in settings where these assumptions are not satisﬁed. To this end, we now study a setting with many training points and small non-zero data noise. This allows us to analyze how well NOMU captures D5 (Vanishing). We consider the popular task of interpolating the solar irradiance data (Steinhilber et al., 2009) also studied in Gal & Ghahramani (2016). We scale the data to X = [ 1, 1] and split it into 194 training and 197 test points. As in Gal & Ghahramani (2016), we choose ﬁve intervals to contain only test points. Since the true function is likely of high frequency, we set λ = 10 19

for NOMU and the benchmarks regularization accordingly. To account for small non-zero data noise, we set πexp = 0.05, ℓmin = 0.01 and use otherwise the same hyperparameters as in Section 4.1.1.

Results Figure 5 visualizes NOMU s UBs. We see that NOMU manages to ﬁt the training data well while capturing model uncertainty between input training points. In particu-

1.0 0.5 0.0 0.5 1.0

1 NOMU: f 2 f

Figure 5. NOMU s model prediction (solid), model uncertainty (dotted) and UBs (shaded area) on the solar irradiance data. Training and test points are shown as black dots and red crosses.

NOMU: Neural Optimization-based Model Uncertainty

Table 3. Average NLL and a 95% normal-CI over 20 runs for UCI data sets. Winners are marked in grey.

DATASET NOMU DE MCDO MCDO2 LL NLM-HPO NLM

BOSTON 2.68 0.11 2.41 0.49 2.46 0.11 2.40 0.07 2.57 0.09 2.58 0.17 3.63 0.39 CONCRETE 3.05 0.06 3.06 0.35 3.04 0.03 2.97 0.03 3.05 0.07 3.11 0.09 3.12 0.09 ENERGY 0.77 0.06 1.38 0.43 1.99 0.03 1.72 0.01 0.82 0.05 0.69 0.05 0.69 0.05 KIN8NM -1.08 0.01 -1.20 0.03 -0.95 0.01 -0.97 0.00 -1.23 0.01 -1.12 0.01 -1.13 0.01 NAVAL -5.63 0.39 -5.63 0.09 -3.80 0.01 -3.91 0.01 -6.40 0.11 -7.36 0.15 -7.35 0.01 CCPP 2.79 0.01 2.79 0.07 2.80 0.01 2.79 0.01 2.83 0.01 2.79 0.01 2.79 0.01 PROTEIN 2.79 0.01 2.83 0.03 2.89 0.00 2.87 0.00 2.89 0.00 2.78 0.01 2.81 0.00 WINE 1.08 0.04 0.94 0.23 0.93 0.01 0.92 0.01 0.97 0.03 0.96 0.01 1.48 0.09 YACHT 1.38 0.28 1.18 0.41 1.55 0.05 1.38 0.01 1.01 0.09 1.17 0.13 1.13 0.09

lar, large gaps between input training points are successfully modeled as regions of high model uncertainty (D3 (Out-of Sample)). Moreover, in regions highly populated by input training points, NOMU s model uncertainty vanishes as required by D5 (Vanishing). Plots for the other algorithms are provided in Appendix B.2.6, where we observe similar patterns as in Section 4.1.1.

4.1.4. UCI DATA SETS

Setting Recall that NOMU is speciﬁcally tailored to noiseless settings with scarce input training data. However, even the current version of NOMU, which does not explicitly model data noise, already performs on par with existing benchmarks on real-world regression tasks with data noise. To demonstrate this, we test its performance on (a) the UCI data sets proposed in Hern andez-Lobato & Adams (2015), a common benchmark for uncertainty quantiﬁcation in noisy, real-world regression, and (b) the UCI gap data set extension proposed in Foong et al. (2019). We consider exactly the same experiment setup as proposed in these works, with a 70/20/10-train-validation-test split, equip NOMU with a shallow architecture of 50 hidden nodes, and train it for 400 epochs. Validation data are used to calibrate the constant c on NLL. See Appendix B.2.7 for details on NOMU s setup.

Results Table 3 reports NLLs on test data, averaged across 20 splits, compared to the current state-of-the-art. We report the NLL for MCDO (Gal & Ghahramani, 2016) and DE (Lakshminarayanan et al., 2017) from the original papers. Moreover, we reprint the best results (for comparable network sizes) of neural linear models (NLMs) with and without hyperparameter optimization (HPO) from (Ober & Rasmussen, 2019) (NLM-HPO, NLM), linearized laplace (LL) (Foong et al., 2019) and a recent strong MCDO baseline (MCDO2) from (Mukhoti et al., 2018). It is surprising that, even though the current design of NOMU does not yet explicitly incorporate data noise, it already performs comparably to state-of-the-art results. We obtain similar results for the UCI gap data. See Appendix B.2.7 for more details on this experiment.

4.2. Bayesian Optimization

In this section, we assess the performance of NOMU in high-dimensional noiseless Bayesian optimization (BO).

Setting In BO, the goal is to maximize an unknown expensive-to-evaluate function, given a budget of function queries. We use a set of test functions with different characteristics from the same library as before, but now in 5 to 20 dimensions d, transformed to X = [ 1, 1]d, f(X) = [ 1, 1].9 Additionally, we again use Gaussian BNNs to create a generative test-bed consisting of a large variety of test functions (see Section 4.1.2). For each test function, we randomly sample 8 initial points (xi, f(xi)) and let each algorithm choose 64 further function evaluations (one by one) using its upper UB as acquisition function. This corresponds to a setting where one can only afford 72 expensive function evaluations in total. We provide details regarding the selected hyperparameters for each algorithm in Appendix B.3.1. We measure the performance of each algorithm based on its ﬁnal regret | maxx X f(x) maxi {1,...,72} f(xi)|/| maxx X f(x)|.

For each algorithm, the UBs must be calibrated by choosing appropriate values of c. We do so in the following straightforward way: First, after observing the 8 random initial points, we determine those two values of c for which the resulting mean width (MW) of the UBs is 0.05 and 0.5, respectively (MW SCALING).10 We perform one BO run for both resulting initial values of c. Additionally, if in a BO run, an algorithm were to choose a point xi very close to an already observed point xi, we dynamically increase c to make it select a different one instead (DYNAMIC C; see Appendix B.3.2 for details). A value of 0.05 in MW

SCALING corresponds to small model uncertainty, such that exploration is mainly due to DYNAMIC C. Smaller values than 0.05 thus lead to similar outcomes. In contrast, a value of 0.5 corresponds to large model uncertainties, such that DYNAMIC C is rarely used. Only for the plain GP (p GP) we use neither MW SCALING nor DYNAMIC C, as p GP uses its default calibration (c is determined by the builtin hyperparameter optimization in every step). However, a comparison of GP and p GP suggests that MW SCALING and DYNAMIC C surpass the built-in calibration (see Table 4). As a baseline, we also report random search (RAND).

Results In Table 4, we present the BO results in 5D, 10D and 20D. We show the average ﬁnal regret per dimension across the ﬁve functions. For each algorithm and dimension, we give the results corresponding to the MW scaling parameter (0.05 or 0.5) that minimizes the average ﬁnal

9These functions are designed for minimization. We multiply them by 1 and equivalently maximize instead. 10We ﬁx MW instead of c, since the scales of the algorithms vary by orders of magnitude.

NOMU: Neural Optimization-based Model Uncertainty

Table 4. BO results: average ﬁnal regrets per dimension and ranks for each individual function (1=best to 7=worst).

FUNCTION NOMU GP MCDO DE HDE PGP RAND

LEVY5D 1 1 6 3 3 4 7 ROSENBROCK5D 1 1 1 1 2 5 7 G-FUNCTION5D 2 3 1 4 2 3 7 PERM5D 3 1 1 5 7 2 4 BNN5D 1 1 4 1 4 1 7

Average Regret 5D 2.87e 2 5.03e 2 4.70e 2 5.18e 2 7.13e 2 4.14e 2 1.93e 1

LEVY10D 1 3 5 6 1 1 6 ROSENBROCK10D 1 1 2 6 3 2 7 G-FUNCTION10D 2 5 1 3 2 5 7 PERM10D 2 1 2 6 2 2 1 BNN10D 1 2 1 1 3 1 7

Average Regret 10D 8.40e 2 1.17e 1 6.96e 2 1.15e 1 9.32e 2 9.46e 2 2.35e 1

LEVY20D 1 1 5 7 1 1 6 ROSENBROCK20D 2 2 2 6 1 4 6 G-FUNCTION20D 1 4 5 1 1 3 7 PERM20D 3 5 3 2 3 3 1 BNN20D 1 2 2 2 6 1 7

Average Regret 20D 1.12e 1 1.33e 1 1.39e 1 1.71e 1 1.37e 1 1.17e 1 2.80e 1

regret across that dimension (see Appendix B.3.3 for both MWs). In practice, one often only knows the dimensionality of a given BO task, which is why we use the average ﬁnal regret per dimension as the criterion for setting the optimal MW. For each individual function, we also present the ranks based on the ﬁnal regret and a 95% CI over 100 (5D) and 50 (10-20D) runs. We see that NOMU performs as well or better than all benchmarks in terms of average ﬁnal regret. By inspecting the ranks achieved for each individual function, we further observe that NOMU is never ranked worse than 3rd. In contrast, the performance of the benchmarks heavily depends on the test function; and each benchmark is ranked 4th and worse multiple times. For Perm10D/20D we see that RAND performs best. However, due to a ﬂat optimum of Perm, all algorithms achieve similar (very small) ﬁnal regrets. Finally, we see that NOMU is always ranked ﬁrst for the BNN test functions. Figure 6 shows the regret plot for BNN20D (see Appendix B.3.4 for all regret plots).

8 24 40 56 72 Number of Evaluations

Figure 6. Regret plot for BNN20D. For each BO step, we show the regrets averaged over 50 runs (solid lines) with 95% CIs.

5. Conclusion

We have introduced NOMU, a new algorithm for estimating model uncertainty for NNs, specialized for scarce and noiseless settings. By using a speciﬁc architecture and carefullydesigned loss function, we have ensured that NOMU satisﬁes ﬁve important desiderata regarding model uncertainty that any method should satisfy. However, when analyzing model uncertainty decoupled from data noise, we have experimentally uncovered that, perhaps surprisingly, established state-of-the-art methods fail to reliably capture some of the desiderata, even those that are required by Bayesian theory. In contrast, NOMU satisﬁes all desiderata, matches the performance of all benchmarks in regression tasks, and performs as well or better in noiseless BO tasks. We see great potential to further improve NOMU, for example by adapting the loss, or by modifying the connections between the two sub-NNs. We also envision several extensions of NOMU, including its application to classiﬁcation, employing different architectures (CNNs, GNNs, RNNs or Transformers), and incorporating data noise.

Acknowledgements

We thank Marius H ogger and Aurelio Dolﬁni for insightful discussions and their excellent research assistance in implementing the Bayesian optimization experiments and the UCI data set experiments, respectively. Furthermore, we thank the anonymous reviewers for helpful comments. This paper is part of a project that has received funding from the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation programme (Grant agreement No. 805542).

NOMU: Neural Optimization-based Model Uncertainty

Ashukha, A., Lyzhov, A., Molchanov, D., and Vetrov, D. Pitfalls of in-domain uncertainty estimation and ensembling in deep learning. In International Conference on Learning Representations, 2020. URL https: //openreview.net/forum?id=BJx I5g HKDr. 2

Baptista, R. and Poloczek, M. Bayesian optimization of combinatorial structures. In International Conference on Machine Learning, pp. 462 471. PMLR, 2018. URL http: //proceedings.mlr.press/v80/baptista18a/baptista18a.pdf. 4

Bergstra, J. and Bengio, Y. Random search for hyperparameter optimization. Journal of machine learning research, 13(2), 2012. URL https://www.jmlr.org/papers/ v13/bergstra12a.html. 18

Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. Weight uncertainty in neural networks. In 32nd International Conference on Machine Learning (ICML), 2015. URL http://proceedings.mlr.press/v37/blundell15.pdf. 2, 3, 38

Bronstein, M. M., Bruna, J., Le Cun, Y., Szlam, A., and Vandergheynst, P. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4): 18 42, 2017. URL https://arxiv.org/abs/1611.08097. 40, 41

Caruana, R., Niculescu-Mizil, A., Crew, G., and Ksikes, A. Ensemble selection from libraries of models. In Proceedings of the twenty-ﬁrst international conference on Machine learning, pp. 18, 2004. URL https://www. cs.cornell.edu/ caruana/ctp/ct.papers/caruana.icml04.ic dm06long.pdf. 18

Cayton, L. Algorithms for manifold learning. Univ. of California at San Diego Tech. Rep, 12(1-17):1, 2005. URL https://www.lcayton.com/resexam.pdf. 5, 26

Foong, A. Y., Li, Y., Hern andez-Lobato, J. M., and Turner, R. E. In-between uncertainty in bayesian neural networks. ar Xiv preprint ar Xiv:1906.11537, 2019. URL https://arxiv.org/abs/1906.11537. 8, 29, 30

Fort, S., Hu, H., and Lakshminarayanan, B. Deep ensembles: A loss landscape perspective. ar Xiv preprint ar Xiv:1912.02757, 2019. URL https://arxiv.org/abs/1912 .02757. 2

Gal, Y. and Ghahramani, Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In 33rd International Conference on Machine Learning (ICML), pp. 1050 1059, 2016. URL https://proceedings.mlr.press/v48/gal16.html. 2, 7, 8, 20, 27, 29, 38

Garnelo, M., Rosenbaum, D., Maddison, C., Ramalho, T., Saxton, D., Shanahan, M., Teh, Y. W., Rezende, D., and Eslami, S. A. Conditional neural processes. In International Conference on Machine Learning, pp. 1704 1713. PMLR, 2018a. URL https://proceedings.mlr.press/v80/ garnelo18a.html. 3, 46

Garnelo, M., Schwarz, J., Rosenbaum, D., Viola, F., Rezende, D. J., Eslami, S., and Teh, Y. W. Neural processes. ar Xiv preprint ar Xiv:1807.01622, 2018b. URL https://arxiv.org/abs/1807.01622. 3, 46

Ghahramani, Z. Probabilistic machine learning and artiﬁcial intelligence. Nature, 521(7553):452 459, 2015. URL https://www.nature.com/articles/nature14541. 1

G omez-Bombarelli, R., Wei, J. N., Duvenaud, D., Hern andez-Lobato, J. M., S anchez-Lengeling, B., Sheberla, D., Aguilera-Iparraguirre, J., Hirzel, T. D., Adams, R. P., and Aspuru-Guzik, A. Automatic chemical design using a data-driven continuous representation of molecules. ACS central science, 4(2):268 276, 2018. URL https://doi.org/10.1021/acscentsci.7b00572. 1

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. Advances in neural information processing systems, 27, 2014. URL https://proceedings.neurips.cc/paper/2014/ﬁle/5ca3 e9b122f61f8f06494c97b1afccf3-Paper.pdf. 5

Graves, A. Practical variational inference for neural networks. In Advances in neural information processing systems, pp. 2348 2356, 2011. URL http://papers.nips.cc /paper/4329-practical-variational-inference-for-neural -networks.pdf. 2, 38

Gustafsson, F. K., Danelljan, M., and Schon, T. B. Evaluating scalable bayesian deep learning methods for robust computer vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 318 319, 2020. URL https://arxiv.org/ab s/1906.01620. 2

Havasi, M., Jenatton, R., Fort, S., Liu, J. Z., Snoek, J., Lakshminarayanan, B., Dai, A. M., and Tran, D. Training independent subnetworks for robust prediction. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=OGg9Xn Kx FAH. 3

Heiss, J., Teichmann, J., and Wutte, H. How implicit regularization of relu neural networks characterizes the learned function part i: the 1-d case of two layers with random ﬁrst layer. ar Xiv preprint ar Xiv:1911.02903, 2019. URL https://doi.org/10.3929/ethz-b-000402003. 15

NOMU: Neural Optimization-based Model Uncertainty

Hern andez-Lobato, J. M. and Adams, R. Probabilistic backpropagation for scalable learning of bayesian neural networks. In International Conference on Machine Learning, pp. 1861 1869, 2015. URL http://proceedings.mlr.press/ v37/hernandez-lobatoc15.pdf. 2, 8, 29, 38

Kendall, A. and Gal, Y. What uncertainties do we need in bayesian deep learning for computer vision? In Advances in neural information processing systems, pp. 5574 5584, 2017. URL https://papers.nips.cc/paper/2017/hash/265 0d6089a6d640c5e85b2b88265dc2b-Abstract.html. 2, 43

Khosravi, A., Nahavandi, S., Creighton, D., and Atiya, A. F. Lower upper bound estimation method for construction of neural network-based prediction intervals. IEEE transactions on neural networks, 22(3):337 346, 2010. URL https://doi.org/10.1109/TNN.2010.2096824. 2, 6

Kuleshov, V., Fenner, N., and Ermon, S. Accurate uncertainties for deep learning using calibrated regression. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 2796 2804, Stockholmsm assan, Stockholm Sweden, 10 15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/kule shov18a.html. 4, 6, 14, 19

Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in neural information processing systems, pp. 6402 6413, 2017. URL http://papers.nips.cc /paper/7219-simple-and-scalable-predictive-uncertainty -estimation-using-deep-ensembles.pdf. 2, 8, 17, 20, 29

Le Cun, Y., Bengio, Y., and Hinton, G. Deep learning. Nature, 521(7553):436 444, 2015. ISSN 1476-4687. doi: 10.1038/nature14539. URL https://www.nature.com/artic les/nature14539. 1

Malinin, A. and Gales, M. Predictive uncertainty estimation via prior networks. In Advances in Neural Information Processing Systems, pp. 7047 7058, 2018. URL https: //papers.nips.cc/paper/2018/ﬁle/3ea2db50e62ceefceaf7 0a9d9a56a6f4-Paper.pdf. 3, 36, 43

Malinin, A., Chervontsev, S., Provilkov, I., and Gales, M. Regression prior networks, 2020a. URL https://arxiv.org/ pdf/2006.11590.pdf. 3, 45

Malinin, A., Mlodozeniec, B., and Gales, M. Ensemble distribution distillation. In International Conference on Learning Representations, 2020b. URL https://openrevi ew.net/forum?id=Byg SP6Vtvr. 5

Martinez-Cantin, R., De Freitas, N., Brochu, E., Castellanos, J., and Doucet, A. A bayesian exploration-exploitation approach for optimal online sensing and planning with a

visually guided mobile robot. Autonomous Robots, 27(2): 93 103, 2009. URL https://doi.org/10.1007/s10514-009 -9130-2. 1

Mukhoti, J., Stenetorp, P., and Gal, Y. On the importance of strong baselines in bayesian deep learning. ar Xiv preprint ar Xiv:1811.09385, 2018. URL https://arxiv.org/abs/1811 .09385. 8, 29

Neal, R. M. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012. URL

https://api.semanticscholar.org/Corpus ID:60809283. 1

Nix, D. A. and Weigend, A. S. Estimating the mean and variance of the target probability distribution. In Proceedings of 1994 ieee international conference on neural networks (ICNN 94), volume 1, pp. 55 60. IEEE, 1994. URL https://doi.org/10.1109/ICNN.1994.374138. 2, 46

Ober, S. W. and Rasmussen, C. E. Benchmarking the neural linear model for regression. ar Xiv preprint ar Xiv:1912.08416, 2019. URL https://arxiv.org/abs/ 1912.08416. 3, 8, 29

Osband, I., Wen, Z., Asghari, M., Ibrahimi, M., Lu, X., and Van Roy, B. Epistemic neural networks. ar Xiv preprint ar Xiv:2107.08924, 2021. URL https://arxiv.org/abs/2107 .08924. 7, 26, 27

Ovadia, Y., Fertig, E., Ren, J., Nado, Z., Sculley, D., Nowozin, S., Dillon, J., Lakshminarayanan, B., and Snoek, J. Can you trust your model's uncertainty? evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https: //proceedings.neurips.cc/paper/2019/ﬁle/8558cb40 8c1d76621371888657d2eb1d-Paper.pdf. 2

Pearce, T., Zaki, M., Brintrup, A., and Neely, A. High-quality prediction intervals for deep learning: A distribution-free, ensembled approach. In 35th International Conference on Machine Learning (ICML), pp. 4075 4084, 2018. URL https://proceedings.mlr.press/v8 0/pearce18a.html. 2, 6

Srinivas, N., Krause, A., Kakade, S. M., and Seeger, M. W. Information-theoretic regret bounds for gaussian process optimization in the bandit setting. IEEE Transactions on Information Theory, 58(5):3250 3265, May 2012. ISSN 1557-9654. doi: 10.1109/tit.2011.2182033. URL http://dx.doi.org/10.1109/TIT.2011.2182033. 1

Steinhilber, F., Beer, J., and Fr ohlich, C. Total solar irradiance during the holocene. Geophysical Research Letters, 36(19), 2009. doi: 10.1029/2009GL040142. URL https://agupubs.onlinelibrary.wiley.com/doi/abs/10.102 9/2009GL040142. 7

Appendix of NOMU: Neural Optimization-based Model Uncertainty

Wahba, G. Improper priors, spline smoothing and the problem of guarding against model errors in regression. Journal of the Royal Statistical Society: Series B (Methodological), 40(3):364 372, 1978. URL https://doi.org/10.1111/j.2517-6161.1978.tb01050.x. 36

Wang, Z., Ren, T., Zhu, J., and Zhang, B. Function space particle optimization for bayesian neural networks. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkgt Ds Cc KQ. 3

Weissteiner, J. and Seuken, S. Deep learning-powered iterative combinatorial auctions. In Proceedings of the 34th AAAI Conference of Artiﬁcial Intelligence, pp. 2284 2293, 2020. URL https://ojs.aaai.org/index.php/AAAI/ar ticle/download/5606/5462. 4

Wen, Y., Tran, D., and Ba, J. Batchensemble: An alternative approach to efﬁcient ensemble and lifelong learning, 2020. URL https://arxiv.org/pdf/2002.06715.pdf. 3

Wenzel, F., Roth, K., Veeling, B., Swiatkowski, J., Tran, L., Mandt, S., Snoek, J., Salimans, T., Jenatton, R., and Nowozin, S. How good is the Bayes posterior in deep neural networks really? In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 10248 10259. PMLR, 13 18 Jul 2020a. URL https://proceedings.mlr.press/v119/wenzel20a.html. 2, 18

Wenzel, F., Snoek, J., Tran, D., and Jenatton, R. Hyperparameter ensembles for robustness and uncertainty quantiﬁcation. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS 20, Red Hook, NY, USA, 2020b. Curran Associates Inc. ISBN 9781713829546. URL https://papers.nips.cc/p aper/2020/ﬁle/481fbfa59da2581098e841b7afc122f1-P aper.pdf. 2, 18, 20, 21

Williams, C. K. and Rasmussen, C. E. Gaussian processes for machine learning. MIT press Cambridge, MA, 2006. URL https://gaussianprocess.org/gpml/. 16, 37

Appendix of NOMU: Neural Optimization-based Model Uncertainty

A. Theoretical Analysis of NOMU

In this section, we

1. ﬁrst provide a theoretical motivation for the design of NOMU and establish via Theorem A.1 a connection to pointwise worst-case UBs UBpw, UBpw (Appendix A.1).

2. Next, we provide a Bayesian interpretation of those pointwise worst-case UBs UBpw, UBpw and elaborate on relative uncertainties (Appendix A.2).

3. Then, we discuss the case if HDtrain is not upwards directed as assumed in Theorem A.1 and further show how we deal with this challenge (Appendix A.3).

4. Finally, we prove Theorem A.1 (Appendix A.4).

A.1. Relating NOMU to Pointwise Worst-Case Uncertainty Bounds

In this section, we provide a theoretical motivation for the design of NOMU. To this end, we ﬁrst deﬁne worst-case UBs. We then state a theorem connecting these worst case bounds and NOMU s UBs via NOMU s loss function. In what follows, we assume zero data noise σn = 0.

Consider a hypothesis class H given as a subset of a Banach space of functions f : X Y . Furthermore, let HDtrain := {f H : f(xtrain i ) = ytrain i , i = 1, . . . , ntrain} denote the set of all functions from the hypothesis class that ﬁt through the training points and let ˆf HDtrain be a prediction function (e.g., a NN trained on Dtrain). Worst-case bounds within the class HDtrain can be deﬁned pointwise for each x X as:

UBpw(x) := inf f HDtrain f(x), (5)

UBpw(x) := sup f HDtrain f(x). (6)

By deﬁnition, these UBs are the tightest possible bounds that cover every f HDtrain (i.e., UBpw(x) f(x) UBpw(x) x X). From a Bayesian perspective, such bounds correspond to credible bounds for α = 1 if the support of the prior is contained in H. Interestingly, if H is the class of regularized NNs, these bounds can also be interpreted as an approximation of credible bounds for α < 1 with respect to a Gaussian prior on the parameters of a NN (see Appendix A.2. for a derivation).11

In applications like BO, when optimizing an acquisition function based on these pointwise-deﬁned bounds, we require the UBs for all x X. Thus, numerically optimizing such an acquisition function is practically infeasible, as it would require solving the optimization problems from (5)

11Standard BNNs also aim to approximate the posterior coming from exactly this prior.

millions of times. In NOMU, we circumvent this problem by constructing the UBs for all x X simultaneously. We can do so by only solving a single optimization problem, i.e., minimizing the NOMU loss from (4).

In the following theorem, we show that these pointwisedeﬁned UBs can be computed by solving a single optimization problem under the following assumption.

Assumption 1 (UPWARDS DIRECTED) For every f1, f2 HDtrain there exists an f HDtrain such that f(x) max(f1(x), f2(x)) for all x X.

Theorem A.1 (SINGLE OPTIMIZATION PROBLEM) Let X = Qd i=1[ai, bi] Rd, ai < bi, let Y = R, and let Dtrain be a nonempty set of training points. Furthermore, let HDtrain (C(X, Y ), ) be compact and upwards directed and ˆf HDtrain. Then, for every strictly-increasing and continuous u : R R, it holds that

UBpw = arg max h HDtrain

X u(h(x) ˆf(x)) dx. (7)

In words, UBpw can be calculated via the single optimization problem (7) on HDtrain.12

Proof. See Appendix A.4.

In practice, Assumption 1 can be violated such that a straightforward calculation of the r.h.s. of (7) for an arbitrary u would result in unreasonable UBs. However, for a sensible choice of u, NOMU s UBs based on the r.h.s. of (7) still satisfy our Desiderata D1 D5, similar to UBpw (see Appendix A.3 for a discussion).

The connection of NOMU s UBs to the pointwise worstcase bounds UBpw, UBpw becomes clear by observing that minimizing NOMU s loss function Lπ (Equation (4)) can be interpreted as solving the r.h.s of (7) for a speciﬁc choice of u, when H is the class of regularized NNs. In detail:

Term (a) of the NOMU-loss (4) implements that ˆf solves the regression task and thus ˆf HDtrain up to numerical precision (if the regularization λ is small enough).

Term (b) enforces ˆrf(xtrain i ) 0 and thus when deﬁning h := ˆf + ˆrf, we directly obtain h(xtrain i ) ytrain i corresponding to the constraint h HDtrain in (7).

While terms (a) and (b) enforce the constraints of (7), term (c) is the objective function of (7) for the speciﬁc choice of u(z) := e cexpz, cexp R 0.

12We formulate Theorem A.1 for upper UBs UBpw. The analogous statement also holds for lower UBs UBpw.

Appendix of NOMU: Neural Optimization-based Model Uncertainty

A.2. Bayesian Interpretation of Pointwise Worst-Case Uncertainty Bounds

In this Section, we provide a Bayesian interpretation of the pointwise worst-case UBs UBpw, UBpw and elaborate on relative uncertainties.

In the following, we denote by NN f θ : X Y a (standard) NN for model predictions. Note that NN f θ does not represent the whole NOMU architecture but can be used as ˆf-sub-network in the NOMU architecture NN θ. Furthermore, we consider the hypothesis class of regularized NNs, i.e., H := n NN f θ : θ 2 γ o . Recall that one needs to assume that the prior is fully concentrated on H in order to interpret the pointwise UBs UBpw, UBpw as α=1 CBs. In the following, we will present an alternative Bayesian interpretation of UBpw.

Many other approaches (MC dropout, BNNs) assume a Gaussian prior on the parameters of the NNs, i.e, θ N(0, σ2 θI), and try to approximate the corresponding posterior CBs. Interestingly, UBpw and UBpw can also be seen as approximations of α<1 CBs in the case of such a Gaussian prior on the parameters. This can be seen as follows:

Let the data generating process be given as y = NN f θ + ε, with ε N(0, σ2 n).13 For a Gaussian prior on the parameters θ N(0, σ2 θI) the negative log posterior can be written as

log(p(θ|D train)) = 1

=:L(θ) z }| { ntrain X

NN f θ(x train i ) y train i 2

+ θ 2 2 2σ2 θ + n train log(σn) + Cntrain,σθ.

for a constant Cntrain,σθ := ntrain

2 log(2π) + 1

2 log(2πσ2 θ). Then the pointwise upper UBs can be reformulated to

UBpw(x) def. = sup f HDtrain f(x) = lim σn 0 sup f Hσn Dtrain f(x) (9)

Hσn Dtrain : = NN f θ : σ2 θ σ2n L(θ) + θ 2 2 γ (10a)

= n NN f θ : log(p(θ|D train)) γσn o (10b)

where γσn := γ

2σ2 θ ntrain log(σn) Cntrain,σθ.

Therefore, for small data noise σn 0 we obtain from

13For simplicity we assume homoskedastic noise in this section.

Equations (9) and (10b) that

UBpw(x) sup θ

NN f θ(x) : p(θ|D train) > e γσn . (11)

In words, from a Bayesian point of view, we seek the highest value NN f θ(x) provided that the posterior density p(θ|Dtrain) > e γσn, which can be seen as a heuristic to approximate CBs similarly as the MAP from the parameterspace is a commonly used heuristic to approximate the posterior mean of y.

Remark A.2 Note, that if p(NN f θ(x)|Dtrain) = p(θ|Dtrain) and this posterior is unimodal, (11) corresponds to a classical highest posterior density interval (HPDI).

Thus, pointwise worst-case UBs can be seen as an alternative method to approximate BNN-based CBs with the typically used Gaussian prior.

A.2.1. FROM ABSOLUTE TO RELATIVE MODEL UNCERTAINTY

If the prior scale (e.g., γ or σθ in the above mentioned approaches) is known, in theory no further calibration is needed and one can interpret the resulting UBs in absolute terms (Kuleshov et al., 2018). However, typically, the prior scale is unknown and the resulting UBs can only be interpreted in terms of relative model uncertainty (i.e., how much more model uncertainty does one have at one point x compared to any other point x ?) and in a second step careful calibration of the resulting UBs is required.

Thus, there are two (almost) independent problems: First, the fundamental concept of how to estimate relative model uncertainty (such as MC dropout, deep ensembles, hyper deep ensembles or NOMU) and second, the calibration of these UBs. In this paper, we do not mix up these two challenges and only focus on the ﬁrst one. Furthermore, desiderata D1 D4 and our metrics AUC and NLLmin do not depend on the scaling of the uncertainty. Whenever we use NLL as a metric, we make sure to calibrate the uncertainties by a scalar c to ensure that our evaluations do not depend on the scaling of the uncertainty predictions.

A.3. A Note on Theorem A.1

In practice, the set HDtrain often is not upwards directed for typical NN-architectures and Equation (7) of Theorem A.1 is not fulﬁlled in general. Indeed, h := arg maxh HDtrain R

X u(h(x) ˆf(x)) d(x) can be much more overconﬁdent than UBpw. However, in the following we will motivate why for a suitably chosen u the relative model uncertainty of h still fulﬁlls the desiderata D1 (Non Negativity), D2 (In-Sample) and D3 (Out-of-Sample).

Remark A.3 In our proposed ﬁnal algorithm from Section 3 NOMU, we incorporated D4 (Metric Learning) by modify-

Appendix of NOMU: Neural Optimization-based Model Uncertainty

ing the network architecture as presented in Section 3.2.

Problem First, we give an example of a speciﬁc NNarchitecture, where Theorem A.1 is not fulﬁlled due to a violation of the upwards directed assumption. Note that from a Lagrangian perspective it is equivalent to add a term λ θ 2 2 , λ 0 to the target function of (7) instead of bounding θ 2 2 γ2 as a constraint in HDtrain. Moreover, for a speciﬁc NN-architecture it is shown that regularizing θ 2 2 is equivalent to regularizing the L2-norm of the second derivative of the function f = NN f θ (Heiss et al., 2019), i.e., regularizing R

X f (x)2 dx. If we choose for example u = id : R R, then increasing h in-between the two close points in the middle of Figure 7, would improve the target function of (7) R u(h(x) ˆf(x)) dx less than the additional regularization cost resulting from the additional second derivative when increasing h. Therefore, h will be below the mean prediction ˆf in this region, break D1 (Non-Negativity), and h = UBpw.

Figure 7. h would not fulﬁll D1 (Non-Negativity), if u was chosen as the identity and the architecture from Heiss et al. (2019) was used.

Solution However, if e.g., we choose u : x 7 e cexpx

with cexp large enough, we highly penalize h < ˆf, and thus ˆf h D1 (Non-Negativity) is fulﬁlled. Since h HDtrain, it follows that h (xtrain i ) ˆf(xtrain i ) = 0 for all training points. This together with ˆf h implies that h ˆf (xtrain i ) = 0 at the training points (in the interior of X), which subsequently interrupts the downwards trend of h ˆf from one side of a point to the other side of a point as depicted in Figure 8.

Figure 8. h fulﬁlls D1 D3 if u prevents it from getting negative.

See Appendix D for a detailed discussion how NOMU fulﬁlls all ﬁve desiderata.

A.4. Proof of Theorem A.1

In the following, we prove Theorem A.4, an even more general version of Theorem A.1. The statement of Theorem A.1 follows from Theorem A.4 by setting

1. X = Qd i=1[ai, bi] Rd compact, Y := R

2. µ = λd, where λd is the Lebesgue measure on Rd (which is ﬁnite on every compact set and has full support).

Theorem A.4 Let X be a nonempty topological space equipped with a ﬁnite measure µ with full support14, let Y be a normed and totally ordered space and let Dtrain denote a set of observations. Moreover, let HDtrain (C(X, Y ), ) be compact, and ˆf HDtrain. Assume further that HDtrain is upwards directed (see Assumption 1). Then, for every strictly-increasing and continuous u : Y R it holds that

UBpw = arg max h HDtrain

X u(h(x) ˆf(x)) dµ(x). (12)

Proof. First note that since µ is ﬁnite and h, ˆf and UBpw are bounded since HDtrain is assumed to be compact with respect to the -topology, it holds that the integral in (12) is ﬁnite.

Let h HDtrain denote an optimizer of (12), which exists since HDtrain is assumed to be compact and the operator h 7 R

Xu(h(x) ˆf(x)) dµ(x) is continuous on the - topology.

We want to show that h (x) = UBpw(x) for every x X.

Note that per deﬁnition for all x X and h HDtrain it holds that

UBpw(x) = sup f HDtrain f(x) h(x). (13)

Thus UBpw(x) h (x) for all x X.

For the reverse inequality assume on the contrary that there exists an x X such that

UBpw(x ) > h (x ). (14)

Then, we deﬁne fx := arg maxf HDtrain f(x ), which exists because of compactness and continuity. Since fx and h are both continuous and fx (x ) = UBpw(x ) > h (x ) there exists a neighbourhood Ux of x such that

fx (x) > h (x) for all x Ux , (15)

Using the upwards directed property with f1 := fx and f2 := h , it follows that there exists a h HDtrain with

h(x) max(fx (x), h (x)) for all x X. (16)

14I.e. every nonempty open set has non-zero measure.

Appendix of NOMU: Neural Optimization-based Model Uncertainty

Using (15) together with (16) implies further that

h(x) h (x) for all x X and (17) h(x) > h (x) for all x Ux . (18)

However, since u is strictly increasing and µ(Ux ) > 0 by the full support assumption, we get that Z

X u( h(x) ˆf(x)) dµ(x) > Z

X u(h (x) ˆf(x)) dµ(x),

which is a contradiction to the assumption that h is the optimizer of (12). Therefore, it holds that UBpw(x) h (x) for all x X.

In total we get that UBpw(x) = h (x) for all x X, which concludes the proof.

Remark A.5 For our algorithm we select H := n NN f θ : θ 2 γ o to be the class of regularized NNs with a continuous activation function. Thus, the assumption of HDtrain being compact and a subset of C(X, Y ) is fulﬁlled.

B. Experiments

B.1. Benchmark Methods

In this Section, we give a brief overview of each considered benchmark algorithm.

B.1.1. GAUSSIAN PROCESS (GP)

A GP deﬁnes a distribution over a set of functions {f : X Y } where every ﬁnite collection of function evaluations follows a Gaussian distribution (see Williams & Rasmussen (2006) for a survey on GPs). A GP is completely speciﬁed by a mean m : X Y and a kernel function kπ : X X Y , where π denotes a tuple of hyper-parameters. More formally, for any ﬁnite set of k N input points x := {x1, . . . , xk} , xi X and for f := (f1, . . . , fk) Y k

with fi := f(xi) it holds that

f Nk (m(x), Kπ(x, x)) , (20)

i.e., it follows a k-dimensional Gaussian distribution with covariance (or Gramian) matrix [Kπ(x, x)]i,j := kπ(xi, xj), and mean vector m(x) := (m(x1), . . . , m(xk)). Let

f GP(m( ), kπ( , )), (21)

denote a GP with mean function m and kernel kπ.

In the following, we summarize the main steps of GP regression for a 1D-setting and m 0.

1. Deﬁne probabilistic model:

y = f(x) + ε. (22)

2. Specify prior (kernel and data noise):

f GP(0, kπ( , )), (23)

ε|x N 0, σ2 n(x) . (24)

3. Calculate likelihood for training points x and y := {y1, . . . , yk}:

y|x, π Nk 0, Kπ(x, x) + diag(σ2 n(x) , (25)

with σ2 n(x) := σ2 n(x1), . . . , σ2 n(xk) . 4. Optimize kernel hyper-parameters (optional):

ˆπ arg max π p(y|x, π). (26)

5. Calculate posterior predictive distribution for new point (x ):

f(x )|x , y, x, ˆπ N ˆµ(x ), ˆσ2 f(x ) , (27)

where for A := K ˆπ(x, x) + diag(σ2 n(x) and kˆπ(x , x) := (kˆπ(x , x1), . . . , kˆπ(x , xk)) the parameters are given as

ˆµ(x ) :=kˆπ(x , x)A 1f(x) (28)

ˆσ2 f(x ) :=kˆπ(x , x ) kˆπ(x , x)A 1kˆπ(x , x)T . (29)

Setting the data noise to zero σn 0 in (28) and (29) yields the mean prediction ˆf and the model uncertainty prediction ˆσ2 f as

ˆf(x )=kˆπ(x , x) (K ˆπ(x, x)) 1f(x), (30)

ˆσ2 f(x )=kˆπ(x , x ) kˆπ(x , x)(K ˆπ(x, x)) 1 kˆπ(x , x)T, (31)

which are then used to deﬁne the Gaussian process s UBs by ˆf(x) c ˆσf(x) with a calibration parameter c R 0.

B.1.2. MONTE CARLO DROPOUT (MCDO)

Let NN f θ = W K φ W K 1 . . . φ W 1(x) be an NN with K 1 hidden layers, activation function φ, and ﬁxed parameters θ = {W 1, . . . , W K}, which have been trained with added dropout regularization. Furthermore, let (p1, . . . , p K) denote the dropout probability vector used when training NN f θ, i.e, pi determines the probability for a single node in the ith hidden layer W i to be dropped in each backpropagation step.15

To obtain model uncertainty one draws M different NNs according to the dropout probability vector and represents model uncertainty using sample estimates of the mean and

15One could also use different probabilities pij for each node within a hidden layer. The equations extend straightforwardly.

Appendix of NOMU: Neural Optimization-based Model Uncertainty

variance of the model predictions.16 These predictions are frequently termed stochastic forward passes. More formally, given a dropout probability vector (p1, . . . , p K), one draws M realisations {θ(1), . . . , θ(M)} of parameters θ, where θ(m) := {W 1,(m), . . . , W K,(m)}. W k,(m) is obtained from the original hidden layer W k by dropping each column with probability pi, i.e., for W k Rdrow dcol set

W k,(m) = W k h z(m) 1 , . . . , z(m) dcol

i , z(m) j Rdrow (32)

where z(m) j :=

( 0, with probability pi 1, with probability 1 pi. (33)

UBs that represent model uncertainty and known data noise σ2 n are then estimated for each x X as

m=1 NN f θ(m)(x), (34)

ˆσ2(x) := 1

NN f θ(m)(x) ˆf(x) 2

| {z } model uncertainty

+ σ2 n(x) | {z } data noise

The model uncertainty prediction ˆσ2 f is then given as

ˆσ2 f(x) := 1

NN f θ(m)(x) ˆf(x) 2 (36)

which deﬁnes Mc dropout s UBs as ˆf(x) c ˆσf(x) with a calibration parameter c R 0.

B.1.3. DEEP ENSEMBLES (DE)

Deep ensembles consists of the following two steps:17

1. Use a NN to deﬁne a predictive distribution pθ(y|x), select a pointwise loss function (proper scoring rule) ℓ(pθ, (x, y)), which measures the quality of the predictive distribution pθ(y|x) for an observation (x, y) and deﬁne the empirical loss used for training as

(x,y) Dtrain ℓ(pθ, (x, y)). (37)

2. Use an ensemble of NNs, each with different randomly initialized parameters to represent model uncertainty.

16Alternatively, one could also determine the UBs using empirical upper and lower quantiles of the different model predictions. 17Lakshminarayanan et al. (2017) also considered in their paper a third step: adversarial training. However, as the authors point out, the effectiveness of adversarial training drops quickly as the number of networks in the ensemble increases. Therefore, we do not consider adversarial training in this paper.

Concretely, for regression, Lakshminarayanan et al. (2017) use a NN NN f θ with two outputs: ˆµθ (mean prediction) and ˆσθ n (data noise prediction) and train it using as pointwise loss function the Gaussian negative log-likelihood, i.e,

pθ(y|x) :=N y; ˆµθ(x), ˆσθ n(x) 2 , (38)

ℓ(pθ, (x, y)) := log ˆσθ n(x) 2

2 + (ˆµθ(x) y)2

2 (ˆσθn(x))2 . (39)

To add model uncertainty, Lakshminarayanan et al. (2017) use an ensemble of M NNs {NN f θ(1), . . . , NN f θ(M)}, where each NN outputs a mean and data noise prediction, i.e, for x X and m {1, . . . , M}

NN f θ(m)(x) := ˆµθ(m)(x), ˆσθ(m) n (x) . (40)

This then deﬁnes the learned predictive distribution for each NN in regression as

pθ(m)(y|x) = N y; ˆµθ(m)(x), ˆσθ(m) n (x) 2 . (41)

Finally, the ensemble is treated as uniformly-weighted Gaussian mixture model, i.e.,

m=1 pθ(m)(y|x), (42)

which is further approximated using a single Gaussian by matching the ﬁrst and second moments. The deep ensemble predictions for the mean ˆf and for the combined model uncertainty and data noise ˆσ2 are given as

m=1 ˆµθ(m)(x), (43)

ˆσ2(x) := 1

ˆσθ(m) n (x) 2

| {z } data noise

h ˆµθ(m)(x) ˆf(x) i2

| {z } model uncertainty

The model uncertainty prediction ˆσ2 f is then given as

ˆσ2 f(x) := 1

h ˆµθ(m)(x) ˆf(x) i2 . (45)

which deﬁnes deep ensembles UBs as ˆf(x) c ˆσf(x)

with a calibration parameter c R 0. Remark B.1 For known data noise σn, no estimation is required and one can use an NN NN f θ with only one output ˆµθ. If additionally the data noise σn is assumed to be homoskedastic, one can train NN f θ using the mean squared error (MSE) with suitably chosen L2-regularization parameter λ. To obtain predictive bounds instead of credible bounds, one can add σ2 n to (45) at the end.

Appendix of NOMU: Neural Optimization-based Model Uncertainty

B.1.4. HYPER DEEP ENSEMBLES (HDE)

Hyper deep ensembles (HDE) (Wenzel et al., 2020b) is a simple extension of DE. In HDE, the ensemble is designed by varying not only weight initializations, but also hyperparameters. HDE involves (i) a random search over different hyperparameters and (ii) stratiﬁcation across different random initializations. HDE inherits all the components (e.g., the architecture, or the loss function) from DE, which are presented in detail in Appendix B.1.3. Speciﬁcally, formulas (43) and (44) for the mean prediction ˆf and the model uncertainty prediction ˆσ2 f(x) are the same for HDE.

Let NN f θ(m),π(m) denote a NN for model predictions with weights θ(m) and hyperparameters π(m) (e.g. the dropout rate). Furthermore, let rand search(κ) denote the random search algorithm from (Bergstra & Bengio, 2012), where κ is the desired number of different NNs with randomly sampled hyperparameters.

The only difference of HDE compared to DE is the procedure how the ensemble is built, which we recall in Algorithm 2. Algorithm 2 uses as subprocedure Algorithm 1 from (Caruana et al., 2004), which we present ﬁrst.

Algorithm 1 greedily grows an ensemble among a given pre-deﬁned set of models M, until some target size M is met, by selecting with-replacement the NN leading to the best improvement of a certain score S on a validation set.18

Algorithm 1 hyper ens (Caruana et al., 2004)

Ensemble E := {}; Score S( ); sbest = + while |E.unique()| M do

NN f θ arg min NN f θ M S E {NN f θ}

if S E {NN f θ} < sbest then

E = E {NN f θ} else output E end if end while output E

Note that the Algorithm 1 does not require the NNs to have the same random weight initialization. However, Wenzel et al. (2020b) consider a ﬁxed initialization for HDE to isolate the effect of just varying the hyperparameters.

Finally, Algorithm 2 builds an ensemble of at most M unique NNs which can exploit both sources of diversity: different random initializations and different choices of hyperparameters.

18Note that we use a slightly different notation than Wenzel et al. (2020a) here.

Algorithm 2 hyper deep ens (Wenzel et al., 2020b)

M0 = {NN f θ(j),λ(j)}κ j=1 rand search(κ) E0 hyper ens(M0, M) Estrat = {} for all NN f θ,λ E0.unique() do for all m {1, . . . , M} do

θ random initialization (seed number m) NN f θ(m),λ train NN f θ ,λ Estrat = Estrat {NN f θ(m),λ} end for end for output hyper ens(Estrat, M)

B.2. Regression

B.2.1. METRICS

In this Section, we provide details on the three metrics, which we use to assess the quality of UBs in the regression settings. In the following, let

UBc(x), UBc(x) := ˆf(x) c ˆσf(x) (46)

denote UBs obtained from any of the considered models via a model prediction ˆf, a model uncertainty prediction ˆσf, and a calibration parameter c R 0. Furthermore, we use the following shorthand notation:

UBc(x) := UBc(x), UBc(x) . (47)

In the following, let D := {(xi, yi) X Y, i {1, . . . , n}}, with n N denote a set of input-output points (depending on the speciﬁc purpose, D would typically refer to a validation or test set).

Mean Width vs. Coverage Probability We ﬁrst formalize the concepts of mean width (MW) and coverage probability (CP). Then we deﬁne the metric AUC.

Deﬁnition B.2 (COVERAGE PROBABILITY) Let D denote a set of input-output points, and let UBc denote UBs for a calibration parameter c. Then the coverage probability is deﬁned as

CP (D | UBc) := 1 |D|

(x,y) D 1{UBc(x) y UBc(x)}. (48)

Deﬁnition B.3 (MEAN WIDTH) Let D denote a set of input-output points, and let UBc denote UBs for a calibration parameter c. Then the mean width is deﬁned as

MW (D | UBc) := 1 |D|

(x,y) D |UB(x) UB(x)|. (49)

Appendix of NOMU: Neural Optimization-based Model Uncertainty

Remark B.4 Note that in Deﬁnition B.3, uncovered points at which UBs (for some ﬁxed calibration parameter c) are narrow have a positive effect on overall mean width. In order not to reward such overconﬁdent mispredictions, a possible remedy is to consider in (49) only the subset Dcapt D of input-output points captured by the UBs, i.e.,

D capt := (x, y) D : UB(x) y UB(x) .

However, focusing on captured data points only punishes UBs that capture some points with large widths and almost cover others with small widths unnecessarily harshly compared to UBs for which the reverse is true. In other words, a slight change in calibration parameter c can lead to very diverse evaluations of UBs that have been assessed almost equal under the original c. Since ultimately we are interested in comparing UBs based on a range of calibration parameters (see Measure 1) we decided to include all points in the calculation of MW in our experiments.

Ideally, MW should be as small as possible, while CP should be close to its maximal value 1. Clearly, CP is counteracting MW. This naturally motivates considering ROC-like curves, plotting MW against CP for a range of calibration parameters c, and comparing different UBs based on their area under the curves (AUC).

Metric 1 (AUC) Let D denote a set of input-output points. Deﬁne further c as the minimal calibration parameter achieving full coverage of D for given UBs UBc, i.e.,

c := arg min c 0 {CP (D | UBc) = 1}.

AUC is then deﬁned as the integral of the following curve

{(CP (D | UBc) , MW (D | UBc)) : c [0, c ]}.19

Negative Log-likelihood Next, we ﬁrst deﬁne the second metric: average20 negative log (Gaussian) likelihood (NLL) and then present our third metric NLLmin.

Metric 2 (NLL) Let D denote a set of input-output points, and let UBc denote UBs for a calibration parameter c with corresponding model prediction ˆf : X R and model uncertainty prediction ˆσf : X R 0. Then NLL is deﬁned

19In our experiments, we approximated this integral via the trapezoidal rule. 20We remark that NLL can be interpreted as average marginal predictive density over the set D, assuming these marginals are Gaussian with mean ˆf and variance cˆσf. In particular, we refrain from posing assumptions of posterior independence, that we believe are highly ﬂawed, i.e., NLL should not be interpreted as joint negative log predictive density.

NLL(D|UBc) := (50)

2 (cˆσf(x))2 + ln (cˆσf(x))

The ﬁrst term in NLL measures the error of the model prediction, where large errors can be attenuated by large uncertainties while the second term penalizes logarithmically larger uncertainties. Thus, NLL penalizes both over-conﬁdent (cˆσf(x) 0) wrong predictions as well as under-conﬁdent (cˆσf(x) 0) predictions.

We deﬁne the third metric as the minimal value of the NLL when varying the calibration parameter c.

Metric 3 (MINIMUM NLL) Let D denote a set of inputoutput points, and let UBc denote UBs for a calibration parameter c. Then NLLmin is deﬁned as

NLLmin := min c R 0 NLL(D|UBc). (51)

Remark B.5 (CALIBRATION) Different approaches of obtaining UBs require rather different scaling to reach similar CP (as well as MW). For instance, as mentioned above, we found that the calibration parameter c required to achieve a certain value of CP typically is larger for DE than those of the remaining methods by a factor of approximately 10. Therefore, in practice it is important to ﬁnd a wellcalibrated parameter c. Standard calibration techniques can be applied to NOMU (e.g. methods based on isotonic regression Kuleshov et al. (2018), or when assuming Gaussian marginals of the posterior one can select c as the arg min c R 0 NLL(D|UBc) on a validation set D).

B.2.2. TOY REGRESSION: CONFIGURATION DETAILS

All NN-based methods are fully connected feed forward NNs with Re LU activation functions, implemented in TENSORFLOW.KERAS and trained for 210 epochs of TENSOR-

FLOW.KERAS Adam stochastic gradient descent with standard learning rate 0.001 and full batch size of all training points. Moreover, weights and biases are initialized uniformly in the interval [ 0.05, 0.05].

NOMU Setup For the MC-integration of the integral in the NOMU loss (4), i.e., term (c), we use l = 128 and l = 256 artiﬁcial input points (we draw new artiﬁcial input points in every gradient descent step) in 1D and 2D, respectively.21 For numerical stability, we use the parameters θ

21In 1D, we tried MC-integration based on uniform samples and MC-integration based on a deterministic grid, where both approaches led to qualitatively similar results and the latter is presented in this paper.

Appendix of NOMU: Neural Optimization-based Model Uncertainty

Table 5. Detailed results of the AUC metric for NOMU, GP, MCDO, DE, HDE and HDE* (our HDE see Appendix B.2.2) for all ten 1D synthetic functions. Shown are the medians and a 95% bootstrap conﬁdence interval over 500 runs. Winners are marked in grey.

NOMU GP MCDO DE HDE HDE* FUNCTION AUC 95%-CI AUC 95%-CI AUC 95%-CI AUC 95%-CI AUC 95%-CI AUC 95%-CI

ABS 0.07 [0.07, 0.08] 0.14 [0.13, 0.16] 0.16 [0.15, 0.17] 0.08 [0.07, 0.09] 1.02 [0.82, 1.20] 0.72 [0.65, 0.81] STEP 0.29 [0.26, 0.31] 0.93 [0.85, 1.02] 0.25 [0.24, 0.27] 0.14 [0.13, 0.15] 1.90 [1.61, 2.67] 0.82 [0.66, 0.99] KINK 0.08 [0.08, 0.09] 0.17 [0.15, 0.19] 0.22 [0.21, 0.23] 0.09 [0.07, 0.10] 1.81 [1.40, 2.31] 1.01 [0.9 , 1.24] SQUARE 0.12 [0.11, 0.13] 0.16 [0.13, 0.19] 0.38 [0.36, 0.41] 0.14 [0.12, 0.15] 1.72 [1.50, 2.23] 1.86 [1.54, 2.53] CUBIC 0.07 [0.07, 0.08] 0.00 [0.00, 0.00] 0.10 [0.10, 0.11] 0.10 [0.09, 0.11] 0.63 [0.54, 0.79] 0.52 [0.45, 0.66] SINE 1 1.10 [1.03, 1.16] 1.14 [1.09, 1.2 ] 0.90 [0.87, 0.93] 1.21 [1.16, 1.26] 9.38 [7.42, 11.4] 9.75 [7.64, 12.1] SINE 2 0.38 [0.37, 0.40] 0.42 [0.41, 0.44] 0.36 [0.35, 0.36] 0.50 [0.47, 0.53] 3.18 [2.82, 4.08] 2.43 [2.12, 3.38] SINE 3 0.20 [0.19, 0.21] 0.31 [0.29, 0.34] 0.28 [0.26, 0.29] 0.20 [0.19, 0.21] 1.43 [1.19, 1.73] 1.33 [1.10, 1.57] FORRESTER 0.19 [0.18, 0.20] 0.18 [0.17, 0.19] 0.26 [0.24, 0.28] 0.25 [0.24, 0.27] 1.28 [1.11, 1.51] 1.58 [1.35, 1.93] LEVY 0.40 [0.39, 0.42] 0.53 [0.51, 0.56] 0.38 [0.36, 0.39] 0.45 [0.43, 0.48] 3.43 [2.67, 4.42] 3.58 [2.83, 4.18]

Table 6. Detailed results of the NLLmin metric (without constant ln(2π)/2) for NOMU, GP, MCDO, DE, HDE and HDE* (our HDE see Appendix B.2.2) for all ten 1D synthetic functions. Shown are the medians and a 95% bootstrap conﬁdence interval over 500 runs. Winners are marked in grey.

NOMU GP MCDO DE HDE HDE* FUNCTION NLLMIN 95%-CI NLLMIN 95%-CI NLLMIN 95%-CI NLLMIN 95%-CI NLLMIN 95%-CI NLLMIN 95%-CI

ABS 2.85 [ 2.95, 2.80] 2.82 [ 2.90, 2.71] 1.76 [ 1.80, 1.69] 2.91 [ 3.03, 2.76] 0.19 [ 0.36, 0.00] 0.72 [ 0.82, 0.56] STEP 1.04 [ 1.12, 0.97] 0.61 [ 0.81, 0.49] 0.78 [ 0.85, 0.69] 2.49 [ 2.59, 2.36] 0.61 [ 0.38, 0.83] 1.18 [ 1.42, 0.84] KINK 2.74 [ 2.80, 2.69] 2.64 [ 2.75, 2.54] 1.28 [ 1.32, 1.22] 2.81 [ 2.93, 2.68] 0.56 [ 0.26, 0.83] 0.35 [ 0.46, 0.17] SQUARE 2.45 [ 2.51, 2.41] 3.80 [ 4.21, 3.38] 0.56 [ 0.65, 0.52] 2.27 [ 2.37, 2.16] 0.53 [ 0.31, 0.73] 0.07 [ 0.17, 0.22] CUBIC 2.98 [ 3.04, 2.94] 6.03 [ 6.15, 5.92] 2.41 [ 2.47, 2.33] 2.65 [ 2.70, 2.59] 0.77 [ 0.93, 0.64] 1.42 [ 1.63, 1.15] SINE 1 0.10 [ 0.04, 0.18] 0.06 [ 0.13, 0.01] 0.09 [ 0.07, 0.13] 0.09 [ 0.04, 0.16] 1.97 [ 1.71, 2.28] 1.65 [ 1.45, 1.84] SINE 2 1.35 [ 1.38, 1.32] 1.48 [ 1.52, 1.42] 1.03 [ 1.06, 1.00] 0.92 [ 0.96, 0.88] 1.21 [ 0.95, 1.65] 0.45 [ 0.26, 0.73] SINE 3 1.55 [ 1.61, 1.47] 1.69 [ 1.85, 1.55] 1.07 [ 1.12, 1.01] 1.66 [ 1.75, 1.58] 0.26 [ 0.10, 0.40] 0.02 [ 0.21, 0.21] FORRESTER 1.98 [ 2.03, 1.91] 2.68 [ 2.86, 2.49] 1.10 [ 1.19, 1.03] 1.75 [ 1.79, 1.69] 0.11 [ 0.23, 0.17] 0.50 [ 0.65, 0.28] LEVY 1.12 [ 1.14, 1.07] 0.93 [ 0.98, 0.90] 0.76 [ 0.79, 0.73] 1.04 [ 1.08, 0.97] 1.35 [ 0.99, 1.53] 0.74 [ 0.53, 0.96]

that gave the best training loss during the training which are not necessarily the ones after the last gradient descent step.

Deep Ensembles Setup We consider the proposed number of ﬁve ensembles (Lakshminarayanan et al., 2017) with a single output ˆµθ only (accounting for zero data noise), each with three hidden layers consisting of 28, 210 and 29

nodes (resulting in 4 million parameters). We train them on standard regularized mean squared error (MSE) with regularization parameter λ = 10 8/ntrain chosen to represent the same data noise assumptions as NOMU.

MC Dropout Setup The MC dropout network is set up with three hidden layers as well, with 210, 211 and 210 nodes (resulting in 4 million parameters). Both training and prediction of this model is performed with constant dropout probability p := pi = 0.2, proposed in (Gal & Ghahramani, 2016). We perform 100 stochastic forward passes. To represent the same data noise assumptions as NOMU and deep ensembles we set the regularization parameter to λ = (1 p) 10 8/ntrain = (1 0.2) 10 8/ntrain (based on equation (7) from (Gal & Ghahramani, 2016)).

Gaussian Process Setup Finally, we compare to a Gaussian process with RBF-kernel,

kπ(x, x ) := κ e x x 2 2 h2

with hyperparameters π := (κ, h), where both the prior variance parameter κ and the length scale parameter h (initialized as κ = h = 1) are optimized in the default range [10 5, 105] using 10 restarts and the data noise level is set to 10 7,22.23

Hyper Deep Ensembles Setup We consider the same network architectures as for DE with a single output ˆµθ only (accounting for zero data noise), each with three hidden layers consisting of 28, 210 and 29 nodes (resulting in 4 million parameters). We train them on standard regularized mean squared error (MSE).

Furthermore, we use the following proposed parameter values from Wenzel et al. (2020b) top build the ensembles: we consider M := 5 networks (termed K in (Wenzel et al., 2020b)) in the ﬁnal ensemble and search over κ = 50 networks resulting from random search. Moreover, as random hyperparameters we use the proposed L2 regulariza-

22This is mainly used for numerical stability. 23We use Gaussian Process Regressor from SCIKIT-LEARN.

Appendix of NOMU: Neural Optimization-based Model Uncertainty

Table 7. Aggregate results for NOMU, GP, MCDO, DE, HDE and HDE* (our HDE see Appendix B.2.2) for the set of all ten 1D synthetic functions. Shown are the medians and a 95% bootstrap CI of AUC and NLLmin (without constant ln(2π)/2) across all runs. Winners are marked in grey.

METHOD AUC 95%-CI NLLMIN 95%-CI

NOMU 0.21 [0.21, 0.22] -1.76 [-1.81, -1.72] GP 0.33 [0.32, 0.35] -1.74 [-1.81, -1.69] MCDO 0.30 [0.29, 0.30] -1.02 [-1.05, -0.99] DE 0.22 [0.21, 0.23] -1.75 [-1.79, -1.70] HDE 2.04 [1.90, 2.20] 0.59 [ 0.51, 0.66] HDE* 1.73 [1.63, 1.84] 0.04 [-0.03, 0.10]

tion and the dropout rate, which we draw as in (Wenzel et al., 2020b) log-uniform from (0.001, 0.9) and (10 8 10 3, 10 8 103), respectively. Furthermore, we use the proposed train/validation split of 80%/20% and as score S the NLL Finally, as for MCDO and DE we scale the realized L2 regularization by (1 dropout rate)/floor(0.8 ntrain) chosen to represent the same data noise assumptions as NOMU.

For our HDE version termed HDE*, we use a train/validation split of 70%/30%, draw the dropout probability from (0.001, 0.5) and continue training the ﬁnal ensemble again on all ntrain training points (rescaling the realized L2 regularization again by floor(0.8 ntrain)/ntrain).

B.2.3. TOY REGRESSION: DETAILED RESULTS

Results 1D In Table 5 (AUC) and Table 6 (NLLmin), we present detailed results, which correspond to the presented ranks from Table 1 in the main paper. In Table 7, we present aggregate results, i.e., median values (incl. a 95% bootstrap conﬁdence interval (CI)) of AUC and NLLmin across all 5000 runs (500 runs for 10 test functions) for each algorithm. We see that NOMU performs well on both metrics and yields numbers similar to those of DE. The GP is only competitive on NLLmin. MCDO performs in both metrics worse than NOMU and DE. Not surprisingly, HDE, shown to be the state-of-the-art ensemble algorithm (Wenzel et al., 2020b), fails to produce reliable model uncertainty estimates in such a scarce and noiseless regression setting (see Appendix B.2.4 for a discussion).

Results 2D Test functions for 2D input are taken from the same library as in the 1D setting.24 Speciﬁcally, we select the following 11 different test functions: Sphere, GFunction, Goldstein-Price, Levy, Bukin N.6, Rosenbrock, Beale, Camel, Perm, Branin, Styblisnki-Tang.

In Table 8, we present median values of AUC and NLLmin

24sfu.ca/ ssurjano/optimization.html All test functions are scaled to X = [ 1, 1]2 and f(X) = [ 1, 1].

Table 8. Aggregate results for NOMU, GP, MCDO, DE, HDE and HDE* (our HDE see Appendix B.2.2) for the set of all eleven 2D synthetic functions. Shown are the medians and a 95% bootstrap CI of of AUC and NLLmin (without constant ln(2π)/2) across all runs. Winners are marked in grey.

METHOD AUC 95%-CI NLLMIN 95%-CI

NOMU 0.42 [0.41, 0.42] -0.99 [-1.01, -0.96] GP 0.36 [0.36, 0.37] -1.07 [-1.09, -1.05] MCDO 0.45 [0.44, 0.45] -0.61 [-0.62, -0.60] DE 0.42 [0.41, 0.43] -0.95 [-0.97, -0.93] HDE 5.06 [4.75, 5.49] 1.74 [ 1.66, 1.84] HDE* 6.33 [5.87, 6.75] 1.58 [ 1.51, 1.67]

across all 5500 runs (500 runs for 11 test functions) for each algorithm. We also provide a 95% bootstrap conﬁdence interval (CI) to assess statistical signiﬁcance. We observe a similar ranking of the different algorithms as in 1D, i.e., NOMU and DE are ranked ﬁrst in AUC; and the GP is ranked ﬁrst in NLLmin followed by NOMU and DE, who share the second rank.

B.2.4. TOY REGRESSION: DETAILED CHARACTERISTICS OF UNCERTAINTY BOUNDS

UB Characteristics in 1D Within the one-dimensional setting, characteristics of NOMU UBs can be nicely compared to those of the benchmark approaches. Figure 9 exemplarily shows typical outcomes of the UBs on a number of selected test functions as obtained in one run.

Deep Ensembles UBs: Throughout the experiment, we observe that deep ensembles, while sometimes nicely capturing increased uncertainty in regions of high second derivative (e.g. around the kink in Figure 9e; pursuant in some form desiderata D4), still at times leads to UBs of somewhat arbitrary shapes. This can be seen most prominently in Figure 9b around x 0.25, in Figure 9f around x 0.8 or in Figure 9d around x 0.2. Moreover, deep ensembles UBs are very different in width, with no clear justiﬁcation. This can be seen in Figure 9b when comparing the UBs for x 0 against the UBs for x 0 and at the edges of the input range in Figures 9a and 9f In addition, we frequently observe that deep ensembles UBs do not get narrow at training points, as for instance depicted in Figure 9c for x < 0.5, in Figure 9f for x > 0.2, or in Figure 9d for x [ 1, 0.7] and thus are not able to handle well small or zero data noise. Hyper Deep Ensembles UBs: Throughout our experiments, HDE produces less accurate UBs, which vary more from one run to another, than DE (recall that here we consider the setting of noiseless scarce regression). Possible reasons for this are: 1. The scarcity of training/validation data. HDE trains

Appendix of NOMU: Neural Optimization-based Model Uncertainty

1.0 0.5 0.0 0.5 1.0

1 NOMU: 1 f

NOMU: f 1 f

MCDO: f 2 f

1.0 0.5 0.0 0.5 1.0 x

1 DE: f 15 f

HDE: f 50 f

HDE*: f 100 f

1.0 0.5 0.0 0.5 1.0

NOMU: f 1 f

MCDO: f 2 f

1.0 0.5 0.0 0.5 1.0 x

HDE: f 20 f

HDE*: f 20 f

1.0 0.5 0.0 0.5 1.0

NOMU: f 1 f

MCDO: f 2 f

1.0 0.5 0.0 0.5 1.0 x

HDE: f 20 f

HDE*: f 20 f

(c) Squared (GP would not behave reasonable for standard hyper-parameters in this instance, so we changed the range for the prior variance parameter κ to be [10 5, 4].)

1.0 0.5 0.0 0.5 1.0

1 NOMU: 1 f

NOMU: f 1 f

MCDO: f 2 f

1.0 0.5 0.0 0.5 1.0 x

1.0 DE: f 15 f

HDE: f 20 f

HDE*: f 20 f

(d) Forrester

Appendix of NOMU: Neural Optimization-based Model Uncertainty

1.0 0.5 0.0 0.5 1.0

1 NOMU: 1 f

NOMU: f 1 f

MCDO: f 2 f

1.0 0.5 0.0 0.5 1.0 x

HDE: f 15 f

HDE*: f 100 f

1.0 0.5 0.0 0.5 1.0

2 NOMU: 1 f

NOMU: f 1 f

MCDO: f 2 f

1.0 0.5 0.0 0.5 1.0 x

HDE: f 10 f

HDE*: f 20 f

Figure 9. Visualisations of resulting UBs for NOMU, GP, MCDO, DE, HDE and HDE* for a selection of test functions.

its NNs based on 80% of the training points and uses the remaining 20% to build an ensemble based on a score, whilst the other methods can use 100% of the training points for training. In a scarce data setting this implies that ﬁrst, the mean prediction of HDE does not ﬁt through all the training points, and second, the scoring rule is less reliable. 2. In a noiseless setting one already knows that the L2 regularization should be small, and thus optimizing this parameter is less useful here. Therefore, we believe HDE is less well suited in a noiseless scarce regression setting to reliably capture model uncertainty. This manifests in Figure 9. HDE s uncertainty bounds often do not narrow at training points (Figures 9a 9c and 9e), while they suggest unreasonably overconﬁdent mean predictions in some regions (Figures 9b and 9d). While our HDE* (see Appendix B.2.2 Hyper Deep Ensembles) manages to better narrow at training data, it tends to more frequently result in unnecessarily narrow bounds (Figures 9a, 9c and 9d). Overall, both HDE and HDE* vary a lot from run to run and thus tend to yield bounds of seemingly arbitrary shapes, where in each run the desiderata are captured in different regions. MC Dropout UBs: MCDO consistently yields tube-like

UBs that do not narrow at training points. Interestingly, we remark that throughout the experiment MCDO samples frequently exhibit stepfunction-like shapes (see e.g. Figure 9d at x 0.5 or Figure 9b for x [ 0.5, 0]). This effect intensiﬁes with increasing training time. NOMU UBs: In contrast, NOMU displays the behaviour it is designed to show. Its UBs nicely tighten at training points and expand in between and thus NOMU fulﬁlls D1 D3 across all considered synthetic test functions. Gaussian Process UBs: The quality of the RBFGaussian process UBs (as judged in this simulated setting) naturally varies with the true underlying function. While the UBs nicely capture parts of the true function with low widths in Figures 9c and 9d they have a hard time accurately enclosing true functions that are not as conformant with the prior belief induced by the choice of the RBF kernel (e.g. Figures 9b and 9e). Nonetheless, we also observe instances in which the training points are misleading to the GP s mean predictions even when considering ground truth functions for which this choice of kernel is very suitable. This manifests in over-conﬁdent mean predictions far away from the data generating true function (Figure 9a) or over-swinging behavior of the ﬁtted mean (Figure 9f). It is true that one could ﬁnd better

Appendix of NOMU: Neural Optimization-based Model Uncertainty

1.0 0.5 0.0 0.5 1.0 x1

Input Training Points

Figure 10. Contour plot of the 2D Styblinski test function.

1.0 0.5 0.0 0.5 1.0 x1

Input Training Points

Figure 11. Estimated model uncertainty ˆσf of NOMU for a single run of the Styblinski test function.

function-speciﬁc kernels. However, the RBF kernel is a good generic choice when evaluated across many different test functions without any speciﬁc prior information, which is why we choose this kernel for our comparison.

UB Characteristics in 2D To visualize the UBs in 2D, we select as the ground truth the Styblinski test function depicted in Figure 10. In Figure 11 (NOMU) and Figure 12 (benchmarks), we visualize the estimated model uncertainty as obtained in one run for this Styblinski test function.

NOMU (Figure 11): As in 1D, we observe that NOMU s UBs nicely tighten at input training points and expand in-between, with overall steady shapes. Speciﬁcally, NOMU s UBs are larger for extrapolation, e.g., [0, 1] [0.5, 1], compared to regions which are surrounded by input training data points, e.g., [ 0.25, 0.25] [ 0.25, 0.75], even though the distance to the closest input training point is approximately the same. Thus, NOMU s UBs do not only depend on the distance to the closest training point but rather on the arrangement of all surrounding training points.

Benchmarks (Figure 12): As in 1D, we see that the RBF-based GP s UBs do have the expected smooth and

isotropic shape with zero model uncertainty at input training points. Moreover, as in 1D, MCDO s UBs exhibit a tubular shape of equal size ( 0 0.25) across the whole input space. Whilst DE nicely captures zero model uncertainty at input training points, it again exhibits the somehow arbitrary behaviour in areas with few input training points. Both HDE and HDE* yield model uncertainty estimates that are small at input training points, except for HDE for one input training point (x1, x2) (0.6, 0.2), where the estimated model uncertainty is only small when continuing training on all training points (see HDE*). However, these estimates drastically fail to capture increased uncertainty in out-of-sample regions. For both algorithms, the model uncertainty estimate is unreasonably low in most regions of the input space and inexplicably high in a tiny fraction of the input space.

1.0 0.5 0.0 0.5 1.0 x1

Input Training Points

(a) GP with calibration constant c = 1.

1.0 0.5 0.0 0.5 1.0 x1

Input Training Points

(b) MCDO with calibration constant c = 10.

Figure 12. Estimated model uncertainty of Gaussian process (GP) and MC dropout (MCDO) for a single run of the Styblinski test function.

Appendix of NOMU: Neural Optimization-based Model Uncertainty

1.0 0.5 0.0 0.5 1.0 x1

Input Training Points

(c) DE with calibration constant c = 15.

1.0 0.5 0.0 0.5 1.0 x1

Training Data

(d) HDE with calibration constant c = 30.

1.0 0.5 0.0 0.5 1.0 x1

Training Data

(e) HDE* with calibration constant c = 30.

Figure 12. (cont.) Estimated model uncertainty of deep ensembles (DE), hyper deep ensembles (HDE) and HDE* for a single run of the Styblinski test function.

B.2.5. GENERATIVE TEST-BED: DETAILS

In this section, we give a detailed description of the generative test-bed setting and provide further results for up to 20 dimensional input.

Detailed Setting

1. We sample a test-function (i.e. the ground truth function) f : [ 1, 1]d R from a BNN with i.i.d Gaus-

sian weights. This means we sample i.i.d random parameters θi N(0, σprior) and set f = NN θ. The BNN is set up with three hidden layers with 210, 211

and 210 nodes, respectively. We choose σprior such that

Vx[NN θ(x) | θ] i 1.25 Resulting values for σprior are shown in Table 9.

Table 9. σprior depending on the input dimension d.

1 0.114 2 0.102 5 0.092 10 0.084 20 0.070

2. We sample ntrain = 8 d input training points uniformly at random from the input space [ 1, 1]d for Tables 2 and 10. In the case of Table 11, we sample ntrain = 8 d input training points from d-dimensional centered normal distribution with a covariance with min(5, d) eigenvalues of value 0.15 and the remaining eigenvalues only have the value 0.001. In both cases we get for input training point xtrain i the corresponding noiseless output-training point ytrain i = f(xtrain i ) = NN θ(xtrain i ). And in both cases we create ntest = 100 d test-data points from the same distribution as the training data points.26

3. We calculate UBs of all considered algorithms including NOMU. For this, we use for all algorithms the same conﬁguration as in the 1D and 2D toy regression settings (see Section 4.1.1 and Appendix B.2.2) except for NOMU, where we set ℓmin = 0.1, ℓmax = 1 and use l = 100 d artiﬁcial data points, where d denotes the respective dimension. 4. We measure the quality of the UBs via NLL for a ﬁne grid of different values for the calibration parameter c.

We repeat these four steps mseeds = 200-times. Then we choose for each method the value c where the average NLL is minimal. Importantly, the chosen calibration constant c only depends on the input-dimension d and on the algorithm but not on the randomly chosen test-function f. In Table 10 and Table 11 we report the mean and a 95% CI over these 200 runs for the chosen calibration constant c.

Vx[NN θ(x) | θ] i 1 holds true for x

Unif [ 1, 1]d . This might deviate (slightly) for Gaussian x. 26Note that we sample the artiﬁcial data points for NOMU always uniformly from [ 1, 1]d, since we assume that the lowdimensional manifold is often unknown in practice. NOMU could probably be improved if one tries to learn the distribution of the input data or if one already had some prior information about the distribution of the input data points. It is good to see how robustly NOMU can handle a different distribution of the input data points without the need to adapt the distribution of the artiﬁcial data points.

Appendix of NOMU: Neural Optimization-based Model Uncertainty

Table 10. Uniform data generation. Shown are average NLL (with-

out constant ln(2π)/2) and a 95% CI over 200 BNN samples.

FUNCTION NOMU GP MCDO DE HDE TUBE

BNN1D -1.65 0.10 -1.08 0.22 -0.34 0.23 -0.38 0.36 8.47 1.00 -0.86 0.19 BNN2D -1.16 0.05 -0.52 0.11 -0.33 0.13 -0.77 0.07 9.10 0.39 -0.81 0.06 BNN5D -0.37 0.02 -0.33 0.02 -0.05 0.04 -0.13 0.03 8.41 1.00 -0.33 0.02 BNN10D -0.09 0.01 -0.11 0.01 0.06 0.02 0.10 0.01 6.44 1.36 -0.12 0.01 BNN20D 0.15 0.01 0.06 0.01 0.18 0.01 0.30 0.01 3.58 0.81 0.07 0.01

Discussion of the Results in Table 10 We see that for d 5 NOMU signiﬁcantly outperforms all other considered benchmarks. For dimensions d 10 it gets harder to ﬁnd a good metric for measuring the quality of the uncertainty bounds: For high dimensions almost no test data point sampled uniformly from [ 1, 1]d is close to the training data points, so they all have quite high uncertainty. We further veriﬁed this for dimensions d = 10 and d = 20 by introducing another algorithm we call TUBE. TUBE uses NOMU s mean prediction and assigns the same (calibrated) constant uncertainty c to all test data points. As we can see in Table 10, TUBE is ranked ﬁrst (en par with GP) in d = 10 and d = 20, highlighting that the metric NLL is ﬂawed in this setting for dimensions d 10. However, for dimensions d 5 the trivial TUBE algorithm is significantly outperformed by other methods. Thus, the NLLmetric for dimensions d 10 and uniformly distributed data on [ 1, 1]d mainly measures if D3 (Out-of-Sample) is reliably fulﬁlled. D3 (Out-of-Sample) is particularly reliably fulﬁlled for TUBE and for GP. For BO however, especially D2 (In-Sample) is important to not get stuck in local optima, because in BO the test data-points are not uniformly distributed, and the predicted uncertainty close to the training points is particularly important. Therefore, in this setting, we only include results for input dimensions 1-5D (see Table 2) in the main part of the paper.

Discussion of the Results in Table 11 Another approach to circumvent the problem that the NLL metric lose their relevance for scenarios where the input test data points are all far away from the input training data points is to use a different distribution for X. The well known manifold-hypothesis (Cayton, 2005) states that in typical real world data-sets relevant input data points lie close to a low-dimensional manifold. Therefore, we run a further experiment where we sample from our training and test data from a Gaussian distribution that concentrates mainly close to a min(d, 5)- dimensional ﬂat manifold as described in Item 2 and report the results in Table 11. For this experiment we see that dimension d 10 are interesting too, since TUBE is signiﬁcantly beaten by NOMU and GP. In Table 11 one can see that NOMU is among the best methods for all considered dimensions.

Table 11. 5D-Gaussian data generation. Shown are average NLL

(without constant ln(2π)/2) and a 95% CI over 200 BNN samples.

FUNCTION NOMU GP MCDO DE HDE TUBE

BNN1D -1.91 0.13 -1.13 0.27 -0.49 0.16 -0.27 0.54 8.95 1.18 -0.85 0.13 BNN2D -1.55 0.05 -0.85 0.12 -0.62 0.13 -1.15 0.08 7.92 0.42 -0.95 0.11 BNN5D -0.77 0.02 -0.72 0.02 -0.47 0.03 -0.55 0.03 9.68 1.08 -0.68 0.02 BNN10D -1.36 0.01 -1.31 0.01 -1.08 0.03 -1.14 0.02 5.16 1.55 -1.25 0.01 BNN20D -1.70 0.01 -1.72 0.01 -1.52 0.02 -1.53 0.01 0.99 0.39 -1.68 0.01

Our Generative Test-Bed vs. Osband et al. (2021) Osband et al. (2021) measures the Kullback Leibler divergence between the posterior of any method to the posterior of shallow GP with a ﬁxed (deep) neural tangent kernel (NTK). Thus, they evaluate methods by their similarity to a NTK-GP posterior. Because of the ﬁxed kernel the posterior of the NTK GP does not fulﬁll D4 (Metric Learning) at all. This implies that their evaluation metric does not reward D4 (Metric Learning) at all. However, we think that D4 (Metric Learning) is very important for real-world deep learning applications (see Footnote 4 and Appendix D.4). The posterior of a ﬁnite-width BNN fulﬁlls D4 (Metric Learning). Therefore, approximating such a posterior is more desirable. However, in contrast to the NTK-GP posterior in (Osband et al., 2021), there is no closed formula for a ﬁnite-width BNN posterior. Thus, at ﬁrst sight one cannot straightforwardly evaluate methods based on the Kullback Leibler divergence between their posterior and this intractable BNN posterior as in Osband et al. (2021).

Nonetheless, in the following, we proof in Theorem B.7 that the metric we use in Tables 2, 10 and 11 indeed converges (up to a constant)27 to the average Kullback Leibler divergence d KL to the posterior of a ﬁnite width BNN as mseeds tends to inﬁnity. First, we deﬁne the average Kullback Leibler divergence d KL and introduce some notation.

Deﬁnition B.6 (AVG KULLBACK LEIBLER DIVERGENCE) Let Dtrain be a ﬁnite set of training points and consider a prior distribution28 P [f ] on the function space {f : X Y } and the corresponding posterior P [f | Dtrain] on the function space. Then the marginal of the posterior P [f(x) | Dtrain, x] is a measure on R for every given input data point x X and every given training data set Dtrain. Let Q [ | Dtrain, x] also be a measure on R for every given input data point x X and every given training data set Dtrain.29 The average Kullback Leibler divergence is then deﬁned as

d KL = EDtrain,x [d KL (P [f(x) | D train, x] || Q [ | D train, x])] ,

27If we are just interested in the relative performance of different methods compared to each other, the constant does not matter. 28In our generative test-bed the prior distribution is a BNN prior with i.i.d centered Gaussian weights. 29In our context Q [ | Dtrain, x] can be the approximation of the marginal posterior at x X given training data Dtrain obtained from any method such as NOMU, GP, MCDO, DE, HDE etc.

Appendix of NOMU: Neural Optimization-based Model Uncertainty

where the expectation is taken over x and Dtrain according to P, and d KL is the classical Kullback Leibler divergence between two probability measures on R. This is equivalent to Equation (1) from (Osband et al., 2021). Theorem B.7 Using the notation from Deﬁnition B.6, let q( | Dtrain, x) be the density of Q [ | Dtrain, x] on R. Let (fj(xj), Dtrain j , xj)j {1,...,mseeds} be i.i.d samples of P [(f(x), Dtrain, x) ].30 Then, the average NLL

lim mseeds 1 mseeds

j=1 log q(fj(xj) | D train j , xj) = d KL+CP

converges in probability to d KL up to an additive constant CP, where CP only31 depends on P and not on Q.

Proof. Let p be the density of P [(f(x), Dtrain, x) ] and p( | Dtrain, x) the density on R of the marginal posterior P [f(x) | Dtrain, x]. Further we write p(Dtrain, x) for the density of of the marginal P [(Dtrain, x) ] evaluated at (Dtrain, x). Since

lim mseeds 1 mseeds

j=1 log q(fj(xj) | D train j , xj)

can be seen as a Monte-Carlo approximation, it converges in probability to

Ef(x),Dtrain,x[ log (q(f(x) | D train, x))] =

= Z log (q(f(x) | D train, x)) p(f(x), D train, x)d(f(x), D train, x) =

= Z log (q(f(x) | D train, x)) p(f(x) | D train, x)p(D train, x)d(f(x), D train, x).

By Fubini this is equal to

Z Z log (q(f(x) | D train, x)) p(f(x) | D train, x)d(f(x))p(D train, x)d(D train, x) =

= EDtrain,x

" Z log (q(f(x) | D train, x)) p(f(x) | D train, x)d(f(x)) | {z } H(P[f(x) | Dtrain,x],Q[ | Dtrain,x])

30We formulate the theorem for the hardest case of ntest = 1. The convergence in probability of

lim mseeds 1 mseeds

i=1 log q(fj(xj,i) | D train j , xj,i) = d KL+CP

is obviously even faster for ntest > 1. For the proof it is only important that mseeds . If one formulates the theorem for general ntest > 1, one has to formulate that (Dtrain j )j {1,...,mseeds} are i.i.d, but (fj(xj,i), xj,i)i {1,...,ntest} are only conditionally independent conditioned on Dtrain j , which would make the presentation unnecessarily technical. 31P does not depend on the chosen method. Only Q (and accordingly q) differs among the methods. Thus, CP does not change the ranking amongst different methods.

where H is the cross-entropy. So this is further equal to

EDtrain,x [d KL (P [f(x) | D train, x] || Q [ | D train, x])] +

EDtrain,x [H (P [f(x) | D train, x])] =

= d KL + CP,

where CP = EDtrain,x [H (P [f(x) | Dtrain, x])] only depends on P and not on Q or q.

To apply Theorem B.7 to the metrics reported in Tables 2, 10 and 11 one has to apply Footnote 30 and has to set q( | Dtrain, x) to be the density corresponding to the already calibrated uncertainty at x obtained from any method trained on Dtrain, i.e., for example let c be a ﬁtted calibration parameter, and ˆf and ˆσ be the ﬁtted model and model uncertainty prediction of NOMU then q( | Dtrain, x) := N( ; ˆf(x), c ˆσ(x)) for x X.32 In our setting, we made sure that the correct calibration constant c is chosen in the limit mseeds , since we chose one ﬁxed value for c per dimension and method and do not over-ﬁt on speciﬁc seeds.

B.2.6. SOLAR IRRADIANCE TIME SERIES: DETAILS

Conﬁguration Details We use for all algorithms the same conﬁguration as in the 1D and 2D toy regression setting (see Section 4.1.1 and Appendix B.2.2) except that we now train all algorithms for 214 epochs and set as L2-regularization λ = 10 19 for NOMU, λ = 10 19/ntrain for DE, and λ = (1 0.2) 10 19/ntrain for MCDO. For HDE we accordingly draw the L2 regularization parameter log-uniform from (10 19 10 3, 10 19 10+3) (which affects the L2regularization parameter of HDE* in the same way as in Appendix B.2.2).

Results Figure 13 visualizes the UBs from NOMU and the benchmark algorithms. As becomes evident, NOMU manages to best ﬁt the training data33, while keeping model uncertainty in-between training points. The corresponding metrics for this speciﬁc run are given in Table 12.

B.2.7. UCI DATA SETS: DETAILS

To get a feeling for how well NOMU (without data noise extensions or hyperparameter tuning) performs in noisy regression with high-dimensional input, we test our algo-

32In theory, more general posteriors than Gaussians could be used too, but within this paper we always assumed Gaussian marginals of the posterior as all the considered benchmarks also output Gaussian distributed approximations of the marginal posteriors. 33(Gal & Ghahramani, 2016) consider the same data set to compare UBs of MC dropout and GPs. In this work however, NNs are trained for 106 epochs possibly explaining why MC dropout more nicely ﬁts the training data in their case.

Appendix of NOMU: Neural Optimization-based Model Uncertainty

1.0 0.5 0.0 0.5 1.0

1 NOMU: f 2 f

1.0 0.5 0.0 0.5 1.0

MCDO: f 5 f

1.0 0.5 0.0 0.5 1.0

1.0 0.5 0.0 0.5 1.0

1 GP: f 1 f

1.0 0.5 0.0 0.5 1.0

HDE: f 15 σf

1.0 0.5 0.0 0.5 1.0

HDE*: f 40 σf

Figure 13. Visualization of each algorithm s model prediction (solid lines) and UBs (shaded areas) on the solar irradiance data set. Training and test points are shown as black dots and red crosses, respectively. We present UBs obtained by NOMU (c=2) compared to the benchmarks MC dropout (MCDO) (c=5), deep ensembles (DE) (c=5), Gaussian process (GP) (c=1) and hyper deep ensembles (HDE) (c=15) and (HDE*) (c=40). Additionally, NOMU s estimated and scaled model uncertainty 2ˆσf is shown as a dotted line.

Appendix of NOMU: Neural Optimization-based Model Uncertainty

Table 12. Metrics for solar irradiance time series.

METHOD AUC NLLMIN

NOMU 0.32 1.44 GPR 0.46 1.24 MCDO 0.30 1.11 DE 0.35 1.28 HDE 0.47 0.98 HDE* 0.59 1.08

rithm on the popular UCI benchmark (Hern andez-Lobato & Adams, 2015) and the UCI gap data sets introduced in (Foong et al., 2019).

The UCI gap data sets were designed to capture whether an algorithm s uncertainty estimate increases in between separated clusters of observations where epistemic uncertainty should be higher compared to regions with many data points. This is also required by D3. However, in Foong et al. (2019) only those data points that had been removed to create gaps in the training data were used as test data for calculating NLL. Thus, NLL on UCI gap data fails by construction to capture the uncertainty quality outside of the gap: First, D2 is not properly measured by this evaluation, since there are not many test-points in the test data-set that are close to training points. Second, also D3 is not properly measured, since the smaller gaps in between data points outside of the gap are not part of the test data-set. Note that D3 also concerns these kinds of gaps. To provide better evaluation of uncertainty, future work should focus on mixed test datasets (moving some input points outside of the gap from the training set to the test set) similar to our train/test split from the experimental setting in Appendix B.2.6.

Nonetheless, we test NOMU on the UCI and UCI gap data sets using the same experiment setup as in the respective works.

NOMU Setup In line with the literature, NOMU s main and side architectures were chosen as a single-hidden-layer NN with 50 hidden nodes (except for the large protein data set, where the number of hidden nodes was 100). NOMU was trained for 400 (UCI) respectively 40 (UCI gap) epochs, with L2-reg 1e 09 and 1e 04 on the mainand side architectures respectively, using a batch size of 100 in Adam stochastic gradient descent with learning rate 0.01. NOMU used ℓ= 100 artiﬁcial data points randomly sampled on each batch and was reﬁt on validation data after the constant c had been calibrated on these. For the standard UCI data sets, we used the same loss parameters as in the remaining regression experiments, namely πexp = 0.01, πsqr = 0.1, and cexp = 30. For the UCI gap data set, where uncertainty is only evaluated in gap regions, we chose πexp = 0.1 and πsqr = 0.01, relaxing the requirement of small uncertainty at observed data points and

Table 13. UCI-GAP average NLL and a 95% normal-CI.

DATASET NOMU LL NLM-HPO NLM

BOSTON 2.80 0.11 2.79 0.09 2.81 0.15 3.60 0.21 CONCRETE 3.38 0.08 3.53 0.13 3.72 0.11 3.89 0.21 ENERGY 3.71 1.62 6.49 5.50 3.78 2.27 3.40 1.99 KIN8NM -0.97 0.02 -1.14 0.03 -1.06 0.03 -1.04 0.03 NAVAL 31.81 28.0 15.66 8.34 1.17 1.56 1.48 1.64 CCPP 2.91 0.04 2.89 0.03 2.96 0.09 2.90 0.05 PROTEIN 3.21 0.10 3.09 0.05 3.22 0.09 3.35 0.11 WINE 0.99 0.02 0.96 0.01 0.98 0.01 1.75 0.07 YACHT 2.15 0.30 1.33 0.29 1.72 0.33 1.44 0.21

strengthening the pull upward on our raw uncertainty output. Moreover, for numerical stability we use the following slight adaptation of the NOMU loss from Deﬁnition 3.3:

Lπ stable(NN θ) :=

i=1 ( ˆf(x train i ) y train i )2

i=1 ρstable (ˆrf(x train i ))

| {z } stable version of (b)

πexp 1 λd(X)

X ustable (ˆrf(x)) dx | {z } stable version of (c)

ρstable(r) =

( r2 , |r| 1 2|r| 1 , |r| > 1 (54)

is the Huber-loss and

ustable(r) =

( u(r) = e cexp r , r 0 cexp r , r < 0. (55)

This stable loss Lπ stable behaves exactly the same as the standard NOMU-loss Lπ as long as the outputs of the NN stay in a reasonable range. Only if the gradient descent gets unstable, such that the NN outputs very extreme value, this stabilized loss assures a bounded gradient of the loss Lπ stable with respect to the outputs of the NN NN θ.

Results UCI Table 3 in the main paper reports NLL on test data averaged over 20 runs (as in Hern andez-Lobato & Adams (2015)) for the UCI data set. It includes the following benchmark models:

NLM-HPO and NLM correspond to BN(BO)-1 NL in Section D.1 respectively Section D.2. of (Ober & Rasmussen, 2019),

LL corresponds to LL 1HL TANH in (Foong et al., 2019),

MCDO2 represents Dropout (Convergence) in (Mukhoti et al., 2018),

MCDO corresponds to Dropout in (Gal & Ghahramani,

2016) and DE to Deep Ensembles in (Lakshminarayanan et al., 2017).

Appendix of NOMU: Neural Optimization-based Model Uncertainty

We used the standard deviations that were reported in these respective works to derive 95%-normal CIs.

Results UCI Gap Table 13 lists Gaussian 95%- conﬁdence intervals for NLL on the UCI gap data sets for NOMU as well as the values reported for benchmarks NLM, NLM-HPO and LL (Linearized Laplace) as listed above. These values result from different runs where in each run the gap were introduced in a different input dimension (see Foong et al. (2019)). We see that NOMU performs comparably, and conﬁrm the observation that LL tends to best capture uncertainty in the large gaps artiﬁcially introduced in the training data.

B.3. Bayesian Optimization

B.3.1. BO: CONFIGURATION DETAILS

In the following, we describe the detailed hyperparameter setup of all algorithms.

NOMU Setup For NOMU, we set πsqr = 1, ℓmin = 1e 6 and use l = 500 artiﬁcial input points for 5D, 10D and 20D. Otherwise we use the exact same hyperparameters as in 1D and 2D regression (see in Section 4.1 the paragraph Algorithm setup).

Deep Ensembles Setup For deep ensembles, we use the exact same hyperparameters as in 1D and 2D regression (see Appendix B.2.2).

MC Dropout Setup For MC dropout, due to computational constraints, we were only able to use 10 stochastic forward passes (instead of 100 in the regression setting) However, this still results in an increase in compute time by a factor 10 of every single acquisition function evaluation compared to NOMU. Otherwise, we use the exact same hyperparameters as in 1D and 2D regression (see Appendix B.2.2) .

Gaussian Process Setup For Gaussian processes (p GP and GP), we use the exact same setup as in 1D and 2D regression (see Appendix B.2.2).

Hyper Deep Ensembles Setup For hyper deep ensembles, due to computational constraints, we were only able to use κ = 20 (instead of κ = 50 in the regression setting). However, this still results in an increase of training time by a factor of 5 in each single BO step compared to NOMU. Otherwise we use the exact same hyperparameters as in the 1D and 2D regression setting (see Appendix B.2.2).

Acquisition Function Optimization For each algorithm and each BO step, we determine the next BO input point

by maximizing the corresponding upper-bound acquisition function, i.e., ( ˆf(x) + c ˆσf(x)). We maximize this upperbound acquisition function using the established DIRECT (Dividing Rectangles) algorithm.

Gaussian BNN Test Functions Note that for the Gaussian BNN test functions in BO (see Table 4 and Table 14, respectively) we do not have access to the ground truth global maximum. Thus, to determine the ﬁnal regret, we use the established DIRECT (Dividing Rectangles) algorithm. With DIRECT, we calculate a point xdirect

s.t. f(xdirect) maxx X f(x) and report |f(xdirect) maxi {1,...,72} f(xi)|/f(xdirect) in Table 4 and Table 14, respectively.

B.3.2. BO: CALIBRATION

MW SCALING In general, having a good understanding of the prior scale is very hard. Nevertheless, often it is easier to have at least some intuition in which order of magnitude the posterior UBs should be on average. When function values approximately range in [ 1, 1], it is reasonable to require that after observing 8 initial points the mean width (MW) of the UBs lies within the order of magnitudes 0.05 and 0.5. Hence our motivation for choosing the calibration parameter c accordingly. An advantage of such a calibration is that it can be applied to every method equally, whereas there is in general no clear notion of setting the prior scale of two different methods (e.g. MC dropout and deep ensembles) to the same value. Note, that we only use MW to calibrate c directly after mean and variance predictions were ﬁt based on the 8 initial data points. So MW is not ﬁxed when further data points are observed in subsequent steps of the BO.

DYNAMIC C The initial choice of c can still be corrected in each BO step. Certainly, in the noiseless case it does not make sense to pick an input point xi that is identical to an already observed input point xi, i < i , where nothing new is to be learned. Therefore, we want our NN-agent to get bored if its acquisition function optimization would suggest to pick an input point xi xi, i < i . The idea of DYNAMIC C is to encourage the agent, in case it was bored , to become more curious to explore something new instead of picking a boring input point. This can be achieved by iteratively increasing c until the acquisition function optimization suggests an input point xi xi, i < i . We then only apply the expensive function evaluation for xi xi, i < i that we obtained after increasing c enough to not bore the NN. However, towards the last BO-steps, if we already know approximately where a global optimum is and we only want to ﬁx the last digits of the optimizer, we have to evaluate the function closer to already observed points. In contrast, a very young NN (i.e., a network that

Appendix of NOMU: Neural Optimization-based Model Uncertainty

Table 14. Results for 15 different BO tasks. Shown are the average ﬁnal regrets and a 95% normal-CI per dimension and for each individual function over 100 (5D) and 50 (10D and 20D) runs for an initial mean width (MW) of 0.05 and 0.5, respectively. Winners are marked in grey.

NOMU NOMU GP GP MCDO MCDO DE DE HDE HDE PGP RAND FUNCTION MW 0.05 MW 0.5 MW 0.05 MW 0.5 MW 0.05 MW 0.5 MW 0.05 MW 0.5 MW 0.05 MW 0.5 MW 0.05 MW

LEVY5D 2.12e 3 5.95e 4 1.52e 3 4.34e 4 1.10e 3 1.68e 4 8.84e 3 1.15e 3 1.98e 2 3.59e 3 1.27e 2 2.57e 3 7.09e 3 3.76e 3 9.09e 2 9.16e 3 4.48e 3 1.25e 3 3.76e 3 1.22e 3 6.25e 3 5.26e 4 5.44e 2 4.22e 3 ROSENBROCK5D 1.75e 4 2.76e 5 3.57e 4 6.39e 5 3.62e 4 6.52e 5 8.44e 4 1.62e 4 2.83e 4 4.64e 5 7.35e 4 1.08e 4 7.93e 4 7.74e 4 5.96e 3 1.78e 3 1.11e 3 7.25e 4 5.02e 4 1.04e 4 7.56e 4 8.13e 5 4.73e 3 7.04e 4 G-FUNCTION5D 1.12e 1 1.99e 2 1.06e 1 1.97e 2 1.68e 1 1.62e 2 2.60e 1 1.72e 2 2.66e 2 8.57e 3 7.80e 2 8.82e 3 1.88e 1 2.08e 2 2.03e 1 2.20e 2 1.35e 1 2.46e 2 1.37e 1 2.59e 2 1.78e 1 1.22e 2 3.63e 1 1.09e 2 PERM5D 4.06e 4 1.70e 4 2.58e 4 1.13e 4 7.76e 5 3.19e 5 1.80e 3 4.44e 4 7.79e 5 4.19e 5 7.11e 5 1.45e 5 6.96e 4 2.20e 4 1.73e 3 4.55e 4 2.74e 3 5.20e 4 2.50e 3 5.58e 4 2.06e 4 7.03e 5 4.62e 4 1.20e 4 BNN5D 5.39e 2 3.34e 2 3.56e 2 1.87e 2 8.24e 2 4.09e 2 4.41e 2 3.04e 2 1.88e 1 5.37e 2 1.51e 1 5.12e 2 6.27e 2 5.28e 2 6.32e 2 3.61e 2 2.45e 1 7.66e 2 2.13e 1 6.85e 2 2.15e 2 6.37e 3 5.42e 1 9.21e 2

Average Regret 5D 3.38e 2 7.77e 3 2.87e 2 5.42e 3 5.03e 2 8.80e 3 6.31e 2 6.99e 3 4.70e 2 1.09e 2 4.86e 2 1.04e 2 5.18e 2 1.14e 2 7.30e 2 8.66e 3 7.75e 2 1.61e 2 7.13e 2 1.47e 2 4.14e 2 2.75e 3 1.93e 1 1.86e 2

LEVY10D 6.27e 3 2.22e 3 6.27e 3 2.22e 3 1.04e 2 2.10e 3 1.04e 2 2.10e 3 2.16e 2 3.71e 3 2.17e 2 5.30e 3 8.65e 2 2.01e 2 1.46e 1 1.41e 2 6.21e 3 2.48e 3 5.81e 3 1.99e 3 6.16e 3 8.94e 4 1.06e 1 7.64e 3 ROSENBROCK10D 2.34e 3 1.43e 3 2.01e 3 6.91e 4 9.14e 4 1.25e 4 5.67e 3 1.60e 3 2.25e 3 2.58e 4 4.96e 3 4.40e 4 1.42e 2 4.43e 3 7.65e 2 1.03e 2 5.10e 3 1.44e 3 5.00e 3 1.70e 3 3.09e 3 5.77e 4 2.82e 2 3.92e 3 G-FUNCTION10D 2.92e 1 3.85e 2 3.53e 1 2.74e 2 3.79e 1 2.24e 2 4.21e 1 2.35e 2 1.79e 1 2.54e 2 4.15e 1 1.12e 2 3.24e 1 2.28e 2 3.33e 1 2.06e 2 2.53e 1 3.76e 2 2.55e 1 3.72e 2 3.80e 1 1.34e 2 4.50e 1 6.48e 3 PERM10D 3.74e 4 1.26e 4 4.27e 4 1.34e 4 3.59e 4 2.13e 4 6.52e 4 2.04e 4 3.37e 4 8.49e 5 5.53e 4 1.33e 4 1.01e 3 3.10e 4 1.48e 3 3.92e 4 4.07e 4 1.21e 4 4.08e 4 1.48e 4 5.57e 4 1.43e 4 1.81e 4 5.44e 5 BNN10D 1.19e 1 3.71e 2 9.72e 2 3.40e 2 1.92e 1 5.15e 2 1.64e 1 4.10e 2 1.45e 1 3.58e 2 1.58e 1 2.86e 2 1.49e 1 4.23e 2 1.10e 1 3.90e 2 2.57e 1 4.99e 2 2.00e 1 3.75e 2 8.38e 2 2.25e 2 5.92e 1 5.34e 2

Average Regret 10D 8.40e 2 1.07e 2 9.18e 2 8.74e 3 1.17e 1 1.12e 2 1.21e 1 9.46e 3 6.96e 2 8.80e 3 1.20e 1 6.23e 3 1.15e 1 1.05e 2 1.33e 1 9.48e 3 1.04e 1 1.25e 2 9.32e 2 1.06e 2 9.46e 2 5.23e 3 2.35e 1 1.09e 2

LEVY20D 1.51e 2 1.69e 3 1.40e 2 1.63e 3 1.98e 2 8.61e 3 2.64e 2 1.12e 2 4.27e 2 4.16e 3 6.91e 2 9.00e 3 1.88e 1 1.15e 2 2.01e 1 1.07e 2 1.13e 2 4.60e 3 1.61e 2 5.47e 3 1.98e 2 7.88e 3 1.48e 1 6.58e 3 ROSENBROCK20D 3.47e 2 7.08e 3 6.03e 3 9.93e 4 8.94e 3 4.00e 3 1.91e 2 4.04e 3 7.41e 3 1.14e 3 7.61e 3 2.15e 3 6.46e 2 1.22e 2 1.41e 1 1.29e 2 2.96e 3 1.36e 4 4.27e 3 1.33e 3 1.09e 2 1.52e 3 7.80e 2 8.38e 3 G-FUNCTION20D 4.16e 1 8.99e 3 4.18e 1 1.70e 2 4.59e 1 6.77e 3 4.86e 1 3.95e 3 4.70e 1 4.43e 3 4.92e 1 1.58e 3 4.18e 1 1.33e 2 4.02e 1 1.61e 2 3.79e 1 2.97e 2 3.73e 1 2.56e 2 4.47e 1 9.37e 3 4.86e 1 2.44e 3 PERM20D 7.45e 5 2.09e 5 7.46e 5 2.50e 5 2.20e 4 8.98e 5 1.57e 4 7.50e 5 9.98e 5 3.19e 5 2.03e 4 8.86e 5 3.54e 5 6.52e 6 1.85e 4 6.90e 5 7.98e 5 2.19e 5 9.28e 5 3.89e 5 9.44e 5 3.25e 5 1.90e 5 4.67e 6 BNN20D 1.55e 1 4.07e 2 1.20e 1 3.08e 2 1.79e 1 3.36e 2 1.38e 1 2.67e 2 1.76e 1 3.45e 2 2.43e 1 4.40e 2 1.87e 1 3.90e 2 3.01e 1 7.09e 2 2.92e 1 5.71e 2 3.04e 1 4.99e 2 1.09e 1 2.06e 2 6.85e 1 3.29e 2

Average Regret 20D 1.24e 1 8.46e 3 1.12e 1 7.04e 3 1.33e 1 7.11e 3 1.34e 1 5.90e 3 1.39e 1 7.02e 3 1.62e 1 9.00e 3 1.71e 1 8.89e 3 2.09e 1 1.49e 2 1.37e 1 1.29e 2 1.40e 1 1.13e 2 1.17e 1 4.80e 3 2.80e 1 6.94e 3

has been trained on few training points) should get bored much more easily, since it is not sensible to exploit a given local optimum up to the last digits, when there is plenty of time to reach other, possibly better local optima.

Thus, it makes sense to ﬁrst explore on a large scale where the good local optima are approximately, then to ﬁnd out which of them are the best ones and ﬁnally to exploit the best one in greater detail at the very end.

Therefore, we want the threshold δi, that determines if

xi δi xi xi xi 2 δi to decrease in each BO step. In our experiment, we choose an exponential decay of

δi = δnstart δnend

(i nstart)/(nend nstart) , (56)

with δnstart = 1 16 and δnend = 0.01.

Concretely, we only evaluate f at xi if xi xi 2 > δi i < i is fulﬁlled. Otherwise we double c until it is fulﬁlled (With larger c more emphasis is put on exploration, so there is a tendency that xi will be further away from the observed input points the larger we pick c, if D3 (Outof-Sample) is fulﬁlled). After doubling c 15 times without success, we evaluate f at xi no matter how close it is to the already observed input points (for methods that have severe troubles to fulﬁll D3 (Out-of-Sample), such as MCDO, even doubling c inﬁnite times would not help if the maximal uncertainty is within an δi -ball around an already observed input point).

B.3.3. BO: DETAILED RESULTS

In Table 14, we present the mean ﬁnal regrets, which correspond to the ranks shown in Table 4 in the main paper.

B.3.4. BO: REGRET PLOTS

In Figures 14 16, we present the regret plots for each test function and both MW values.

C. Extensions

C.1. Incorporating Data Noise

We now discuss one way to incorporate data noise in NOMU. If σn is unknown, one option to learn it, is to add another output ˆσn to its architecture and change the loss function to

Lπ(NN θ) :=

ˆf(xtrain i ) ytrain i 2

2 (ˆσn(xtrain i ))2 + ln (ˆσn(x train i ))

(ˆrf(xtrain i ))2

2 (ˆσn(xtrain i ))2 (57)

+ πexp 1 λd(X)

X e cexp ˆrf (x) dx,

in the case of Gaussian data noise uncertainty. In this case we recommend to ﬁrst train the model prediction ˆf and the data noise ˆσn only and then freeze their parameters and train the ˆrf-network for capturing model uncertainty ˆσf. Note that in the NOMU loss (4) we implicitly assumed a constant very small and negligible data noise σ2 n, absorbed as a factor into the hyperparameter πexp and the regularization factor λ. Therefore, if one wishes to be consistent in these prior assumptions, the hyperparameters πexp and λ need to be divided by 2σ2 n in Equation (57). Thus, when using the loss (57) instead of the NOMU loss (4), πexp and λ need to be chosen signiﬁcantly larger.

In the case of known (heteroscedastic) data noise σn(x), (57)

Appendix of NOMU: Neural Optimization-based Model Uncertainty

8 24 40 56 72 Number of Evaluations

(a) Regret plot G-Function, 0.05 MW

8 24 40 56 72 Number of Evaluations

(b) Regret plot G-Function, 0.5 MW

8 24 40 56 72 Number of Evaluations

(c) Regret plot Levy, 0.05 MW

8 24 40 56 72 Number of Evaluations

(d) Regret plot Levy, 0.5 MW

8 24 40 56 72 Number of Evaluations

(e) Regret plot Perm, 0.05 MW

8 24 40 56 72 Number of Evaluations

(f) Regret plot Perm, 0.5 MW

8 24 40 56 72 Number of Evaluations

(g) Regret plot Rosenbrock, 0.05 MW

8 24 40 56 72 Number of Evaluations

(h) Regret plot Rosenbrock, 0.5 MW

8 24 40 56 72 Number of Evaluations

(i) Regret plot BNN, 0.05 MW

8 24 40 56 72 Number of Evaluations

(j) Regret plot BNN, 0.5 MW

Figure 14. Regret plots for all 5D test functions and MWs of 0.05 and 0.5, respectively. We show regrets averaged over 100 runs (solid lines) with 95% CIs.

Appendix of NOMU: Neural Optimization-based Model Uncertainty

8 24 40 56 72 Number of Evaluations

(a) Regret plot G-Function, 0.05 MW

8 24 40 56 72 Number of Evaluations

(b) Regret plot G-Function, 0.5 MW

8 24 40 56 72 Number of Evaluations

(c) Regret plot Levy, 0.05 MW

8 24 40 56 72 Number of Evaluations

(d) Regret plot Levy, 0.5 MW

8 24 40 56 72 Number of Evaluations

(e) Regret plot Perm, 0.05 MW

8 24 40 56 72 Number of Evaluations

(f) Regret plot Perm, 0.5 MW

8 24 40 56 72 Number of Evaluations

(g) Regret plot Rosenbrock, 0.05 MW

8 24 40 56 72 Number of Evaluations

(h) Regret plot Rosenbrock, 0.5 MW

8 24 40 56 72 Number of Evaluations

(i) Regret plot BNN, 0.05 MW

8 24 40 56 72 Number of Evaluations

(j) Regret plot BNN, 0.5 MW

Figure 15. Regret plots for all 10D test functions and MWs of 0.05 and 0.5, respectively. We show regrets averaged over 100 runs (solid lines) with 95% CIs.

Appendix of NOMU: Neural Optimization-based Model Uncertainty

8 24 40 56 72 Number of Evaluations

(a) Regret plot G-Function, 0.05 MW

8 24 40 56 72 Number of Evaluations

(b) Regret plot G-Function, 0.5 MW

8 24 40 56 72 Number of Evaluations

(c) Regret plot Levy, 0.05 MW

8 24 40 56 72 Number of Evaluations

(d) Regret plot Levy, 0.5 MW

8 24 40 56 72 Number of Evaluations

(e) Regret plot Perm, 0.05 MW

8 24 40 56 72 Number of Evaluations

(f) Regret plot Perm, 0.5 MW

8 24 40 56 72 Number of Evaluations

(g) Regret plot Rosenbrock, 0.05 MW

8 24 40 56 72 Number of Evaluations

(h) Regret plot Rosenbrock, 0.5 MW

8 24 40 56 72 Number of Evaluations

(i) Regret plot BNN, 0.05 MW

8 24 40 56 72 Number of Evaluations

(j) Regret plot BNN, 0.5 MW

Figure 16. Regret plots for all 20D test functions and MWs of 0.05 and 0.5, respectively. We show regrets averaged over 100 runs (solid lines) with 95% CIs.

Appendix of NOMU: Neural Optimization-based Model Uncertainty

can be simpliﬁed, replacing ˆσn by σn and dropping the term ln (σn) (in this case, one can then also drop the ˆσn-output of the NOMU architecture).

Predictive UBs are then obtained as ˆf(x) q

c1 ˆσ2 f(x) + c2 ˆσ2n(x) (58)

with suitable calibration parameters c1, c2 R 0.

In the case of known normally distributed data noise (and under the assumption that the posterior of each f(x) is Gaussian), it is sufﬁcient to calibrate one calibration parameter c to obtain approximate α predictive bounds ˆf(x) Φ 1(1 1 α

c ˆσ2 f(x) + σ2n(x) , (59)

where c relates to the typically unknown prior scale.

C.2. NOMU for Upwards and Downwards Directed Hypothesis Classes

As mentioned in Appendix A.1 and discussed in more detail in Appendix A.3, often the set HDtrain is not upwards directed for typical NN-architectures and Equation (7) of Theorem A.1 is not fulﬁlled in general. Therefore, we carefully designed our NOMU algorithm to be able to cope with settings where the set HDtrain is not upwards and/or downwards directed. The downwards directed property is deﬁned analogously as follows: Assumption 2 (DOWNWARDS DIRECTED) For every f1, f2 HDtrain there exists an f HDtrain such that f(x) min(f1(x), f2(x)) for all x X.

However, in the following, we discuss a modiﬁcation of NOMU, which is speciﬁcally designed for the case if HDtrain is indeed upwards and/or downwards directed. In this case, by Theorem A.1, we can directly solve

arg max h HDtrain

X u(h(x) ˆf(x)) dx (60)

to obtain an upper UB and/or

arg min h HDtrain

X u(h(x) ˆf(x)) dx (61)

to obtain a lower UB, without the need for any modiﬁcations as used in the proposed NOMU algorithm (we do not need the dashed connections in the architecture from Figure 2, we do not need a speciﬁc choice of u and we do not need to introduce ˆrf and the ﬁnal activation function ϕ.). The UBs obtained in this way exactly coincide then with the pointwise upper and lower UBs deﬁned in (5), respectively.

Moreover in this case, ˆf can be even removed from Equations (60) and (61) (as can be seen from the proof of Theorem A.4). Thus, in the following loss formulation, we will remove ˆf in the respective term ( c).

c UB(x) R 0

c UB(x) R 0

c UB-network

c UB-network

Figure 17. g NN θ: a modiﬁcation of NOMU s original network architecture for upwards and downwards directed hypothesis classes.

C.2.1. THE ARCHITECTURE

Under Assumption 1 (upwards directed) and Assumption 2 (downwards directed), we propose an architecture g NN θ consisting of three sub-networks with three outputs: the model prediction ˆf, the estimated lower UB c UB and the es-

timated upper UB c UB. In Figure 17, we provide a schematic representation of g NN θ.

C.2.2. THE LOSS FUNCTION

From Equations (60) and (61) we can then directly formulate the following modiﬁed NOMU loss function Lπ.

Deﬁnition C.1 (NOMU LOSS UPWARDS AND DOWNWARDS DIRECTED) Let π := (πsqr, πexp, cexp) R3 0 denote a tuple of hyperparameters. Let λd denote the ddimensional Lebesgue measure. Furthermore, let u : Y R be strictly-increasing and continuous. Given a training set Dtrain, the loss function Lπ is deﬁned as

Lπ( g NN θ) :=

i=1 ( ˆf(x train i ) y train i )2

| {z } ( a)

i=1 ( c UB(x train i ) y train i )2 + ( c UB(x train i ) y train i )2

| {z } ( b)

πexp 1 λd(X)

X u( c UB(x)) + u( c UB(x)) dx | {z } ( c)

The interpretations of the three terms ( a), ( b) and ( c) are

Appendix of NOMU: Neural Optimization-based Model Uncertainty

analogous to the ones in the original NOMU loss formulation.

Note that, the three sub-networks: the c UB-network, the c UB-network and the ˆf-network can also be trained independently using the corresponding terms in the loss function. Moreover, if one is only interested in the upper (lower) UB or HDtrain is only upwards (downwards) directed, i.e., fulﬁlls only Assumption 1 (Assumption 2), then one can remove the respective sub-network from the architecture as well as the corresponding terms in the loss function. Furthermore, note that, now the obtained UBs can be asymmetric too.

D. Discussion of the Desiderata

In this section, we discuss in more detail the desiderata proposed in Section 3.1. Speciﬁcally, we discuss how NOMU fulﬁlls them and thereby prove several propositions. First, we establish a relation of NOMU to the classical Bayesian approach.

The Bayesian point of view allows for mathematically rigorous estimation of uncertainty. However, in general a fully Bayesian approach for quantifying uncertainty is very challenging and involves to

i. formulate a realistic prior, ii. use an algorithm to approximate the posterior (challenging to get a good approximation in feasible time for complex models such as NNs), iii. use this approximation of the posterior to obtain UBs.

We follow a different approach by directly approximating iii. based on essential properties of the posterior, e.g. for zero data noise model uncertainty vanishes at, and becomes larger far away from training data. This can be a reasonable approach in applications since many Bayesian state-of-theart methods even fail to fulﬁll these basic properties when implemented in practice (Malinin & Gales, 2018) (see Figure 1). Since especially for NNs ii. is very costly, we ask ourselves the question, whether in practice it is even true that one has more intuition about the important properties of the prior than about the important properties of the posterior? In other words, can i. and ii. be skipped by directly approximating iii.? In the case of mean predictions, many successful algorithms following this procedure already exist. These algorithms directly try to approximate the posterior mean by exploiting one s intuition how the posterior of a realistic prior should behave without the need of precisely specifying the prior, e.g.:

1. Spline regression: In many applications it is very intuitive that a good approximation of the posterior mean should not have unnecessarily large second derivative, without explicitly stating the corresponding prior. Even though spline regression can be formulated as the posterior mean

of a limit of priors (Wahba, 1978), for a practitioner it can be much more intuitive to decide whether spline regression ﬁts to one s prior beliefs by looking at the smoothness properties of the posterior mean than looking at such complex priors.

2. Convolutional Neural Networks (CNNs): In image recognition, it is very intuitive to see that a good approximation of the posterior mean should fulﬁll typical properties, e.g. two pictures that are just slightly shifted should be mapped to similar outputs. CNNs fulﬁll such typical properties to a large extent and thus have proven to be very successful in practice. Nevertheless, from a Bayesian point of view these properties rely on a yet unknown prior.

D.1. Desideratum D1 (Non-Negativity)

Desideratum D1 (Non-Negativity) is trivial, since σf 0 per deﬁnition. Credible bounds CB and CB are lower and upper bounds of an interval [CB, CB], therefore CB CB holds by deﬁnition as well. To the best of our knowledge, every established method to estimate model uncertainty yields bounds that fulﬁl D1 (Non-Negativity). Furthermore, note that D1 (Non-Negativity) also holds in the presence of data noise uncertainty.

D.1.1. HOW DOES NOMU FULFILL D1 (NON-NEGATIVITY)?

By deﬁnition, NOMU exactly fulﬁlls D1 (Non-Negativity).

Proposition D.1.a For NOMU, ˆσf 0 and thus UBc = ˆf cσf ˆf ˆf + cσf = UBc for all c 0.

Proof. This holds since ℓmin 0 in the readout map ϕ (see Equation (2)).

D.2. Desideratum D2 (In-Sample)

In the case of zero data noise, Desideratum D2 (In-Sample) holds true exactly.

Proposition D.2.a (ZERO MODEL UNCERTAINTY AT TRAINING POINTS) Let σn 0. Furthermore, let Dtrain

be a ﬁnite set of training points and consider a prior distribution P [f ] on the function space {f : X Y } such that there exists a function in the support of P [f ] that exactly ﬁts through the training data. Then for the posterior distribution it holds that for all (xtrain, ytrain) Dtrain that

P(f(x train) = y train|D train) = 1 (65)

P(f(x train) = y train|D train) = 0. (66)

In words, there is no model uncertainty at input training points, i.e., σf(xtrain) = 0 for all xtrain {xtrain : (xtrain, ytrain) Dtrain}.

Appendix of NOMU: Neural Optimization-based Model Uncertainty

Proof. Intuitively, if the noise is zero, the data generating process is ytrain = f(xtrain) + 0. Thus, if we observe (xtrain, ytrain) Dtrain, we know that f(xtrain) = ytrain with zero uncertainty. More formally, let (xtrain, ytrain) Dtrain and deﬁne for some ϵ > 0

Uϵ(y train) := (y train ϵ, y train + ϵ)

Uϵ(D train) := [

(x,y) Dtrain Uϵ(x, y),

where Uϵ(x, y) denotes an ϵ-ball around (x, y) X Y . Furthermore, let D be a random variable describing the data generating process assuming zero noise. Then it holds that

P(f(x train) Uϵ(y train)c|D Uϵ(D train))

=P(D Uϵ(Dtrain) f(xtrain) Uϵ(ytrain)c)

P(D Uϵ(Dtrain))

= 0 P(D Uϵ(Dtrain)).

Note that P(D Uϵ(Dtrain)) > 0 for every ϵ > 0, since by assumption there exists a function in the support of the prior that exactly ﬁts through the training data.34 Deﬁning the posterior

P(f(x train) = y train|D train) :=

lim ϵ 0 P(f(x train) Uϵ(y train)c|D Uϵ(D train)),

in the canonical way concludes the proof.

Even if theoretically, we know that σf(xtrain) = 0 at all training points xtrain, in practice ˆσf(xtrain) 0 can be acceptable (due to numerical reasons).

D.2.1. NON-ZERO DATA NOISE

For non-zero data noise there is non-zero data-noise induced model uncertainty at input training points. However, also for non-zero but small data noise the model uncertainty at input training points should not be signiﬁcantly larger than the data noise. In fact, for GPs one can rigorously show that σf(xtrain) σn(xtrain) in the case of known σn.

Proposition D.2.b (GPS MODEL UNCERTAINTY AT TRAINING POINTS) Let Dtrain be a set of training points. For a prior f GP(m( ), k( , )) and ﬁxed σn it holds that

σf(x train) σn(x train), (67)

for all input training points xtrain.

34Formally, f that ﬁts exactly through Dtrain with the property P [f Uϵ(f )] > 0. Since, 0 < P f Uϵ(f ) < P(D Uϵ(Dtrain)) (for the canonical L -topology) the claim follows. Given this one can also see that Proposition D.2.a still holds true with the even weaker assumption P(D Uϵ(Dtrain)) > 0.

Proof. We prove the proposition by induction over the number of training points ntrain. For this let

Antrain := {x train : (x train, y train) D train}.

Base case ntrain = 1: In this case Antrain = A1 = {xtrain 1 }. Let k := k(xtrain 1 , xtrain 1 ). Since

σ2 f(x train 1 ) (29) = k k 1 k + σ2n(xtrain 1 ) k σ2 n(x train 1 ) (68)

k2 k2 σ4 n(x train 1 ), (69)

the claim follows.

ntrain = m: Let K(Am, Am) be the Gram matrix and let

P := [K(Am, Am) + diag(σn(Am))] .

We then assume for all xtrain Am that

σ2 f(x train|Am) (29) = (70)

k(x train, x train) k(x train, Am)T P 1k(x train, Am) σ2 n(x train) (71)

Inductive step ntrain = m + 1: We now show that under the inductive assumption for any xtrain Am+1 we have

σ2 f (xtrain)

(29) = k(xtrain, xtrain) k(xtrain, Am) k(xtrain, xtrain m+1)

1 k(xtrain, Am) k(xtrain, xtrain m+1)

σ2 n (xtrain)

R := (k(x train 1 , x train m+1), . . . k(x train m , x train m+1)) = k(x train m+1, Am)T ,

Q := (k(x train 1 , x train m+1), . . . k(x train m , x train m+1))T = RT ,

S := k(x train m+1, x train m+1) + σ2 n(x train m+1).

Setting k := k(xtrain, xtrain), v := k(xtrain, Am), and w := k(xtrain, xtrain m+1) (72) can be rewritten as

σ2 n(x train) (73)

with submatrices P, Q, R, S as in (Williams & Rasmussen, 2006, (A.12)). Furthermore, with M := (S RP 1Q) 1 as in (Williams & Rasmussen, 2006, (A.12)), we get that (73) is equivalent to

k v T P v + v T Qw + w Rv + w Sw σ2 n (xtrain)

k v T P 1v + v T P 1QMRP 1v v T P 1QMw

w MRP 1v + w Mw σ2 n (xtrain)

k v T P 1v (v T P 1Q w)M(RP 1v w) σ2 n (xtrain)

k v T P 1v | {z }

=σ2 f (xtrain|Am)

(v T P 1Q w)2M σ2 n (xtrain) (74)

Appendix of NOMU: Neural Optimization-based Model Uncertainty

where the last line follows since RT = Q and P symmetric. Note that

M = k(x train m+1, x train m+1) + σ2 n(x train m+1)

k(x train m+1, Am)T P 1k(x train m+1, Am) 1

= σ2 n(x train m+1) + σ2 f(x train m+1|Am) 1.

With this, (74) can be further reformulated as

σ2 f (xtrain|Am)

k(xtrain, Am)T P 1k(xtrain m+1, Am) k(xtrain, xtrain m+1) 2

σ2n (xtrain m+1) + σ2 f (xtrain m+1|Am) | {z } 0

σ2 n (xtrain). (75)

First, for xtrain Am, (75) holds true by assumption. Second, for xtrain = xtrain m+1 we obtain

σ2 f (xtrain m+1|Am)

σ2 f (xtrain m+1|Am) 2

σ2n (xtrain m+1) + σ2 f (xtrain m+1|Am) σ2 n (xtrain m+1)

σ4 f (xtrain m+1|Am) σ4 n (xtrain m+1) σ4 f (xtrain m+1|Am)

σ4 n (xtrain m+1) 0.

Thus, (73) holds true for any xtrain Am+1.

In fact, Proposition D.2.b does not come as a surprise: Even if we only observe one training point (xtrain, ytrain) and ignore all our prior knowledge by using a ﬂat uninformative improper prior p(f(xtrain)) 1, this results in σf(xtrain) = σn(xtrain). Introducing additional information e.g. observing more additional training points and introducing additional prior information (such as smoothness assumptions instead of a ﬂat uninformative prior) typically reduces model uncertainty further. Thus, we believe that σf(xtrain) σn(xtrain) holds for most reasonable priors.

Finally, note that Proposition D.2.a and Proposition D.2.b hold true for every prior respectively every Gaussian process prior as long as there exists an f in the support of this prior which explains the observed training points (even if this prior is strongly misspeciﬁed). For example this assumption is obviously fulﬁlled for the prior of Gaussian distributed weights of an overparameterized NN (BNN).

D.2.2. WHY DOES MC DROPOUT STRONGLY VIOLATE

D2 (IN-SAMPLE) ?

In Figure 1, MC dropout (MCDO) predicts for every input training point ˆσf(xtrain i ) > 100σn. Thus, if ˆσf(xtrain i ) was correctly calculated as posterior model uncertainty, this would be an practically unobservable event as long as f actually comes from this prior (P [|ytrain i f(xtrain i )| > 100σn] < 10 2173). Therefore, this is clear statistical evidence that

MCDO severely fails to estimate posterior model uncertainty at training points. This can have one of the following three reasons:

1. MCDO severely fails in correctly approximating the posterior given its prior (i.i.d. Gaussian on weights). 2. MCDO s prior does not ﬁt to the data generating process at all. 3. During our experiments we very often observed very extreme events that should only happen with probabilities smaller than 10 2000.

We agree with prior work (Gal & Ghahramani, 2016; Blundell et al., 2015) that a Gaussian prior on the weights of a NN, i.e., the prior mentioned in Item 2, is a very reasonable assumption. Note that NOMU can also be seen as a heuristic to approximate the posterior model uncertainty given exactly the same prior (see Appendix A.2). Therefore, since Item 3 can be ruled out, we can conclude that MCDO s problem is Item 1.

MC Dropout s Failure in Approximating the Posterior Table 10 and Table 11 show that even though we generate the ground truth function from the same prior assumed by MCDO (and also assumed by most BNN algorithms), NOMU signiﬁcantly outperforms MCDO. This empirically shows (with the help of Theorem B.7) that (i) NOMU is able to better approximate posterior BNN-credible bounds than MCDO in terms of average Kullback-Leibler divergence d KL (including further popular variational BNN approximations from (Graves, 2011; Blundell et al., 2015; Hern andez-Lobato & Adams, 2015), which themselves are outperformed by MCDO) and (ii) MCDO s variational approximation algorithm severely fails in approximating the targeted posterior.

D.2.3. IMPORTANCE OF D2 (IN-SAMPLE) IN BO

Especially in Bayesian optimization (BO) it is particularly important to fulﬁll D2 (In-Sample) as much as possible, since D2 (In-Sample) helps a lot to prevent the BOalgorithm from getting stuck in local maxima. For NNs we often observed that at the i -th step the mean prediction is maximized/minimized either at the boundary or exactly at a training point with the largest/smallest function value observed so far maxi {1,...,i }/mini {1,...,i } (see Figure 9). In the latter case, without model uncertainty (or with almost constant model uncertainty as is the case in MC dropout), one would query all future function evaluations at exactly this point without learning anything new. E.g., consider the situation of Figure 9d when minimizing the Forrester function. Each new function evaluation of MC Dropout would be sampled at an already observed training point x 0.4. This intuitively explains why estimating model uncertainty precisely at the training data points is especially important in

Appendix of NOMU: Neural Optimization-based Model Uncertainty

BO and why it can be very problematic in BO, if the model uncertainty does not decrease sufﬁciently at the training data points. To summarize, D2 (In-Sample) strongly inﬂuences the acquisition function in a direction that discourages the algorithm from choosing the same point again and D2 (In Sample) together with D3 (Out-of-Sample) can prevent the BO-algorithm from getting stuck (see also Appendix B.3.2).

D.2.4. DOMINATING DATA NOISE

In the case of dominating data noise uncertainty σn 0, the model uncertainty σf should not be small at input training points (only if one observes a very large amount of input training points very close to a input training point xtrain

the model uncertainty should become small.) However, in this paper, we do not focus on the case of large data noise uncertainty, but on the case of negligible or zero data noise. In particular, D2 (In-Sample) is only formulated for this case.

D.2.5. HOW DOES NOMU FULFILL D2 (IN-SAMPLE)?

Recall, that we train NOMU by minimizing

Lπ(NN θ) + λ θ 2 2 , (76)

where the NOMU loss Lπ(NN θ) is deﬁned as:

Lπ(NN θ) :=

i=1 ( ˆf(x train i ) y train i )2

i=1 (ˆrf(x train i ))2

| {z } (b) (77)

+ πexp 1 λd(X)

X e cexp ˆrf (x) dx | {z } (c)

Then, the following proposition holds:

Proposition D.2.c Let λ, πexp, cexp R 0 be ﬁxed and let ˆσf be NOMU s model uncertainty prediction. Then, it holds that ˆσf(xtrain i ) converges to ℓmin for πsqr for all input training points xtrain i , where ℓmin 0 is an arbitrarily small constant modelling a minimal model uncertainty used for numerical reasons.

Proof. By the deﬁnition of Lπ(NN θ), i.e., since (b) dominates the loss function if πsqr , it follows that ˆrf(xtrain i ) = 0. More precisely, for the NN NN θ = ( ˆf , ˆr f) with parameters θ that minimize (76) it holds that

Lπ(NN θ ) + λ θ 2 2 Lπ(0) + λ 0 2 2

i=1 (y train i )2 + πexp 1

i=1 ( ˆf (x train i ) y train i )2 + πsqr

ˆr f(x train i ) 2 +

πexp 1 λd(X)

X e cexp ˆr f (x) dx + λ θ 2 2

i=1 (y train i )2 + πexp

ˆr f(x train i ) 2

i=1 (y train i )2 + πexp =: C

where for ﬁxed parameters λ, πexp, cexp R 0, C > 0 is a constant. Assume now that for ˆr f does not vanish at all training data points for πsqr to inﬁnity, i.e., that there exists an ϵ > 0 such that for every πsqr large enough Pntrain

i=1 ˆr f(xtrain i ) 2 > ϵ. This however implies

πsqr ϵ < C πsqr < C

ϵ πsqr large enough,

which yields a contradiction. Thus, limπsqr ˆr f(xtrain) = 0 for all training input points xtrain. Finally, by Equation (2) it follows that ˆσf(xtrain i ) = ℓmin.

Note that even for a ﬁnite (sufﬁciently large) πsqr, the raw model uncertainty ˆrf converges to zero as λ goes to zero ( πsqr

λ ), if the model is sufﬁciently over-parameterized. Empirically one can see in Figures 1, 4, 5, 9, 11, 13, 18 and 19 that NOMU fulﬁlls D2 (In-Sample) with a high precision for our choice of hyper-parameters.

D.3. Desideratum D3 (Out-of-Sample)

We ﬁrst consider the case of zero (or negligible) data noise σn 0 and then discuss possible extensions to settings with non-zero data noise.

D.3.1. ZERO DATA NOISE

The notion of distance used in D3 (Out-of-Sample) heavily depends on the speciﬁc application (i.e., on the prior used in this application). More concretely, there are the following two hyperparameters .

1. First, the metric35 d : X X R 0 on X we use to measure distances can heavily depend on the prior for the speciﬁc application. For example, in the case of image recognition, two pictures that are only slightly shifted can be seen as very close to each other even if the Euclidean distance of their pixel-values is quite high.36 If one uses a CNN-architecture in NOMU this prior belief on d is

35We use the term metric to describe a general pseudometric. 36We assume in this paper that if one sees a 1920 1080-pixel

Appendix of NOMU: Neural Optimization-based Model Uncertainty

approximately captured. The successful generalization properties of many different network architectures can be explained precisely by their use of application-dependent non-Euclidean metrics (Bronstein et al., 2017). (Additionally, instead of ﬁxing d apriori, further aspects of the metric d can be learned from the training data as we discuss in detail in Appendix D.4.)

2. Second, even if we can precisely write down a metric d : X X R 0, a priori it is not clear how to deﬁne the distance d : X 2X R 0 between a point x and the input training points from Dtrain X := {xtrain : (xtrain, ytrain) Dtrain}. Both common deﬁnitions d(x, Dtrain X ) := infz Dtrain X d(x, z) and d(x, Conv(Dtrain X )), where Conv( ) denotes the convex hull, are inappropriate choices for d.37 In Section 3.1, we consider a point x closer to the input training points if it is surrounded by input training points in all directions, as opposed to a point x which only has close input training points in some directions and there is a large range of directions without any close input training points. This implies that, for example,

(i) very close to noiseless input training points that are surrounded by many other noiseless input training points there is very little model uncertainty.

(ii) for extrapolation one typically has more uncertainty than for interpolation.

(iii) far away from the convex hull of the training points model uncertainty is very high.

(iv) also within the convex hull of the training data, model uncertainty is high within big gaps inbetween training points.

Figure 11 shows how well NOMU fulﬁlls these properties of d similarly to a GP (see Figure 12a).

D.3.2. NON-ZERO HOMOSCEDASTIC DATA NOISE

If there is homoscedastic non-zero data noise σn(x) σn > 0, it is important that the distance d of x to the input training points is not minimal if it exactly equals one of the input training points. Instead, one should use a notion of distance d that can even get smaller if there are multiple input training points at x or very close to x.

image, which is perfectly recognizable as a cat, every 10-pixel shift of this picture is also recognizable as a cat with almost no uncertainty (even though this cannot be proven mathematically). Thus, it is very desirable to predict very small model uncertainty ˆσf(x) for every image x X which is only shifted by less than 10 pixel from at least one noiselessly labeled training image xtrain. 37E.g. for GPs, none of these two classical notions of distance between a point and a set is entirely applicable (see Equation (29)). An appropriate choice of d should be a compromise between these two notions.

D.3.3. NON-ZERO HETEROSCEDASTIC DATA NOISE

One can also extend D3 (Out-of-Sample) to heteroscedastic settings. In that case, the used notion of distance d of x to the input training points needs to be weighted by the precision of the input training points, i.e. if x is close to multiple input training points with low data noise σn( ) you consider x closer to the input training points than if x is close to multiple input training points with high data noise.

D.3.4. EXAMPLE FOR d BASED ON GPS

In this section, we give the concrete example of Gaussian process regression (GPR) from Appendix B.1.1 in which d from D3 (Out-of-Sample) can be written down explicitly in closed form.

For any arbitrary metric d on X, consider for instance the kernel k(xi, xj) = e d(xi,xj)2. Then, d(x, Dtrain X ) = ˆσf(x|Dtrain X ), with posterior model uncertainty ˆσf from Equation (29). While this is one of the simplest possible ways to deﬁne d, alternatively one could also deﬁne it differently if it shares similar qualitative properties.38

Why do we still consider it interesting to formulate D3 (Out-of-Sample) vaguely, given that there is already such a precise formula as is the case for GPs?

1. The GP s formula only holds true for the speciﬁc prior of a GP. We however, want to formulate desiderata that capture the most essential properties of credible bounds that almost all reasonable priors have in common.

2. We want to provide some easy to understand intuition for D3 (Out-of-Sample): It might be challenging to see directly from the GP s formula (29) how the posterior model uncertainty qualitatively behaves as visualized in Figure 12a.

To summarize, both the exact notion of distances d, d and the exact rate of how model uncertainty increases with increasing distance to the input training points depend on one s prior belief. However, Section 3.1 gives a qualitative description of properties that most reasonable (generic) priors have in common (see Items (i) (iv)).

D.3.5. HOW DOES NOMU FULFILL D3 (OUT-OF-SAMPLE)?

Recall, that we train NOMU by minimizing

Lπ(NN θ) + λ θ 2 2 , (79)

38For instance, d could also be deﬁned based on a kernel of the form k(xi, xj) = g(d(xi, xj)) with a monotonically decreasing function g, e.g. a Mat ern-typed kernel.

Appendix of NOMU: Neural Optimization-based Model Uncertainty

where the NOMU loss Lπ(NN θ) is deﬁned as:

Lπ(NN θ) :=

i=1 ( ˆf(x train i ) y train i )2

i=1 (ˆrf(x train i ))2

+ πexp 1 λd(X)

X e cexp ˆrf (x) dx | {z } (c)

The interplay of (b), (c), and regularization promotes D3 (Out-of-Sample) (note that the behaviour of ˆrf directly translates to the behaviour of σf): Term (c) pushes ˆrf towards inﬁnity across the whole input space X. However, due to the counteracting force of (b) as well as regularization, ˆrf increases continuously as you move away from the training data see for example Figure 9 and Figure 11 (or any other plot showing NOMU, i.e. Figures 1, 4, 5, 13, 18 and 19). In Figure 11 one can see how NOMU fulﬁlls the properties (i) (iv) of d : X 2X R 0 mentioned in Appendix D.3.1. In Figures 18 and 19 one can observe how NOMU behaves when a non-stationary metric d Figure 18(x, x ) = |x x | respectively non-stationary non-isotropic metric d Figure 19(x, x ) = x x 2 is used (because d Figure 18 and d Figure 19 were learned from the data as desired by D4 (Metric Learning) in these examples).

The hyperparameters πexp and cexp control the size and shape of the UBs. Concretely, the larger πexp, the wider the UBs; the larger cexp, the narrower the UBs at points x with large ˆσf(x) and the wider the UBs at points x with small ˆσf(x).

Finally, we give some intuition that if CNNs are used for the two sub-networks in NOMU s architecture, D3 (Outof-Sample) will be fulﬁlled with respect to an almost shiftinvariant metric d: In the noiseless setting, we can choose πsqr large enough such that D2 (In-Sample) is fulﬁlled, so that we have ˆσf(xtrain) 0 at any training input point xtrain. Regularized CNNs have the property that if you slightly shift the input the output barely changes. So if x can be obtained from xtrain by slightly shifting it, the CNN-output ˆσf(x) ˆσf(xtrain) 0 also does not move too far away from zero. Only if you move further away with respect to the almost shift-invariant metric d, the CNN-output ˆσf is able to move further away from zero. The same principle can also be used for other geometric NNs (e.g. graph neural networks (GNNs)) which correspond to different (non-Euclidean) metrics Bronstein et al. (2017).

D.4. Desideratum D4 (Metric Learning)

A priori, it is often not obvious which metric d to choose in D3 to measure distances. In many applications, it is therefore best to learn this metric from the training data (as

explained in Footnote 4 on Page 4).

In the following section, we present visualizations of D4 (Metric Learning) for all benchmark algorithms in easy to understand, low dimensional settings.

D.4.1. VISUALIZATION OF D4 (METRIC LEARNING)

1D In order to visualize D4 and show how for NOMU the mean prediction impacts it s model uncertainty prediction we conduct the following experiment. We sample 16 equidistant noiseless training points of a trend-adjusted version of Sine 3. We then train NOMU (hyperparameters are as in Appendix B.2.2 with πsqr = 0.5, ℓmin = 10 4, regularization parameter 10 4 on the ˆrf-network, and number of training epochs 212) and compute the corresponding UBs. Figure 18 shows that NOMU UBs are wider (cp. the dotted blue line) in those areas of the input space where small changes of x lead to large variation in the target ( x 0) compared to areas without large variation in the target ( x 0). This effect is present even though the input training points are sampled from an equidistant grid, and thus isolates the effect of D4.

2D Analogously, we visualize D4 for two-dimensional input by training NOMU on 16 training points sampled on an equidistant 4 4-grid and evaluated at the two-dimensional extension of the Step function, i.e.,

f = R2 R : (x1, x2) 7

( 1 if x1 < 0 1 if x1 0. (82)

Here, D4 can be interpreted as follows: imagine we do not have any prior knowledge of whether x1 or x2 is more important for predicting the unknown function f. However, when NOMU observes the 16 training points it should be able to learn that x1 is more important for the model prediction than x2, and that the function is more regular/predictable far away from {x1 0}. D4 requires in this example that feature x1 should have a higher impact than feature x2 also

1.0 0.5 0.0 0.5 1.0

2 NOMU: 3 f

NOMU: f 3 f

Figure 18. Visualisation of D4 (Metric Learning).

Appendix of NOMU: Neural Optimization-based Model Uncertainty

on the model uncertainty prediction. If a model for UBs did not incorporate D4, we would expect the uncertainty in this example to fulﬁll ˆσf ((x1, x2)) = ˆσf ((x2, x1)) because of the equidistant grid of the training points (this is indeed the case for GPs, see Figure 20a).

For NOMU however, we have very good control on how strongly we enforce D4, e.g., we can strengthen D4 by increasing the L2-regularization of the hidden layers in the ˆrf-network and/or decreasing the size of the ˆrf-network.

NOMU: Visualization of D4 in 2D In Figure 19, we present the estimated model uncertainty ˆσf obtained for different hyperparameters of the ˆrf-network with ﬁxed ˆfarchitecture among all four subplots. Thus, Figure 19 shows how D4 realizes in different magnitudes. In Figure 19a, we use the same hyperparameters for the ˆrf-network as for the ˆf-network. In 19b, we only increase the L2-regularization of the ˆrf-network. In Figure 19c, we only decrease the size of ˆrf-networks. In Figure 19d, we combine both, i.e., we increase the L2-regularization of the ˆrf-network and decrease the size of the ˆrf-network. While D4 is barely visible in Figure 19a, it is clearly visible in Figures 19b 19d. In Figures 19b 19d, we observe that the estimated model uncertainty ˆσf grows faster in horizontal directions (corresponding to changes in x1) than in vertical directions. In Figures 19b 19d, we further observe that the estimated model uncertainty ˆσf is larger around {x1 0} than far away from this region. The magnitude of both these effects increases from Figure 19b to Figure 19d. Both of these effects can also be observed for MC dropout (MCDO) and deep ensembles (DE) (see Figure 20b and Figure 20c).

Benchmarks: Visualization of D4 in 2D In Figure 20, we present uncertainty plots of all benchmark methods. We can see that deep ensembles (DE) gives high preference to capturing D4, even though its estimated model uncertainty still is subject to some randomness with non-uniform patterns for x1 [ 0.25, 0.25] (Figure 20c). Moreover, MC dropout (MCDO) also captures higher model uncertainty for x1 [ 0.25, 0.25] as desired by D4, but it does not fulﬁll D2 (Figure 20b). The Gaussian process (GP) with RBF kernel does not account for D4 (Figure 20a), which directly follows from the deﬁnition. Similarly to deep ensembles (DE), hyper deep ensembles (HDE) and HDE* strongly capture D4 but show even more random behaviour. This randomness is visible most prominently along x1 = 0 where one should observe large model uncertainty, whereas their estimated model uncertainty is surprisingly close to 0.

1.0 0.5 0.0 0.5 1.0 x1

Input Training Points

(a) Same L2-regularization on the ˆrf-network and ˆf-network (λ = 10 8).

1.0 0.5 0.0 0.5 1.0 x1

Input Training Points

(b) Larger L2-regularization on the ˆrf-network (λ = 10 4) than on the ˆf-network (λ = 10 8).

1.0 0.5 0.0 0.5 1.0 x1

Input Training Points

(c) Shallow ˆrf-network consisting of 4 hidden nodes.

1.0 0.5 0.0 0.5 1.0 x1

Input Training Points

(d) Shallow ˆrf-network consisting of 4 hidden nodes and larger regularization of λ = 10 4 on ˆrf-network.

Figure 19. Estimated model uncertainty ˆσf of NOMU: visualizing

Appendix of NOMU: Neural Optimization-based Model Uncertainty

1.0 0.5 0.0 0.5 1.0 x1

Input Training Points

(a) GP (c=15)

1.0 0.5 0.0 0.5 1.0 x1

Input Training Points

(b) MCDO (c=20)

1.0 0.5 0.0 0.5 1.0 x1

Input Training Points

(c) DE (c=30)

1.0 0.5 0.0 0.5 1.0 x1

Training Data

(d) HDE (c=30)

Figure 20. Estimated model uncertainty of Gaussian process (GP), MC dropout (MCDO), deep ensembles (DE), and hyper deep ensembles (HDE)

1.0 0.5 0.0 0.5 1.0 x1

Training Data

(e) HDE* (c=30)

Figure 20. (cont.) Estimated model uncertainty of HDE*.

D.4.2. HOW DOES NOMU FULFILL D4 (METRIC LEARNING)?

Recall NOMU s architecture depicted in Figure 2 in the main paper.

The ˆrf-network learns the raw model uncertainty and is connected with the ˆf-network through the last hidden layer (dashed lines in Figure 2). This connection enables ˆrf to re-use features that are important for the model prediction ˆf. This behaviour directly translates to ˆσf (see Equation (2)), implementing Desideratum D4 (Metric Learning).

In Figures 18 and 19, one can observe that NOMU fulﬁlls D4 (Metric Learning). Moreover, Figure 19 shows how to control the strength of D4 (Metric Learning) by varying the L2-regularization and the number of neurons of the ˆrfarchitecture.

D.5. Desideratum D5 (Vanishing)

First, note that D3 (Out-of-Sample) already suggest D5 (Vanishing), since in the limit ntrain every point x in the support of the input data generating distribution is inﬁnitely close to inﬁnitely many other input training points xtrain and therefore inﬁnitely close to the input training set.

To the best of our knowledge, D5 (Vanishing) is fulﬁlled by most reasonable algorithms that estimate model uncertainty. Speciﬁcally, NOMU, GPs, DE and HDE capture D5 (Vanishing). This can be nicely observed in Figure 13, where all these algorithms result in zero model uncertainty in areas with many input training points (up to numerical precision).

MC Dropout can be seen as a variational algorithm for approximating BNNs. While in theory, BNNs should also fulﬁll D5 (Vanishing), MC Dropout often struggles to do so, as can be observed in Figure 13 (see also D.2.2 for a discussion why MC Dropout struggles to approximate BNNs).

Finally, D5 (Vanishing) is widely accepted (Kendall & Gal, 2017; Malinin & Gales, 2018) and most of the time

Appendix of NOMU: Neural Optimization-based Model Uncertainty

loosely stated along the lines of While data noise uncertainty (aleatoric uncertainty) is irreducible, model uncertainty (epistemic uncertainty) vanishes with an increasing number of training observations. In other words, the width of credible bounds converges to zero whilst the width of predictive bounds does not converge to zero in the presence of data noise.

However, whilst such statements are qualitatively true, formally, D5 (Vanishing) only holds in the limit of the number of training points ntrain to inﬁnity and for x X that are in the support of the input data generating distribution. Furthermore, note that D3 (Out-of-Sample) also holds in the presence of data noise uncertainty.

Moreover, note that while D1 D4 are statements on relative model uncertainty, i.e., statements that are independent of the calibration parameter c 0 (see A.2.1), D5 (Vanishing) is a statement about absolute model uncertainty. Thus, D5 (Vanishing) only holds for a ﬁxed calibration parameter c 0 (if c increases sufﬁciently fast with increasing ntrain, model uncertainty does not vanish).

D.5.1. WHY DOES D5 (VANISHING) ONLY HOLD IN THE LIMIT?

1. Even for ﬁxed c 0, in the case of large unknown data noise uncertainty that is simultaneously learned by the algorithm, adding another input training point x close to existing input training points xtrain whose corresponding target y is very far away from ytrain could lead to an increase in model uncertainty, since this new training point (x, y) would increase the predicted data noise uncertainty and thus increase the data noise induced model uncertainty.

2. Even if there is no data noise uncertainty σn 0 and c 0 is ﬁxed, adding another input training point can increase the model uncertainty, when D4 (Metric Learning) is fulﬁlled. To see this, consider the following scenario: an already observed set of training points suggest that f is very ﬂat/simple/predictable (e.g. linear) in a certain region. However, adding a new training point (x, y) shows that f is much more irregular in this region than expected. Then, the learned metric can drastically change resulting in increased model uncertainty in this region (outside an ϵ-ball around the new input training point x).

D.5.2. HOW DOES NOMU FULFILL D5 (VANISHING)?

Recall, that we train NOMU by minimizing

Lπ(NN θ) + λ θ 2 2 , (83)

where the NOMU loss Lπ(NN θ) is deﬁned as:

Lπ(NN θ) :=

i=1 ( ˆf(x train i ) y train i )2

i=1 (ˆrf(x train i ))2

+ πexp 1 λd(X)

X e cexp ˆrf (x) dx | {z } (c)

Then, the following proposition holds:

Proposition D.5.a Let λ, πexp, cexp, πsqr R 0 be ﬁxed and let the activation-functions of NN θ be Lipschitz-continuous and let ˆσf(x) be NOMU s model uncertainty prediction. Then, it holds that ˆσf(x) converges in probability to ℓmin for ntrain for all input training points x in the support of the input data generating distribution PX, i.e., more formally x supp (PX) , ϵ > 0, δ 0 : n0 : n n0 :

P [|ˆσf(x) ℓmin| ϵ] < δ, (86)

where ℓmin 0 is an arbitrarily small constant modelling a minimal model uncertainty used for numerical reasons.

Proof. Let x supp (PX), δ 0 and ϵ > 0.

First, let Lπ(0) := c < , i.e., the value of the loss function when inserting the constant zero function. Then for the optimal solution NN θ of (83) it immediately follows that

Using the fact that all activation-functions φ of NN θ are Lipschitz-continuous together with (87), one can show that there exists a constant L := L(c, λ, φ, architecture) such that the raw model uncertainty prediction ˆrθ f is Lipschitzcontinuous with constant L.

Next, let U := U ϵ

4L (x) be an open ball with radius ϵ 4L around x. Given that the diameter of U is equal to ϵ 2L and the fact that ˆrθ f is Lipschitz-continuous with constant L, it follows that

max z U ˆrθ f (z) min z U ˆrθ f (z) < ϵ

Given U let p := P [x U]. Since x supp (PX), per deﬁnition it holds that p > 0.

In the following let Dtrain n,x (PX)n denote the random variable representing a set of n input training points. Now, let n0 be sufﬁciently large such that

P h |D train n0,x U| > 4 c

i > 1 δ. (89)

Appendix of NOMU: Neural Optimization-based Model Uncertainty

Note that one can explicitly calculate the value of n0, since |Dtrain n0,x U| N0 is binomial distributed with p > 0 and n0 N.

Finally, we show that ˆrθ f (x) < ϵ by contradiction. For this, assume on the contrary that

ˆrθ f (x) ϵ. (90)

Using (88) it follows that for all z Dtrain n0,x U it holds that ˆrθ f (z) > ϵ

2 with probability larger than 1 δ. This together with the fact that each summand in the term (b) in the NOMU loss function Lπ is non-negative implies that

xtrain Dtrain n0,x U (ˆrf(x train))2 > ϵ

Putting everything together and using the fact that each term in the NOMU loss is non-negative implies that

Lπ(NN θ ) + λ θ 2 2 (b)

(91) > c = Lπ(0) + λ 0 2 2 ,

which is a contradiction for NN θ being optimal in (83).

Therefore, we can conclude that ˆrθ f (x) < ϵ with probability larger than 1 δ. By deﬁnition of ˆσf this implies that |ˆσf ℓmin| < ϵ with probability larger than 1 δ, which concludes the proof.

Note that empirically one can see in Figure 13 (in the areas with many input training points) how well NOMU fulﬁlls D5 (Vanishing) in real-world settings. Furthermore, on can see that the statement only holds true and is only desirable for x in the support of the input data generating distribution PX (not in the gaps).

E. NOMU vs. Prior Networks

In this Section, we highlight several differences of NOMU compared to prior regression networks that were recently introduced in a working paper by Malinin et al. (2020a).

While the high level idea of introducing a separate loss term for the in-sample-distribution and out-of-distribution (OOD) distribution is related to NOMU, there are several important differences, which we discuss next:

1. Malinin et al. (2020a) s approach focuses on estimating both model and data noise uncertainty. Thus, to properly compare it to NOMU, we consider their approach for known and negligible data noise uncertainty, e.g. for σn = 10 10 we need to set in their paper (L, ν) = (I l 1, 1

σn l) with l , such that the their corresponding model uncertainty prediction is given by (κ(x)Λ(x)) 1 l = σn κ(x) I. In the following, we will

consider for simplicity a one-dimensional output, i.e. ˆσf = σn κ(x).

2. They explicitly deﬁne a prior target distribution independent of x X, which is parametrized by κ0 (model uncertainty) and m0 (mean prediction) for OOD input points. Speciﬁcally, for OOD input points their mean prediction ˆf is pushed towards m0. In many applications the success of classical mean predictions of deep NNs is evident. In none of these applications there was a term that pushed the mean prediction to a ﬁxed predeﬁned prior mean. Therefore, for NOMU we keep the mean prediction untouched.

3. Instead of our loss, their loss (derived from a reverse KL-divergence) is of the form:

ˆf(xtrain i ) ytrain i 2

1 κ(x) | {z } in-sample

κ0 ˆf(x) m0 2

2 (σn)2 + κ0

κ(x) log κ0

κ(x) 1 dµOOD(x)

| {z } out-of-distribution

where m0, κ0 are the prior parameters for the mean and model uncertainty prediction and µOOD is an OOD measure. Speciﬁcally, they enforce zero model uncertainty at input training points via the linear term 1 κ(x), while we use a quadratic term. Moreover, they only use an OOD term and no out-of-sample (OOS) term (see below Item 4).

4. In their loss formulation in (92), they only use an out-ofdistribution (OOD) term, while we use an out-of-sample (OOS) term. By OOD they refer to input training points only far away from the training data, e.g., in (Malinin et al., 2020a, Section 3) µOOD only has support far away from the convex hull of the input training points. Thus, they do not enforce model uncertainty in gaps between input training points. In contrast, by out-of-sample (OOS) we refer to a distribution with no mass on the input training points, i.e. we sample new input points that are not equal to the input training points but come from the same range (we recommend to use a distribution that is similar to the data generating process). Therefore, our loss explicitly also enforces model uncertainty in gaps between input training points.

5. They use a different architecture and train only one NN. This implies that their mean prediction m(x) can be inﬂuenced in unwanted ways by the model uncertainty prediction (κ(x)Λ(x)) 1.

6. Their theoretical motivation substantially differs from ours: They only partially specify a prior belief by deﬁning marginal distributions for f(x) for each input point

Appendix of NOMU: Neural Optimization-based Model Uncertainty

x X, without specifying a joint prior distribution for f. However, given only marginals no joint distribution, which is the crucial aspect when deﬁning a prior in regression, can be derived without further information (E.g. consider Gaussian processes (GPs); here all onedimensional marginal distributions are simply given by N((m(x)), k(x, x)). However, the crucial part is how to deﬁne k(x, x ) specifying the relation of x and x . Only deﬁning the marginals does not sufﬁce to fully deﬁne GPs, leaving this most crucial part undeﬁned). However, for NOMU, we give in Appendix A a theoretical connection to BNNs, with Gaussian prior on the weights. This induces a prior on the function space, i.e., a distribution over f rather than separate marginal distributions over f(x) for each x X.

7. Parametrizing the model precision instead of the model uncertainty can have negative effects due to (implicit) regularization of NNs in the case of negligible or zero data noise. To get uncertainties in gaps between the input training points (small κ(x)) while having almost zero uncertainty at these input training points (κ(x) ), would imply very high regularization costs for the function κ(x) and thus is very hard to learn for a NN. For NOMU, we therefore parameterize directly the model uncertainty (which is always ﬁnite) instead of the model precision (that should be inﬁnite at noiseless training data points).

8. Our experimental results suggest that NOMU clearly outperforms DE in BO, whilst DE outperforms prior regression networks in their set of experiments.

F. NOMU vs. Neural Processes

In this section, we discuss the differences between neural processes (NPs) introduced by Garnelo et al. (2018a;b) and NOMU. Speciﬁcally, we explain in the following why NPs and NOMU are solving very different problems in different settings.

For training NPs, one has to observe data from 1000s of realizations of fk, sampled i.i.d. from the prior distribution (for each fk one observes xi and fk(xi) to train the NP). This is often mentioned in Garnelo et al. (2018a;b), e.g., Garnelo et al. (2018a, Section 4.1): We generate [...] datasets that consist of functions generated from a GP [...] At every training step we sample a curve from the GP [...]. . For NOMU we consider the very different task of estimating p(ytest|xtest; Dtrain) based on a single dataset, i.e, generated from a single realization f = f1.

For example in the case of the Boston housing data set from Section 4.1.4, there is only one function f = f1 involved that maps a (multidimensional) input data point x corresponding to a house in Boston to its price f(x). For this

data set, NPs would not be well suited, since it only contains data (xi, yi) = (xi, f(xi) + εi) coming from this speciﬁc function f. One does not have access to data corresponding to another function f2 that had been sampled from the same prior distribution.

The same is true for the other data sets we consider in this paper (e.g., for the solar irradiance data set we only use the data visible in Figure 5 and NOMU does not have access to any data coming from other time series to make its predictions). Thus, NPs cannot be applied to the tasks considered in this paper.

Summary. NOMU, GP, MCDO, DE and HDE are designed to be trained on data coming from one unknown function f without having access to data from other functions f2, f3, . . . . In contrast, NPs are designed to be trained on multiple data sets generated from multiple functions f1, f2, f3, . . . .

G. Aleatoric Neural Networks: Aleatoric vs. Epistemic Uncertainty

In this section, we discuss the classical approach of a NN with two outputs, one output for a model prediction and another for aleatoric uncertainty, which is trained using the (scaled) Gaussian negative log-likelihood as introduced by Nix & Weigend (1994). We will use the terms aleatoric uncertainty and data noise as well as model uncertainty and epistemic uncertainty interchangeably.

In what follows, we call this method aleatoric neural network (ANN). Within this section, we show that such an ANN does not explicitly estimate model uncertainty (in contrast to all other benchmark methods discussed in this paper), i.e., when using the aleatoric uncertainty output ˆσn naively as ˆσf, the so obtained ˆσf := ˆσn does not represent model uncertainty (epistemic uncertainty).39 First, we give the deﬁnition of an ANN.

Deﬁnition G.1 (ANN) An ANN is a fully-connected feedforward NN NN ANN : Rd R R+ with two outputs: (i) the model prediction ˆf R and (ii) a model uncertainty prediction ˆσn R+ that is trained for a given set of training points Dtrain using the following loss function:

L ANN(NN ANN) :=

(x,y) Dtrain

2 (ˆσn(x))2 + ln (ˆσn(x))

The idea of ANN is to estimate the noise scale σn(x) = p

V[y|x, f(x)]. Figure 21 shows that, as we

39The reminder of this section only targets readers who do not directly see that substituting ˆσf by ˆσn is an extremely bad idea.

Appendix of NOMU: Neural Optimization-based Model Uncertainty

1.0 0.5 0.0 0.5 1.0 3

NOMU: f 1 f

Figure 21. Comparison of UBs resulting from ANN (c = 1) (where ˆσn is used as a substitute for ˆσf) vs. NOMU for the Forrester function (solid black line). For NOMU, we also show ˆσf as a dotted blue line. Training points are shown as black dots.

would expect, the trained ANN has learned the true σn 0 ˆσn quite precisely. However, it as also becomes evident that the ANN does not learn any form of model uncertainty. Very far away from all observed training data points, the ANN does not express any uncertainty about its prediction (in Figure 22, to the right of x = 0.5, the predictions are very far away from the truth, but ˆσn does not capture this uncertainty). Therefore, the ANN s aleatoric uncertainty output ˆσn does not fulﬁll desideratum D3 (Out-of-Sample). The problem of misusing ˆσn as substitute for σf is not that ˆσn is too small (as we study relative uncertainty in this paper (see Appendix A.2.1), one can always scale the uncertainty by a factor c). However, Figure 22 shows that also the scaled uncertainty completely fails to capture the desiderata, i.e., the aleatoric uncertainty output ˆσn is almost constant for all input points x. Thus, UBs of an ANN do not fulﬁll D2 (In Sample) and D3 (Out-of-Sample). In Figure 22, we can see that 5ˆσn is way too underconﬁdent at input training points and at the same time way too overconﬁdent far away from the observed input training points. This would result in very bad NLL scores on a test set. (Moreover in noisy settings, i.e., σn = 0, the aleatoric uncertainty output ˆσn of an ANN does not fulﬁll D5 (Vanishing) either: ˆσn should converge to σn while ˆσf should converge to zero as the amount of training data increases.)

Furthermore, especially in Bayesian optimization (BO) cˆσn would be a terribly bad substitute for σf: Maximizing the upper UB acquisition function ˆf + cˆσn, would be almost equivalent to maximizing ˆf since cˆσn is almost constant because of the lack of D2 (In-Sample) and D3 (Out-of Sample). If one wants to maximize the function in Figure 22 on [ 1, 1], the next BO-step would propose to query an input training point at the left boundary x = 1 (even for large c). However, one does not learn anything new from evaluating at x = 1, because this input training point has already been evaluated in a previous BO-step. All

1.0 0.5 0.0 0.5 1.0 3

NOMU: f 1 f

Figure 22. Comparison of UBs resulting from ANN (c = 5) (where ˆσn is used as a substitute for ˆσf) vs. NOMU for the Forrester function (solid black line). For NOMU, we also show ˆσf as a dotted blue line. Training points are shown as black dots.

subsequent BO steps would propose to query the same point x = 1 without exploring any other region of the input space and one would never ﬁnd the true maximum at x 1. In contrast, a reasonable model for estimating σf (such as NOMU), would directly (after scaling up c dynamically by a factor 2 as described in Appendix B.3.2) choose a point in the unexplored right region x 1, because the left side x 1 is already well explored.

Overall, aleatoric uncertainty σn and epistemic uncertainty σf are two very different objects. Thus, an estimator ˆσn designed to estimate σn is usually a bad estimator for σf.

H. Hyperparameter Sensitivity Analysis

In this section, we provide a sensitivity analysis with respect to NOMU s loss hyperparameters, i.e., πsqr, πexp, cexp, and Dart. First, we present a visual qualitative analysis in 1D showing how each of these hyperparameters affects the shape of NOMU s UBs (Appendix H.1). Second, we also present an extensive quantitative sensitivity analysis in the generative test-bed setting from Section 4.1.2, where in addition to the loss hyperparameters we also include the hyperparameters of the readout map ℓmin, and ℓmax in our analysis (Appendix H.2).

H.1. Qualitative Sensitivity Analysis

In this section, we consider the setting of Section 4.1.1, and visualize the effect of increasing or decreasing each of NOMU s loss hyperparameters πsqr, πexp, cexp, and Dart

on the example of the 1D Levy function. For reference, Figure 23 shows NOMU s UBs (with scaling factor c = 2) for the default loss hyperparameters πsqr = 0.1, πexp = 0.01, cexp = 30, and Dart = 128 that are used in Section 4.1.1 in the main paper.

For each of Dart and cexp, we ﬁt two additional NOMU mod-

Appendix of NOMU: Neural Optimization-based Model Uncertainty

1.0 0.5 0.0 0.5 1.0

4 NOMU: 2 f

NOMU: f 2 f

Figure 23. NOMU s UBs (c=2) for the generic loss hyperparameters from Section 4.1.1 in the main paper.

els, where we ceteris paribus deand increase the hyperparameter s default value by factors 1/s and s, respectively. The multiplicative factors πsqr and πexp we treat jointly in our sensitivity analysis: First, we show the effect of ceteris paribus deand increasing the default value of the product πexpπsqr by factors sl = 0.001 and su = 10. Second, we vary the ratio πexp/πsqr in the same fashion, with scaling factors sl = 1/16 and su = 16. Within all of the experiments in this section, we sample artiﬁcial input points Dart on an equidistant (deterministic) grid on [ 1.1, 1.1]. This allows us to give a qualitative analysis of the hyperparameters effects as follows.

Varying πexpπsqr with Scaling Factors of sl = 0.001 and su = 10. Decreasing πexpπsqr by decreasing both πexp and πsqr leads to more tubular bounds by relaxing the desiderata D2 (In-Sample) and D3 (Out-of-Sample). This can be seen in Figure 24: NOMU s blue dashed uncertainty (corresponding to small πexpπsqr) is larger at data points than the orange one of NOMU 2 (corresponding to large πexpπsqr), and it is smaller further away from the training data points. NOMU s default hyperparameters are in a range where the loss is already at its limit40 enforcing the desiderata D2 (In-Sample) and D3 (Out-of-Sample). Therefore, further increasing πexpπsqr (while keeping their ratio and the other hyperparameters ﬁxed) barely causes the UBs to change as can be seen for the orange UBs in Figure 24. Increasing

40For NOMU s default parameters the ratio πexpπsqr/λ is already very large such that the explicit regularization via λ θ 2 2 is almost negligible. Thus, the UBs are only prevented from having even larger curvature by implicit regularization, i.e., within a given number of epochs the training algorithm cannot reach a function with more curvature, because increasing the amplitude of the loss is (partially) compensated by the adaptivity of the ADAM algorithm. Only in ranges where πexpπsqr/λ is small enough for the explicit regularization to actually matter, the UBs become sensitive to the ratio πexpπsqr/λ. Then the regularization of λ θ 2 2 keeps the curvature of ˆσf low, i.e., the UBs become more tubular.

1.0 0.5 0.0 0.5 1.0

4 NOMU: 2 f

NOMU: f 2 f

NOMU 2: 2 f

NOMU 2: f 2 f

Figure 24. NOMU s UBs (c=2) for πsqr = 1e 4, πexp = 1e 5 (blue) and πsqr = 1, πexp = 0.1 (orange).

πexpπsqr too much, can lead to numerical instabilities.

Varying πexp/πsqr with Scaling Factors of sl = 1/16 and su = 16. Increasing the ratio of πexp/πsqr (while keeping their product and all other hyperparameters ﬁxed) simply causes NOMU s UBs to uniformly widen across the entire domain. Indeed, in Figure 25, the orange UBs of NOMU 2 (corresponding to large πexp/πsqr) are blown up and cover the blue UBs (corresponding to small πexp/πsqr) throughout the input space.

1.0 0.5 0.0 0.5 1.0

4 NOMU: 2 f

NOMU: f 2 f

NOMU 2: 2 f

NOMU 2: f 2 f

Figure 25. NOMU s UBs (c=2) for πexp = 0.0025, πsqr = 0.4 (blue) and πexp = 0.04, πsqr = 0.025 (orange).

Varying cexp with a Scaling Factor of s = 2. Increasing the hyperparameter cexp causes UBs to shrink in areas of large uncertainty and to widen in areas of small uncertainty. This effect is visualized in Figure 26: the orange dashed uncertainty line of NOMU 2 (large cexp = 60) lies above the blue one of NOMU (small cexp = 15) at data points; and in regions of large uncertainty, the orange UBs (corresponding to large cexp) turn out to be more narrow than the blue ones (corresponding to small cexp). Thus, increasing cexp causes NOMU s UBs to be more tubular.

Varying Dart with a Scaling Factor of s = 8. Finally, we assess the effect of changing the number of artiﬁcial data

Appendix of NOMU: Neural Optimization-based Model Uncertainty

1.0 0.5 0.0 0.5 1.0

4 NOMU: 2 f

NOMU: f 2 f

NOMU 2: 2 f

NOMU 2: f 2 f

Figure 26. NOMU s UBs (c=2) for cexp = 15 (blue) and cexp = 60 (orange).

points Dart used to approximate the integral (c) of NOMU s loss function deﬁned in Equation (4). As expected, the UBs behave overall very similarly. However, for very small gaps in between input training points, Dart can have an inﬂuence on the estimated UBs. For example, in the gap between the training input point at x 0.77 and the one at x 0.73, the ˆσf obtained from the smaller Dart (blue) vanishes (because of a lack of artiﬁcial data points falling in this gap), while ˆσf obtained from the larger Dart also estimates nonzero model uncertainty in this small gap.

1.0 0.5 0.0 0.5 1.0

4 NOMU: 2 f

NOMU: f 2 f

NOMU 2: 2 f

NOMU 2: f 2 f

Figure 27. NOMU s UBs (c=2) for Dart = 16 (blue) and Dart = 1024 (orange).

Varying the architecture and L2-regularization In Appendix D.4.1 we visualize how certain changes to the architecture and different L2-regularization parameters λ for different parts of the network inﬂuence NOMU s model uncertainty estimate ˆσf. In particular, we show in Appendix D.4.1 how the choice of the architecture and the L2-regularization determine the degree to which NOMU fulﬁlls desiderata D4 (Metric Learning).

Finally, in Appendix H.2, we empirically show that NOMU is robust with respect to its hyperparameters within a certain range.

H.2. Quantitative Sensitivity Analysis

We now present an extensive quantitative sensitivity analysis of NOMU s loss hyperparameters: πsqr, πexp, cexp, and Dart in the generative test-bed setting (see Section 4.1.2 for details on this setting). Additionally, we also consider in this analysis the hyperparameters corresponding to the readout map, i.e.., ℓmin and ℓmax. We decided to perform the quantitative sensitivity analysis in the generative test-bed setting, since it offers a particularly rich variety of different test functions and thus exposes each hyperparameter selection to hundreds of different test functions.

Setting We use the same default hyperparameters as in Section 4.1.2. This includes NOMU s loss hyperparameters: πsqr, πexp, cexp, and Dart, the hyperparameters of the readout map: ℓmin and ℓmax as well as all other hyperparameters. For the following sensitivity analysis, we then vary at each time only a single hyperparameter, i.e., one of πsqr, πexp, cexp, Dart, ℓmin, and ℓmax, and set all other hyperparameters to their default values (i.e., we perform a ceteris paribus analysis as in Appendix H.1).

In Table 15, we present for each considered hyperparameter a grid of ﬁve different values which we use to test its sensitivity. The NOMU column in Table 15 corresponds to the NOMU s default hyperparameters used in the generative test-bed setting (Section 4.1.2). The columns NOMU1 to NOMU4 in Table 15 then correspond to deviations from these original hyperparameters.

Results In Table 16, Table 17, and Table 18 we present for each of the ceteris paribus runs average NLL values for input dimensions 1D, 2D, and 5D, respectively. Each cell in those tables represents a single hyperparameter selection where we use NOMU s default hyperparameters except for the hyperparameter of the corresponding row which we choose according to the cell s column, e.g., to obtain the result for the cell (πexp, NOMU3), we use the default NOMU hyperparameters from Section 4.1.2 except for πexp

Table 15. Grid selection for each hyperparameter (HP). The column NOMU corresponds to NOMU s default hyperparameters used in the generative test-bed setting. NOMU1 to NOMU4 correspond to deviations from these default hyperparameters. For Dart, d denotes the input dimension.

HP NOMU1 NOMU2 NOMU NOMU3 NOMU4

πSQR 0.01 0.02 0.1 0.5 1 πEXP 0.001 0.002 0.01 0.05 0.1 c EXP 10 20 30 45 90 D ART 100 d

2 100 d 2 (100 d) 4 (100 d) ℓMIN 0.01 0.05 0.1 0.15 0.20 ℓMAX 0.50 0.75 1 2.0 4.0

Appendix of NOMU: Neural Optimization-based Model Uncertainty

which we set according to column NOMU3 in Table 15 to πexp := 0.05. Overall, we can make the following three main observations:

1. The majority of all cells in Table 16, Table 17, and Table 18 are marked in grey. This shows that, their corresponding hyperparameters lead to average NLL results which are statistically on par with the results obtained via NOMU s default hyperparameters in the NOMU columns. This highlights NOMU s robustness with respect to all considered hyperparameters within the chosen grids. Furthermore, it conﬁrms our claim from the main paper that using generic hyperparameters for NOMU often works well without much hyperparametertuning.

2. Since NOMU with the default hyperparameters already outperforms all other considered benchmark methods in this setting, i.e., each NOMU column represents the winning method among benchmarks (see Table 2), we see that all grey marked deviations of hyperparameters lead to results that outperform all other considered benchmark methods too. Moreover, all except for one (5D (ℓmin, NOMU1)) NOMU models corresponding to cells which are not marked in grey, i.e., with hyperparameters that lead to statistically worse results than NOMU s default hyperparameters, are as good or better than the best benchmark methods from Table 2.

3. By varying NOMU s hyperparameters, we can even obtain better results (i.e., with a smaller average NLL) than the ones reported in Table 2 of the main paper, e.g. in 1D with ℓmin := 0.01 the average NLL = 1.83 < 1.65. While these improvements are not statistically signiﬁcant, these results still suggest that systematic hyperparametertuning could improve the performance of NOMU even further.

Table 16. Sensitivity analysis for 1D generative test-bed setting. We present for each hyperparameter (HP) and its ﬁve corresponding grid-points the average NLL (without const. ln(2π)/2) and a 95% CI over 200 BNN samples. Results which are statistically on par with NOMU s default HPs, i.e., the column NOMU, are marked in grey. Note that, the best benchmark method for this experiment is GP with NLL = 1.08 0.22 (see Table 2).

HP NOMU1 NOMU2 NOMU NOMU3 NOMU4

πSQR -1.59 0.11 -1.63 0.10 -1.65 0.10 -1.55 0.13 -1.54 0.14 πEXP -1.64 0.09 -1.67 0.09 -1.65 0.10 -1.62 0.10 -1.60 0.10 c EXP -1.77 0.12 -1.73 0.10 -1.65 0.10 -1.49 0.12 -1.11 0.13 D ART -1.65 0.09 -1.62 0.11 -1.65 0.10 -1.63 0.12 -1.65 0.10 ℓMIN -1.83 0.09 -1.72 0.10 -1.65 0.10 -1.58 0.10 -1.47 0.12 ℓMAX -1.49 0.12 -1.60 0.10 -1.65 0.10 -1.63 0.16 -1.66 0.14

Table 17. Sensitivity analysis for 2D generative test-bed setting. We present for each hyperparameter (HP) and its ﬁve corresponding grid-points the average NLL (without const. ln(2π)/2) and a 95% CI over 200 BNN samples. Results which are statistically on par with NOMU s default HPs, i.e., the column NOMU, are marked in grey. Note that, the best benchmark method for this experiment is DE with NLL = 0.77 0.07 (see Table 2).

HP NOMU1 NOMU2 NOMU NOMU3 NOMU4

πSQR -1.18 0.04 -1.18 0.04 -1.16 0.05 -1.15 0.04 -1.15 0.04 πEXP -1.15 0.04 -1.15 0.05 -1.16 0.05 -1.18 0.04 -1.18 0.04 c EXP -1.17 0.04 -1.19 0.04 -1.16 0.05 -1.11 0.05 -1.00 0.05 D ART -1.16 0.05 -1.16 0.05 -1.16 0.05 -1.16 0.04 -1.15 0.05 ℓMIN -1.07 0.05 -1.17 0.04 -1.16 0.05 -1.14 0.05 -1.12 0.05 ℓMAX -1.13 0.05 -1.15 0.05 -1.16 0.05 -1.16 0.04 -1.16 0.04

Table 18. Sensitivity analysis for 5D generative test-bed setting. We present for each hyperparameter (HP) and its ﬁve corresponding grid-points the average NLL (without const. ln(2π)/2) and a 95% CI over 200 BNN samples. Results which are statistically on par with NOMU s default HPs, i.e., the column NOMU, are marked in grey. Note that, the best benchmark method for this experiment is GP with NLL = 0.33 0.02 (see Table 2).

HP NOMU1 NOMU2 NOMU NOMU3 NOMU4

πSQR -0.37 0.03 -0.37 0.02 -0.37 0.02 -0.36 0.02 -0.35 0.02 πEXP -0.33 0.02 -0.34 0.02 -0.37 0.02 -0.40 0.02 -0.40 0.02 c EXP -0.39 0.02 -0.38 0.02 -0.37 0.02 -0.37 0.02 -0.34 0.05 D ART -0.37 0.02 -0.37 0.02 -0.37 0.02 -0.37 0.02 -0.37 0.02 ℓMIN -0.21 0.03 -0.33 0.02 -0.37 0.02 -0.38 0.02 -0.39 0.02 ℓMAX -0.39 0.02 -0.37 0.02 -0.37 0.02 -0.35 0.02 -0.33 0.02

I. Details on our Notation

In Section 2, we are using a slightly overloaded notation, where we use the same symbol f for different mathematical objects. Sometimes, we use f for a function-valued random variable F : (Ω, Σ, P) Y X. and sometimes we use f for the speciﬁc unknown ground truth function ftrue := F(ω) (i.e., x X : ftrue(x) = (F(ω))(x)). While we used this slightly overloaded notation for the sake of readability in the main paper, we will now introduce our Bayesian uncertainty framework in its full mathematical detail:

In practice, there exists an unknown ground truth function ftrue. In the classical Bayesian paradigm, one assumes that everything unknown (i.e., here ftrue) was sampled from a prior distribution. Speciﬁcally, ftrue := F(ω) is a realization of a random variable F : (Ω, Σ, P) Y X distributed according to a prior belief, i.e., a prior distribution. Using this notation, one can mathematically describe in a rigorous way the full Bayesian data generating process as follows:

Let F : (Ω, Σ, P) Y X be a function-valued random variable distributed according to a prior belief, i.e., a prior

Appendix of NOMU: Neural Optimization-based Model Uncertainty

distribution. Moreover, let (Xi, Ei)i {1,...,ntrain} denote the random variable representing the input data points and the corresponding data noise, i.e., a family off i.i.d random variables independent of F, where each random variable (Xi, Ei) : (Ω, Σ, P) X R fulﬁlls x X : (Ei|Xi = x) N (0, σn(x)). Finally, let Yi be the random variable associated to the targets, which we deﬁne as Yi : (Ω, Σ, P) Y, ω 7 Yi(ω) := F(ω)(Xi(ω)) + Ei(ω).

With this notation in place, the objects used in the main paper in Section 2: xi = Xi(ω), yi = Yi(ω) and εi = Ei(ω) are the values one actually observe in practice as training data, i.e., the realizations of the data generating process.

Therefore, σf(x) from Equation (1) should be interpreted for all x X as

V [F( )(x)| i {1, . . . , ntrain} : (Xi, Yi) = (xi, yi)], (94)

which, for all x X, is deﬁned mathematically even more rigorously via the conditional variance as follows:

V F( )(x) (Xi, Yi)i {1,...,ntrain} (ω). (95)

Throughout the paper it should always be clear from the context if xi refers to the random variable Xi or its realization Xi(ω) where the same holds true for yi, εi and f.

Importantly, note that in the setting which we consider in this paper, one only observes data coming from a single function f = F(ω) and one does not have access to more functions fi. I.e., we neither have access to other samples of the random variable F nor do we consider multiple i.i.d random variables Fi in contrast to the setting considered for neural processes as described in Appendix F.