# activationlevel_uncertainty_in_deep_neural_networks__d1dcf3e0.pdf

Published as a conference paper at ICLR 2021

ACTIVATION-LEVEL UNCERTAINTY IN DEEP NEURAL NETWORKS

Pablo Morales-Álvarez Department of Computer Science and AI University of Granada, Spain pablomorales@decsai.ugr.es

Daniel Hernández-Lobato Department of Computer Science Universidad Autónoma de Madrid, Spain

Rafael Molina Department of Computer Science and AI University of Granada, Spain

José Miguel Hernández-Lobato Department of Engineering University of Cambridge, UK Alan Turing Institute, London, UK

Current approaches for uncertainty estimation in deep learning often produce too conﬁdent results. Bayesian Neural Networks (BNNs) model uncertainty in the space of weights, which is usually high-dimensional and limits the quality of variational approximations. The more recent functional BNNs (f BNNs) address this only partially because, although the prior is speciﬁed in the space of functions, the posterior approximation is still deﬁned in terms of stochastic weights. In this work we propose to move uncertainty from the weights (which are deterministic) to the activation function. Speciﬁcally, the activations are modelled with simple 1D Gaussian Processes (GP), for which a triangular kernel inspired by the Re Lu non-linearity is explored. Our experiments show that activation-level stochasticity provides more reliable uncertainty estimates than BNN and f BNN, whereas it performs competitively in standard prediction tasks. We also study the connection with deep GPs, both theoretically and empirically. More precisely, we show that activation-level uncertainty requires fewer inducing points and is better suited for deep architectures.

1 INTRODUCTION

Deep Neural Networks (DNNs) have achieved state-of-the-art performance in many different tasks, such as speech recognition (Hinton et al., 2012), natural language processing (Mikolov et al., 2013) or computer vision (Krizhevsky et al., 2012). In spite of their predictive power, DNNs are limited in terms of uncertainty estimation. This has been a classical concern in the ﬁeld (Mac Kay, 1992; Hinton & Van Camp, 1993; Barber & Bishop, 1998), which has attracted a lot of attention in the last years (Lakshminarayanan et al., 2017; Guo et al., 2017; Sun et al., 2019; Wenzel et al., 2020). Indeed, this ability to know what is not known is essential for critical applications such as medical diagnosis (Esteva et al., 2017; Mobiny et al., 2019) or autonomous driving (Kendall & Gal, 2017; Gal, 2016).

Bayesian Neural Networks (BNNs) address this problem through a Bayesian treatment of the network weights1 (Mac Kay, 1992; Neal, 1995). This will be refered to as weight-space stochasticity. However, dealing with uncertainty in weight space is challenging, since it contains many symmetries and is highly dimensional (Wenzel et al., 2020; Sun et al., 2019; Snoek et al., 2019; Fort et al., 2019). Here we focus on two speciﬁc limitations. First, it has been recently shown that BNNs with well-established inference methods such as Bayes by Backprop (BBP) (Blundell et al., 2015) and MC-Dropout (Gal & Ghahramani, 2016) underestimate the predictive uncertainty for instances located in-between two clusters of training points (Foong et al., 2020; 2019; Yao et al., 2019). Second, the weight-space prior does not allow BNNs to guide extrapolation to out-of-distribution (OOD) data (Sun et al., 2019; Nguyen et al., 2015; Ren et al., 2019). Both aspects are illustrated graphically in Figure 3, more details in Section 3.1.

Work developed mostly while visiting Cambridge University, UK. 1The bias term will be absorbed within the weights throughout the work.

Published as a conference paper at ICLR 2021

Figure 1: Graphical representation of the artiﬁcial neurons for closely related methods. The subscript d and the superscript l refer to the d-th unit in the l-th layer, respectively. (a) In standard Neural Networks (NN), both the weights and the activation function are deterministic. (b) In Bayesian NNs, weights are stochastic and the activation is deterministic. (c) In au NN (this work), weights are deterministic and the activation is stochastic. (d) Deep GPs do not have a linear projection through weights, and the output is modelled directly with a GP deﬁned on the Dl 1-dimensional input space.

As an alternative to standard BNNs, Functional Bayesian Neural Nets (f BNN) specify the prior and perform inference directly in function space (Sun et al., 2019). This provides a mechanism to guide the extrapolation in OOD data, e.g. predictions can be encouraged to revert to the prior in regions of no observed data. However, the posterior stochastic process is still deﬁned by a factorized Gaussian on the network weights (i.e. as in BBP), see (Sun et al., 2019, Sect. 3.1). We will show that this makes f BNN inherit the problem of underestimating the predictive uncertainty for in-between data.

In this work, we adopt a different approach by moving stochasticity from the weights to the activation function, see Figure 1. This will be referred to as au NN (activation-level uncertainty for Neural Networks). The activation functions are modelled with (one-dimensional) GP priors, for which a triangular kernel inspired by the Re Lu non-linearity (Nair & Hinton, 2010; Glorot et al., 2011) is used. Since non-linearities are typically simple functions (e.g. Re Lu, sigmoid, tanh), our GPs are sparsiﬁed with few inducing points. The network weights are deterministic parameters which are estimated to maximize the marginal likelihood of the model. The motivation behind au NN is to avoid inference in the complex space of weights. We hypothesise that it could be enough to introduce stochasticity in the activation functions that follow the linear projections to provide sensible uncertainty estimations.

We show that au NN obtains well-calibrated estimations for in-between data, and its prior allows to guide the extrapolation to OOD data by reverting to the empirical mean. This will be visualized in a simple 1D example (Figure 3 and Table 1). Moreover, au NN obtains competitive performance in standard benchmarks, is scalable (datasets of up to ten millions training points are used), and can be readily used for classiﬁcation. The use of GPs for the activations establishes an interesting connection with deep GPs (DGPs) (Damianou & Lawrence, 2013; Salimbeni & Deisenroth, 2017). The main difference is the linear projection before the GP, recall Figure 1(c-d). This allows au NN units to model simpler mappings between layers, which are deﬁned along one direction of the input space, similarly to neural networks. However, DGP units model more complex mappings deﬁned on the whole input space, see also Figure 2a. We will show that au NN units require fewer inducing points and are better suited for deep architectures, achieving superior performance. Also, a thorough discussion on additional related work will be provided in Section 4.

In summary, the main contributions of this paper are: (1) a new approach to model uncertainty in DNNs, based on deterministic weights and simple stochastic non-linearities (in principle, not necessarily modelled by GPs); (2) the speciﬁc use of non-parametric GPs as a prior, including the triangular kernel inspired by the Re Lu; (3) au NN addresses a well-known limitation of BNNs and f BNNs (uncertainty underestimation for in-between data), can guide the extrapolation to OOD data by reverting to the empirical mean, and is competitive in standard prediction tasks; (4) au NN units require fewer inducing points and are better suited for deep architectures than DGP ones, achieving superior performance.

2 PROBABILISTIC MODEL AND INFERENCE

Model speciﬁcation. We focus on a supervised task (e.g. regression or classiﬁcation) with training data2 {xn,:, yn,:}N n=1. The graphical model in Figure 2b will be useful throughout this section. We

2The output is represented as a vector since all the derivations apply for the multi-output case.

Published as a conference paper at ICLR 2021

DGP au NN (a) (b) (c) Figure 2: (a) Type of mappings modelled by DGP and au NN units (colours represent different values). Whereas DGP units describe complex functions deﬁned on the whole Dl 1 dimensional input space, the linear projection through wl d in au NN yields simpler functions deﬁned on just one direction. This is closer in spirit to NNs, requires fewer inducing points, and is better suited for deep architectures. The inducing points are shown in black (for au NN, these correspond to (hyper)planes in the input space before the projection). (b) Probabilistic graphical model for an au NN layer. Yellow variables are to be estimated (light ones through point estimates and the dark one through a posterior distribution). The box highlights the auxiliary variables (inducing points and their values). (c) Graphical representation of the UCI gap splits. In red, a segment that crosses the gap joining two training points from different components, which will be used in the experiments.

assume a model of L layers, each one with Dl units as in Figure 1c. Each activation is modelled with a (1D) GP prior, i.e. f l d(al d) GP(µl d, kl d), with µl d : R R and kl d : R R R. The GP hyperparameters θl d will be omitted for clarity (for the kernels used here, θl d includes the amplitude and the lengthscale). Assuming independence between units, each layer depends on the previous one as: p(Fl|Fl 1, Wl) = p(Fl|Al) = QDl

d=1 p(f l d|al d), (1) where Fl is the N Dl matrix of outputs of the l-th layer for N inputs, Wl is the Dl 1 Dl matrix of weights in that layer, and Al is the N Dl matrix of pre-activations, i.e. Al = Fl 1 Wl. As usual, the columns and rows of Fl are denoted as f l d and f l n,:, respectively (and analogously for the other matrices). Since the activation is deﬁned by a GP, we have p(f l d|al d) = N(f l d|µl d, Kl d), with µl d (resp. Kl d) the result of evaluating µl d (resp. kl d) on al d (that is, µl d is a N-dimensional vector and Kl d is a N N matrix). To fully specify the model, the output Y is deﬁned from the last layer with a distribution that factorizes across data points, i.e. p(Y|FL) = QN n=1 p(yn,:|f L n,:). This formulation resembles that of DGPs (Damianou & Lawrence, 2013; Salimbeni & Deisenroth, 2017). The main difference is that we model Fl|Fl 1 through Dl 1D GPs evaluated on the pre-activations Al (i.e. the projections of Fl 1 through Wl), whereas DGPs use Dl GPs of dimension Dl 1 evaluated directly on Fl 1, recall Figure 1(c-d).

Variational Inference. Inference in the proposed model is intractable. To address this, we follow standard sparse variational GP approaches (Titsias, 2009; Hensman et al., 2013; 2015), similarly to the Doubly Stochastic Variational Inference (DSVI) for DGPs (Salimbeni & Deisenroth, 2017). Speciﬁcally, in each unit of each layer we introduce M l inducing values ul d, which are the result of evaluating the GP on the one-dimensional inducing points zl d. We naturally write Ul and Zl for the corresponding M l Dl matrices associated to the l-th layer, respectively. Following eq. (1), the augmented model for one layer is

p(Fl, Ul|Fl 1, Wl, Zl) = p(Fl|Ul, Al, Zl)p(Ul|Zl) = QDl

d=1 p(f l d|ul d, al d, zl d)p(ul d|zl d). (2)

Variational inference (VI) involves the approximation of the true posterior p({Fl, Ul}l|Y). Following (Hensman et al., 2013; Salimbeni & Deisenroth, 2017), we propose a posterior given by p(F|U) and a parametric Gaussian on U:

q({Fl, Ul}l) = QL l=1 p(Fl|Ul, Al, Zl)q(Ul) = QL l=1 QDl

d=1 p(f l d|ul d, al d, zl d)q(ul d), (3)

where q(ul d) = N(ul d|ml d, Sl d), with ml d RM l and Sl d RM l M l variational parameters to be estimated. Minimizing the KL divergence between q({Fl, Ul}l) and the true posterior is equivalent to maximizing the following evidence lower bound (ELBO):

log p(Y|{Wl, Zl}l) ELBO =

n=1 Eq(f L n,:) log p(yn,:|f L n,:)

d=1 KL q(ul d)||p(ul d) . (4)

Published as a conference paper at ICLR 2021

In the ELBO, the KL term can be computed in closed-form, as both q(ul d) and p(ul d) are Gaussians. The log likelihood term can be approximated by sampling from the marginal posterior q(f L n,:), which can be done efﬁciently through univariate Gaussians as in (Salimbeni & Deisenroth, 2017). Speciﬁcally, Ul can be analytically marginalized in eq. (3), which yields q({Fl}l) = Q

l q(Fl|Fl 1, Wl) = Q

l,d N(f l d| µl d, Σ l d), with:

[ µl d]i = µl d(al id) + αl d(al id) (ml d µl d(zl d)), (5)

[ Σ l d]ij = kl d(al id, al jd) αl d(al id) (kl d(zl d) Sl d)αl d(al jd), (6)

where αl d(x) = kl d(x, zl d)[kl d(zl d)] 1 and al n,: = Wlf l 1 n,: . Importantly, the marginal posterior q(f l n,:) is a Gaussian that depends only on al n,:, which in turn only depends on q(f l 1 n,: ). Therefore, sampling from f l n,: is straightforward using the reparametrization trick (Kingma & Welling, 2013):

f l nd = [ µl d]n + ε [ Σ l d]1/2 nn , with ε N(0, 1), and f 0 n,: = xn,:. (7)

Training consists in maximizing the ELBO, eq. (4), w.r.t. variational parameters {ml d, Sl d}, inducing points {zl d}, and model parameters (i.e. weights {wl d} and kernel parameters {θl d}). This can be done in batches, allowing for scalability to very large datasets. The complexity to evaluate the ELBO is O(NM 2(D1 + + DL)), the same as DGPs with DSVI (Salimbeni & Deisenroth, 2017).3

Predictions. Given a new x ,:, we want to compute4 p(f L ,:|X, Y) Eq({Ul}) p(f L ,:|{Ul}) . As in (Salimbeni & Deisenroth, 2017), this can be approximated by sampling S values up to the (L 1)-th layer with the same eq. (7), but starting with x ,:. Then, p(f L ,:|X, Y) is given by the mixture of the S Gaussians distributions obtained from eqs. (5)-(6).

Triangular kernel. One of the most popular kernels in GPs is the RBF (Williams & Rasmussen, 2006), which produces very smooth functions. However, the Re Lu non-linearity led to a general boost in performance in DNNs (Nair & Hinton, 2010; Glorot et al., 2011), and we aim to model similar activations. Therefore, we introduce the use of the triangular (TRI) kernel. Just like RBF, TRI is an isotropic kernel, i.e. it depends on the distance between the inputs, k(x, y) = γ g(|x y|/ℓ), with γ and ℓthe amplitude and lengthscale. For RBF, g(t) = e t2/2. For TRI, g(t) = max(1 t, 0). This is a valid kernel (Williams & Rasmussen, 2006, Section 4.2.1). Similarly to the Re Lu, the functions modelled by TRI are piecewise linear, see Figure 6a in the main text and Figure 8 in Appendix C.

Comparison with DGP. The difference between au NN and DGP units is graphically illustrated in Figure 2a. Whereas DGP mappings from one layer to the next are complex functions deﬁned on Dl 1 dimensions (Dl 1 = 2 in the ﬁgure), au NN mappings are deﬁned just along one direction via the weight projection. This is closer in spirit to NNs, whose mappings are also simpler and better suited for feature extraction and learning more abstract concepts. Moreover, since the GP is deﬁned on a 1D space, au NN requires fewer inducing points than DGP (which, intuitively, can be interpreted as inducing (hyper)planes in the Dl 1-dimensional space before the projection).

3 EXPERIMENTS

In this section, au NN is compared to BNN, f BNN (Sun et al., 2019) and DSVI DGP (Salimbeni & Deisenroth, 2017). BNNs are trained with BBP (Blundell et al., 2015), since au NN also leverages a simple VI-based inference approach. In each section we will highlight the most relevant experimental aspects, and all the details can be found in Appendix B. In the sequel, NLL stands for Negative Log Likelihood. Anonymized code for au NN is provided in the supplementary material, along with a script to run it for the 1D illustrative example of Section 3.1.

3.1 AN ILLUSTRATIVE EXAMPLE

Here we illustrate the two aspects that were highlighted in the introduction: the underestimation of predictive uncertainty for instances located in-between two clusters of training points and the

3As in (Salimbeni & Deisenroth, 2017), there exists also a cubic term O(M 3(D1 + + DL)) that is dominated by the former (since the batch size N is typically larger than M). Moreover, in au NN we have the multiplication by weights, with complexity O(NDl 1Dl) for each layer. This is also dominated by the former. 4The distribution p(y L ,:|X, Y) is obtained as the expectation of the likelihood over p(f L ,:|X, Y). A Gaussian likelihood is used for regression, and the Robust-Max (Hernández-Lobato et al., 2011) for classiﬁcation.

Published as a conference paper at ICLR 2021

5.0 2.5 0.0 2.5 5.0 2

NN BNN f BNN

5.0 2.5 0.0 2.5 5.0

5.0 2.5 0.0 2.5 5.0

Figure 3: Predictive distribution (mean and one standard deviation) after training on a 1D dataset with two clusters of points. This simple example illustrates the main limitations of NN, BNN and f BNN, which are overcome by the novel au NN. See Table 1 for a summary and the text for details.

Table 1: Visual overview of conclusions from the 1D experiment in Figure 3. This shows that NN, BNN, f BNN and au NN increasingly expand their capabilities.

Epistemic uncertainty Reverts to the mean In-between uncertainty

NN BNN f BNN au NN

extrapolation to OOD data. Figure 3 shows the predictive distribution of NN, BNN, f BNN and au NN (with RBF and TRI kernels) after training on a simple 1D dataset with two clusters of points. All the methods have one hidden layer with 25 units, and 5 inducing points are used for au NN.

In Figure 3, the deterministic nature of NNs prevents them from providing epistemic uncertainty (i.e. the one originating from the model (Kendall & Gal, 2017)). Moreover, there is no prior to guide the extrapolation to OOD data. BNNs provide epistemic uncertainty. However, the prior in the complex space of weights does not allow for guiding the extrapolation to OOD data (e.g. by reverting to the empirical mean). Moreover, note that BNNs underestimate the predictive uncertainty in the region between the two clusters, where there is no observed data (this region is usually called the gap). More speciﬁcally, as shown in (Foong et al., 2020), the predictive uncertainty for data points in the gap is limited by that on the extremes. By specifying the prior in function space, f BNN can induce properties in the output, such as reverting to the empirical mean for OOD data through a zero-mean GP prior. However, the underestimation of in-between uncertainty persists, since the posterior stochastic process for f BNN is based on a weight-space factorized Gaussian (as BNN with BBP), see (Sun et al., 2019, Section 3.1) for details. Finally, au NN (either with RBF or TRI kernel) addresses both aspects through the novel activation-level modelling of uncertainty, which utilizes a zero-mean GP prior for the activations. Table 1 summarizes the main characteristics of each method. Next, a more comprehensive experiment with deeper architectures and more complex multidimensional datasets is provided.

3.2 UCI REGRESSION DATASETS WITH GAP SPLITS

Standard splits are not appropriate to evaluate the quality of uncertainty estimates for in-between data, since both train and test sets may cover the space equally. This motivated the introduction of gap splits (Foong et al., 2019). Namely, a set with D dimensions admits D such train-test partitions by considering each dimension, sorting the points according to its value, and selecting the middle 1/3 for test (and the outer 2/3 for training), see Figure 2c. With these partitions, overconﬁdent predictions for data points in the gap manifest as very high values of test negative log likelihood.

Using the gap splits, it was recently shown that BNNs yield overconﬁdent predictions for in-between data (Foong et al., 2019). The authors highlight the case of Energy and Naval datasets, where BNNs fail catastrophically. Figure 4a reproduces these results for BNNs and checks that f BNNs also obtain overconﬁdent predictions, as theoretically expected. However, notice that activation-level stochasticity performs better, specially through the triangular kernel, which dramatically improves the results (see the plot scale). Figure 4b conﬁrms that the difference is due to the underestimation of uncertainty, since the predictive performance in terms of RMSE is on a similar scale for all the methods. In all cases, D = 50 hidden units are used, and au NN uses M = 10 inducing points.

To further understand the intuition behind the different results, Figure 5 shows the predictive distribution over a segment that crosses the gap, recall Figure 2c. We observe that activation-level approaches obtain more sensitive (less conﬁdent) uncertainties in the gap, where there is no observed data. For instance, BNN and f BNN predictions in Naval are unjustiﬁably overconﬁdent, since the output in that dataset ranges from 0.95 to 1. Also, to illustrate the internal mechanism of au NN, Figure 6a shows one example of the activations learned when using each kernel. Although it is just one example, it allows for visualising the different nature: smoother for RBF and piecewise linear for TRI. All the activations for a particular network and for both kernels are shown in Appendix C (Figure 8).

Published as a conference paper at ICLR 2021

0 100 au NN-TRI-3 au NN-TRI-2 au NN-RBF-3 au NN-RBF-2

f BNN-3 f BNN-2

BNN-3 BNN-2

Test neg log lik

0 1000 2000

2 4 6 au NN-TRI-3 au NN-TRI-2 au NN-RBF-3 au NN-RBF-2

f BNN-3 f BNN-2

BNN-3 BNN-2

Figure 4: Test NLL (a) and RMSE (b) for the gap splits in Energy and Naval datasets (mean and one standard error, the lower the better). Activation-level uncertainty, specially through the triangular kernel, avoids the dramatic failure of BNN and f BNN in terms of NLL (see the scale). The similar values in RMSE reveal that this failure actually comes from an extremely overconﬁdent estimation by BNN and f BNN, see also Figure 5.

500 600 700 10

500 600 700

500 600 700

500 600 700

500 600 700

au NN-RBF-2

500 600 700

au NN-RBF-3

500 600 700

au NN-TRI-2

500 600 700

au NN-TRI-3

2000 3000 gap dim.

2000 3000 gap dim.

2000 3000 gap dim.

2000 3000 gap dim.

2000 3000 gap dim.

2000 3000 gap dim.

2000 3000 gap dim.

2000 3000 gap dim.

Figure 5: Predictive distribution (mean and one standard deviation) over a segment that crosses the gap, joining two training points from different connected components. au NN avoids overconﬁdent predictions by allocating more uncertainty in the gap, where there is no observed data.

In addition to the paradigmatic cases of Energy and Naval illustrated here, four more datasets are included in Appendix C. Figure 7 there is analogous to Figure 4 here, and Tables 4 and 5 there show the full numeric results and ranks. We observe that au NN, specially through the triangular kernel, obtains the best results and does not fail catastrophically in any dataset (unlike BNN and f BNN, which do in Energy and Naval). Finally, the performance on the gap splits is complemented by that on standard splits, see Tables 6 and 7 in Appendix C. This shows that, in addition to the enhanced uncertainty estimation, au NN is a competitive alternative in general practice.

3.3 COMPARISON WITH DGPS

As explained in Section 2, the choice of a GP prior for activation stochasticity establishes a strong connection with DGPs. The main difference is that au NN performs a linear projection from Dl 1 to Dl dimensions before applying Dl 1D GPs, whereas DGPs deﬁne Dl GPs directly on the Dl 1 dimensional space. This means that au NN units are simpler than those of DGP, recall Figure 2a. Here we show two practical implications of this.

First, it is reasonable to hypothesise that DGP units may require a higher number of inducing points M than au NN, since they need to cover a multi-dimensional input space. By contrast, au NN may require a higher number of hidden units D, since these are simpler. Importantly, the computational cost is not symmetric in M and D, but signiﬁcantly cheaper on D, recall Section 2. Figure 6b shows the performance of au NN and DGP for different values of M and D on the UCI Kin8 set (with one hidden layer; depth will be analyzed next). As expected, note the different inﬂuence by M and D: whereas au NN improves by rows (i.e. as D grows), DGP does it by columns (i.e. as M grows)5. Next section (Section 3.4), will show that this makes au NN faster than DGP in practice. An analogous ﬁgure for RMSE and full numeric results are in Appendix C (Figure 9 and Tables 9-10).

Second, au NN simpler units might be better suited for deeper architectures. Figure 6c shows the performance on the UCI Power dataset when depth is additionally considered. It can be observed that au NN is able to take greater advantage of depth, which translates into better overall performance.

5Interestingly, the fact that DGP is not greatly inﬂuenced by D could be appreciated in its recommended value in the original work (Salimbeni & Deisenroth, 2017). They set D = min(30, D0), where D0 is the input dimension. This limits D to a maximum value of 30.

Published as a conference paper at ICLR 2021

5.0 2.5 0.0 2.5 5.0

5.0 2.5 0.0 2.5 5.0

5 10 25 50 75100

5 10 25 50 75100

5 10 25 50 75100

L=4 L=3 L=2

5 10 25 50 75100

5 10 25 50 75100

5 10 25 50 75100

Figure 6: (a) One example of activation function (mean and standard deviation) learned by au NN with each kernel. RBF s one is smoother, whereas TRI s is piecewise linear, inspired by Re Lu. Black dots represent (the mean of) the inducing point values. Green dots are the locations of input data when propagated to the corresponding unit. (b) Test NLL of au NN and DGP for different values of M (number of inducing points) and D (number of hidden units). The lower the better. The results are the average over ﬁve independent runs with different splits. Whereas DGP improves by columns (i.e. with M), au NN does it by rows (i.e. with D). This is as hypothesized, and is convenient from a scalability viewpoint. (c) Test NLL with increasing depth (L = 2, 3, 4). This supports that au NN might beneﬁt more than DGP from deeper networks. Moreover, the aforementioned different inﬂuence of M and D on DGP and au NN is conﬁrmed here.

Moreover, the aforementioned different inﬂuence of D and M on DGP and on au NN is also conﬁrmed here. The results on RMSE are similar, see Figure 10 and Tables 11-12 in Appendix C.

Finally, it may be argued that au NN closely resembles a DGP with additive kernel (Duvenaud et al., 2011; Durrande et al., 2011) (DGP-add hereafter). Recall that an additive kernel models functions that are decomposed as f(x) = f1(x1) + + f D(x D). Therefore, the model for al+1|al in au NN is very similar to that of f l+1|f l in DGP-add, see Figure 11 in Appendix C. Speciﬁcally, in both cases, the input (al in au NN, f l in DGP-add) goes through 1D GPs and then these are aggregated (linear combination through W in au NN, summation in DGP-add) to yield the output (al+1 in au NN, f l+1 in DGP-add). However, there exists a key difference. In au NN, all the nodes in the (l + 1)-th layer (i.e. al+1 i ) aggregate a shared set of distinct functions (namely, f l i), each node using its own weights to aggregate them. While in DGP-add, there is not such shared set of functions, and each node in the (l + 1)-th layer (i.e. f l+1 i ) aggregates a different set of GP realizations (i.e. the unlabelled blue nodes in Figure 11c). This subtle theoretical difference has empirical implications, since many more functions need to be learned for DGP-add. Indeed, Figures 12 and 13 in Appendix C compare the performance of DGP-add and au NN-RBF (the experimental setting is analogous to that of Figure 6c)6. We observe that the results obtained by DGP-add are worse than those by au NN-RBF, probably due to the larger number of functions that need to be learned in DGP-add.

3.4 CLASSIFICATION, SCALABILITY, AND ADDITIONAL METRICS

So far, we have experimented with small to medium regression datasets, and uncertainty estimation has been measured through the (negative) log likelihood and the visual inspection of the predictive distribution (Figures 3 and 5). Here we focus on two large scale classiﬁcation datasets (up to 107 instances), and additional metrics that account for uncertainty calibration are reported. We use the well-known particle physics binary classiﬁcation sets HIGGS (N = 11M, D = 28) and SUSY (N = 5M, D = 18) (Baldi et al., 2014). We consider DGP as a baseline, as it obtained state-of-the-art results for these datasets (Salimbeni & Deisenroth, 2017). For all the methods, we consider a Robust-Max classiﬁcation likelihood (Hernández-Lobato et al., 2011).

The metrics to be used are the Brier score (Gneiting & Raftery, 2007) and the Expected Calibration Error (ECE) (Guo et al., 2017). The former is a proper score function that measures the accuracy of probabilistic predictions for categorical variables. In practice, it is computed as the mean squared

6For a fair comparison, here we use au NN-RBF (and not TRI), because DGP-add leverages a RBF kernel.

Published as a conference paper at ICLR 2021

Table 2: Brier score and expected calibration error (ECE) for au NN and DGP in the large scale classﬁcation datasets HIGGS and SUSY (the lower the better in both metrics). The standard error (on three splits) is close to zero in all cases, see Table 13 in Appendix C.

N D RBF-2 RBF-3 RBF-4 TRI-2 TRI-3 TRI-4 DGP-2 DGP-3 DGP-4

Brier HIGGS 11M 28 0.3363 0.3159 0.3098 0.3369 0.3172 0.3118 0.4527 0.4399 0.4378 SUSY 5M 18 0.2746 0.2739 0.2737 0.2749 0.2742 0.2738 0.3815 0.3816 0.3804

ECE HIGGS 11M 28 0.2196 0.2383 0.2427 0.2198 0.2390 0.2397 0.4352 0.4303 0.4251 SUSY 5M 18 0.3453 0.3496 0.3504 0.3462 0.3485 0.3465 0.5304 0.5291 0.5273

Table 3: Average training time per batch over 50 independent runs (in seconds). The standard error is low in all cases, see Table 14 in Appendix C.

RBF-2 RBF-3 RBF-4 TRI-2 TRI-3 TRI-4 DGP-2 DGP-3 DGP-4

HIGGS 0.0962 0.1607 0.2259 0.0922 0.1647 0.2308 0.1918 0.3102 0.3930 SUSY 0.0926 0.1564 0.2245 0.0923 0.1563 0.2265 0.1430 0.2129 0.2771

difference between a one dimensional vector with the probability for each class label and the one-hot encoding of the actual class. The latter measures miscalibration as the difference in expectation between conﬁdence and accuracy. This is done by partitioning the predictions in M equally spaced bins and taking a weighted average of the bins accuracy/conﬁdence difference, see (Guo et al., 2017, Eq.(3)) for details.

Table 2 shows the Brier score and ECE for au NN and DGP for different values of L (depth). We observe that au NN outperforms DGP in both metrics, achieving superior uncertainty estimation. Both TRI and RBF kernels obtain similar results for au NN. Notice that the Brier score generally improves with the network depth, whereas the performance in ECE decreases with depth. Interestingly, this behavior was also observed for standard NNs (Guo et al., 2017, Figure 2a).

Finally, as was theoretically justiﬁed in Section 2, au NN can scale up to very large datasets (HIGGS has more than 107 training instances). Regarding the practical computational cost, Table 3 shows the average training time per batch for both au NN and DGP in the previous datasets. Although the theoretical complexity is analogous for both methods (recall Section 2), the experiments in Figures 6b-c showed that DGP requires larger values of M, whereas au NN needs larger D 7. Since the computational cost is not symmetric on M and D, but signiﬁcantly cheaper in the latter (recall Section 2), au NN is faster than DGP in practice.

4 RELATED WORK

Activation-level uncertainty is introduced here as an alternative to weight-space stochasticity. The expressiveness of the latter has been recently analyzed in the recent work (Wenzel et al., 2020), where the authors advocate a modiﬁed BNN objective. Alternatively, different prior speciﬁcations are studied in (Hafner et al., 2020; Pearce et al., 2019; Flam-Shepherd et al., 2017), in addition to the f BNN discussed here (Sun et al., 2019). However, none of these works consider stochasticity on the activations.

Since we present a straightforward use of VI for au NN, in this work we have compared empirically with the well-known VI-based BBP for BNNs. Yet, we expect au NN to beneﬁt from independent inference reﬁnements like those proposed over the last years for BNNs. For instance, natural-gradient VI allows for leveraging techniques such as Batch Norm or data augmentation (Osawa et al., 2019), and the information contained in the SGD trajectory can be exploited as well (Maddox et al., 2019). Also, getting rid of the gradient variance through deterministic approximate moments has provided enhanced results in BNNs (Wu et al., 2019).

7In this section, both DGP and au NN are trained with one hidden layer and their optimal conﬁguration according to the previous experiment: large M for DGP (M = 100, D is set as recommended by the authors, i.e. D = min(30, D0)), and large D for au NN (D = 50, M is set to the intermediate value of M = 25).

Published as a conference paper at ICLR 2021

A key aspect of au NN is the modelling of the activation function. This element of neural nets has been analyzed before. For instance, self-normalizing neural nets (Klambauer et al., 2017) induce the normalization that is explicitly performed in related approaches such as Batch Norm (Ioffe & Szegedy, 2015) and weight and layer normalization (Salimans & Kingma, 2016; Ba et al., 2016). Learnable deterministic activations have been explored too, e.g. (He et al., 2015; Agostinelli et al., 2014). However, as opposed to au NN, in all these cases the activations are deterministic.

Probabilistic neural networks such as Natural-Parameter Networks (NPN) (Wang et al., 2016) propagate probability distributions through layers of transformations. Therefore, the values of the activations are also described by probability distributions (speciﬁcally, the exponential family is used in NPN). Fast dropout training (Wang & Manning, 2013) and certain variants of NPNs can be also viewed in this way (Shekhovtsov & Flach, 2018; Postels et al., 2019). However, in au NN the activations are modelled themselves as stochastic learnable components that follow a GP prior. Along with the deterministic weights, this provides a conceptually different approach to model uncertainty.

A very preliminary study on GP-based activation functions is proposed in (Urban & van der Smagt, 2018). However, the method is not empirically evaluated, no connection with deep GPs is provided, and the inference approach is limited. Namely, the output of each unit is approximated with a Gaussian whose mean and covariance are computed in closed-form, as was done in (Bui et al., 2016) for DGPs. However, this is only tractable for the RBF kernel (in particular, it cannot leverage the more convenient TRI kernel studied here), and the Gaussian approximation typically yields worse results than Monte Carlo approximations to the ELBO as used here (indeed, DSVI (Salimbeni & Deisenroth, 2017) substantially improved the results for DGPs compared to (Bui et al., 2016)).

5 CONCLUSIONS AND FUTURE WORK

We proposed a novel approach for uncertainty estimation in neural network architectures. Whereas previous methods are mostly based on a Bayesian treatment of the weights, here we move the stochasticity to the activation functions, which are modelled with a simple 1D GP and a triangular kernel inspired by the Re Lu. Our experiments show that the proposed method obtains better calibrated uncertainty estimates and is competitive in standard prediction tasks. Moreover, the connection with deep GPs is analyzed. Namely, our approach requires fewer inducing points and is better suited for deep architectures, achieving superior performance.

We hope this work raises interest on alternative approaches to model uncertainty in neural networks. One of the main directions of future research is to deeply understand the properties induced by each one of the kernels considered here (i.e. the triangular one and RBF). In particular, it would be interesting to automatically learn the optimal kernel for each unit in a probabilistic way. Also, the use of a GP prior for the activation function may hamper the scalability of au NN to wider and/or deeper networks. In these cases, the GP-based activation model could be substituted by a simpler Bayesian parametric one. This would allow for a cheaper modelling of uncertainty within the activations. Finally, since only the activation function is modiﬁed, important deep learning elements such as convolutional layers can be still incorporated.

ACKNOWLEDGEMENTS

This work was supported by the Agencia Estatal de Investigación of the Spanish Ministerio de Ciencia e Innovación under contract PID2019-105142RB-C22/AEI/10.13039/501100011033, and the Spanish Ministerio de Economía, Industria y Competitividad under contract DPI2016-77869C2-2-R. DHL acknowledges support from the Spanish Ministerio de Ciencia e Innovación (projects TIN2016-76406-P and PID2019-106827GB-I00/AEI/10.13039/501100011033). PMA was funded by La Caixa Banking Foundation (ID 100010434, Barcelona, Spain) through La Caixa Fellowship for Doctoral Studies LCF/BQ/ES17/11600011, and the University of Granada through the program Proyectos de Investigación Precompetitivos para Jóvenes Investigadores del Plan Propio 2019 (ref. PPJIB2019-03).

F. Agostinelli, M. Hoffman, P. Sadowski, and P. Baldi. Learning activation functions to improve deep neural networks. ar Xiv preprint ar Xiv:1412.6830, 2014.

Published as a conference paper at ICLR 2021

J.L. Ba, J.R. Kiros, and G.E. Hinton. Layer normalization. ar Xiv preprint ar Xiv:1607.06450, 2016.

P. Baldi, P. Sadowski, and D. Whiteson. Searching for exotic particles in high-energy physics with deep learning. Nature communications, 5:4308, 2014.

D. Barber and C.M. Bishop. Ensemble learning for multi-layer networks. In Advances in neural information processing systems, pp. 395 401, 1998.

C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural network. In International conference on machine learning, pp. 1613 1622, 2015.

T. Bui, D. Hernández-Lobato, J.M. Hernández-Lobato, Y. Li, and R.E. Turner. Deep Gaussian processes for regression using approximate expectation propagation. In International conference on machine learning, pp. 1472 1481, 2016.

A. Damianou and N.D. Lawrence. Deep Gaussian processes. In International conference on artiﬁcial intelligence and statistics, pp. 207 215, 2013.

A.G. De G. Matthews, M. Van Der Wilk, T. Nickson, K. Fujii, A. Boukouvalas, P. León-Villagrá, Z. Ghahramani, and J. Hensman. Gpﬂow: A Gaussian process library using tensorﬂow. The Journal of Machine Learning Research, 18(1):1299 1304, 2017.

N. Durrande, D. Ginsbourger, and O. Roustant. Additive kernels for gaussian process modeling. ar Xiv preprint ar Xiv:1103.4023, 2011.

D.K. Duvenaud, H. Nickisch, and C.E. Rasmussen. Additive Gaussian processes. In Advances in neural information processing systems, pp. 226 234, 2011.

A. Esteva, B. Kuprel, R.A. Novoa, J. Ko, S.M. Swetter, H.M. Blau, and S. Thrun. Dermatologist-level classiﬁcation of skin cancer with deep neural networks. Nature, 542(7639):115 118, 2017.

D. Flam-Shepherd, J. Requeima, and D. Duvenaud. Mapping Gaussian process priors to Bayesian neural networks. In NIPS Bayesian deep learning workshop, 2017.

A.Y.K. Foong, Y. Li, J.M. Hernández-Lobato, and R.E. Turner. in-between uncertainty in Bayesian neural networks. ICML 2019 Workshop on Uncertainty and Robustness in Deep Learning, 2019.

A.Y.K. Foong, D.R. Burt, Y. Li, and R.E. Turner. On the expressiveness of approximate inference in Bayesian neural networks. In Advances in neural information processing systems, 2020.

S. Fort, H. Hu, and B. Lakshminarayanan. Deep ensembles: A loss landscape perspective. ar Xiv preprint ar Xiv:1912.02757, 2019.

Y. Gal. Uncertainty in Deep Learning. Ph D thesis, University of Cambridge, 2016.

Y. Gal and Z. Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In International conference on machine learning, pp. 1050 1059, 2016.

X. Glorot and Y. Bengio. Understanding the difﬁculty of training deep feedforward neural networks. In International conference on artiﬁcial intelligence and statistics, pp. 249 256, 2010.

X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectiﬁer neural networks. In International conference on artiﬁcial intelligence and statistics, pp. 315 323, 2011.

T. Gneiting and A.E. Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359 378, 2007.

C. Guo, G. Pleiss, Y. Sun, and K.Q. Weinberger. On calibration of modern neural networks. In International conference on machine learning, pp. 1321 1330, 2017.

D. Hafner, D. Tran, T. Lillicrap, A. Irpan, and J. Davidson. Noise contrastive priors for functional uncertainty. In Uncertainty in Artiﬁcial Intelligence, pp. 905 914, 2020.

K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectiﬁers: Surpassing human-level performance on imagenet classiﬁcation. In Proceedings of the IEEE international conference on computer vision, pp. 1026 1034, 2015.

Published as a conference paper at ICLR 2021

J. Hensman, N. Fusi, and N.D. Lawrence. Gaussian processes for big data. In Uncertainty in Artiﬁcial Intelligence, pp. 282 290, 2013.

J. Hensman, A.G. De G. Matthews, and Z. Ghahramani. Scalable variational Gaussian process classiﬁcation. In International conference on artiﬁcial intelligence and statistics, pp. 351 360, 2015.

D. Hernández-Lobato, J.M. Hernández-Lobato, and P. Dupont. Robust multi-class Gaussian process classiﬁcation. In Advances in neural information processing systems, pp. 280 288, 2011.

G.E. Hinton and D. Van Camp. Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the sixth annual conference on Computational learning theory, pp. 5 13, 1993.

G.E. Hinton, L. Deng, D. Yu, G.E. Dahl, A.R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T.R. Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine, 29(6): 82 97, 2012.

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pp. 448 456, 2015.

A. Kendall and Y. Gal. What uncertainties do we need in Bayesian deep learning for computer vision? In Advances in neural information processing systems, pp. 5574 5584, 2017.

D.P. Kingma and J. Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

D.P. Kingma and M. Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013.

G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter. Self-normalizing neural networks. In Advances in neural information processing systems, pp. 971 980, 2017.

A. Krizhevsky, I. Sutskever, and G.E. Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097 1105, 2012.

B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in neural information processing systems, pp. 6402 6413, 2017.

D.J.C. Mac Kay. A practical Bayesian framework for backpropagation networks. Neural computation, 4(3):448 472, 1992.

W.J. Maddox, P. Izmailov, T. Garipov, D.P. Vetrov, and A.G. Wilson. A simple baseline for Bayesian uncertainty in deep learning. In Advances in neural information processing systems, pp. 13132 13143, 2019.

T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efﬁcient estimation of word representations in vector space. ar Xiv preprint ar Xiv:1301.3781, 2013.

A. Mobiny, A. Singh, and H. Van Nguyen. Risk-aware machine learning classiﬁer for skin lesion diagnosis. Journal of clinical medicine, 8(8):1241, 2019.

V. Nair and G.E. Hinton. Rectiﬁed linear units improve restricted boltzmann machines. In International conference on machine learning, pp. 807 814, 2010.

R.M. Neal. Bayesian Learning for Neural Networks. Ph D thesis, University of Toronto, 1995.

A. Nguyen, J. Yosinski, and J. Clune. Deep neural networks are easily fooled: High conﬁdence predictions for unrecognizable images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 427 436, 2015.

Published as a conference paper at ICLR 2021

K. Osawa, S. Swaroop, M.E.E. Khan, A. Jain, R. Eschenhagen, R.E. Turner, and R. Yokota. Practical deep learning with Bayesian principles. In Advances in neural information processing systems, pp. 4289 4301, 2019.

T. Pearce, R. Tsuchida, M. Zaki, A. Brintrup, and A. Neely. Expressive priors in Bayesian neural networks: Kernel combinations and periodic functions. In Uncertainty in Artiﬁcial Intelligence, 2019.

J. Postels, F. Ferroni, H. Coskun, N. Navab, and F. Tombari. Sampling-free epistemic uncertainty estimation using approximated variance propagation. In Proceedings of the IEEE international conference on computer vision, pp. 2931 2940, 2019.

J. Ren, P.J. Liu, E. Fertig, J. Snoek, R. Poplin, M. Depristo, J. Dillon, and B. Lakshminarayanan. Likelihood ratios for out-of-distribution detection. In Advances in neural information processing systems, pp. 14680 14691, 2019.

T. Salimans and D.P. Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in neural information processing systems, pp. 901 909, 2016.

H. Salimbeni and M. Deisenroth. Doubly stochastic variational inference for deep Gaussian processes. In Advances in neural information processing systems, pp. 4588 4599, 2017.

A. Shekhovtsov and B. Flach. Feed-forward propagation in probabilistic neural networks with categorical and max layers. In International conference on learning representations, 2018.

J. Shi, S. Sun, and J. Zhu. A spectral approach to gradient estimation for implicit distributions. In International conference on machine learning, pp. 4644 4653, 2018.

J. Snoek, Y. Ovadia, E. Fertig, B. Lakshminarayanan, S. Nowozin, D. Sculley, J. Dillon, J. Ren, and Z. Nado. Can you trust your model s uncertainty? evaluating predictive uncertainty under dataset shift. In Advances in neural information processing systems, pp. 13969 13980, 2019.

S. Sun, G. Zhang, J. Shi, and R. Grosse. Functional variational Bayesian neural networks. In International conference on learning representations, 2019.

M. Titsias. Variational learning of inducing variables in sparse Gaussian processes. In International conference on artiﬁcial intelligence and statistics, pp. 567 574, 2009.

S. Urban and P. van der Smagt. Gaussian process neurons. https://openreview.net/ forum?id=By-Iif ZRW, 2018. Accessed: 2020-05-15.

H. Wang, X. Shi, and D.Y. Yeung. Natural-parameter networks: A class of probabilistic neural networks. Advances in neural information processing systems, pp. 118 126, 2016.

S. Wang and C. Manning. Fast dropout training. In International conference on machine learning, pp. 118 126, 2013.

F. Wenzel, K. Roth, B.S. Veeling, J. Swi atkowski, L. Tran, S. Mandt, J. Snoek, T. Salimans, R. Jenatton, and S. Nowozin. How good is the bayes posterior in deep neural networks really? ar Xiv preprint ar Xiv:2002.02405, 2020.

C.K.I. Williams and C.E. Rasmussen. Gaussian processes for machine learning, volume 2. MIT press Cambridge, MA, 2006.

A. Wu, S. Nowozin, E. Meeds, R.E. Turner, Hernández-Lobato J.M., and A.L. Gaunt. Deterministic variational inference for robust bayesian neural networks. In International Conference on Learning Representations, 2019.

J. Yao, W. Pan, S. Ghosh, and F. Doshi-Velez. Quality of uncertainty quantiﬁcation for bayesian neural network inference. ar Xiv preprint ar Xiv:1906.09686, 2019.

Published as a conference paper at ICLR 2021

A PRACTICAL SPECIFICATIONS FOR AUNN

Whitening transformation for q(ul d). The proposed parametric posterior for each unit is given by the Gaussian q(ul d) = N(ul d|ml d, Sl d). The GP prior on ul d is p(ul d) = N(ul d|µl d, Kl d), with µl d = µl d(zl d) and Kl d = kl d(zl d, zl d). For numerical stability and to reduce the amount of operations, we use a white representation for q(ul d), as is common practice in (D)GPs (De G. Matthews et al., 2017; Salimbeni & Deisenroth, 2017). That is, we consider the variable vl d N( ml d, Sl d), with ul d = µl d + (Kl d)1/2vl d. Speciﬁcally, in the code the variable ml d is denoted as q_mu, and Sl d is represented through its Cholesky factorization ( Sl d)1/2, which is named q_sqrt.

Initialization of the variational parameters {ml d}. These are the mean of the posterior distribution on the inducing points. Therefore, their value determines the initialization of the activation function. If the RBF kernel is used, {ml d} are initialized to the prior µl d = µl d(zl d) (since we are using the aforementioned white representation, q_mu is initialized to zero). This is the most standard initialization in GP literature. For the TRI kernel, {ml d} are initialized according to the Re Lu which TRI is inspired by, i.e. ml d = Re Lu(zl d).

Initialization of the variational parameters {Sl d}. The posterior distribution covariance matrices are initialized to the prior Kl d = kl d(zl d, zl d) (that is, q_sqrt is initialized to the identity matrix). Following common practise for DGPs (Salimbeni & Deisenroth, 2017), the covariance matrices of inner layers are scaled by 10 5.

Initialization of the weights. The Glorot uniform initializer (Glorot & Bengio, 2010), also called Xavier uniform initializer, is used for the weights. The biases are initialized to zero.

Initialization of the kernel hyperparameters. The kernels used (RBF and TRI) have two hyperparameters: the variance γ and the lengthscale ℓ. Both are always initialized to 1 (except for the lengthscale in the 1D example in Section 3.1, where ℓis initialized to 0.1).

Initialization of the inducing points. In order to initialize zl d, the N input data points are propagated through the network with the aforementioned initial weights, biases, and activation function. Then, in each layer and unit, zl d is initialized with a linspace between the minimum and maximum of the N values there (the minimum (resp. the maximum) is decreased (resp. increased) by 0.1 to strictly contain the interval of interest).

Initialization of the regression likelihood noise. In the regression problems, we use a Gaussian likelihood p(y|f) = N(y|f, σ2). The standard deviation of the noise is initialized to σ = 0.1.

Mean function. We always use a zero mean function. Since data is normalized to have zero mean (and standard deviation equal to one), a zero mean function allows for reverting to the empirical mean for OOD data, as explained in the main text.

Optimizer and learning rate. Throughout the work, we use the Adam Optimizer (Kingma & Ba, 2014) with default parameters and learning rate of 0.001.

B EXPERIMENTAL DETAILS FOR THE EXPERIMENTS

All the experiments were run on a NVIDIA Tesla P100. In order to predict, all the methods utilize 100 test samples in all the experiments. Details for each section are provided below.

An illustrative example (Section 3.1 in the main text). All the methods use two layers (i.e. one hidden layer). The hidden layer has D = 25 units in all cases. BNN and f BNN use Re Lu activations. The au NN methods use M = 10 inducing points in each unit (the rest of methods do not have such inducing points). The methods are trained during 5000 epochs with the whole dataset (no mini-batches). The dataset is synthetically generated to have two clusters of points around x = 1. More speciﬁcally, 30 points are sampled uniformly in each interval (x 0.3, x + 0.3) for x = 1, and the output is given by the sin function plus a Gaussian noise of standard deviation 0.1. We have also trained DGP and GP on this dataset, see Figure 14. Both methods use M = 10 inducing points, and are trained during 5000 epochs with the whole dataset (no mini-batches). DGP has one one hidden layer with D = 25 units.

Published as a conference paper at ICLR 2021

4 6 au NN-TRI-3 au NN-TRI-2 au NN-RBF-3 au NN-RBF-2

f BNN-3 f BNN-2

BNN-3 BNN-2

Test neg log lik

0 1000 2000

4 6 8 au NN-TRI-3 au NN-TRI-2 au NN-RBF-3 au NN-RBF-2

f BNN-3 f BNN-2

BNN-3 BNN-2

0.7 0.8 0.9

Figure 7: Performance of the compared methods in the gap splits for six UCI datasets. Mean and one standard error of NLL (upper row) and RMSE (lower row) are shown, the lower the better.

UCI regression datasets with gap (and standard) splits (Section 3.2 in the main text). The methods use L = 2, 3 layers. In all cases, the hidden layers have D = 50 units. BNN and f BNN use Re Lu activations. The methods are trained during 10000 epochs, with a mini-batch size that depends on the size of the dataset. For those with fewer than 5000 instances (i.e. Boston, Concrete, Energy, Wine and Yacht), the mini-batch size is 500. For those with more than 5000 (i.e. Naval), the mini-batch size is 5000. Recall from the main text that each dataset has as many gap splits as dimensionality, with 2/3 for train and 1/3 for test. In the case of standard splits, each dataset uses 10 random 90%-10% train-test splits. Regarding the segment used in Figure 5, each extreme of the segment is a point from a different connected component of the training set. These are chosen so that the function is well-known in the extremes (but not along the segment, which crosses the gap). Namely, the extremes are chosen as the training points who have minimum average distance to the closest ﬁve points in its connected component.

Comparison with DGPs (Section 3.3 in the main text). Here, different values of depth L, number of inducing points M and number of hidden layers D are studied (see the main text). au NN is trained during 5000 epochs, with a mini-batch size of 5000 (20000 epochs are used for DGP, as proposed by the authors (Salimbeni & Deisenroth, 2017)). Each experiment is repeated on ﬁve random 90%-10% train-test splits. DGP uses a RBF kernel. The experimental details for DGP-add are the same as for DGP, with the only difference of the kernel. Namely, an additive kernel using RBF components is used for DGP-add.

Large scale experiments (Section 3.4 in the main text). Since we are dealing with classiﬁcation datasets, a Robust-Max likelihood is used in all cases (Hernández-Lobato et al., 2011). The values of D and M are chosen following the conclusions from Section 3.3. That is, DGP needs large M (the largest M = 100 is used), but is less inﬂuenced by D (this is chosen as recommended by the authors (Salimbeni & Deisenroth, 2017): D = min(30, D0), with D0 the dimensionality of the input data). au NN needs large D (the largest D = 50 is used), but is less inﬂuenced by M (the intermediate value M = 25 is chosen). All the methods are trained during 100 epochs, with a mini-batch size of 5000. Three random train-test splits are used. In both datasets, 500000 instances are used for test (which leaves 10.5M and 4.5M training instances for HIGGS and SUSY, respectively).

C ADDITIONAL FIGURES AND TABLES

Finally, additional material is provided here. Every ﬁgure and table is referred from the main text.

Published as a conference paper at ICLR 2021

Table 4: Test NLL for the gap splits of the six UCI datasets (mean and one standard error, the lower the better). Last column is the per-group (weight-space stochasticity vs activation-level stochasticity) average rank.

Boston Concrete Energy Naval Wine Yacht Rank Rank (group)

BNN-2 3.29 0.10 3.58 0.09 114.84 70.69 2186.30 464.32 0.96 0.01 1.54 0.09 3.92 0.79

4.83 0.32 BNN-3 3.54 0.03 4.23 0.04 30.91 19.97 618.44 147.99 0.98 0.02 4.10 0.03 4.98 0.70 f BNN-2 3.67 0.25 4.60 0.39 111.65 69.68 1050.65 192.61 2.80 0.31 1.77 0.12 5.04 0.36 f BNN-3 3.69 0.24 4.49 0.34 93.92 56.45 1060.54 247.21 198.76 30.24 1.47 0.15 5.36 0.50

au NN-RBF-2 5.19 0.47 4.27 0.26 39.93 20.89 379.55 67.74 1.44 0.05 1.68 0.35 4.69 0.61

4.17 0.40 au NN-RBF-3 5.68 0.75 5.54 0.40 50.48 28.26 352.94 72.13 16.05 1.13 1.28 0.23 5.29 0.89 au NN-TRI-2 2.77 0.06 3.45 0.06 3.99 1.14 30.47 5.54 1.06 0.03 2.34 0.03 3.25 0.57 au NN-TRI-3 2.70 0.04 3.39 0.06 5.50 2.45 2.38 3.23 1.23 0.04 2.68 0.30 3.47 0.80

Table 5: Test RMSE for the gap splits of the six UCI datasets (mean and one standard error, the lower the better). Last column is the per-group (weight-space stochasticity vs activation-level stochasticity) average rank.

Boston Concrete Energy Naval Wine Yacht Rank Rank (group)

BNN-2 6.54 0.56 7.62 0.35 4.23 1.91 0.03 0.00 0.63 0.01 1.18 0.11 4.09 0.67

4.91 0.37 BNN-3 7.77 0.40 16.33 0.67 5.27 1.41 0.02 0.00 0.64 0.01 14.31 0.76 6.15 0.91 f BNN-2 3.75 0.21 7.58 0.41 3.95 1.82 0.03 0.00 0.78 0.02 1.25 0.08 4.70 0.54 f BNN-3 3.81 0.20 7.52 0.36 4.48 1.79 0.03 0.00 0.87 0.04 1.13 0.12 4.71 0.53

au NN-RBF-2 4.90 0.47 7.81 0.47 3.41 1.46 0.03 0.00 0.72 0.01 0.99 0.18 4.32 0.47

4.09 0.23 au NN-RBF-3 4.27 0.29 7.74 0.21 2.72 1.03 0.03 0.00 0.82 0.01 1.03 0.14 4.27 0.55 au NN-TRI-2 4.01 0.30 7.44 0.38 2.72 0.79 0.02 0.00 0.67 0.01 1.51 0.20 3.90 0.33 au NN-TRI-3 3.78 0.19 7.03 0.23 3.36 1.23 0.02 0.00 0.68 0.01 3.80 2.41 3.85 0.46

Table 6: Test NLL for the standard splits of the six UCI datasets (mean and one standard error, the lower the better). Last column is the per-group (weight-space stochasticity vs activation-level stochasticity) average rank.

test NLL Boston Concrete Energy Naval Wine Yacht Rank Rank (group)

BNN-2 2.71 0.07 3.12 0.02 0.65 0.04 -5.38 0.59 0.99 0.02 1.01 0.07 3.78 0.41

4.5 0.39 BNN-3 3.62 0.05 4.24 0.01 0.80 0.03 -5.02 0.33 1.01 0.02 4.06 0.05 6.25 0.70 f BNN-2 2.83 0.20 3.20 0.04 0.67 0.04 -6.17 0.02 1.55 0.08 0.77 0.02 4.13 0.57 f BNN-3 2.75 0.14 3.13 0.05 0.65 0.03 -6.26 0.00 207.43 9.12 0.79 0.02 3.83 0.85

au NN-RBF-2 3.38 0.30 3.14 0.05 0.63 0.03 -5.40 0.08 1.16 0.06 0.52 0.04 3.97 0.60

4.5 0.43 au NN-RBF-3 3.89 0.47 3.25 0.13 0.53 0.07 -5.69 0.03 8.98 1.51 0.54 0.03 4.42 0.85 au NN-TRI-2 2.56 0.05 3.08 0.02 1.47 0.04 -4.81 0.07 0.96 0.03 2.25 0.02 4.78 0.92 au NN-TRI-3 2.50 0.02 2.98 0.02 1.42 0.02 -3.43 0.32 1.10 0.07 2.26 0.01 4.83 1.01

Table 7: Test RMSE for the standard splits of the six UCI datasets (mean and one standard error, the lower the better). Last column is the per-group (weight-space stochasticity vs activation-level stochasticity) average rank.

test RMSE Boston Concrete Energy Naval Wine Yacht Rank Rank (group)

BNN-2 3.47 0.34 5.49 0.13 0.45 0.02 0.00 0.00 0.65 0.01 0.68 0.08 4.70 0.48

4.59 0.41 BNN-3 8.89 0.45 16.71 0.20 0.51 0.02 0.00 0.00 0.67 0.02 13.49 0.94 6.50 0.64 f BNN-2 2.80 0.21 5.34 0.13 0.47 0.02 0.00 0.00 0.70 0.02 0.33 0.04 3.70 0.61 f BNN-3 2.74 0.16 5.07 0.12 0.46 0.02 0.00 0.00 0.83 0.02 0.36 0.04 3.45 0.88

au NN-RBF-2 3.16 0.23 5.13 0.16 0.45 0.02 0.00 0.00 0.67 0.02 0.41 0.04 4.25 0.35

4.41 0.41 au NN-RBF-3 3.01 0.25 4.51 0.18 0.41 0.03 0.00 0.00 0.76 0.02 0.38 0.03 3.35 0.77 au NN-TRI-2 3.00 0.26 5.21 0.10 0.72 0.02 0.00 0.00 0.62 0.02 1.15 0.14 5.40 0.80 au NN-TRI-3 2.81 0.17 4.67 0.15 0.65 0.03 0.01 0.00 0.62 0.02 1.16 0.15 4.65 1.00

Table 8: Standard error obtained by au NN and DGP in three splits of the large scale classiﬁcation datasets HIGGS and SUSY.

N D RBF-2 RBF-3 RBF-4 TRI-2 TRI-3 TRI-4 DGP-2 DGP-3 DGP-4

HIGGS 11M 28 0.0001 0.0006 0.0007 0.0003 0.0004 0.0008 0.0005 0.0009 0.0010 SUSY 5M 18 0.0004 0.0005 0.0005 0.0005 0.0005 0.0004 0.0005 0.0027 0.0035

Published as a conference paper at ICLR 2021

au NN-RBF au NN-TRI

Layer 1 Layer 2 Layer 3 Layer 1 Layer 2 Layer 3

Figure 8: A complete example of the activation functions learned by au NN with RBF and TRI kernels. These were obtained for the Energy dataset with the ﬁrst gap split, using three layers, 10 hidden units per (hidden) layer, and 5 inducing points in each unit. Whereas au NN-RBF learns smoother activations, au NN-TRI ones are piece-wise linear, inspired by the Re Lu. Notice that au NN allows units to switch off if they are not required. Black dots represent the ﬁve inducing points in each unit. Green points are the locations of the input data when propagated to the corresponding unit.

Published as a conference paper at ICLR 2021

5 10 25 50 75100

5 10 25 50 75100

5 10 25 50 75100

Figure 9: Test RMSE of au NN and DGP for different values of M (number of inducing points) and D (number of hidden units). Results are the average over 5 independent runs on the UCI Kin8 dataset. The lower the better. Whereas DGP improves by columns (i.e. with M), au NN does it by rows (i.e. with D). This is as theoretically expected, and it is convenient from a scalability viewpoint.

5 10 25 50 75100

5 10 25 50 75100

5 10 25 50 75100

L=4 L=3 L=2

Figure 10: Test RMSE with increasing depth (L = 2, 3, 4). This supports that au NN might beneﬁt more than DGP from deeper networks. Moreover, the aforementioned different inﬂuence of M and D on DGP and au NN is conﬁrmed here.

Published as a conference paper at ICLR 2021

Table 9: Test NLL of au NN and DGP for different values of M (number of inducing points) and D (number of hidden units). Mean and one standard error over 5 independent runs on the UCI Kin8 dataset are shown. The lower the better.

au NN-RBF au NN-TRI DGP

D M 5 10 25 50 75 100 5 10 25 50 75 100 5 10 25 50 75 100

5 -0.85 0.01 -0.89 0.01 -0.89 0.01 -0.89 0.01 -0.89 0.01 -0.90 0.01 -0.78 0.00 -0.79 0.03 -0.78 0.05 -0.77 0.04 -0.67 0.06 -0.71 0.04 -0.67 0.01 -0.98 0.00 -1.19 0.01 -1.30 0.01 -1.33 0.01 -1.34 0.01 10 -1.06 0.01 -1.09 0.01 -1.09 0.01 -1.09 0.01 -1.10 0.02 -1.10 0.01 -0.96 0.01 -1.02 0.01 -1.03 0.01 -0.98 0.03 -0.94 0.03 -0.89 0.03 -0.69 0.01 -0.98 0.00 -1.19 0.00 -1.30 0.01 -1.33 0.01 -1.35 0.01 25 -1.27 0.02 -1.30 0.02 -1.30 0.02 -1.30 0.02 -1.31 0.01 -1.31 0.02 -1.09 0.01 -1.19 0.01 -1.22 0.01 -1.15 0.02 -1.11 0.01 -1.06 0.03 -0.68 0.01 -0.98 0.00 -1.17 0.01 -1.26 0.01 -1.29 0.01 -1.30 0.01 50 -1.33 0.01 -1.34 0.01 -1.34 0.02 -1.33 0.01 -1.34 0.02 -1.32 0.03 -1.15 0.01 -1.24 0.01 -1.29 0.01 -1.26 0.01 -1.24 0.02 -1.19 0.02 -0.69 0.01 -0.96 0.01 -1.16 0.01 -1.21 0.01 -1.22 0.01 -1.24 0.01

Table 10: Test RMSE of au NN and DGP for different values of M (number of inducing points) and D (number of hidden units). Mean and one standard error over 5 independent runs on the UCI Kin8 dataset are shown. The lower the better.

au NN-RBF au NN-TRI DGP

D M 5 10 25 50 75 100 5 10 25 50 75 100 5 10 25 50 75 100

5 0.10 0.00 0.10 0.00 0.10 0.00 0.10 0.00 0.10 0.00 0.10 0.00 0.11 0.00 0.11 0.00 0.11 0.01 0.11 0.00 0.12 0.01 0.12 0.00 0.12 0.00 0.09 0.00 0.07 0.00 0.07 0.00 0.06 0.00 0.06 0.00 10 0.08 0.00 0.08 0.00 0.08 0.00 0.08 0.00 0.08 0.00 0.08 0.00 0.09 0.00 0.08 0.00 0.08 0.00 0.09 0.00 0.09 0.00 0.10 0.00 0.12 0.00 0.09 0.00 0.07 0.00 0.06 0.00 0.06 0.00 0.06 0.00 25 0.07 0.00 0.07 0.00 0.07 0.00 0.07 0.00 0.07 0.00 0.07 0.00 0.07 0.00 0.07 0.00 0.07 0.00 0.08 0.00 0.08 0.00 0.08 0.00 0.12 0.00 0.09 0.00 0.07 0.00 0.07 0.00 0.07 0.00 0.06 0.00 50 0.06 0.00 0.06 0.00 0.06 0.00 0.06 0.00 0.06 0.00 0.06 0.00 0.07 0.00 0.07 0.00 0.07 0.00 0.07 0.00 0.07 0.00 0.07 0.00 0.12 0.00 0.09 0.00 0.07 0.00 0.07 0.00 0.07 0.00 0.07 0.00

Table 11: Test NLL of au NN and DGP for different values of M (number of inducing points) and D (number of hidden units) as the depth increases from L = 2 to L = 4. Mean and one standard error over 5 independent runs on the UCI Power dataset are shown. The lower the better.

au NN-RBF au NN-TRI DGP

L D M 5 10 25 50 75 100 5 10 25 50 75 100 5 10 25 50 75 100

5 2.86 0.02 2.84 0.02 2.84 0.02 2.84 0.02 2.84 0.02 2.84 0.02 2.85 0.02 2.83 0.02 2.82 0.02 2.84 0.02 2.89 0.03 2.84 0.02 2.87 0.02 2.85 0.02 2.83 0.02 2.82 0.02 2.81 0.02 2.81 0.02 10 2.84 0.02 2.83 0.02 2.82 0.02 2.81 0.02 2.81 0.02 2.81 0.02 2.84 0.02 2.82 0.02 2.80 0.02 2.81 0.02 2.81 0.02 2.81 0.02 2.87 0.02 2.86 0.02 2.83 0.02 2.83 0.02 2.82 0.02 2.81 0.02 25 2.83 0.02 2.81 0.02 2.81 0.02 2.81 0.02 2.81 0.02 2.81 0.02 2.83 0.02 2.80 0.02 2.78 0.02 2.78 0.02 2.78 0.02 2.79 0.02 2.87 0.02 2.85 0.02 2.83 0.02 2.82 0.02 2.81 0.02 2.81 0.02 50 2.82 0.02 2.81 0.02 2.81 0.02 2.80 0.02 2.81 0.02 2.81 0.02 2.82 0.02 2.80 0.02 2.77 0.02 2.76 0.02 2.76 0.02 2.76 0.03 2.86 0.02 2.87 0.02 2.85 0.02 2.83 0.02 2.82 0.02 2.81 0.02

5 2.84 0.02 2.83 0.02 2.83 0.02 2.83 0.03 2.83 0.02 2.83 0.02 2.84 0.02 2.83 0.02 2.82 0.02 2.82 0.02 2.85 0.02 2.82 0.02 2.86 0.02 2.83 0.02 2.82 0.02 2.79 0.02 2.78 0.02 2.77 0.01 10 2.81 0.02 2.81 0.02 2.81 0.02 2.81 0.02 2.81 0.02 2.80 0.02 2.82 0.02 2.80 0.02 2.79 0.02 2.79 0.02 2.78 0.02 2.78 0.02 2.86 0.02 2.83 0.02 2.80 0.02 2.81 0.02 2.82 0.02 2.81 0.02 25 2.80 0.02 2.79 0.02 2.77 0.02 2.77 0.02 2.77 0.02 2.77 0.02 2.79 0.02 2.77 0.02 2.74 0.02 2.72 0.02 2.74 0.03 2.74 0.03 2.86 0.02 2.85 0.02 2.83 0.02 2.82 0.02 2.81 0.02 2.81 0.02 50 2.78 0.02 2.78 0.02 2.77 0.02 2.76 0.02 2.76 0.02 2.76 0.03 2.78 0.02 2.75 0.02 2.71 0.02 2.71 0.03 2.70 0.03 2.70 0.02 2.87 0.02 2.87 0.02 2.84 0.02 2.82 0.02 2.82 0.02 2.81 0.02

5 2.84 0.02 2.83 0.02 2.82 0.02 2.83 0.02 2.82 0.01 2.83 0.02 3.69 0.35 2.83 0.01 2.83 0.02 2.83 0.02 2.83 0.02 2.82 0.02 2.86 0.02 2.83 0.02 2.80 0.02 2.79 0.02 2.78 0.02 2.77 0.02 10 2.81 0.02 2.80 0.02 2.80 0.02 2.80 0.02 2.82 0.02 2.81 0.02 2.83 0.02 2.81 0.01 2.79 0.01 2.79 0.02 2.79 0.02 2.79 0.02 2.86 0.02 2.84 0.02 2.83 0.02 2.79 0.02 2.82 0.02 2.81 0.02 25 2.79 0.02 2.78 0.02 2.77 0.02 2.75 0.02 2.77 0.02 2.76 0.02 2.80 0.01 2.78 0.02 2.75 0.02 2.74 0.02 2.75 0.03 2.75 0.02 2.86 0.02 2.85 0.02 2.83 0.02 2.82 0.02 2.82 0.02 2.81 0.02 50 2.79 0.02 2.80 0.02 2.75 0.03 2.77 0.03 2.75 0.03 2.74 0.03 2.79 0.01 2.77 0.01 2.73 0.02 2.74 0.02 2.74 0.02 2.75 0.02 2.87 0.02 2.85 0.02 2.84 0.02 2.82 0.02 2.82 0.02 2.81 0.02

Published as a conference paper at ICLR 2021

Table 12: Test RMSE of au NN and DGP for different values of M (number of inducing points) and D (number of hidden units) as the depth increases from L = 2 to L = 4. Mean and one standard error over 5 independent runs on the UCI Power dataset are shown. The lower the better.

au NN-RBF au NN-TRI DGP

L D M 5 10 25 50 75 100 5 10 25 50 75 100 5 10 25 50 75 100

5 4.20 0.09 4.14 0.08 4.12 0.08 4.13 0.08 4.13 0.07 4.14 0.09 4.16 0.08 4.09 0.08 4.06 0.09 4.12 0.09 4.32 0.13 4.12 0.09 4.24 0.10 4.19 0.09 4.08 0.09 4.05 0.08 4.03 0.07 4.01 0.07 10 4.15 0.09 4.08 0.08 4.03 0.08 4.03 0.07 4.03 0.09 4.03 0.09 4.10 0.10 4.03 0.08 3.99 0.08 4.01 0.08 4.03 0.09 4.03 0.07 4.24 0.10 4.21 0.09 4.10 0.08 4.08 0.08 4.03 0.08 4.02 0.07 25 4.09 0.08 4.01 0.08 4.01 0.08 4.01 0.08 4.00 0.09 4.02 0.08 4.04 0.08 3.96 0.07 3.90 0.08 3.91 0.08 3.89 0.08 3.92 0.07 4.24 0.09 4.18 0.09 4.10 0.09 4.06 0.08 4.03 0.08 4.01 0.08 50 4.06 0.08 4.00 0.07 4.00 0.07 3.98 0.07 4.01 0.09 3.99 0.08 4.04 0.08 3.93 0.07 3.86 0.09 3.83 0.08 3.81 0.08 3.81 0.10 4.24 0.10 4.24 0.10 4.18 0.09 4.11 0.09 4.06 0.09 4.03 0.08

5 4.14 0.09 4.08 0.08 4.11 0.06 4.09 0.11 4.11 0.09 4.09 0.09 4.10 0.09 4.10 0.07 4.02 0.08 4.04 0.08 4.15 0.08 4.04 0.09 4.22 0.09 4.10 0.08 4.07 0.09 3.92 0.07 3.90 0.06 3.86 0.05 10 4.02 0.08 4.02 0.08 4.00 0.07 4.02 0.07 4.02 0.07 3.99 0.07 4.01 0.08 3.95 0.07 3.92 0.07 3.92 0.08 3.90 0.07 3.90 0.08 4.20 0.09 4.10 0.08 3.98 0.08 3.99 0.07 4.05 0.08 4.03 0.08 25 3.96 0.08 3.93 0.07 3.87 0.07 3.87 0.07 3.87 0.07 3.83 0.07 3.88 0.08 3.84 0.08 3.76 0.07 3.67 0.06 3.75 0.10 3.71 0.09 4.23 0.10 4.19 0.08 4.11 0.09 4.06 0.08 4.03 0.08 4.02 0.08 50 3.89 0.08 3.88 0.07 3.85 0.06 3.80 0.09 3.82 0.07 3.80 0.08 3.86 0.08 3.77 0.09 3.62 0.06 3.61 0.08 3.59 0.09 3.60 0.07 4.24 0.10 4.24 0.10 4.12 0.09 4.07 0.08 4.04 0.08 4.03 0.08

5 4.14 0.10 4.10 0.08 4.08 0.08 4.11 0.09 4.07 0.05 4.09 0.09 12.00 3.26 4.04 0.07 4.06 0.09 4.07 0.07 4.09 0.08 4.07 0.08 4.22 0.09 4.10 0.08 3.97 0.07 3.93 0.07 3.88 0.07 3.85 0.07 10 4.01 0.08 3.98 0.07 3.99 0.07 3.99 0.07 4.05 0.07 4.01 0.06 4.03 0.09 3.99 0.06 3.94 0.06 3.94 0.07 3.92 0.07 3.93 0.08 4.20 0.09 4.12 0.08 4.08 0.10 3.94 0.07 4.06 0.08 4.01 0.08 25 3.93 0.09 3.91 0.08 3.87 0.07 3.78 0.06 3.84 0.07 3.82 0.07 3.94 0.07 3.85 0.08 3.76 0.08 3.75 0.08 3.77 0.10 3.78 0.09 4.24 0.09 4.18 0.09 4.11 0.09 4.06 0.08 4.04 0.08 4.03 0.08 50 3.92 0.08 3.96 0.07 3.78 0.11 3.82 0.07 3.74 0.10 3.70 0.07 3.90 0.08 3.82 0.06 3.71 0.09 3.73 0.09 3.75 0.08 3.77 0.08 4.24 0.09 4.20 0.10 4.12 0.09 4.06 0.08 4.04 0.08 4.03 0.08

Published as a conference paper at ICLR 2021

(c) DGP-add

Figure 11: Representation of two hidden layers (with two units per layer) for au NN (a), DGP (b), and DGP-add (c).

D COMPUTATIONAL COST SUMMARY

Table 15 shows the training computational complexity for the methods compared in this paper. Moreover, in order to evaluate the computational cost in practice, the table also shows the actual running time for the experiment of Section 3.1. BNN is the fastest algorithm, since it utilizes a factorized Gaussian for the approximate posterior. Although f BNN has the same theoretical complexity, the Spectral Stein Gradient Estimator (Shi et al., 2018) is used to compute the KL divergence gradients. Moreover, a GP prior is speciﬁed at function space, for which a GP must be trained as a previous step. DGP and au NN have the same theoretical complexity. In practice, au NN is typically faster because it requires fewer inducing points, recall Section 3.3 and Table 3. The running time in Table 15 is very similar for both because the same amount of inducing points (M = 10) is used in this simple experiment.

5 10 25 50 75100

5 10 25 50 75100

Figure 12: Test NLL on Power dataset for different values of D and M (the lower the better).

Published as a conference paper at ICLR 2021

5 10 25 50 75100

5 10 25 50 75100

Figure 13: Test RMSE on Power dataset for different values of D and M (the lower the better).

Table 13: Standard error for the results in Table 2. Three random train-test splits are considered.

N D RBF-2 RBF-3 RBF-4 TRI-2 TRI-3 TRI-4 DGP-2 DGP-3 DGP-4

Brier HIGGS 11M 28 0.0001 0.0007 0.0008 0.0003 0.0005 0.0009 0.0018 0.0016 0.0006 SUSY 5M 18 0.0005 0.0005 0.0006 0.0005 0.0005 0.0005 0.0011 0.0014 0.0021

ECE HIGGS 11M 28 0.0015 0.0020 0.0022 0.0010 0.0035 0.0019 0.0006 0.0004 0.0008 SUSY 5M 18 0.0012 0.0011 0.0014 0.0018 0.0012 0.0014 0.0005 0.0006 0.0008

Table 14: Standard error for the results in Table 3. Fifty independent runs are considered.

RBF-2 RBF-3 RBF-4 TRI-2 TRI-3 TRI-4 DGP-2 DGP-3 DGP-4

HIGGS 0.0258 0.0325 0.0371 0.0188 0.0318 0.0378 0.0248 0.0266 0.0269 SUSY 0.0215 0.0274 0.0369 0.0202 0.0258 0.0350 0.0108 0.0126 0.0144

Figure 14: DGP and GP trained on the dataset of Section 3.1. The experimental details are analogous to those in Section 3.1, see Appendix B. Whereas DGP underestimates the uncertainty for in-between data, a simpler GP does provide increased uncertainty in the gap.

Table 15: Training computational cost for the models compared in this paper. The running time (in seconds) corresponds to the mean and one standard error over 10 independent runs of the experiment in Section 3.1. More details in Appendix D.

BNN f BNN DGP au NN

Running time (s) 15.21 0.78 51.92 1.07 22.37 0.97 21.16 0.89 Complexity O(N P

i Di Di+1) O(N P

i Di Di+1) O(NM 2 P

i Di) O(NM 2 P