# repulsive_deep_ensembles_are_bayesian__6ea81965.pdf

Repulsive Deep Ensembles are Bayesian

Francesco D Angelo

ETH Zürich Zürich, Switzerland dngfra@gmail.com

Vincent Fortuin

ETH Zürich Zürich, Switzerland fortuin@inf.ethz.ch

Deep ensembles have recently gained popularity in the deep learning community for their conceptual simplicity and efﬁciency. However, maintaining functional diversity between ensemble members that are independently trained with gradient descent is challenging. This can lead to pathologies when adding more ensemble members, such as a saturation of the ensemble performance, which converges to the performance of a single model. Moreover, this does not only affect the quality of its predictions, but even more so the uncertainty estimates of the ensemble, and thus its performance on out-of-distribution data. We hypothesize that this limitation can be overcome by discouraging different ensemble members from collapsing to the same function. To this end, we introduce a kernelized repulsive term in the update rule of the deep ensembles. We show that this simple modiﬁcation not only enforces and maintains diversity among the members but, even more importantly, transforms the maximum a posteriori inference into proper Bayesian inference. Namely, we show that the training dynamics of our proposed repulsive ensembles follow a Wasserstein gradient ﬂow of the KL divergence to the true posterior. We study repulsive terms in weight and function space and empirically compare their performance to standard ensembles and Bayesian baselines on synthetic and real-world prediction tasks.

1 Introduction

There have been many recent advances on the theoretical properties of sampling algorithms for approximate Bayesian inference, which changed our interpretation and understanding of them. Particularly worth mentioning is the work of Jordan et al. [38], who reinterpret Markov Chain Monte Carlo (MCMC) as a gradient ﬂow of the KL divergence over the Wasserstein space of probability measures. This new formulation allowed for a deeper understanding of approximate inference methods but also inspired the inception of new and more efﬁcient inference strategies. Following this direction, Liu and Wang [51] recently proposed the Stein Variational Gradient Descent (SVGD) method to perform approximate Wasserstein gradient descent. Conceptually, this method, which belongs to the family of particle-optimization variational inference (POVI), introduces a repulsive force through a kernel acting in the parameter space to evolve a set of samples towards high-density regions of the target distribution without collapsing to a point estimate.

Another method which has achieved great success recently are ensembles of neural networks (socalled deep ensembles), which work well both in terms of predictive performance [42, 80] as well as uncertainty estimation [65], and have also been proposed as a way to perform approximate inference in Bayesian neural networks [82, 36]. That being said, while they might allow for the averaging of predictions over several hypotheses, they do not offer any guarantees for the diversity between those hypotheses nor do they provably converge to the true Bayesian posterior under any meaningful limit. In this work, we show how the introduction of a repulsive term between the members in the ensemble, inspired by SVGD, not only naïvely guarantees the diversity among the members, avoiding their

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

Figure 1: BNN 1D regression. The function-space methods (SVGD and WGD) approach the HMC posterior more closely, while the standard deep ensembles and weight-space methods fail to properly account for the uncertainty, especially the in-between uncertainty.

collapse in parameter space, but also allows for a reformulation of the method as a gradient ﬂow of the KL divergence in the Wasserstein space of distributions. It thus allows to endow deep ensembles with convergence guarantees to the true Bayesian posterior.

An additional problem is that BNN inference in weight space can lead to degenerate solutions, due to the overparametrization of these models. That is, several samples could have very different weights but map to the same function, thus giving a false sense of diversity in the ensemble. This property, that we will refer to as non-identiﬁability of neural networks (see Appendix A), can lead to redundancies in the posterior distribution. It implies that methods like MCMC sampling, deep ensembles, and SVGD waste computation in local modes that account for equivalent functions. Predictive distributions approximated using samples from these modes do not improve over a simple point estimate and lead to a poor uncertainty estimation. Following this idea, Wang et al. [76] introduced a new method to extend POVI methods to function space, overcoming this limitation. Here, we also study an update rule that allows for an approximation of the gradient ﬂow of the KL divergence in function space in our proposed repulsive ensembles.

We make the following contributions:

We derive several different repulsion terms that can be added as regularizers to the gradient

updates of deep ensembles to endow them with Bayesian convergence properties.

We show that these terms approximate Wasserstein gradient ﬂows of the KL divergence and

can be used both in weight space and function space.

We compare these proposed methods theoretically to standard deep ensembles and SVGD

and highlight their different guarantees.

We assess all these methods on synthetic and real-world deep learning tasks and show

that our proposed repulsive ensembles can achieve competitive performance and improved uncertainty estimation.

2 Repulsive Deep Ensembles

In supervised deep learning, we typically consider a likelihood function p(y|f(x; w)) (e.g., Gaussian for regression or Categorical for classiﬁcation) parameterized by a neural network f(x; w) and training data D = {(xi, yi)}n

i=1 with x 2 X and y 2 Y. In Bayesian neural networks (BNNs), we are interested in the posterior distribution of all likely networks given by p(w|D) / Qn

i=1 p(yi|f(xi; w)) p(w), where p(w) is the prior distribution over weights. Crucially, when making a prediction on a test point x , in the Bayesian approach we do not only use a single parameter bw to predict y = f(x ; bw), but we marginalize over the whole posterior, thus taking all possible explanations of the data into account:

p(y |x , D) =

p(y |f(x ; w)) p(w|D) dw (1)

While approximating the posterior of Bayesian neural networks (or sampling from it) is a challenging task, performing maximum a posteriori (MAP) estimation, which corresponds to ﬁnding the mode of the posterior, is usually simple. Ensembles of neural networks use the non-convexity of the MAP optimization problem to create a collection of K independent and possibly different solutions. Considering n weight conﬁgurations of a neural network {wi}n

i=1 with wi 2 Rd, the dynamics of the ensemble under the gradient of the posterior lead to the following update rule at iteration t:

with step size t. Ensemble methods have a long history [e.g., 45, 26, 6] and were recently revisited for neural networks [42] and coined deep ensembles. The predictions of the different members are combined to create a predictive distribution by using the solutions to compute the Bayesian model average (BMA) in Eq. (1). Recent works [65] have shown that deep ensembles can outperform some of the Bayesian approaches for uncertainty estimation. Even more recently, Wilson and Izmailov [82] argued that deep ensembles can be considered a compelling approach to Bayesian model averaging. Despite these ideas, the ability of deep ensembles to efﬁciently average over multiple hypotheses and to explore the functional landscape of the posterior distribution studied in [18] does not guarantee sampling from the right distribution. Indeed, the additional Langevin noise introduced in [77], which is not considered in deep ensembles, is crucial to ensure samples from the true Bayesian posterior.

From a practical standpoint, since the quality of an ensemble hinges on the diversity of its members, many methods were recently proposed to improve this diversity without compromising the individual accuracy. For instance, Wenzel et al. [80] propose hyper-deep ensembles that combine deep networks with different hyperparameters. Similarly, cyclical learning-rate schedules can explore several local minima for the ensemble members [33]. Alternatively, Rame and Cord [67] proposed an informationtheoretic framework to avoid redundancy in the members and Oswald et al. [63] studied possible interactions between members based on weight sharing. However, the absence of a constraint that prevents particles from converging to the same mode limits the possibility of improvement by introducing more ensemble members. This means that any hopes to converge to different modes must exclusively rely on:

1. the randomness of the initialization 2. the noise in the estimation of the gradients due to minibatching 3. the number of local optima that might be reached during gradient descent.

Moreover, the recent study of Geiger et al. [25] showed how the empirical test error of the ensemble converges to the one of a single trained model when the number of parameters goes to inﬁnity, leading to deterioration of the performance. In other words, the bigger the model, the harder it is to maintain diversity in the ensemble and avoid collapse to the same solution. This is intuitively due to the fact that bigger models are less sensitive to the initialization. Namely, in order for them to get stuck in a local minimum, they must have second derivatives that are positive simultaneously in all directions. As the number of hidden units gets larger, this becomes less likely.

2.1 Repulsive force in weight space

To overcome the aforementioned limitations of standard deep ensembles, we introduce, inspired by SVGD [51], a deep ensemble with members that interact with each other through a repulsive component. Using a kernel function to model this interaction, the single models repel each other based on their position in the weight space, so that two members can never assume the same weights. Considering a stationary kernel k( , ) : Rd Rd ! R acting in the parameter space of the neural network, a repulsive term R can be parameterized through its gradient:

To get an intuition for the behavior of this repulsive term and its gradients, we can consider the RBF kernel k(wi, wj) = exp

h||wi wj||2)

with lengthscale h and notice how its gradient

drives wi away from its neighboring members wj, thus creating a repulsive effect. Naturally, not all the choices of R induce this effect. One of the simplest formulations to obtain it is via a linear combination of the kernel gradients scaled by a positive factor, that is, β Pn

j) with β 2 R+

. We will see in Section 3 how the choice of β can be justiﬁed in order to obtain convergence to the Bayesian posterior together with alternative possible formulations of R that preserve this convergence.

2.2 Repulsive force in function space

To overcome the aforementioned overparameterization issue, the update in Eq. (3) can be formulated in function space instead of weight space. Let f : w 7! f( ; w) be the map that maps a conﬁguration of weights w 2 Rd to the corresponding neural network regression function and denote as fi := f( ; wi) the function with a certain conﬁguration of weights wi. We can now consider n particles in function space {fi}n

i=1 with f 2 F and model their interaction with a general positive deﬁnite kernel k( , ). We also consider the implicit functional likelihood p(y|x, f), determined by the measure p(y|x, w) in the weight space, as well as the functional prior p(f), which can either be deﬁned separately (e.g., using a GP) or modeled as a push-forward measure of the weight-space prior p(w). Together, they determine the posterior in function space p(f|D). The functional evolution of a particle can then be written as:

i log p(f t

However, computing the update in function space is neither tractable nor practical, which is why two additional considerations are needed. The ﬁrst one regards the inﬁnite dimensionality of function space, which we circumvent using a canonical projection into a subspace: Deﬁnition 1 (Canonical projection). For any A X, we deﬁne A : RX ! RA as the canonical projection onto A, that is, A(f) = {f(a)}a2A.

In other words, the kernel will not be evaluated directly in function space, but on the projection k

B(f), B(f 0)

, with B being a subset of the input space given by a batch of training data points. The second consideration is to project this update back into the parameter space and evolve a set of particles there, because ultimately we are interested in representing the functions by parameterized neural networks. For this purpose, we can use the Jacobian of the i-th particle as a projector:

i log p(f t

2.3 Comparison to Stein variational gradient descent

Note that our update is reminiscent of SVGD [51], which in parameter space can be written as:

It is important to notice that here, the gradients are averaged across all the particles using the kernel matrix. Interestingly, SVGD can be asymptotically interpreted as gradient ﬂow of the KL divergence under a new metric induced by the Stein operator [16, 50] (see Appendix D for more details). Moving the inference from parameter to function space [76] leads to the update rule

j log p(f t

j|D) + rf t

This way of averaging gradients using a kernel can be dangerous in high-dimensional settings, where kernel methods often suffer from the curse of dimensionality. Moreover, in Eq. (6), the posterior gradients of the particles are averaged using their similarity in weight space, which can be misleading in multi-modal posteriors. Worse yet, in Eq. (7), the gradients are averaged in function space and

are then projected back using exclusively the i-th Jacobian, which can be harmful given that it is not guaranteed that distances between functions evaluated on a subset of their input space resemble their true distance. Our proposed method, on the other hand, does not employ any averaging of the posterior gradients and thus comes closest to the true particle gradients in deep ensembles.

3 Repulsive deep ensembles are Bayesian

So far, we represented the repulsive force as a general function of the gradients of a kernel. In this section, we show how to determine the explicit form of the repulsive term, such that the resulting update rule is equivalent to the discretization of the gradient ﬂow dynamics of the KL divergence in Wasserstein space. We begin by introducing the concepts of particle approximation and gradient ﬂow.

3.1 Particle approximation

A particle-based approximation of a target measure depends on a set of weighted samples {(xi, wi)}n

i=1, for which an empirical measure can be deﬁned as

wi δ(x xi) , (8)

where δ( ) is the Dirac delta function and the weights wi satisfy wi 2 [0, 1] and Pn

i=1 wi = 1. To approximate a target distribution (x) using the empirical measure, the particles and their weights need to be selected in a principled manner that minimizes some measure of distance between (x) and (x) (e.g., a set of N samples with weights wi = 1/N obtained using an MCMC method).

3.2 Gradient ﬂow in parameter space

Given a smooth function J : Rd ! R in Euclidean space, we can minimize it by creating a path that follows its negative gradient starting from some initial conditions x0. The curve x(t) with starting point x0 described by that path is called gradient ﬂow. The dynamics and evolution in time of a considered point in the space under this minimization problem can be described as the ODE1

dt = r J(x) . (9)

We can extend this concept to the space of probability distributions (Wasserstein gradient ﬂow) [3]. Let us consider the space of probability measures P2(M), that is, the set of probability measures with ﬁnite second moments deﬁned on the manifold M:

' : M ! [0, 1)

|x|2'(x)dx < +1

Taking (µ, ) as the set of joint probability measures with marginals µ, , we can deﬁne the Wasserstein metric on the space P2(M) as:

2 (µ, ) = inf 2 (µ, )

|x y|2 d (x, y) . (10)

Considering the optimization problem of a functional J : P2(M) ! R, such as the KL divergence between the particle approximation in Eq. (8) and the target posterior (x),

inf 2P2(M) DKL( , ) =

(log (x) log (x)) (x) dx ,

the evolution in time of the measure under the equivalent of the gradient, the Wasserstein gradient ﬂow, is described by the Liouville equation2 [38, 3, 64]:

log (x) log (x)

1Together with the initial condition, this is know as the Cauchy problem. 2Also referred to as continuity equation.

δ DKL( , ) =: r W2DKL( , ) is the Wasserstein gradient and the operator δ δ : P2(M) ! R represents the functional derivative or ﬁrst variation (see Appendix C for more details). In the particular case of the KL functional, we can recover the Fokker-Planck equation,

(x)r(log (x) log (x))

(x)r log (x)

+ r2 (x) , that admits as unique stationary distribution the posterior (x). The deterministic particle dynamics ODE [2] related to Eq. (11), namely mean-ﬁeld Wasserstein dynamics, is then given by:

log (x) log (x)

Considering a discretization of Eq. (12) for a particle system {x}n

i=1 and small stepsize t, we can rewrite Eq. (12) as:

i) r log (xt

. (13) Unfortunately, we do not have access to the analytical form of the gradient r log , so an approximation is needed. At this point, it is crucial to observe the similarity between the discretization of the Wasserstein gradient ﬂow in Eq. (13) and the repulsive update in Eq. (3) to notice how, if the kernelized repulsion is an approximation of the gradient of the empirical particle measure, the update rule minimizes the KL divergence between the particle measure and the target posterior. Different sample-based approximations of the gradient that use a kernel function have been recently studied. The simplest one is given by the kernel density estimation (KDE) (details in Appendix E)

i=1 k(x, xi

t), where k( , ) : Rd Rd ! R and the gradient of its log density is given by [70]:

Using this approximation in Eq. (13) we obtain:

where, if we substitute the posterior for , we obtain an expression for the repulsive force in Eq. (3). This shows that if the repulsive term in Eq. (3) is the normalized sum of the gradients R = Pn

j), we do not only encourage diversity of the ensemble members and thus avoid collapse, but surprisingly in the asymptotic limit of n ! 1, where the KDE approximation is exact [66] also converge to the true Bayesian posterior!

Nevertheless, approximating the gradient of the empirical measure with the KDE can lead to suboptimal performance, as already studied by Li and Turner [46]. They instead introduced a new Stein gradient estimator (SGE) that offers better performance, while maintaining the same computational cost. Even more recently, Shi et al. [69] introduced a spectral method for gradient estimation (SSGE), that also allows for a simple estimation on out-of-sample points. These two estimators can be used in Eq. (13), to formulate the following update rules with two alternative repulsive forces. The one using the Stein estimator, that we will call SGE-WGD, is:

where K is the kernel Gram matrix, a small constant, and I the identity matrix. We can notice an important difference between KDE and SGE, in that the former is only considering the interaction of the i-th particle being updated with all the others, while the latter is simultaneously considering also the interactions between the remaining particles. The spectral method, that we will call SSGE-WGD, leads to the following update rule:

where λj is the j-th eigenvalue of the kernel matrix and ujk is the k-th component of the j-th eigenvector. Computationally, both SSGE and SGE have a cost of O(M 3 + M 2d), with M being the number of points and d their dimensionality. SSGE has an additional cost for predictions of O(M(d + J)), where J is the number of eigenvalues.

Figure 2: Single Gaussian. We show samples from SVGD, KDE-WGD, SGE-WGD, and SSGEWGD (from left to right). The upper and right plots show the empirical one-dimensional marginal distributions obtained using KDE on the samples (red) and on the particles (blue).

3.3 Gradient ﬂow in function space

To theoretically justify the update rule introduced in function space in Eq. (5), we can rewrite the Liouville equation for the gradient ﬂow in Eq. (11) in function space as

log (f) log (f)

Following this update, the mean ﬁeld functional dynamics are

log (f) log (f)

Using the same KDE approximation as above, we can obtain a discretized evolution in function space and with it an explicit form for the repulsive force in Eq. (4) as

rf log (f i

The update rules using the SGE and SSGE approximations follow as for the parametric case. It is important to notice that this update rule requires the function space prior gradient: rfj log p(fj|x, y) = rfj log p(y|x, fj) + rfj log p(fj). If one wants to use an implicit prior deﬁned in weight space, an additional estimator is needed due to its analytical intractability. We again adopted the SSGE, introduced by Shi et al. [69], which was already used for a similar purpose in Sun et al. [71]. It is also interesting to note that the update rule in Eq. (20) readily allows for the use of alternative priors that have an analytical form, such as Gaussian processes. This is an important feature of our method that allows for an explicit encoding of function space properties that can be useful for example in achieving better out of distribution detection capabilities [14].

3.4 The choice of the kernel

A repulsive effect can always be created in the ensemble by means of the gradient of any kernel function that is measuring the similarity between two members. Nevertheless, it is important to keep in mind that to ensure the asymptotic convergence to the Bayesian posterior, the repulsive component must be a consistent estimator of the gradient in Eq. (13), as shown in Section 3. Therefore, some important constraints over the kernel choice are needed. In particular, the SGE and SSGE need a kernel function belonging to the Stein class (see Shi et al. [69] for more details). On the other hand, for the KDE, any symmetric probability density function can be used. On this subject, the work of Aggarwal et al. [1] has shown how the Manhattan distance metric (L1 norm) might be preferable over the Euclidean distance metric (L2 norm) for high-dimensional settings. We performed some additional experiments using the L1 norm (Laplace kernel) for the KDE but we could not observe any substantial difference compared to using the L2 norm. Further investigations regarding this hypothesis are left for future research.

Figure 3: BNN 2D classiﬁcation. We show the entropy of the predictive posteriors. Again, the function-space methods capture the uncertainty better than the weight-space ones, thus approaching the gold-standard HMC posterior.

4 Experiments

In this section, we compare the different proposed WGD methods with deep ensembles and SVGD on synthetic sampling, regression, and classiﬁcation tasks and real-world image classiﬁcation tasks. We use an RBF kernel (except where otherwise speciﬁed) with the popular median heuristic [51] to choose the kernel bandwidth. In our experiments, an adaptive bandwidth leads to better performance than ﬁxing and tuning a single constant value for the entire evolution of the particles. We also quantitatively assess the uncertainty estimation of the methods in terms of calibration and OOD detection. In our experiments, we report the test accuracy, negative log-likelihood (NLL), and the expected calibration error (ECE) [58]. To assert the robustness on out-of-distribution (OOD) data, we report the ratio between predictive entropy on OOD and test data points (Ho/Ht), and the OOD detection area under the ROC curve AUROC(H) [47]. Moreover, to assess the diversity of the ensemble generated by the different methods in function space, we measure the functional diversity using the model disagreement (MD) (details in Appendix B). In particular, we report the ratio between the average model disagreement on the OOD and test data points (MDo/MDt) and additionally the OOD detection AUROC(MD) computed using this measure instead of the entropy.

Sampling from synthetic distributions As a sanity check, we ﬁrst assessed the ability of our different approximations for Wasserstein gradient descent (using KDE, SGE, and SSGE) to sample from a two-dimensional Gaussian distribution (Figure 2). We see that our SGE-WGD, SSGE-WGD and the SVGD ﬁt the target almost perfectly. We also tested the different methods in a more complex two-dimensional Funnel distribution [59] and present the results in Figure F.1 in the Appendix. There, SGE-WGD and SVGD also perform best.

BNN 1D regression We then assessed the different methods in ﬁtting a BNN posterior on a synthetically generated one-dimensional regression task. The results are reported in Figure 1, consisting of the mean prediction and 1, 2, 3 standard deviations of the predictive distribution. We can see that all methods performing inference in the weight space (DE, w-SVGD, WGD) are unable to capture the epistemic uncertainty between the two clusters of training data points. Conversely, the functional methods (f-SVGD, f WGD) are perfectly able to infer the diversity of the hypotheses in this region due to the lack of training evidence. They thereby achieve a predictive posterior that very closely resembles the one obtained with the gold-standard HMC sampling.

BNN 2D classiﬁcation Next, we investigated the predictive performance and quality of uncertainty estimation of the methods in a two-dimensional synthetic classiﬁcation setting. The results are displayed in Figure 3. We can clearly observe that the weight-space methods are overconﬁdent and do not capture the uncertainty well. Moreover, all the functions seems to collapse to the optimal classiﬁer. These methods thus only account for uncertainty close to the decision boundaries and to the origin

Table 1: BNN image classiﬁcation. AUROC(H) is the AUROC computed using the entropy whereas AUROC(MD) is computed using the model disagreement. Ho/Ht is the ratio of the entropies on OOD and test points respectively and MDo/MDt is the ratio for model disagreement. We see that the best accuracy is achieved by our WGD methods, while our f WGD methods yield the best OOD detection and funtional diversity. All our proposed methods improve over standard deep ensembles in terms of accuracy and diversity, highlighting the effect of our repulsion.

AUROC(H) AUROC(MD) Accuracy Ho/Ht MDo/MDt ECE NLL

Fashion MNIST

Deep ensemble [42] 0.958 0.001 0.975 0.001 91.122 0.013 6.257 0.005 6.394 0.001 0.012 0.001 0.129 0.001 SVGD [51] 0.960 0.001 0.973 0.001 91.134 0.024 6.315 0.019 6.395 0.018 0.014 0.001 0.127 0.001 f-SVGD [76] 0.956 0.001 0.975 0.001 89.884 0.015 5.652 0.009 6.531 0.005 0.013 0.001 0.150 0.001 hyper-DE [80] 0.968 0.001 0.981 0.001 91.160 0.007 6.682 0.065 7.059 0.152 0.014 0.001 0.128 0.001 kde-WGD (ours) 0.960 0.001 0.970 0.001 91.238 0.019 6.587 0.019 6.379 0.018 0.014 0.001 0.128 0.001 sge-WGD (ours) 0.960 0.001 0.970 0.001 91.312 0.016 6.562 0.007 6.363 0.009 0.012 0.001 0.128 0.001 ssge-WGD (ours) 0.968 0.001 0.979 0.001 91.198 0.024 6.522 0.009 6.610 0.012 0.012 0.001 0.130 0.001 kde-f WGD (ours) 0.971 0.001 0.980 0.001 91.260 0.011 7.079 0.016 6.887 0.015 0.015 0.001 0.125 0.001 sge-f WGD (ours) 0.969 0.001 0.978 0.001 91.192 0.013 7.076 0.004 6.900 0.005 0.015 0.001 0.125 0.001 ssge-f WGD (ours) 0.971 0.001 0.980 0.001 91.240 0.022 7.129 0.006 6.951 0.005 0.016 0.001 0.124 0.001

Deep ensemble [42] 0.843 0.004 0.736 0.005 85.552 0.076 2.244 0.006 1.667 0.008 0.049 0.001 0.277 0.001 SVGD [51] 0.825 0.001 0.710 0.002 85.142 0.017 2.106 0.003 1.567 0.004 0.052 0.001 0.287 0.001 f SVGD [76] 0.783 0.001 0.712 0.001 84.510 0.031 1.968 0.004 1.624 0.003 0.049 0.001 0.292 0.001 hyper-DE [80] 0.789 0.001 0.743 0.001 84.743 0.011 1.951 0.010 1.690 0.015 0.046 0.001 0.288 0.001 kde-WGD (ours) 0.838 0.001 0.735 0.004 85.904 0.030 2.205 0.003 1.661 0.008 0.053 0.001 0.276 0.001 sge-WGD (ours) 0.837 0.003 0.725 0.004 85.792 0.035 2.214 0.010 1.634 0.004 0.051 0.001 0.275 0.001 ssge-WGD (ours) 0.832 0.003 0.731 0.005 85.638 0.038 2.182 0.015 1.655 0.001 0.049 0.001 0.276 0.001 kde-f WGD (ours) 0.791 0.002 0.758 0.002 84.888 0.030 1.970 0.004 1.749 0.005 0.044 0.001 0.282 0.001 sge-f WGD (ours) 0.795 0.001 0.754 0.002 84.766 0.060 1.984 0.003 1.729 0.002 0.047 0.001 0.288 0.001 ssge-f WGD (ours) 0.792 0.002 0.752 0.002 84.762 0.034 1.970 0.006 1.723 0.005 0.046 0.001 0.286 0.001

region, for which the uncertainty is purely aleatoric. In this setting, f-SVGD suffers from similar issues as the weight space methods, being overconﬁdent away from the training data. Conversely, our f WGD methods are conﬁdent (low entropy) around the data but not out-of-distribution, thus representing the epistemic uncertainty better. This suggests that the functional diversity captured by this method naturally leads to a distance-aware uncertainty estimation [49, 20], a property that translates into conﬁdent predictions only in the proximity of the training data, allowing for a principled OOD detection.

Fashion MNIST classiﬁcation Moving on to real-world data, we used an image classiﬁcation setting using the Fashion MNIST dataset [83] for training and the MNIST dataset [43] as an out-ofdistribution (OOD) task. The results are reported in Table 1 (top). We can see that all our methods improve upon standard deep ensembles and SVGD, highlighting the effectiveness of our proposed repulsion terms when training neural network ensembles. In particular, the sge-WGD offers the best accuracy, whereas the methods in function space all offer a better OOD detection. This is probably due to the fact that these methods achieve a higher entropy ratio and functional diversity measured via the model disagreement when compared to their weight-space counterparts. Interestingly, they also reach the lowest NLL values. We can also notice how the model disagreement (MD) not only serves its purpose as a metric for the functional heterogeneity of the ensemble but also allows for a better OOD detection in comparison to the entropy. To the best of our knowledge, this insight has not been described before, although it has been used in continual learning [31]. Interestingly, using this metric, the hyper-deep ensemble [80] shows OOD detection performance comparable with our repulsive ensemble in function space.

CIFAR classiﬁcation Finally, we use a Res Net32 architecture [29] on CIFAR-10 [41] with the SVHN dataset [61] as OOD data. The results are reported in Table 1 (bottom). We can see that in this case, the weight-space methods achieve better performance in accuracy and OOD detection using the entropy than the ones in function space. Nevertheless, all our repulsive ensembles improve functional diversity, accuracy, and OOD detection when compared to standard SVGD, whereas the standard deep ensemble achieves the best OOD detection using the entropy.

5 Related Work

The theoretical and empirical properties of SVGD have been well studied [40, 48, 13] and it can also be seen as a Wasserstein gradient ﬂow of the KL divergence in the Stein geometry [16, 50]

(see Appendix D for more details). Interestingly, a gradient ﬂow interpretation is also possible for (stochastic gradient) MCMC-type algorithms [48], which can be uniﬁed under a general particle inference framework [10]. Moreover, our Wasserstein gradient descent using the SGE approximation can also be derived using an alternative formulation as a gradient ﬂow with smoothed test functions [48]. A projected version of WGD has been studied in Wang et al. [75], which could also be readily applied in our framework. Besides particle methods, Bayesian neural networks [54, 59] have gained popularity recently [79, 22, 19, 36], using modern MCMC [59, 79, 22, 24, 21] and variational inference techniques [5, 72, 17, 34]. On the other hand, ensemble methods have also been extensively studied [42, 18, 82, 23, 80, 32, 85, 78].Moreover, repulsive interactions between the members have also been studied in Wabartha et al. [74]. Moreover, providing Bayesian interpretations for deep ensembles has been previously attempted through the lenses of stationary SGD distributions [56, 8], ensembles of linear models [57], additional random functions [62, 12, 27], approximate inference [82], Stein variational inference [15], and marginal likelihood lower bounds [53], and ensembles have also been shown to provide good approximations to the true BNN posterior in some settings [36]. Furthermore, variational inference in function space has recently gained attention [71] and the limitations of the KL divergence have been studied in Burt et al. [7].

6 Conclusion

We have presented a simple and principled way to improve upon standard deep ensemble methods. To this end, we have shown that the introduction of a kernelized repulsion between members of the ensemble not only improves the accuracy of the predictions but even more importantly can be seen as Wasserstein gradient descent on the KL divergence, thus transforming the MAP inference of deep ensembles into proper Bayesian inference. Moreover, we have shown that incorporating functional repulsion between ensemble members can improve the quality of the estimated uncertainties on simple synthetic examples and OOD detection on real-world data and can approach the true Bayesian posterior more closely.

In future work, it will be interesting to study the impact of the Jacobian in the f WGD update and its implications on the Liouville equation in more detail, also compared to other neural network Jacobian methods, such as neural tangent kernels [37] and generalized Gauss-Newton approximations [35]. Moreover, it would be interesting to derive explicit convergence bounds for our proposed method and compare them to the existing bounds for SVGD [40].

Acknowledgments

VF would like to acknowledge ﬁnancial support from the Strategic Focus Area Personalized Health and Related Technologies of the ETH Domain through the grant #2017-110 and from the Swiss Data Science Center through a Ph D fellowship. We thank Florian Wenzel, Alexander Immer, Andrew Gordon Wilson, Pavel Izmailov, Christian Henning, and Johannes von Oswald for helpful discussions. We also thank Dr. Sheldon Cooper, Dr. Leonard Hofstadter and Penny Hofstadter for their support and inspiration.

[1] Charu C Aggarwal, Alexander Hinneburg, and Daniel A Keim. On the surprising behavior of

distance metrics in high dimensional space. In International conference on database theory, pages 420 434. Springer, 2001.

[2] Luigi Ambrosio and Gianluca Crippa. Continuity equations and ode ﬂows with non-smooth

velocity. Proceedings of the Royal Society of Edinburgh: Section A Mathematics, 144(6): 1191 1244, 2014.

[3] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient ﬂows: in metric spaces and in

the space of probability measures. Springer Science & Business Media, 2008.

[4] Vijay Badrinarayanan, Bamdev Mishra, and Roberto Cipolla. Symmetry-invariant optimization

in deep networks. ar Xiv preprint ar Xiv:1511.01754, 2015.

[5] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty

in neural networks. ar Xiv preprint ar Xiv:1505.05424, 2015.

[6] Leo Breiman. Bagging predictors. Machine learning, 24(2):123 140, 1996.

[7] David R Burt, Sebastian W Ober, Adrià Garriga-Alonso, and Mark van der Wilk. Understanding

variational inference in function-space. ar Xiv preprint ar Xiv:2011.09421, 2020.

[8] Pratik Chaudhari and Stefano Soatto. Stochastic gradient descent performs variational inference,

converges to limit cycles for deep networks. In 2018 Information Theory and Applications Workshop (ITA), pages 1 10. IEEE, 2018.

[9] An Mei Chen, Haw-minn Lu, and Robert Hecht-Nielsen. On the geometry of feedforward

neural network error surfaces. Neural computation, 5(6):910 927, 1993.

[10] Changyou Chen, Ruiyi Zhang, Wenlin Wang, Bai Li, and Liqun Chen. A uniﬁed particle-

optimization framework for scalable bayesian sampling. ar Xiv preprint ar Xiv:1805.11659, 2018.

[11] Jiefeng Chen, Xi Wu, Yingyu Liang, Somesh Jha, et al. Robust out-of-distribution detection in

neural networks. ar Xiv preprint ar Xiv:2003.09711, 2020.

[12] Kamil Ciosek, Vincent Fortuin, Ryota Tomioka, Katja Hofmann, and Richard Turner. Conserva-

tive uncertainty estimation by ﬁtting prior networks. In International Conference on Learning Representations, 2019.

[13] Francesco D Angelo and Vincent Fortuin. Annealed stein variational gradient descent. ar Xiv

preprint ar Xiv:2101.09815, 2021.

[14] Francesco D Angelo and Christian Henning. Uncertainty-based out-of-distribution detection

requires suitable function space priors. ar Xiv preprint ar Xiv:2110.06020, 2021.

[15] Francesco D Angelo, Vincent Fortuin, and Florian Wenzel. On stein variational neural network

ensembles. ar Xiv preprint ar Xiv:2106.10760, 2021.

[16] A Duncan, Nikolas Nuesken, and Lukasz Szpruch. On the geometry of stein variational gradient

descent. ar Xiv preprint ar Xiv:1912.00894, 2019.

[17] Michael W Dusenberry, Ghassen Jerfel, Yeming Wen, Yi-an Ma, Jasper Snoek, Katherine

Heller, Balaji Lakshminarayanan, and Dustin Tran. Efﬁcient and scalable bayesian neural nets with rank-1 factors. ar Xiv preprint ar Xiv:2005.07186, 2020.

[18] Stanislav Fort, Huiyi Hu, and Balaji Lakshminarayanan. Deep ensembles: A loss landscape

perspective. ar Xiv preprint ar Xiv:1912.02757, 2019.

[19] Vincent Fortuin. Priors in bayesian deep learning: A review. ar Xiv preprint ar Xiv:2105.06868,

[20] Vincent Fortuin, Mark Collier, Florian Wenzel, James Allingham, Jeremiah Liu, Dustin Tran,

Balaji Lakshminarayanan, Jesse Berent, Rodolphe Jenatton, and Effrosyni Kokiopoulou. Deep classiﬁers with label noise modeling and distance awareness. ar Xiv preprint ar Xiv:2110.02609, 2021.

[21] Vincent Fortuin, Adrià Garriga-Alonso, Mark van der Wilk, and Laurence Aitchison. Bnnpriors:

A library for bayesian neural network inference with different prior distributions. Software Impacts, page 100079, 2021.

[22] Vincent Fortuin, Adrià Garriga-Alonso, Florian Wenzel, Gunnar Rätsch, Richard Turner, Mark

van der Wilk, and Laurence Aitchison. Bayesian neural network priors revisited. ar Xiv preprint ar Xiv:2102.06571, 2021.

[23] Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry Vetrov, and Andrew Gordon

Wilson. Loss surfaces, mode connectivity, and fast ensembling of dnns. ar Xiv preprint ar Xiv:1802.10026, 2018.

[24] Adrià Garriga-Alonso and Vincent Fortuin. Exact langevin dynamics with stochastic gradients.

ar Xiv preprint ar Xiv:2102.01691, 2021.

[25] Mario Geiger, Arthur Jacot, Stefano Spigler, Franck Gabriel, Levent Sagun, Stéphane d Ascoli,

Giulio Biroli, Clément Hongler, and Matthieu Wyart. Scaling description of generalization with number of parameters in deep learning. Journal of Statistical Mechanics: Theory and Experiment, 2020(2):023401, 2020.

[26] L. K. Hansen and P. Salamon. Neural network ensembles. IEEE Trans. Pattern Anal. Mach.

Intell., 12(10):993 1001, October 1990. ISSN 0162-8828. doi: 10.1109/34.58871. URL https://doi.org/10.1109/34.58871.

[27] Bobby He, Balaji Lakshminarayanan, and Yee Whye Teh. Bayesian deep ensembles via the

neural tangent kernel. ar Xiv preprint ar Xiv:2007.05864, 2020.

[28] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectiﬁers:

Surpassing human-level performance on imagenet classiﬁcation. In Proceedings of the IEEE international conference on computer vision, pages 1026 1034, 2015.

[29] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image

recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016.

[30] Robert Hecht-Nielsen. On the algebraic structure of feedforward network weight spaces. In

Advanced Neural Computers, pages 129 135. Elsevier, 1990.

[31] Christian Henning, Maria R Cervera, Francesco D Angelo, Johannes von Oswald, Regina

Traber, Benjamin Ehret, Seijin Kobayashi, João Sacramento, and Benjamin F Grewe. Posterior meta-replay for continual learning. ar Xiv preprint ar Xiv:2103.01133, 2021.

[32] Gao Huang, Yixuan Li, Geoff Pleiss, Zhuang Liu, John E Hopcroft, and Kilian Q Weinberger.

Snapshot ensembles: Train 1, get m for free. ar Xiv preprint ar Xiv:1704.00109, 2017.

[33] Gao Huang, Yixuan Li, Geoff Pleiss, Zhuang Liu, John E. Hopcroft, and Kilian Q. Weinberger.

Snapshot ensembles: Train 1, get M for free. Co RR, abs/1704.00109, 2017. URL http: //arxiv.org/abs/1704.00109.

[34] Alexander Immer, Matthias Bauer, Vincent Fortuin, Gunnar Rätsch, and Mohammad Emtiyaz

Khan. Scalable marginal likelihood estimation for model selection in deep learning. ar Xiv preprint ar Xiv:2104.04975, 2021.

[35] Alexander Immer, Maciej Korzepa, and Matthias Bauer. Improving predictions of bayesian

neural nets via local linearization. In International Conference on Artiﬁcial Intelligence and Statistics, pages 703 711. PMLR, 2021.

[36] Pavel Izmailov, Sharad Vikram, Matthew D Hoffman, and Andrew Gordon Wilson. What are

bayesian neural network posteriors really like? ar Xiv preprint ar Xiv:2104.14421, 2021.

[37] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: convergence and

generalization in neural networks. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 8580 8589, 2018.

[38] Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the fokker

planck equation. SIAM journal on mathematical analysis, 29(1):1 17, 1998.

[39] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint

ar Xiv:1412.6980, 2014.

[40] Anna Korba, Adil Salim, Michael Arbel, Giulia Luise, and Arthur Gretton. A non-asymptotic

analysis for stein variational gradient descent. Advances in Neural Information Processing Systems, 33, 2020.

[41] Alex Krizhevsky et al. Learning multiple layers of features from tiny images. 2009.

[42] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable

predictive uncertainty estimation using deep ensembles. In Advances in neural information processing systems, pages 6402 6413, 2017.

[43] Yann Le Cun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/,

[44] Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pennington,

and Jascha Sohl-Dickstein. Deep neural networks as gaussian processes. ar Xiv preprint ar Xiv:1711.00165, 2017.

[45] E. Levin, N. Tishby, and S. A. Solla. A statistical approach to learning and generalization in

layered neural networks. Proc. of the IEEE Special issue on Neural Networks, 1990.

[46] Yingzhen Li and Richard E Turner. Gradient estimators for implicit models. ar Xiv preprint

ar Xiv:1705.07107, 2017.

[47] Shiyu Liang, Yixuan Li, and Rayadurgam Srikant. Enhancing the reliability of out-ofdistribution image detection in neural networks. ar Xiv preprint ar Xiv:1706.02690, 2017.

[48] Chang Liu, Jingwei Zhuo, Pengyu Cheng, Ruiyi Zhang, and Jun Zhu. Understanding and

accelerating particle-based variational inference. In International Conference on Machine Learning, pages 4082 4092. PMLR, 2019.

[49] Jeremiah Zhe Liu, Zi Lin, Shreyas Padhy, Dustin Tran, Tania Bedrax-Weiss, and Balaji Laksh-

minarayanan. Simple and principled uncertainty estimation with deterministic deep learning via distance awareness. ar Xiv preprint ar Xiv:2006.10108, 2020.

[50] Qiang Liu. Stein variational gradient descent as gradient ﬂow. In Advances in neural information

processing systems, pages 3115 3123, 2017.

[51] Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose bayesian

inference algorithm. In Advances in neural information processing systems, pages 2378 2386, 2016.

[52] Qiang Liu, Jason Lee, and Michael Jordan. A kernelized stein discrepancy for goodness-of-ﬁt

tests. In International conference on machine learning, pages 276 284. PMLR, 2016.

[53] Clare Lyle, Lisa Schut, Robin Ru, Yarin Gal, and Mark van der Wilk. A bayesian perspective

on training speed and model selection. Advances in Neural Information Processing Systems, 33, 2020.

[54] David JC Mac Kay. A practical bayesian framework for backpropagation networks. Neural

computation, 4(3):448 472, 1992.

[55] Andrey Malinin, Bruno Mlodozeniec, and Mark Gales. Ensemble distribution distillation. ar Xiv

preprint ar Xiv:1905.00076, 2019.

[56] Stephan Mandt, Matthew D Hoffman, and David M Blei. Stochastic gradient descent as

approximate bayesian inference. ar Xiv preprint ar Xiv:1704.04289, 2017.

[57] Alexander G de G Matthews, Jiri Hron, Richard E Turner, and Zoubin Ghahramani. Sample-

then-optimize posterior sampling for bayesian linear models. In Neur IPS Workshop on Advances in Approximate Bayesian Inference, 2017.

[58] Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated

probabilities using bayesian binning. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 29, 2015.

[59] Radford M Neal. Bayesian learning for neural networks. 1995.

[60] Radford M Neal. Bayesian learning for neural networks, volume 118. Springer Science &

Business Media, 2012.

[61] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng.

Reading digits in natural images with unsupervised feature learning. 2011.

[62] Ian Osband, Benjamin Van Roy, Daniel J Russo, Zheng Wen, et al. Deep exploration via

randomized value functions. J. Mach. Learn. Res., 20(124):1 62, 2019.

[63] Johannes Von Oswald, Seijin Kobayashi, Joao Sacramento, Alexander Meulemans, Christian

Henning, and Benjamin F Grewe. Neural networks with late-phase weights. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum? id=C0q JUx5dx Fb.

[64] Felix Otto. The geometry of dissipative evolution equations: the porous medium equation.

[65] Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, David Sculley, Sebastian Nowozin, Joshua

Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model s uncertainty? evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems, pages 13991 14002, 2019.

[66] Emanuel Parzen. On estimation of a probability density function and mode. The annals of

mathematical statistics, 33(3):1065 1076, 1962.

[67] Alexandre Rame and Matthieu Cord. Dice: Diversity in deep ensembles via conditional

redundancy adversarial estimation. ar Xiv preprint ar Xiv:2101.05544, 2021.

[68] Geoffrey Roeder, Luke Metz, and Diederik P Kingma. On linear identiﬁability of learned

representations. ar Xiv preprint ar Xiv:2007.00810, 2020.

[69] Jiaxin Shi, Shengyang Sun, and Jun Zhu. A spectral approach to gradient estimation for implicit

distributions. In International Conference on Machine Learning, pages 4644 4653. PMLR, 2018.

[70] Radhey S Singh. Improvement on some known nonparametric uniformly consistent estimators

of derivatives of a density. The Annals of Statistics, pages 394 399, 1977.

[71] Shengyang Sun, Guodong Zhang, Jiaxin Shi, and Roger Grosse. Functional variational bayesian

neural networks. ar Xiv preprint ar Xiv:1903.05779, 2019.

[72] Jakub Swiatkowski, Kevin Roth, Bastiaan S Veeling, Linh Tran, Joshua V Dillon, Stephan

Mandt, Jasper Snoek, Tim Salimans, Rodolphe Jenatton, and Sebastian Nowozin. The k-tied normal distribution: A compact parameterization of gaussian mean ﬁeld posteriors in bayesian neural networks. ar Xiv preprint ar Xiv:2002.02655, 2020.

[73] Johannes von Oswald, Christian Henning, João Sacramento, and Benjamin F. Grewe. Continual

learning with hypernetworks. In International Conference on Learning Representations, 2020. URL https://arxiv.org/abs/1906.00695.

[74] Maxime Wabartha, Audrey Durand, Vincent François-Lavet, and Joelle Pineau. Handling black

swan events in deep learning with diversely extrapolated neural networks. In IJCAI, pages 2140 2147, 2020.

[75] Yifei Wang, Peng Chen, and Wuchen Li. Projected wasserstein gradient descent for high-

dimensional bayesian inference. ar Xiv preprint ar Xiv:2102.06350, 2021.

[76] Ziyu Wang, Tongzheng Ren, Jun Zhu, and Bo Zhang. Function space particle optimization for

bayesian neural networks. ar Xiv preprint ar Xiv:1902.09754, 2019.

[77] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics.

In Proceedings of the 28th international conference on machine learning (ICML-11), pages 681 688, 2011.

[78] Yeming Wen, Dustin Tran, and Jimmy Ba. Batchensemble: an alternative approach to efﬁcient

ensemble and lifelong learning. ar Xiv preprint ar Xiv:2002.06715, 2020.

[79] Florian Wenzel, Kevin Roth, Bastiaan S Veeling, Jakub Swiatkowski, Linh Tran, Stephan Mandt,

Jasper Snoek, Tim Salimans, Rodolphe Jenatton, and Sebastian Nowozin. How good is the bayes posterior in deep neural networks really? ar Xiv preprint ar Xiv:2002.02405, 2020.

[80] Florian Wenzel, Jasper Snoek, Dustin Tran, and Rodolphe Jenatton. Hyperparameter ensembles

for robustness and uncertainty quantiﬁcation. In Advances in Neural Information Processing Systems, 2020.

[81] Christopher KI Williams. Computation with inﬁnite neural networks. Neural Computation, 10

(5):1203 1216, 1998.

[82] Andrew Gordon Wilson and Pavel Izmailov. Bayesian deep learning and a probabilistic perspective of generalization. ar Xiv preprint ar Xiv:2002.08791, 2020.

[83] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for

benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017.

[84] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding

deep learning requires rethinking generalization. ar Xiv preprint ar Xiv:1611.03530, 2016.

[85] Ruqi Zhang, Chunyuan Li, Jianyi Zhang, Changyou Chen, and Andrew Gordon Wilson. Cyclical

stochastic gradient mcmc for bayesian deep learning. ar Xiv preprint ar Xiv:1902.03932, 2019.