# nonlinear_mcmc_for_bayesian_machine_learning__047e3950.pdf

Nonlinear MCMC for Bayesian Machine Learning

James Vuckovic james@jamesvuckovic.com

We explore the application of a nonlinear MCMC technique ﬁrst introduced in [1] to problems in Bayesian machine learning. We provide a convergence guarantee in total variation that uses novel results for long-time convergence and large-particle ( propagation of chaos ) convergence. We apply this nonlinear MCMC technique to sampling problems including a Bayesian neural network on CIFAR10.

1 Introduction

Characterizing uncertainty is a fundamental problem in machine learning. It is often desirable for a machine learning model to provide a prediction and a measure of how certain the model is about that prediction. Having access to a robust measure of uncertainty becomes particularly important in real-world, high risk scenarios such as self-driving cars [2 4], medical diagnosis [5, 6], and classifying harmful text [7].

However, despite the need for uncertainty in machine learning predictions, it is well known that traditional ML training, i.e. based on optimizing an objective function, frequently does not provide robust uncertainty measures [8], yielding overconﬁdent predictions for popular neural networks such as Res Nets [9]. 1 An appealing alternative to the traditional optimization paradigm for ML is the Bayesian probabilistic framework, due to its relatively simple formulation and extensive theoretical grounding; see for example [10].

From the probabilistic perspective of machine learning [10], one combines a prior P( ) over the parameter space 2 and a likelihood of the data given model parameters P(D| ) using Bayes rule to obtain a posterior over the parameters P( |D) / P(D| )P( ). The traditional approach in machine learning is to optimize the posterior (or the likelihood) to obtain 2 arg max P( |D) and generate predictions via P(y|x, D) = P(y|x, ). However, if we adopt the Bayesian approach, the posterior characterizes the uncertainty about the parameters of the model (i.e. epistemic uncertainty), which can propagate to uncertainty about a prediction by integration: P(y|x, D) =

P(y|x, )P( |D)d . This paper studies the problem of how to approximate this integration with samples from P( |D).

1.1 Contributions

Our main contribution is the novel analysis of a modiﬁcation of the general nonlinear Markov

Chain Monte Carlo (MCMC) sampling method from [1] to obtain quantitative convergence guarantees in both the number of iterations and the number of samples.

We apply the general results from above to determine the convergence of two speciﬁc

nonlinear MCMC samplers.

In experiments, we compare these nonlinear MCMC samplers to their linear counterparts,

and ﬁnd that nonlinear MCMC provides additional ﬂexibility in designing sampling algorithms with as good, or better, performance as the linear variety.

1 In Appendix C.2.4, we provide an experiment that demonstrates this effect.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

1.2 Background

Bayesian ML & MCMC. In Bayesian machine learning, the computationally difﬁcult step is integration since the integral

P(y|x, )P(d |D) is not analytically solvable except for rare cases. In practice, one typically uses a Monte Carlo approximation such as

P(y|x, )P( |D)d 1

P(y|x, i), where i iid P( |D)

where the expected error in this approximation is well known to converge to zero like O(1/

N) by the Central Limit Theorem (CLT). There are various approaches to sampling from P( |D), but Markov chain Monte Carlo (MCMC) is perhaps the most widely used. The basic idea of MCMC is to use a Markov transition kernel T with stationary measure P( |D) to simulate a Markov chain n+1 T ( n, ) that converges rapidly to P( |D). In this case, we can estimate

P(y|x, )P( |D)d 1

where nsim is some large number of simulation steps and { i

n=0 are independent Markov chains governed by T .

The basic problem is then to design an efﬁcient transition kernel T . There is a vast body of literature studying various choices of T ; some well known choices are the Metropolis-Hastings algorithm [11, 12], the Gibbs sampler [13], the Langevin algorithm [14 17], Metropolis-Adjusted Langevin [18, 19], and Hamiltonian Monte Carlo [20 24].

However, there are various challenges in Bayesian ML that make applying these samplers difﬁcult in practice. One challenge is that the posterior P( |D) can be highly multimodal [25, 26], which makes it difﬁcult to ensure that the Markov chain n explores all modes of the target distribution. One can combat this issue by employing auxiliary samplers that explore a more tractable variation of P( |D) [27, 28]. Other methods that empirically improve posterior sampling quality include tempering [29 32], RMSProp-style preconditioning [33], or adaptive MCMC algorithms [34, 35] such as the popular No U-Turn Sampler [36].

Nonlinear MCMC. Another class of powerful MCMC algorithms, which is less-studied in the context of Bayesian ML, arises from allowing the transition kernel T to depend on the distribution of the Markov chain as in n+1 TDistribution( n)( n, ). This approach gives rise to so-called nonlinear MCMC since { n} is no longer a true Markov chain. Nonlinear Markov theory is a rich area of research [37 42] and has strong connections to nonlinear ﬁltering problems [43, 44], sequential Monte Carlo [45, 46], and nonlinear Feynman-Kac models [47]. One can replace Distribution( n), which is often intractable, with an empirical estimate Distribution( n) 1 N

i=1 δ in to obtain interacting particle MCMC (i MCMC, or i PMCMC) methods; see for example [48, 49, 1, 50, 51].

Our view is that nonlinear MCMC offers some appealing features that traditional linear MCMC lacks. One such feature is the ability to leverage global information about the state space contained in Distribution( i

n) to improve exploration, a central issue in Bayesian ML. Another feature is the increased ﬂexibility of nonlinear MCMC algorithms, which can be leveraged to correct biases that are introduced by other design decisions in MCMC for Bayesian ML such as tempering. These features will be explored empirically in Section 4.

However, the theoretical analysis of nonlinear Markov MCMC presents an added difﬁculty in that the particles { 1

n, . . . , N

n } of an interacting particle system are now statistically dependent. This means that, in addition to studying the long-time behaviour which is classical in MCMC [52, 53], one must study the large-particle behaviour separately to obtain Monte Carlo estimates since the CLT does not apply. One such large-particle behaviour is the propagation of chaos [41], which is the tendency for groups of interacting particles to become independent as the number of particles, N, increases; see [41]. We will need both of these elements long-time convergence and propagation of chaos to properly characterize the convergence of nonlinear MCMC.

Other Sampling Methods. Finally, let us mention that MCMC is certainly not the only way to obtain Monte Carlo sample estimates in Bayesian ML; some popular examples include MC dropout [8], black-box variational inference [54], and normalizing ﬂows [55, 56].

1.3 Common Notation

Let P(Rd) be the set of probability measures on the measurable space (Rd, B(Rd)). For µ 2 P(Rd) and f 2 Bb(Rd) := {f : Rd ! R | f is bounded}, we will denote µ(f) :=

fdµ. If K : Rd B(Rd) ! [0, 1] is a Markov kernel2 then we will denote Kf(x) :=

f(y)K(x, dy) and µK(dy) :=

µ(dx)K(x, dy). Finally, for y := {y1, . . . , y N} Rd, we will denote the empirical measure of y as m(y) := 1

i=1 δyi 2 P(Rd).

2 Nonlinear MCMC

In this section, we outline the family of MCMC algorithms that will be studied in rest of the work. We will use general notation for simplicity but this difference is merely cosmetic; the target distribution in this section corresponds directly to P( |D) from the previous section.

2.1 Nonlinear Jump Interaction Markov Kernels

To specify a MCMC algorithm, we must specify the Markov transition kernel. The family of nonlinear Markov kernels that we will be studying was introduced in [1] and is a mixture of a linear kernel, denoted K, and a nonlinear jump-interaction kernel indexed by a probability measure , denoted J , to obtain

K (x, dy) := (1 ")K(x, dy) + "J (x, dy) (1) where " 2]0, 1[ is the mixture hyperparameter. The Markov kernel K will be the main object of interest throughout this paper. We will give speciﬁc examples of J in Section 2.2, which were also introduced in [1]. Despite building on the constructions of [1], this work proceeds in some different substantially directions; see Appendix A.1 for more details.

Mean Field System. Now we show how the kernel K can be used to construct a Markov chain. Following [1], we use an auxiliary Markov chain {Yn} with transition kernel Q on the same state space as K (i.e. Rd) to obtain the nonlinear Markov chain {(Yn, Xn)}1

n=0 deﬁned by 8 > > <

Yn+1 Q(Yn, )

n+1 := Distribution(Yn+1)

Xn+1 K n+1(Xn, )

Y0 0, X0 µ0 (2)

where µ0, 0 2 P(Rd) are the initial distributions and denotes sample from . One should interpret this as a sequence of steps where ﬁrst we sample the auxiliary state Yn+1 from Q, then we obtain the distribution of Yn+1 denoted n+1, and we use this distribution to index the primary kernel K n+1 and obtain a sample Xn+1. We sample Xn+1 with probability (1 ") from the linear kernel K, and with probability " it will jump according to J n+1(Xn, ). Because the Markov dynamics depend on Distribution(Yn), we call this a mean ﬁeld system .

Interacting Particle System. One issue with the mean ﬁeld system (2) is the fact that computing Distribution(Yn+1) is generally impossible except in special cases. Hence, to get a viable simulation algorithm, we must approximate Distribution(Yn), and we do this by replacing Distribution(Yn) with its empirical measure estimated from a set of N particles Y n := {Y 1

n , . . . , Y N

n } as follows: 8 > > <

n+1 := m(Y n+1)

iid µ0, i = 1, . . . , N. (3)

2.2 Application to MCMC

Now we detail how to apply K and the Markov chains (2) and (3) to MCMC. In particular, we must understand how to choose Q, K, J such that K will be invariant w.r.t. a target distribution .

2 i.e. K(x, ) is a probability measure 8x 2 Rd and K( , A) is measurable 8A 2 B(Rd)

As is usually the case in probabilistic inference problems, we will assume that the target distribution is known only up to a normalizing constant and that it has a density, also denoted . We also make the simplifying assumption that Q has an invariant measure ? (also with density denoted ?) i.e. ?Q = ?. This is not burdensome; in practice we can, and will, obtain Q from a linear MCMC algorithm for some choice of ?. In fact, being able to choose ? is a powerful design parameter of our methods as we will see in Section 4. We will also assume that the linear kernel K is -invariant, i.e. K = .

To see how we can ensure that is K -invariant, consider the fact that we will design Q s.t. n := Distribution(Yn) converges to ?. This means we will eventually be sampling from the kernel K ? and we already have -invariance of K. Therefore, if we arrange for J ? to be -invariant, will be invariant for K ? since

K ? = (1 ") K + " J ? = (1 ") + " = . Intuitively, if the auxiliary chain converges to a steady state and the jumps in that steady state preserve (and K preserves ), then so will K ?. Now the remaining task is to design nonlinear interaction kernels J that will yield good performance; we detail two choices below.

Boltzmann-Gibbs Interaction. The ﬁrst choice of J we will investigate, from [1], relies on the Boltzmann-Gibbs transformation [47], which we now explain. Let G : Rd !]0, 1[ be a potential function; then the Boltzmann-Gibbs (BG) transformation is a nonlinear mapping G : P(Rd) ! P(Rd) deﬁned by

G(µ)(dx) := G(x)

µ(G)µ(dx) or equivalently

f(x) G(µ)(dx) :=

for any f 2 Bb(Rd) and whenever µ(G) 6= 0. This transformation has many interesting properties and been extensively studied in [47] and related works.

To use the BG transformation in MCMC, we will assume that the densities and ? are positive3 and make the choice that G(x) will be the function

G(x) := (x)

With this choice, we get an interaction kernel JBG

(x, dy) := G( )(dy). We can easily see that

d = 1 and G( ?)(dx) = G(x)

?(G) ?(dx) = (x)

?(x) ?(dx) = (dx),

i.e. G is the multiplicative change of measure from ? to . Hence the ﬁrst nonlinear Markov kernel we will investigate is

(x, dy) := (1 ")K(x, dy) + " G( )(dy). (5) From the remarks above, is clear that is K ?-invariant.

Accept-Reject Interaction. The second choice of jump interaction we will study, also introduced in [1], is a type of accept-reject interaction related to the Metropolis-Hastings algorithm. For the same choice of potential function G in (4), we can deﬁne the acceptance ratio

(x, y) := 1 G(y)

G(x) = 1 (y) ?(x)

?(y) (x) and the quantity A (x) :=

(x, y) (dy)

for 2 P(Rd). Hence we can deﬁne the accept-reject interaction kernel as4

(x, dy) := (x, y) (dy) + (1 A (x))δx(dy). We can interpret this jump interaction as: starting in state x, we jump to a new state distributed according to (dy) with probability (x, y) (i.e. accept the proposed jump) and remain the in current state with probability 1 A (x) (i.e. reject the proposed jump). This is a form of adaptive Metropolis-Hastings in which the proposal distribution evolves over time as the distribution of the auxiliary Markov chain. Hence we obtain the accept-reject nonlinear jump interaction kernel

(x, dy) := (1 ")K(x, dy) + "[ (x, y) (dy) + (1 A (x))δx(dy)]. (6)

We note that is also JAR

? -invariant; see Proposition 3 in Appendix G for a simple calculation.

3 this can be relaxed to µ and µ 4 Given f 2 Bb(Rd), we can also write this as JAR

[f(y) f(x)] (x, y) (dy) + f(x)

Simulation. Let us note brieﬂy that using both KBG

in (3) produce interacting particle systems that can, and will, be simulated. The simulation is relatively straightforward, see Appendix A for pseudocode implementing the nonlinear MCMC algorithms we have now constructed.

3 Convergence Analysis

We will now study whether the nonlinear MCMC algorithms based on K from Section 2 i.e., the interacting particle system (3) with the restrictions on K, Q, J from Section 2.2 will actually converge to the target distribution . In other words, we would like to estimate kµN

n k for some suitable notion of distance on P(Rd), where µN

n := Distribution(X1

n) is the distribution of a single particle (it doesn t matter which particle as the Xi

n are exchangeable).

The nonlinear nature of K makes this analysis more difﬁcult than of a linear MCMC method. We break the problem into two parts: one studying the convergence of the mean-ﬁeld system (2) as the number of steps n ! 1, and one studying the convergence of the interacting particle system (3) to the mean ﬁeld system as the number of particles N ! 1. This will allow us to apply the triangle inequality as follows:

n µnk | {z } large-particle convergence

+ kµn k | {z } long-time convergence

where µn := Distribution(Xn) is the distribution of the mean-ﬁeld system. Crucially, our analysis of the large-particle limit is uniform in the number of steps n, which will allow us to establish bounds above that hold as n ! 1. The actual result is contained in Theorem 1.

While our analysis does not rely on heavy mathematical machinery, to state the full set of conditions and results for long-time and large-particle convergence each of which is a substantial result in its own right would occupy too much space in the main text. Instead, we will state the main result in Theorem 1, which is essentially a corollary of the long-time and large-particle analyses Theorems 2 and 3 in Appendices E and F respectively, and below we will sketch the general arguments used in those appendices. The proofs of the main results are in Appendix F.2. Note that our analysis, and the results we obtain, are novel and not found in [1]; see Appendix D for an elaboration.

The following result is stated in terms of the total variation metric, deﬁned here for µ, 2 P(Rd) as kµ ktv := supkfk1 1 |µ(f) (f)| where k k1 is the sup-norm on Bb(Rd).

Theorem 1. [Convergence of Nonlinear MCMC] Under suitable conditions on K and Q, there exist ﬁxed constants C1, C2, C3 > 0, a function R : [0, 1[! [1, 1[, and > 0 s.t.

1 N R(1/N) + C2 n + C3n n.

Let us make a couple of remarks:

This result shows that, to control the approximation error kµN

n ktv, it does not necessarily sufﬁce to run the MCMC algorithm for a large number of steps n, since if n ! 1 but N < 1 then our bound on kµN

n k 6! 0. However, this approximation cannot lead to arbitrarily bad results: Theorem 1 provides a quantitative upper bound on how much the MCMC algorithm can be biased. This behaviour is supported empirically; in Figure 5 of Appendix C.1.4 we provide a clear illustration of how changing N signiﬁcantly affects the bias of our nonlinear MCMC methods while having no effect on the bias of linear MCMC, as expected.

This result uses total variation, which is a strong metric that represents a worst-case over

all bounded functions f (up to rescaling by kfk1). It is entirely possible that, for many choices of practical f, the approximation will be better as we will see empirically.

The constant is, roughly speaking, the slower of the rate of convergence for Q and for K.

Hence if K, Q are chosen to be efﬁcient samplers with fast convergence, this will result in 1 and hence µN

n will also converge quickly.

In our speciﬁc samplers KBG

and KAR, we will see in Appendix G that R is a monotonically increasing function that is lower-bounded by 1. Hence, as N ! 1, 1

N ) ! 0 as expected.

A corollary of Theorem 1 is that that we regain a Monte Carlo estimate for the interacting particle system. This result is essentially due to [41] Theorem 2.2. Corollary 1. [Adapted from [41], Theorem 2.2] Suppose that Theorem 1 applies to K . Let Xn := {X1

n, . . . , XN

n } be the interacting particle system from (3). Then for every n 2 N and f 2 Bb(Rd) we have

This corollary directly relates to the application of Bayesian ML we are interested in, where we would have f( ) = P(y|x, ).

3.1 Long-Time Bounds

There are two main ingredients in the general result on long-time convergence: ergodicity of K and Q, and Lipschitz regularity of the interaction kernel 7! J . These, along with other technical conditions, produce an estimate of the form kµn k C2 n + C3n n where k k is a weighted total variation norm. The full statement is in Theorem 2 of Appendix E.

Ergodicity of K and Q. A fundamental requirement of our results is that the linear building blocks of K must converge to their respective stationary measures in an appropriate metric. This type of result is now standard in the Markov chain literature, and we use a result from [57] for K and a result from [1] for Q. The former is actually able to ensure that K is a contraction on P(Rd) w.r.t. a suitable weighted total variation; we use this feature repeatedly in our analysis.

Lipschitz Regularity of J . We also need that 7! J is Lipschitz-continuous w.r.t. a weighted total variation norm on Markov kernels (the Lipschitz constant does not have to be < 1). This regularity is used to translate the convergence of n ! ? (as guaranteed by the ergodicity of Q) into convergence of J ! J ? with the Lipschitz estimate k J n J ?k . k n ?k. In Appendix G, we verify this analytically for JBG

, see Lemma 5 and [1] Proposition 5.3.

3.2 Large-Particle Bounds

To study the large-particle behaviour, we would like to measure how close subset of q 2 {1, . . . , N} interacting particles {X1

n, . . . , Xq

n, . . . , XN

n } =: Xn from (3) is to being i.i.d. according to the mean-ﬁeld measure µn. This analysis was pioneered in [41] under the name propagation of chaos and formalizes the intuition that, as N ! 1, the inﬂuence of any individual particle ! 0.

To state this more precisely, ﬁrst note that if we had random variables Zi iid 2 P(Rd) then the joint distribution of Z := {Z1, . . . , ZN} would be N. Hence, as N ! 1 for the interacting particles Xn at time n, we expect the distribution of {X1

n, . . . , Xq

n}, denoted µq,N

n , to get closer to the distribution of i.i.d. mean ﬁeld particles from (2), denoted µ q

n . In other words, we expect kµq,N

n ktv ! 0 as N ! 1. The full statement of this result is Theorem 3 of Appendix F.

The main condition in our propagation of chaos result is another type of regularity for 7! J which basically requires that, if one approximates a distribution 2 P(Rd) by its empirical measure m(Y ) where Y := {Y 1, . . . , Y N} and Y i iid , then Jm(Y ) ! J as N ! 1. More precisely, there should be a function R : [0, 1[! [1, 1[, which is ideally nondecreasing, s.t.

m(Y )f(x)] J q

N R(q2/N) 8x 2 Rd and f 2 Bb(Rd) with oscillations osc(f) 1.

The expectation is taken over N, and oscillations are deﬁned precisely in Appendix D. This inequality is essentially a total variation regularity since we can alternately write kµ ktv = sup{|µ(f) (f)| | f 2 Bb(Rd), osc(f) 1} [47].

3.3 Analysis of Speciﬁc Interaction Kernels

The main result Theorem 1 is in terms of conditions on a general K (i.e. general choices of K, Q, J ). To apply this result to the samplers in Section 2, we must establish if these conditions hold for KBG

. Fortunately this can be done analytically; in Appendix G, we present conditions under which the Lipschitz regularity (see Lemma 5 and [1] Proposition 5.3) and large-particle regularity (see Corollaries 3 and 5) hold. These results, particularly for KBG

, are interesting and rely on novel techniques for controlling the various quantities, sometimes improving over previous methods. Due to space constraints, the results for KBG

are in Corollaries 2 and 4 from Appendix G.

4 Experiments

In this section, we detail two experiments designed to explore how one might apply the nonlinear MCMC methods developed in the previous sections to Bayesian machine learning.5 Let us state explicitly that our aim is not to achieve state-of-the-art with these experiments, nor do we claim that this method will necessarily lead to state-of-the-art results on a particular task. Rather, the aims of these experiments are: to demonstrate that nonlinear MCMC can be applied successfully to large-scale problems; to compare linear vs nonlinear methods to understand what beneﬁts and drawbacks nonlinear MCMC offers compared to linear MCMC in practice; and to develop some recipes for choosing the various hyperparameters and samplers that determine a nonlinear MCMC method.

4.1 Two-Dimensional Toy Experiments

First, we use a toy setting of two-dimensional distributions to compare the relative beneﬁts of linear and nonlinear MCMC. A beneﬁt of this simple setting is that the multimodal toy distributions can be exactly sampled. This gives us the opportunity to quantify the quality of our samples via an unbiased estimator of the Maximum Mean Discrepancy (MMD) metric on P(Rd) [58]. This approach stands in contrast to many previous works, which use simplistic distributions (e.g. Gaussians) with analytically tractable statistics to measure quality. See Appendix C.1 for an overview of our methodology.

Setup. Our setup will compare the Metropolis Adjusted Langevin Algorithm (MALA) [18] (see Appendix B for an overview of MALA) with the nonlinear BG and AR samplers using MALA for the kernels K, Q in our nonlinear setup from Section 2. This will allow us to examine the effects of the nonlinearity in KAR

while controlling for the type of sampler used and its hyperparameters.

The main difference between the linear and nonlinear algorithms, aside from the interaction itself, is the extra design knob to control in the form of the choice of the auxiliary density ?. Below, we show how one can use ? to incorporate additional insight to guide the sampling, such as regions of the state space to explore. In our experiments, we choose ? to be a centered, 2-dimensional Gaussian with a large variance ( = 4I2 for the circular Mo G and two-rings densities, and = 20I2 for the grid Mo G). This conveys coarse-grained information of roughly where the support of the target density is located in this case, a neighbourhood of (0, 0). See Table 2 in Appendix C.1 for a full account of our experimental settings.

Results. From Figure 1, we see that having a simple auxiliary density with good coverage of the support of the target distribution is quite helpful. In all three examples, one or both of the nonlinear samplers outperformed the equivalent linear sampler in the empirical MMD metric. For the two most challenging densities, the two rings and grid Mo G distributions, the improved exploration is particularly evident. We also include an analysis of the runtime of the algorithms in Appendix C.1.3.

Comparison With [1]. We also compared the performance of our methods with those of [1] in this two-dimensional toy setting. See Appendix C.1.4 for an overview of the results; they support all of the theoretical considerations and design principles we have introduced in this paper.

5 The code used in our experiments can be found at https://github.com/jamesvuc/ nonlinear-mcmc-paper. See also Appendix A for a discussion of the implementation details.

Target Density

Linear (MALA)

Figure 1: Visualizations of the 2d experiment. The top row shows the empirical MMD-squared plotted over number of sampled steps, where the shaded region is 1 standard deviation with 5 independent runs. The bottom three rows show histograms for the N = 2000 samples of the Circular Mixture of Gaussians (Mo G) density [59], the Two Rings density [59], and the Grid Mixture of Gaussians density [60] respectively.

4.2 CIFAR10

4.2.1 Setup

To examine the properties of the nonlinear sampler outside of a toy setting, we have also implemented a Bayesian neural network on the CIFAR10 dataset. We use a likelihood P(y|x, ) parameterized by a Res Net-18 convolutional neural network [61] and a Gaussian prior P( ) on the parameters of this neural network which are combined to form a posterior P( |Dtrain) / P( ) Q

(xi,yi)2Dtrain P(yi|xi, ). The goal is to sample i, i = 1, . . . , N from this posterior. See Table 4 in Appendix C.2 for a full account of the experimental settings.

To deal with the fact that this sampling problem is very high-dimensional (d 11M) and |Dtrain| is large (|Dtrain| = 60, 000) we use a variety of techniques:

1. We sample minibatches b D of size 256 to obtain the surrogate target density P( | b D) [17]. 2. We use an RMSProp-like preconditioned Langevin algorithm, called RMS-Langevin

or RMS-ULA, as in [33] for the auxiliary sampler Q; see Appendix B for details on this sampler. As shown in [33], this sampler is biased. 3. We use tempering, wherein we aim to sample from or / P( |Dtrain)1/ where is a

small number. This substantially improves mixing for hard-to-sample distributions such as P( |Dtrain) at the cost of bias since we are no longer sampling from the true posterior [32].

Using our nonlinear algorithm presents a novel opportunity to correct the bias introduced by tempering. For our experiments, we pick ? / P( |Dtrain)1/ and = P( |Dtrain). This means that the auxiliary chain Yn explores a tempered version of the target, whereas the target chain Xn (in theory) explores the true target distribution. This is a novel strategy that is made possible by being able to select ? almost independently of the target . We study the case when is tempered as well.

For our experiments, we use the RMS-Langevin sampler as the baseline, and we also use it for the auxiliary sampler Q. For the target sampler, it is not possible to use the RMS-Langevin algorithm because the smoothed square-gradient estimate is incompatible with the discontinuities (i.e. jumps) introduced by the nonlinear interaction. Instead, for the linear sampler K we use the unadjusted Langevin algorithm, ULA, [16] (see Appendix B). We investigate both test accuracy and calibration error [9] to assess performance.

Table 1: Results for CIFAR10 experiments. represents 1 standard deviation on 5 random seeds. The tempered results are using = 10 4. See Appendix C.2 for an overview of expected calibration error. We also compute the maximum calibration error in Appendix C.2. All Expected Calibration Error numbers are multiplied by 102 in this table.

Test Accuracy (") Expected Calibration Error (#) Algorithm Non-Tempered Tempered Non-Tempered Tempered Linear 85.01 0.10 85.01 0.19 0.24 0.02 0.26 0.014 Nonlinear (BG) 84.28 0.28 84.74 0.08 0.14 0.03 0.16 0.03 Nonlinear (AR) Diverged 84.67 0.23 Diverged 0.15 0.05

Figure 2: Evaluation of test accuracy during sampling for CIFAR10. The shaded areas represent 1 standard deviation for 5 random seeds. For readability, we omit the AR interaction curve on the non-tempered result since it diverged and it distorts the scale of the plot. For completeness, all the curves for the non-tempered case are plotted in Appendix C.2.2, Figure 6.

4.2.2 Results

Linear vs Nonlinear. From Table 1, we see that the linear (RMS-Langevin) sampler has slightly higher, but comparable test accuracy to the nonlinear samplers. This is likely because RMS-Langevin algorithm has better stability around the regions of high probability due to its adaptive stepsize scaling. However, from Figure 2, we see that for both the tempered and non-tempered cases, the nonlinear interaction appears to beneﬁt during early exploration. This is an expected and desired property of these nonlinear samplers, which incorporate global information about the sampler state (in this case, the relative potential G of each auxiliary chain s state) and are able to emphasize those states with higher probability. However, the linear method is able to eventually explore the relevant regions of the state space, and the difference disappears. See Appendix C.2.5 for a comparison linear vs nonlinear performance scaled by the number of gradient evaluations.

Tempering Vs Non-Tempering. The linear MCMC sampler is always tempered in our experiments so there should not be any statistically signiﬁcant difference in the linear case. For the nonlinear sampler, the tempered version has slightly higher accuracy; this trend is also observed in [32]. On the other hand, the calibration errors are the same for both tempered and non-tempered variants. This is somewhat surprising, given the aggressive tempering used, and one would expect that this reduces the variance of the posterior estimate.6 This observation can perhaps be explained by the fact that we are using N = 10 samples which may not be enough to accurately change the tempered auxiliary distribution ? into the non-tempered primary distribution in the jump interaction.

6 As ! 0, this P( |Dtrain)1/ converges to the maximum a posteriori estimate with zero variance.

Calibration. Considering the expected calibration error (ECE) [9], in Table 1 we see that the nonlinear method has statistically signiﬁcantly lower ECE (p 0.05) compared to the linear method. We hypothesize that this is due to a tension between the RMS scaling of the gradient which improves the efﬁciency of each MCMC step but at the cost of bias, which may be measurable in the form of calibration error. The Langevin algorithm is also biased, but is generally known to have good convergence properties [16] and much less is known about the RMS-Langevin variant. By using our nonlinear setup, we are able to aggressively explore the auxiliary distribution without sacriﬁcing calibration on the target distribution.

5 Conclusion

In this paper, we have studied the theoretical and empirical properties of nonlinear MCMC methods. We have obtained powerful theoretical results to characterize the convergence of our MCMC methods, and we have applied these methods to Bayesian neural networks. The results on BNNs are comparable to, but not better than, the linear methods we studied. We hypothesize that this is because more investigation into choosing the best auxiliary density ? is required; our choice is simplistic and may not be optimal. This hypothesis is supported by our toy experiments, which show signiﬁcant improvement when ? is able to incorporate some additional insight into the problem. How to do this in high dimensions is an exciting direction for future research.

Broader Impact. There are beneﬁts and drawbacks to the nonlinear MCMC methods we describe. The beneﬁts are mainly that properly accounting for uncertainty in machine learning will lead to better real-world outcomes for high-value scenarios such as self-driving cars or medical imaging. The drawbacks are that MCMC methods require O(N) storage and computations relative to the O(1) for deterministic methods; in fact, our nonlinear method would require 2N resources compared with N for a linear method (see also Appendices C.1.3 and C.2.5). If our algorithms were applied to a large swath of ML as is , this would mean a substantial increase in the energy consumption required for experimentation and deployment, worsening an already substantial issue in the ﬁeld.

[1] Christophe Andrieu, Ajay Jasra, Arnaud Doucet, and Pierre Del Moral. On nonlinear Markov

chain Monte Carlo. Bernoulli, 17(3):987 1014, 2011. doi: 10.3150/10-BEJ307. URL https://doi.org/10.3150/10-BEJ307. [2] Rhiannon Michelmore, Marta Kwiatkowska, and Yarin Gal. Evaluating uncertainty quantiﬁca-

tion in end-to-end autonomous driving control. ar Xiv preprint ar Xiv:1811.06817, 2018. [3] Sina Shafaei, Stefan Kugele, Mohd Hafeez Osman, and Alois Knoll. Uncertainty in machine

learning: A safety perspective on autonomous driving. In International Conference on Computer Safety, Reliability, and Security, pages 458 464. Springer, 2018. [4] Liuhui Ding, Dachuan Li, Bowen Liu, Wenxing Lan, Bing Bai, Qi Hao, Weipeng Cao, and

Ke Pei. Capture uncertainties in deep neural networks for safe operation of autonomous driving vehicles. In 2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/Social Com/Sustain Com), pages 826 835. IEEE, 2021. [5] Benjamin Kompa, Jasper Snoek, and Andrew L Beam. Second opinion needed: communicating

uncertainty in medical machine learning. NPJ Digital Medicine, 4(1):1 6, 2021. [6] Roohallah Alizadehsani, Mohamad Roshanzamir, Sadiq Hussain, Abbas Khosravi, Afsaneh

Koohestani, Mohammad Hossein Zangooei, Moloud Abdar, Adham Beykikhoshk, Afshin Shoeibi, Assef Zare, et al. Handling of uncertainty in medical data using machine learning and probability theory techniques: A review of 30 years (1991 2020). Annals of Operations Research, pages 1 42, 2021. [7] Jianfeng He, Xuchao Zhang, Shuo Lei, Zhiqian Chen, Fanglan Chen, Abdulaziz Alhamadani,

Bei Xiao, and Chang Tien Lu. Towards more accurate uncertainty estimation in text classiﬁcation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8362 8372, 2020. [8] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing

model uncertainty in deep learning. In International Conference on Machine Learning, pages 1050 1059. PMLR, 2016.

[9] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural

networks. In International Conference on Machine Learning, pages 1321 1330. PMLR, 2017. [10] K.P. Murphy. Machine Learning: A Probabilistic Perspective. Adaptive Computation and

Machine Learning series. MIT Press, 2012. ISBN 9780262018029. URL https://books. google.ca/books?id=NZP6AQAAQBAJ. [11] Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and

Edward Teller. Equation of state calculations by fast computing machines. The Journal of Chemical Physics, 21(6):1087 1092, 1953. [12] W Keith Hastings. Monte Carlo sampling methods using Markov chains and their applications.

1970. [13] Stuart Geman and Donald Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian

restoration of images. IEEE Transactions on pattern analysis and machine intelligence, (6): 721 741, 1984. [14] Donald L. Ermak. A computer simulation of charged particles in solution. i. technique and

equilibrium properties. The Journal of Chemical Physics, 62(10):4189 4196, 1975. doi: 10.1063/1.430300. URL https://doi.org/10.1063/1.430300. [15] Gareth O Roberts and Richard L Tweedie. Exponential convergence of Langevin distributions

and their discrete approximations. Bernoulli, pages 341 363, 1996. [16] Alain Durmus and Eric Moulines. High-dimensional Bayesian inference via the unadjusted

Langevin algorithm. Bernoulli, 25(4A):2854 2882, 2019. [17] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient Langevin dynamics.

In Proceedings of the 28th International Conference on Machine Learning, pages 681 688. Citeseer, 2011. [18] Gareth O Roberts and Jeffrey S Rosenthal. Optimal scaling of discrete approximations to

Langevin diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 60(1):255 268, 1998. [19] Gareth O. Roberts and Richard L. Tweedie. Exponential convergence of Langevin distributions

and their discrete approximations. Bernoulli, 2(4):341 363, 1996. doi: bj/1178291835. URL https://doi.org/. [20] Simon Duane, Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth. Hybrid Monte

Carlo. Physics letters B, 195(2):216 222, 1987. [21] Michael Betancourt. A conceptual introduction to Hamiltonian Monte Carlo. ar Xiv preprint

ar Xiv:1701.02434, 2017. [22] Radford M. Neal. An improved acceptance procedure for the hybrid Monte Carlo algorithm.

Journal of Computational Physics, 111:194 203, 1992. [23] R.M. Neal. Bayesian Learning for Neural Networks. Lecture Notes in Statistics. Springer

New York, 2012. ISBN 9781461207450. URL https://books.google.ca/books?id= LHHr Bw AAQBAJ. [24] Mark Girolami and Ben Calderhead. Riemann manifold Langevin and Hamiltonian Monte

Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73 (2):123 214, 2011. [25] Peter Müller and David Rios Insua. Issues in Bayesian analysis of neural network models.

Neural Computation, 10(3):749 770, 1998. [26] Arya A Pourzanjani, Richard M Jiang, and Linda R Petzold. Improving the identiﬁability

of neural networks for Bayesian inference. In NIPS Workshop on Bayesian Deep Learning, volume 4, page 31, 2017. [27] David M. Higdon. Auxiliary variable methods for Markov chain Monte Carlo with applications.

Journal of the American Statistical Association, 93(442):585 595, 1998. ISSN 01621459. URL http://www.jstor.org/stable/2670110. [28] Raza Habib and David Barber. Auxiliary variational MCMC. In International Conference on

Learning Representations, 2018. [29] Malcolm Sambridge. A Parallel Tempering algorithm for probabilistic sampling and multimodal

optimization. Geophysical Journal International, 196(1):357 374, 10 2013. ISSN 0956-540X. doi: 10.1093/gji/ggt342. URL https://doi.org/10.1093/gji/ggt342. [30] Robert Swendsen and Jian-Sheng Wang. Replica Monte Carlo simulation of spin-glasses.

Physical review letters, 57:2607 2609, 12 1986. doi: 10.1103/Phys Rev Lett.57.2607.

[31] Rohitash Chandra, Konark Jain, Ratneel V Deo, and Sally Cripps. Langevin-gradient parallel

tempering for Bayesian neural learning. Neurocomputing, 359:315 326, 2019. [32] Florian Wenzel, Kevin Roth, Bastiaan Veeling, Jakub Swiatkowski, Linh Tran, Stephan Mandt,

Jasper Snoek, Tim Salimans, Rodolphe Jenatton, and Sebastian Nowozin. How good is the Bayes posterior in deep neural networks really? In International Conference on Machine Learning, pages 10248 10259. PMLR, 2020. [33] Chunyuan Li, Changyou Chen, David Carlson, and Lawrence Carin. Preconditioned stochastic

gradient Langevin dynamics for deep neural networks. In Thirtieth AAAI Conference on Artiﬁcial Intelligence, 2016. [34] Christophe Andrieu and Éric Moulines. On the ergodicity properties of some adaptive MCMC

algorithms. The Annals of Applied Probability, 16(3):1462 1505, 2006. [35] Christophe Andrieu and Johannes Thoms. A tutorial on adaptive MCMC. Statistics and

computing, 18(4):343 373, 2008. [36] Matthew D Hoffman, Andrew Gelman, et al. The no-u-turn sampler: adaptively setting path

lengths in Hamiltonian Monte Carlo. J. Mach. Learn. Res., 15(1):1593 1623, 2014. [37] Henry P Mc Kean Jr. A class of Markov processes associated with nonlinear parabolic equations.

Proceedings of the National Academy of Sciences of the United States of America, 56(6):1907, 1966. [38] Sylvie Méléard. Asymptotic behaviour of some interacting particle systems; mckean-vlasov

and boltzmann models. In Probabilistic models for nonlinear partial differential equations, pages 42 95. Springer, 1996. [39] José A Carrillo, Robert J Mc Cann, and Cédric Villani. Kinetic equilibration rates for granular

media and related equations: entropy dissipation and mass transportation estimates. Revista Matematica Iberoamericana, 19(3):971 1018, 2003. [40] Arnaud Guillin and Pierre Monmarché. Uniform long-time and propagation of chaos estimates

for mean ﬁeld kinetic particles in non-convex landscapes. Journal of Statistical Physics, 185(2): 1 20, 2021. [41] Alain-Sol Sznitman. Topics in propagation of chaos. In Ecole d été de probabilités de Saint-

Flour XIX 1989, pages 165 251. Springer, 1991. [42] OA Butkovsky. On ergodic properties of nonlinear Markov chains and stochastic Mc Kean

Vlasov equations. Theory of Probability & Its Applications, 58(4):661 674, 2014. [43] Pierre Del Moral. Nonlinear ﬁltering: Interacting particle resolution. Comptes Rendus de

l Académie des Sciences-Series I-Mathematics, 325(6):653 658, 1997. [44] Pierre Del Moral and Laurent Miclo. Asymptotic results for genetic algorithms with applications

to nonlinear estimation. In Theoretical aspects of evolutionary computing, pages 439 493. Springer, 2001. [45] Arnaud Doucet, Nando De Freitas, and Neil Gordon. An introduction to sequential Monte Carlo

methods. In Sequential Monte Carlo methods in practice, pages 3 14. Springer, 2001. [46] Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. Sequential Monte Carlo samplers. Journal

of the Royal Statistical Society: Series B (Statistical Methodology), 68(3):411 436, 2006. [47] P.D. Moral. Feynman-Kac Formulae: Genealogical and Interacting Particle Systems with

Applications. Probability and Its Applications. Springer New York, 2004. ISBN 9780387202686.

URL https://books.google.ca/books?id=8Lypfu G8ZLYC. [48] Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. Particle Markov chain Monte

Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72 (3):269 342, 2010. [49] Tom Rainforth, Christian Naesseth, Fredrik Lindsten, Brooks Paige, Jan-Willem Vandemeent,

Arnaud Doucet, and Frank Wood. Interacting particle Markov chain Monte Carlo. In International Conference on Machine Learning, pages 2616 2625. PMLR, 2016. [50] Grégoire Clarté, Antoine Diez, and Jean Feydy. Collective proposal distributions for nonlinear

MCMC samplers: Mean-ﬁeld theory and fast implementation. ar Xiv preprint ar Xiv:1909.08988, 2019. [51] Michael Arbel, Anna Korba, Adil Salim, and Arthur Gretton. Maximum mean discrepancy

gradient ﬂow. Advances in Neural Information Processing Systems, 32, 2019. [52] Gareth O Roberts and Jeffrey S Rosenthal. General state space Markov chains and MCMC

algorithms. Probability surveys, 1:20 71, 2004.

[53] Sean P Meyn and Richard L Tweedie. Markov chains and stochastic stability. Springer Science

& Business Media, 2012. [54] Rajesh Ranganath, Sean Gerrish, and David Blei. Black box variational inference. In Artiﬁcial

intelligence and statistics, pages 814 822. PMLR, 2014. [55] Danilo Rezende and Shakir Mohamed. Variational inference with normalizing ﬂows. In

International conference on machine learning, pages 1530 1538. PMLR, 2015. [56] George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji

Lakshminarayanan. Normalizing ﬂows for probabilistic modeling and inference. Journal of Machine Learning Research, 22(57):1 64, 2021. [57] Martin Hairer and Jonathan C Mattingly. Yet another look at Harris ergodic theorem for

Markov chains. In Seminar on Stochastic Analysis, Random Fields and Applications VI, pages 109 117. Springer, 2011. [58] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander

Smola. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723 773, 2012. [59] Vincent Stimper, Bernhard Schölkopf, and José Miguel Hernández-Lobato. Resampling base

distributions of normalizing ﬂows. In International Conference on Artiﬁcial Intelligence and Statistics, pages 4915 4936. PMLR, 2022. [60] Ruqi Zhang, Chunyuan Li, Jianyi Zhang, Changyou Chen, and Andrew Gordon Wilson. Cyclical

stochastic gradient MCMC for Bayesian deep learning. In International Conference on Learning Representations, 2019. [61] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image

recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016. [62] Michael K Pitt and Neil Shephard. Filtering via simulation: Auxiliary particle ﬁlters. Journal

of the American statistical association, 94(446):590 599, 1999. [63] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal

Maclaurin, George Necula, Adam Paszke, Jake Vander Plas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+Num Py programs, 2018. URL http://github.com/google/jax. [64] Grigorios A Pavliotis. Stochastic processes and applications: diffusion processes, the Fokker-

Planck and Langevin equations, volume 60. Springer, 2014. [65] Denis Talay and Luciano Tubaro. Expansion of the global error for numerical schemes solving

stochastic differential equations. Stochastic analysis and applications, 8(4):483 509, 1990. [66] T. Tieleman and G. Hinton. Lecture 6.5 Rms Prop: Divide the gradient by a running average

of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012. [67] Chii-Ruey Hwang, Shu-Yin Hwang-Ma, and Shuenn-Jyi Sheu. Accelerating diffusions. The

Annals of Applied Probability, 15(2):1433 1444, 2005. [68] Andrew B Duncan, Tony Lelievre, and Grigorios A Pavliotis. Variance reduction using nonre-

versible Langevin samplers. Journal of statistical physics, 163(3):457 491, 2016. [69] Alessandro Barp, Lancelot Da Costa, Guilherme França, Karl Friston, Mark Girolami, Michael I

Jordan, and Grigorios A Pavliotis. Geometric methods for sampling, optimisation, inference and adaptive agents. ar Xiv preprint ar Xiv:2203.10592, 2022. [70] Joris Bierkens, Paul Fearnhead, and Gareth Roberts. The zig-zag process and super-efﬁcient

sampling for Bayesian analysis of big data. The Annals of Statistics, 47(3):1288 1320, 2019. [71] George Deligiannidis, Alexandre Bouchard-Côté, and Arnaud Doucet. Exponential ergodicity

of the bouncy particle sampler. The Annals of Statistics, 47(3):1268 1287, 2019. [72] Augustin Chevallier, Sam Power, Andi Q Wang, and Paul Fearnhead. PDMP Monte Carlo

methods for piecewise-smooth densities. ar Xiv preprint ar Xiv:2111.05859, 2021. [73] Alain Durmus, Arnaud Guillin, and Pierre Monmarché. Piecewise deterministic Markov

processes and their invariant measures. In Annales de l Institut Henri Poincaré, Probabilités et Statistiques, volume 57, pages 1442 1475. Institut Henri Poincaré, 2021. [74] Pierre Monmarché. High-dimensional MCMC with a standard splitting scheme for the under-

damped Langevin diffusion. Electronic Journal of Statistics, 15(2):4117 4166, 2021. [75] Dimitris Bertsimas and John Tsitsiklis. Simulated annealing. Statistical science, 8(1):10 15,

[76] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images.

2009. [77] Tom Hennigan, Trevor Cai, Tamara Norman, and Igor Babuschkin. Haiku: Sonnet for JAX,

2020. URL http://github.com/deepmind/dm-haiku. [78] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng.

Reading digits in natural images with unsupervised feature learning. 2011.