# diffusive_gibbs_sampling__ef6fb9bb.pdf

Diffusive Gibbs Sampling

Wenlin Chen * 1 2 Mingtian Zhang * 3 Brooks Paige 3 Jos e Miguel Hern andez-Lobato 1 David Barber 3

The inadequate mixing of conventional Markov Chain Monte Carlo (MCMC) methods for multimodal distributions presents a significant challenge in practical applications such as Bayesian inference and molecular dynamics. Addressing this, we propose Diffusive Gibbs Sampling (Di GS), an innovative family of sampling methods designed for effective sampling from distributions characterized by distant and disconnected modes. Di GS integrates recent developments in diffusion models, leveraging Gaussian convolution to create an auxiliary noisy distribution that bridges isolated modes in the original space and applying Gibbs sampling to alternately draw samples from both spaces. A novel Metropolis-within-Gibbs scheme is proposed to enhance mixing in the denoising sampling step. Di GS exhibits a better mixing property for sampling multi-modal distributions than state-of-the-art methods such as parallel tempering, attaining substantially improved performance across various tasks, including mixtures of Gaussians, Bayesian neural networks and molecular dynamics.

1. Introduction

Generating samples from complex unnormalized probability distributions is an important problem in machine learning, statistics and natural sciences. Consider an unnormalized target distribution of the form

p(x) = exp( E(x))

where x Rd is the variable of interest, E : Rd R is a lower-bounded differentiable energy function, and

*Equal contribution 1University of Cambridge, Cambridge, UK 2Max Planck Institute for Intelligent Systems, T ubingen, Germany 3University College London, London, UK. Correspondence to: Wenlin Chen <wc337@cam.ac.uk>, Mingtian Zhang <m.zhang@cs.ucl.ac.uk>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

Z = R exp( E(x)) dx is the (intractable) normalization constant. We aim to draw independent samples x p(x) from the target distribution and estimate expectations of functions Ep(x)[h(x)] = R h(x)p(x) dx under the target distribution p(x).

The gradient of the log density of the target distribution is known as the score function:

x log p(x) = x E(x), (2)

which is independent of Z. The score function can be evaluated at any location x since E is assumed to be differentiable. This assumption is commonly satisfied in various practical applications, such as posteriors in Bayesian inference (Welling & Teh, 2011), score/energy networks in generative image modelling (Song & Ermon, 2019), and Boltzmann distributions in statistical mechanics (No e et al., 2019). Below, we introduce score-based methods for sampling from unnormalized distributions.

1.1. Score-Based MCMC Methods

Unadjusted Langevin Algorithm (ULA) (Grenander & Miller, 1994; Roberts & Tweedie, 1996) follows the transition rule given by a discrete-time Langevin SDE:

xk+1 = xk + η x log p(xk) + p

where ϵk N(0, I). For an infinitesimal step size η, the Markov chain converges to the target p(x) as k .

Metropolis-adjusted Langevin Algorithm (MALA) (Roberts & Tweedie, 1996; Roberts & Stramer, 2002) defines a proposal xk+1 using the ULA update rule and additionally corrects the bias according to the transition probability given by the Metropolis-Hasting (MH) algorithm:

a MALA min 1, exp( E(xk+1))q(xk|xk+1)

exp( E(xk))q(xk+1|xk)

where the proposal distribution is given by

q(x |x) = N(x |x + η x log p(x), 2ηI). (5)

Hamiltonian Monte Carlo (HMC) (Duane et al., 1987; Neal et al., 2011) augments the original variable x with an auxiliary momentum variable v, which defines a joint distribution

p(x, v) = p(x)p(v) e E(x) K(v), (6)

Diffusive Gibbs Sampling

Figure 1. Challenge of multi-modal sampling with score-based MCMC. The true samples represent a mixture of 9 Gaussians and each Gaussian has a standard deviation σ = 0.1. The generated samples are produced by MALA initialized at the origin.

where p(v) = N(v|0, M) corresponds to the kinetic energy K(v) = 1

2v TM 1v. The total energy, or Hamiltonian, is denoted by H(x, v) = E(x) + K(v). HMC generates samples of x and v by simulating the Hamiltonian equations

v = M 1v, dv

x = x log p(x). (7)

Accurate numerical simulation of the Hamiltonian equations can be done by the leapfrog algorithm (Neal et al., 2011), with discretization bias corrected by the MH algorithm.

Despite the effort to address the challenge of exploring the whole support of a distribution using Langevin and Hamiltonian dynamics, MALA and HMC can still be ineffective for distributions with disconnected modes in practice since they often struggle to cross low-density barriers that separate the modes, leading to prolonged transitions from one mode to another (Pompe et al., 2020). More generally, score-based sampling methods utilize local gradient information to propose subsequent states, which presents challenges in sampling multi-modal distributions when there is insufficient bridging density to connect different modes. This limited connectivity impairs the ability of the resulting samples to accurately represent the entire distribution, a limitation illustrated in Figure 1.

1.2. Convolution-Based Method

A popular approach to bridging disconnected modes in multi-modal distributions is Gaussian convolution, which has been widely used in the recent developments of diffusion models (Sohl-Dickstein et al., 2015; Song & Ermon, 2019; Ho et al., 2020). For a target distribution p(x) and a Gaussian convolution kernel p( x|x) = N( x|αx, σ2I), a convolved distribution p( x) can be constructed as follows:

p( x) = Z p( x|x)p(x) dx. (8)

Since p( x|x) has the full support of Rd, it can effectively create non-negligible density paths between disconnected modes in p( x), which makes the modes in p( x) exhibit better connectivity compared to those in p(x). This cherished

property has made Gaussian convolution a popular remedy to heal the blindness of score matching (Song & Ermon, 2019; Wenliang & Kanagawa, 2020; Zhang et al., 2022) and fix KL divergence training for distributions with disjoint or ill-defined density (Roth et al., 2017; Zhang et al., 2020; 2023b; Brown et al., 2022).

Due to the mode-bridging property of convolution, it is generally easier for score-based samplers to explore the whole space of p( x) than p(x). If we could obtain numerous samples x p( x), then it is more likely that these samples will encapsulate a broader range of modes in p( x) which are close to different high-density areas in p(x). Consequently, these samples of x can then serve as initial points for sampling from the original target p(x), which facilitates score-based samplers in capturing different modes in p(x). However, the score function of the convolved noisy distribution p( x) has the form

x log p( x) = x log Z exp E(x) x αx 2

which is typically intractable for non-Gaussian targets. This makes score-based sampling infeasible for p( x).

2. Diffusive Gibbs Sampling

We introduce a novel sampling method, Diffusive Gibbs Sampling (Di GS), which leverages Gaussian convolution with an innovative Metropolis-within-Gibbs scheme to enhance multi-modal sampling, while avoiding the intractability of the convolved score function as shown in Equation 9.

2.1. Sampler Construction

Instead of trying to directly produce samples from the intractable convolved distribution p( x), Di GS employs a Gibbs sampler to sample from the joint distribution p(x, x) = p( x|x)p(x). This Gibbs sampling procedure involves alternately drawing samples from the two conditional distributions p( x|x) (the convolution kernel) and p(x| x) (the denoising posterior). In each step, for a given clean sample x(i 1) from the target p(x), we draw

1. a noisy sample x(i 1) p( x|x = x(i 1)), 2. a new clean sample x(i) p(x| x = x(i 1)).

For a Gaussian convolution kernel p( x|x) = N( x|αx, σ2I), noisy samples can be easily obtained by corrupting clean samples with Gaussian noises:

x = αx + σϵ, ϵ N(0, I), (10)

where α [0, 1] is a contraction factor inspired by Diffusion models (Ho et al., 2020) and σ determines the level of smoothness in the Gaussian convolution. Intuitively, a small

Diffusive Gibbs Sampling

(b) p(x| x(i 1)).

Figure 2. Visualization of an Mo G target with unequal weights w = [0.1, 0.1, 0.1, 0.7] for different components. (a) Density heatmap of the target p(x), a clean sample x(i 1) and a noisy sample x(i 1). (b) Density heatmap of the denoising posterior p(x| x(i 1)) with Gaussian convolution parameters α = 1, σ = 1.

α will compress the distribution and make the modes closer, and a large σ will encourage the sampler to jump out of the local modes. The effects of these hyperparameters will be investigated in Section 2.3, and hyperparameter tuning strategies will be further discussed in Section 2.4.

Unlike the convolved distribution p( x) which has an intractable score as shown in Equation 9, the denoising posterior p(x| x) p(x, x) = p( x|x)p(x), taking the form

p(x| x) exp E(x) αx x 2

has a tractable score function (Gao et al., 2020; Huang et al., 2023):

x log p(x| x) = x E(x) α (αx x)

Therefore, common score-based methods like MALA, and HMC could be directly applied to sample from the denoising posterior p(x| x) in the denoising sampling step. It is worth noting that, in contrast to directly sampling from the original target p(x) exp( E(x)) using score-based methods, incorporating an additional quadratic term as in Equation 11 improves the Log-Sobolev conditions (i.e., makes it more Gaussian-like ), which in turn significantly increases the convergence speed of score-based samplers (Vempala & Wibisono, 2019); see Huang et al. (2023) for further indepth analysis. Below, we show that the Di GS yields a p(x, x)-irreducible and recurrent Markov Chain under certain regularity conditions. Theorem 2.1. For an absolutely continuous target distribution p(x), Di GS with a Gaussian convolution kernel p( x|x) = N( x|αx, σ2I) (α > 0, σ > 0) yields a p(x, x)- irreducible and recurrent Markov Chain.

The proof of Theorem 2.1 can be found in Appendix A. Intuitively, these properties ensure that the chain comprehensively explores the state space from any starting point

(irreducibility) and effectively captures the target distribution by infinitely revisiting every state (recurrence) (Robert et al., 1999).

In addition to being a valid MCMC sampler, there are some practical considerations that can impact the performance of Di GS. We discuss these in the following sections.

2.2. Initialization of the Denoising Sampling Step

Ideally, one might hope to construct an exact Gibbs sampler where the score-based sampler targeting p(x| x(i 1)) draws a true sample x(i) at each iteration. Unfortunately, when the target distribution has very disconnected modes, the resulting denoising posterior p(x| x) may still exhibit a multi-modal nature. For instance, Figure 2(a) illustrates the density of an Mo G with unbalanced weights. For a given previous clean sample x(i 1), we generate a noisy sample x(i 1) using a Gaussian convolutional kernel with α = 1 and σ = 1. The corresponding denoising density p(x| x(i 1)) is depicted in Figure 2(b), which exhibits four distinct modes with varying weights. In such scenarios, selecting an appropriate initial point, x(i) init, for the subsequent sampling process x(i) p(x| x = x(i 1)) is crucial for score-based samplers.

An ideal initial point for sampling from the denoising posterior p(x| x(i 1) would be the mean of the denoising distribution, defined as µ( x) R xp(x| x) dx. By Tweedie s lemma (Efron, 2011; Robbins, 1992), the mean can be expressed as a function of the noisy score x log p( x):

µ( x(i 1)) = x(i 1) + σ2 x log p( x(i 1))

Figure 2(b) demonstrates that the mean function µ( x(i 1)) provides an initial point positioned in the middle of the four modes according to their weights, slightly favoring the mode with the largest weight. However, a challenge arises since x log p( x) is generally intractable (see Equation 9), making it impractical to compute.

Alternatively, a more straightforward approach is to initialize the denoising sampler at the previous state x(i 1). However, this strategy has a drawback: since x(i 1) is typically close to one of the modes, the score-based sampler often remains trapped in the vicinity of that mode, thereby hindering effective exploration of the entire distribution. Another heuristic initialization strategy is to use the (scaled) noisy sample x(i 1)/α, where the scaling factor α reflects the scale relationship between x and x as suggested by the mean function in Equation 13. This approach exhibits a uniform preference for a random mode (for instance, the upper-left mode in Figure 2(b)) and ignores the underlying weighting between modes, resulting in a bias in representing the true weights of different modes and consequently diminishing the overall quality of the samples.

Diffusive Gibbs Sampling

(b) x(i 1).

(c) x(i 1)/α.

Figure 3. Comparison of different initialization techniques for denoising posterior sampling on an unequally weighted Mo G target described in Figure 2(a). In each case, we generate 1,000 samples using a Gaussian convolution kernel with α = 1, σ = 1.

Table 1. MMD between true samples and samples obtained using different initialization techniques for denoising posterior sampling on an unequally weighted Mo G target described in Figure 2(a).

Init. x(i 1) x(i 1)/α MH MMD 0.15 0.01 0.92 0.04 0.03 0.01

2.2.1. A METROPOLIS-WITHIN-GIBBS SCHEME

To avoid potential bias, we propose a Metropolis-within Gibbs scheme to facilitate mixing across modes by sampling an initialization position x init from an additional MCMC transition kernel for the subsequent finite score-based sampling steps. Specifically, to initialize score-based sampling from the denoising posterior p(x| x(i 1)), we employ a linear Gaussian proposal x init q(x| x(i 1)) which is centered at the (scaled) noisy sample x(i 1)/α:

q(x| x(i 1)) = N(x| x(i 1)/α, (σ/α)2I), (14)

where the mean and the variance are chosen according to q(x| x) p( x|x) inspired by the mean function µ( x(i 1)). Note that this proposal only depends on the noisy sample x(i 1) and is independent of the previous state x(i 1). We use the MH algorithm to calculate the acceptance rate for this proposal:

ainit = min 1, p(x init| x(i 1))q(x(i 1)| x(i 1)) p(x(i 1)| x(i 1))q(x init| x(i 1))

where the denoising posterior ratio is tractable since

p(x init| x(i 1))) p(x(i 1)| x(i 1)) = e E(x init)p( x(i 1)|x init) e E(x(i 1))p( x(i 1)|x(i 1)). (16)

If the proposal is accepted, we will initialize the denoising sampling process with the updated value x init, rather than the previous state x(i 1). Algorithm 1 summarizes the proposed Diffusion Gibbs Sampling (Di GS) procedure.

To demonstrate the benefits of the additional MCMC kernel updating the initialization, we run Di GS with three different initializations (the previous state x(i 1), a heuristic reinitialization at x(i 1)/α, and our MH transition strategy)

Algorithm 1 Diffusive Gibbs Sampling (Di GS)

1: Input: target energy E(x); Gaussian convolution hyperparameters α, σ; score-based denoising sampler S; the number of denoising sampling steps L; the number of Gibbs sampling sweeps K; initial clean sample x(0). 2: for i 1 to K do 3: Draw x(i 1) p( x|x(i 1)) using Equation 10. 4: Propose x init q(x| x(i 1)) as initialization for the denoising process from the proposal in Equation 14.

5: Accept x(i) init x init with probability ainit in Equation 15; otherwise set x(i) init x(i 1). 6: Draw x(i) p(x| x(i 1)) by running the score-based sampler S for L steps from the initial point x(i) init using Equation 12. 7: end for 8: Output: x(K)

on an Mo G target with different component weights. Figure 3 provides a visual comparison of the samples obtained from these initializations, showing that only the MH transition scheme captures all modes with the correct weightings. For evaluation, we employ the Maximum Mean Discrepancy (MMD) (Gretton et al., 2012) throughout all the experiments. The results are shown in Table 1, which demonstrates that the MH strategy outperforms the other two in capturing all modes and accurately representing the true weightings. Detailed experimental setup can be found in Appendix D.1.

2.3. Choosing the Gaussian Convolution Kernels

The performance of Di GS can be influenced by the hyperparameters α and σ in the Gaussian convolution kernel. Intuitively, for a given fixed α, a large σ will enhance the exploration capability of the sampler. However, this also makes the denoising posterior p(x| x) closer to p(x) as illustrated in Equation 11, thereby elevating the complexity of denoising sampling. Similarly, with a fixed σ, reducing α will bring the modes closer but will also make x log p(x| x) close to x log p(x) as shown in Equation 12, making the denoising sampling challenging. To illustrate the effects of hyperparameters, we apply Di GS to the Mo G problem described in Figure 1 with varying values of α, σ and show the results in Figures 4(a) and 4(b), respectively. This demonstrates that within a specific range of α and σ values, Di GS consistently achieves optimal performance indicated by almost zero MMD. However, the quality of samples degrades when hyperparameters deviate beyond certain thresholds.

Although these results suggest that Di GS is robust to a certain range of hyperparameters, selecting an appropriate range remains crucial for each specific target distribution. The optimal range can depend on the support and the shape of the target density, which is typically unknown in practice.

Diffusive Gibbs Sampling

0 1 2 3 4 5

(a) Effects of α (fix σ = 1).

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0

(b) Effects of σ (fix α = 1).

2 3 4 5 The Number of Time Steps (T)

(c) Effects of T in the VP schedule.

Figure 4. Effects of the hyperparameters α, σ in Gaussian convolution kernels and the number T of noise levels in the variance-preserving (VP) noise scheduling. The y-axis in all three plots is the MMD between true samples and generated samples generated by Di GS with varying hyperparameters. Experimental setups can be found in Appendices D.2 and D.3.

Next section will introduce a multi-level noise scheduling to mitigate the need for precise hyperparameter selection.

2.4. Multi-Level Noise Scheduling

Drawing inspiration from the noise scheduling technique used in diffusion models (Ho et al., 2020; Song & Ermon, 2019; Song et al., 2021; Gao et al., 2018), we propose to use a sequence of Gaussian convolution kernels with 0 < αT < < α1 < 1 and the corresponding variance σ2 t = 1 α2 t in each Gaussian convolution kernel pt( xt|x) = N( x|αtx, σ2 t I). This approach is commonly known as the variance-preserving (VP) schedule, where pt( xt) = R pt( xt|x)p(x) dx and p0( x0) p(x). As αT 0, it follows that p T ( x T |x) N( x T |0, I) and p T ( x T ) N( x T |0, I), which are easy to sample from since they are independent of the characteristics of the target distribution p(x).

We follow the common sampling procedure in the scorebased diffusion model literature (Song & Ermon, 2019; 2020) and propose the following sampling procedure: for each t from T to 1, we run Di GS to generate a sample, which is used as an initial point in the subsequent time step t 1. For a given number of noise levels T, we apply a simple linear scheduling scheme to determine the values of αt. Specifically, given the end points α1 and αT , we have

αt = αT + (α1 αT ) T t

T 1, σt = q

1 α2 t. (17)

Despite the similarity in noise scheduling to diffusion models, the fundamental sampling mechanism in Di GS is different. In diffusion models, the sampling process requires a progression from time step T to 0 to yield valid samples. However, Di GS produces a valid sample at any timestep t in principle. This property allows us to set αT > 0 and α1 < 1, thereby enhancing the efficiency of Di GS without the necessity to align with the asymptotic distributions. To illustrate the effect of the number of noise levels T, we implement the VP schedule with a linear noise scheme. We set αT = 0.1, α1 = 0.9 and vary T from 2 to 5. We test this multi-level Di GS on the Mo G problem described in Figure 1. Figure 4(c) shows that an optimal sampler is achieved

Figure 5. Comparison of a multi-modal target distribution p(x), tempered distribution pβ(x), and convolved distribution p( x).

with T > 2 for this problem, circumventing the need for manually selecting Gaussian convolution hyperparameters.

3. Comparison to Related Methods

In this section, we explore the relationship between Di GS and related methods, complementing this with empirical comparisons to highlight their distinct characteristics.

3.1. Tempering-Based Sampling

Tempering-based sampling is a state-of-the-art method for multi-modal target distributions, which samples from smoothed versions of the target distribution and exchanges samples with the original target once in a while. A tempered target distribution is defined as

pβ(x) p(x)β exp( βE(x)), (18)

where β 1/τ < 1 is the inverse temperature. As τ , the tempered target pβ 0(x) converges to a flat distribution, which encourages transitions among different modes. Tempering is a key building block to develop state-of-the-art multi-modal sampling methods such as parallel tempering (PT) (Swendsen & Wang, 1986; Geyer & Thompson, 1995; Syed et al., 2022; Surjanovic et al., 2022) and annealed importance sampling (AIS) (Neal, 2001).

Tempering-based sampling makes transitions between distant modes easier. However, tempering is unable to bridge disconnected modes even with a large temperature as it does not alter the support of a distribution, since p(x)β = 0 wherever p(x) = 0. Moreover, Figure 5 shows that the tempering-based method is inefficient in connecting modes

Diffusive Gibbs Sampling

(a) True Samples

Figure 6. Comparison between parallel tempering (PT) and Di GS on a Mixture of Deltas problem in 2D with the same budge. PT consists of 5 temperatures, while Di GS uses one noise level. Each sampler is initialized at the origin and generates 1,000 samples. Detailed experimental setup can be found in Appendix D.4.

even when they are not completely disconnected but far away. Appendix B gives an analytical example, showing the log-density of a point between two modes could approach in the tempered distribution pβ(x), whereas the logdensity of that point in the convolved distribution p( x) is lower bounded, regardless of the shape of the target p(x).

To empirically compare the mode-bridging property of PT and Di GS, we conduct experiments on an extreme problem, Mixture of Deltas , which is an Mo G with extremely small variance in each Gaussian. Figure 6 shows the comparison of the samples from these two samplers with the same budget: PT is unable to escape the central mode in this extreme setting even with large temperatures τ, whereas Di GS manages to recover all 9 modes, demonstrating its superiority over PT in terms of mode exploration and coverage. To make PT work on this problem, one needs to use more chains with the DEO scheme (Deng et al., 2023); see Appendix D.4 for a detailed description.

3.2. Score-Based Diffusion Model

The convolution technique plays a key role in score-based diffusion models (Song & Ermon, 2019; Song et al., 2021; Song & Ermon, 2020; Ho et al., 2020; Sohl-Dickstein et al., 2015; Gao et al., 2020). These models employ a series of convolutions (indexed by t) to form a distribution sequence from p( x0) p(x) to a simple distribution p( x T ). At each time step t, the score function at time xt log p( xt) is learned from samples x1, , x N p(x) using denoising score matching (Vincent, 2011). To sample from p(x), ULA and Euler discretization are typically applied in reverse (t = T 0), using samples from each time step t as the initial point at time t 1. This scheme has shown stateof-the-art generation quality for complex data like images. However, unlike score-based/diffusion generative models where x log p( x) is learned directly from samples, our setting only assumes access to an energy function E(x) without samples, making the noisy marginal score x log p( x) intractable, as shown in Equation 9.

Zhang et al. (2023a) proposes a pseudo-Gibbs sampler for the joint p(x, x) = p(x)p( x|x) to obtain clean samples from score-based diffusion models, where p( x|x) = N( x|αx, σ2I) and p(x| x) is approximated by a fullcovariance Gaussian. This approach also requires training data to estimate the score function x log p( x), which is different from our problem setting.

3.3. Proximal Sampler

Di GS also belong to the proximal sampler family (Chen et al., 2022; Lee et al., 2021), where Gibbs sampling is executed between the Gaussian convolution distribution p( x|x) and the denoising posterior p(x| x). In the denoising sampling step, proximal samplers find a mode using gradientbased optimization technique and perform rejection sampling around that mode. We highlight several significant improvements of Di GS over the proximal samplers. First, we employ a score-based sampler such as MALA and HMC in the denoising step instead of rejection sampling around a single mode as in proximal samplers, which helps the method scale to higher-dimensional distributions. Second, we employ an innovative Metropolis-within-Gibbs scheme to initialize the denoising sampling step, demonstrating its importance in achieving correct density allocation in multimodal situations, as discussed in Section 2.2. Third, we emphasize the importance of tuning Gaussian convolution kernel in Section 2.4 and employ a multi-level scheduling to eliminate the need of choosing Gaussian convolution hyperparameters. These improvements make the Di GS applicable to practical problems with multi-modal target distributions in high dimensional spaces.

Entropy-MCMC (Li & Zhang, 2023) is designed to sample from the flat regions within the posterior of the Bayesian neural network p(θ|D), where D represents the training dataset. Specifically, the approach begins by defining a surrogate distribution using Gaussian convolution:

p(θa|D) = Z p(θa|θ)p(θ|D) dθ, (19)

where p(θa|θ) exp 1

2η θa θ 2 2 is a Gaussian kernel aimed at smoothing out the sharp regions in the posterior. Subsequently, SGLD (Welling & Teh, 2011) is applied to sample within the joint space of (θ, θa), where samples of both θ and θa are kept. It is crucial to highlight that, while this method aims to seek flat modes in the target distribution p(θ|D), similar to the approaches described in Chaudhari et al. (2019); Staines & Barber (2012), it is not intended and cannot provide exact samples from the target distribution p(θ|D) due to its optimization nature. In contrast, Di GS is designed as a valid MCMC sampler that is specifically designed to identify all modes (including both sharp and flat modes) and their corresponding density allocations in the target distribution, thus pursuing a distinct objective.

Diffusive Gibbs Sampling

0.0 0.5 1.0 1.5 2.0 The Number of Energy Evaluations 1e6

Figure 7. Computational cost comparison between Di GS and RDMC. The x-axis is the number of energy evaluations ( 106) and the y-axis is the MMD between true and generated samples. We use Di GS with a single noise level (α = 1, σ = 1) and run Gibbs sampling for 1-10 sweeps, represented by the 10 blue scatters. The RDMC is experimented with T {1, 2, 3, 4}, represented by the four orange points in the plot. Detailed experimental setup can be found in Appendix D.5.

3.4. Reverse Diffusion Monte Carlo

To emulate a diffusion model sampling process with access to the unnormalized density of the target only, Huang et al. (2023) introduces Reverse Diffusion Monte Carlo (RDMC), which approximates the score function in the noisy space at each time t. Specifically, by rewriting the Tweedie s lemma in Equation 13, the noisy score can be expressed as

xt log p(xt) = (αµ(xt) xt) /σ2 t , (20)

where the denoising posterior mean µ(xt) = R xp(x|xt)

1 K PK k=1 x(t,k) 0 is approximated by the Monte Carlo method, with samples x(t,1) 0 , , x(t,K) 0 obtained by running a score-based sampler such as ULA targeting the posterior p(x0|xt) exp( E(x0) xt αtx0 2/2σ2 t ). Therefore, in RDMC, obtaining a single sample xt p(xt) via ULA at each time step t [1, T] involves an intermediate step of generating K posterior samples via another ULA, leading to a nested MCMC sampling procedure that imposes substantial computational demands. In contrast, Di GS does not have such hierarchy since it requires only one MCMC chain during the denoising sampling step in each Gibbs sweep, significantly reducing the computational burden. Figure 7 compares the computational costs of these two methods when applied to Mo G target as shown in Figure 1, demonstrating that Di GS can achieve the same accuracy as RDMC with 10 fewer energy evaluations. The efficiency of Di GS is particularly vital in applications such as molecular configuration sampling (Section 4.3), where even a single energy evaluation is costly.

3.5. Auxiliary Variable MCMC

Di GS belongs to the broader auxiliary variable MCMC family. This family encompasses various notable methods such as the Swendsen-Wang algorithm (Swendsen & Wang, 1987), slice sampling (Neal, 2003), Hamiltonian Monte

Table 2. Sample quality on Mo G-40. MMD is computed between true samples and samples generated by each sampler. MAE is computed between the true and estimated expectation values of a quadratic function under the target and is expressed as the percentage of the true expectation value.

Sampler MMD MAE (%) #energy eval.

MALA 1.73 0.12 93.3 0.73 1.0 107

HMC 1.70 0.09 92.8 0.34 1.0 107

PT (1.89 0.44) 10 2 7.32 1.85 1.0 107

Di GS (4.57 1.10) 10 4 0.75 0.19 1.0 107

Carlo (HMC) (Duane et al., 1987; Neal et al., 2011), and auxiliary variational MCMC (Habib & Barber, 2018; Agakov & Barber, 2004); see Barber (2012) for a detailed introduction. Among these, HMC bears the closest resemblance to Di GS. We delve into the similarities and differences below.

As introduced in Section 1.1, HMC generates samples from the joint distribution p(x, v) = p(x)p(v) where the v is the auxiliary variable that represents the momentum. The momentum is usually distributed as a Gaussian p(v) = N(v|0, σ2 v I) that is independent of x. However, there are other variants of HMC where the momentum auxiliary variable v depends on x. For example, in Riemannian Manifold HMC (Girolami & Calderhead, 2011), the momentum is distributed as p(v|x) = N(v|0, Σv(x)), where Σv(x) is the Fisher information matrix that captures the local curvature of the energy around x. The auxiliary variable in Di GS is the convolved variable x, which is a noisy version of x, given by p( x|x) = N( x|αx, σ2I). Notably, when α = 0, it follows that p( x|x) = p( x) = N( x|0, σ2I), and the joint distribution p(x, x) = p(x)p( x) in Di GS is identical to p(x, v) in the classic HMC formulation. Despite this similarity, Di GS employs Gibbs sampling which alternately samples from the two conditionals p( x|x) and p(x| x), whereas HMC simulates the Hamiltonian equations as discussed in Section 1.1, interleaved with samples from p(v).

4. Empirical Evaluation

We evaluate Di GS on three complex multi-modal sampling tasks across various domains1: a mixture of 40 Gaussians, Bayesian neural network, and molecular dynamics. For Di GS, we employ MALA with the Metropolis-within-Gibbs scheme. We compare Di GS with three baselines: MALA, HMC and PT. In all experiments, the step sizes of MALA and HMC are tuned via trial-and-error so that the acceptance rates are close to 0.574 (Roberts & Rosenthal, 1998) and 0.65 (Neal et al., 2011), respectively. We choose not to compare with RDMC, since it is computationally intractable on these complex tasks as demonstrated in Section 3.4.

1The code of our experiments can be found in https:// github.com/Wenlin-Chen/Di GS.

Diffusive Gibbs Sampling

(a) True Samples

Figure 8. Visualization of 104 samples for Mo G-40 generated by each sampler. All samplers are initialized at the origin.

4.1. Mixture of 40 Gaussians

We first consider a synthetic problem from Midgley et al. (2023), which is a 2D Mo G with 40 mixture components. This is a relatively challenging multi-modal sampling task, yet it allows for visual examination of the mode-coverage property for each method. In this experiment, each method is initialized at the origin and generates 104 samples for evaluation. MALA runs 1,000 Langevin steps per sample. HMC runs 1,000 leapfrog steps per sample. PT consists of 5 chains with temperatures τ = {1.0, 5.62, 31.62, 177.83, 1000.0}, where each chain is constructed by an HMC sampler with 200 leapfrog steps per sample. Di GS uses T = 1 noise level with α = 0.1 and σ2 = 1 α2, 200 Gibbs sweeps, and 5 MALA denoising sampling steps per Gibbs sweep.

Figure 8 shows a visual comparison for 104 samples generated by each method. We can see that MALA and HMC fail to explore the modes that are far away from the origin. PT covers all 40 modes but produces significantly less samples for the modes on the top-right and bottom-right corners. Di GS manages to cover all modes with the right amount of samples in each mode. Table 2 shows the Maximum Mean Discrepancy (MMD) (Gretton et al., 2012) (computed with 5 kernels with bandwidths {2 2, 2 1, 20, 21, 22}) between the true samples and samples generated by each sampler and the Mean Absolute Error (MAE) between the true and estimated expectations of a quadratic function under the Mo G-40 target. This demonstrates that our method significantly outperforms all baselines on this problem.

4.2. Bayesian Neural Networks

The posterior density of the parameters in a Bayesian neural network (BNN) is known to be complex and multimodal (Barber & Bishop, 1998; Hern andez-Lobato & Adams, 2015; Louizos & Welling, 2017; Izmailov et al., 2018). For a given training dataset Dtrain = {(xi, yi)}N i=1, the posterior density can be expressed as p(θ|Dtrain) p(θ) Q

i p(yi|xi, θ), where p(θ) is the prior density over the parameters and p(y|x, θ) is the likelihood given by the NN fθ(x) for a data point (x, y). We consider a threelayer neural network with Re LU activation, input-layer size dx = 20, hidden-layer size dh = 25, and output-layer size

Table 3. Average test predictive NLL for the BNN estimated by 103 samples generated by each sampler.

Sampler NLL #energy evaluations

MAP 0.548 0.066 5.0 106

MALA 0.399 0.014 5.0 106

HMC 0.315 0.012 5.0 106

PT 0.241 0.005 5.0 106

Di GS 0.189 0.002 5.0 106

dy = 1. This results in d = 550 parameters in total. We use a Gaussian prior p(θ) = N(θ|0, σ2 p I) with σp = 1/ dfan-in and a Gaussian likelihood p(y|x, θ) = N(y|fθ(x), σ2 n) with σn = 0.1. We sample the ground-truth parameters θ p(θ) from the prior and use θ to generate N = 500 training points and 500 test points for evaluation.

All methods are initialized at the same random sample from the prior and generate 150 samples from the posterior p(θ|Dtrain) for evaluation. MAP (Maximum a Posteri) runs 7.5 105 full-batch gradient descent steps. MALA runs 5,000 Langevin steps per sample. HMC runs 5,000 leapfrog steps per sample. PT consists of 5 chains with temperatures τ = {1.0, 5.62, 31.62, 177.83, 1000.0}, where each chain is constructed by an HMC sampler with 1,000 leapfrog steps per sample. Di GS uses the VP schedule with T = 5 noise levels, ranging from αT = 0.1 to α1 = 0.9, each with 100 Gibbs sweeps and 10 MALA denoising sampling steps per Gibbs sweep. Table 3 reports the average predictive negative log-likelihood (NLL) for each method on the test data, which shows that Di GS significantly outperforms other baselines. We speculate that the performance gain comes from the fact that Di GS captures a broader range of modes.

4.3. Molecular Configuration Sampling

Finally, we consider a real-world problem of sampling equilibrium molecular configurations from the Boltzmann distribution of the 22-atom molecule alanine dipeptide in an implicit solvent at temperature 300K, where the potential energy E(x) is a function of 3D atomic coordinates obtained by simulating physical laws (Wu et al., 2020; Dibak et al., 2022; Campbell et al., 2021; Stimper et al., 2022;

Diffusive Gibbs Sampling

Table 4. KL divergences for the Ramachandran plots and the marginals of the dihedral angles ϕ and ψ in alanine dipeptide.

Sampler p(ϕ) p(ψ) Ramachandran #energy evaluations #samples

MALA 1.9 10 3 5.3 10 4 1.1 10 2 1.0 109 106

HMC 4.0 10 4 3.1 10 4 7.1 10 3 1.0 109 106

Di GS 2.3 10 4 1.2 10 4 4.3 10 3 1.0 109 106

MD PT (Midgley et al., 2023) 2.5 10 4 1.4 10 4 5.3 10 3 2.3 1010 106

Ground-truth (Midgley et al., 2023) 2.3 1011 107

Midgley et al., 2023). This is a very challenging problem since E(x) is highly multi-modal with many high energy barriers and is also costly to evaluate. Following the setup in Midgley et al. (2023), we represent the molecule with d = 60 roto-translation invariant internal coordinates.

Figure 9. Alanine dipeptide (Midgley et al., 2023)

Each method is initialized at the minimum energy configuration as in Midgley et al. (2023) and generates 106 samples of molecular configurations for evaluation. MALA runs 1,000 Langevin steps per sample. HMC runs 1,000 leapfrog steps per sample. Di GS uses the VP schedule with T = 2 noise levels (α2 = 0.1 and α1 = 0.9), each with 100 Gibbs sweeps and 5 MALA denoising sampling steps per Gibbs sweep. Note that PT is the gold-standard method in molecular dynamics (MD) simulation, which serves as the ground-truth. The PT samples are taken from Midgley et al. (2023), which is generated using 21 chains starting at temperature 300K and increasing the temperature by 50K for each subsequent chain, where each chain is constructed by an HMC sampler with 1,000 leapfrog steps per sample. We follow Midgley et al. (2023) and treat 107 PT samples as the ground-truth. In addition, we consider a baseline of 106 PT samples as a reference. The quality of the sampled configurations is assessed by the Ramachandran plot (Ramachandran et al., 1963), which can be used to analyze how the protein folds locally. A Ramachandran plot is a 2D histogram for joint distribution of the two dihedral angles ϕ and ψ in the bonds connecting an amino acid to the protein backbone, as shown in Figure 9. Table 4 shows the KL divergences for the Ramachandran plots p(ϕ, ψ) and the marginals p(ϕ) and p(ψ) between the ground-truth samples and samples generated by each sampler. We can see that Di GS significantly outperforms MALA and HMC with the same number of energy evaluations. Moreover, Di GS also outperforms the goldstandard MD simulation method PT with 23 less energy evaluations. Figure 10 shows a visual comparison between the ground-truth Ramachandran plots and the one produced by Di GS, confirming that Di GS captures all modes with the correct weightings.

(a) Ground-truth

(b) Di GS (106 samples)

Figure 10. Ramachandran plots for alanine dipeptide.

5. Conclusion

Diffusive Gibbs Sampling (Di GS) is an innovative sampler that combines recent development in diffusion models with a novel Metropolis-within-Gibbs scheme. Di GS offers significant improvements in sampling multi-modal distributions, surpassing traditional methods in both efficiency and accuracy. Its applicability across various fields (e.g., Bayesian inference and molecular dynamics) showcases the potential of Di GS to facilitate sampling complex distributions in numerous scientific applications.

5.1. Limitations and Future Work

In our experiments, we observed that the acceptance rate of the Metropolis-within-Gibbs scheme given by Equation 15 was around 0.15, due to the random walk behavior of the linear Gaussian proposal as in Equation 14. A potential future work direction would be to design more efficient proposals for the Metropolis-within-Gibbs scheme with higher acceptance rate, for instance, combining the Gaussian convolution kernel and the linear Gaussian proposal by marginalizing out the noisy variable: q(x|x(i 1)) = R q(x| x)p( x|x(i 1))d x (Titsias & Papaspiliopoulos, 2018). Empirically, it would be interesting to investigate the the performance of Di GS with different score-based samplers such as HMC in the denoising sampling step in future work. Besides, other multi-modal samplers such as population-based MCMC could be considered as baselines for comparison in future work. We also leave the convergence analysis of Di GS in the general multi-level noise scheduling setting as a future work; see Appendix C for some preliminary discussions.

Diffusive Gibbs Sampling

Acknowledgements

We thank Andi Zhang for useful discussions on RDMC, Ruqi Zhang for highlighting Entropy-MCMC, an anonymous reviewer for discussions on PT, and Arnaud Doucet, Michael Hutchinson and Sam Power for pointing out proximal samplers.

MZ and DB acknowledge funding from the Cisco Centre of Excellence. WC acknowledges funding via a Cambridge Trust Scholarship (supported by the Cambridge Trust) and a Cambridge University Engineering Department Studentship (under grant G105682 NMZR/089 supported by Huawei R&D UK). JMHL acknowledges support from a Turing AI Fellowship under grant EP/V023756/1.

Impact Statement

The goal of this paper is to advance the field of machine learning. There could be many potential positive societal consequences of this methodological work, too numerous to specifically highlight here.

Agakov, F. V. and Barber, D. An auxiliary variational method. In Neural Information Processing: 11th International Conference, ICONIP 2004, Calcutta, India, November 22-25, 2004. Proceedings 11, pp. 561 566. Springer, 2004.

Barber, D. Bayesian reasoning and machine learning. Cambridge University Press, 2012.

Barber, D. and Bishop, C. M. Ensemble learning in Bayesian neural networks. Nato ASI Series F Computer and Systems Sciences, 168:215 238, 1998.

Brown, B. C., Caterini, A. L., Ross, B. L., Cresswell, J. C., and Loaiza-Ganem, G. The union of manifolds hypothesis and its implications for deep generative modelling. ar Xiv preprint ar Xiv:2207.02862, 2022.

Campbell, A., Chen, W., Stimper, V., Hernandez-Lobato, J. M., and Zhang, Y. A gradient based strategy for Hamiltonian Monte Carlo hyperparameter optimization. In International Conference on Machine Learning, pp. 1238 1248. PMLR, 2021.

Chaudhari, P., Choromanska, A., Soatto, S., Le Cun, Y., Baldassi, C., Borgs, C., Chayes, J., Sagun, L., and Zecchina, R. Entropy-sgd: Biasing gradient descent into wide valleys. Journal of Statistical Mechanics: Theory and Experiment, 2019(12):124018, 2019.

Chen, Y., Chewi, S., Salim, A., and Wibisono, A. Improved analysis for a proximal algorithm for sampling. In Conference on Learning Theory, pp. 2984 3014. PMLR, 2022.

Deng, W., Zhang, Q., Feng, Q., Liang, F., and Lin, G. Nonreversible parallel tempering for deep posterior approximation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 7332 7339, 2023.

Dibak, M., Klein, L., Kr amer, A., and No e, F. Temperature steerable flows and Boltzmann generators. Physical Review Research, 4(4):L042005, 2022.

Duane, S., Kennedy, A. D., Pendleton, B. J., and Roweth, D. Hybrid Monte Carlo. Physics letters B, 195(2):216 222, 1987.

Efron, B. Tweedie s formula and selection bias. Journal of the American Statistical Association, 106(496):1602 1614, 2011.

Gao, R., Lu, Y., Zhou, J., Zhu, S.-C., and Wu, Y. N. Learning generative convnets via multi-grid modeling and sampling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9155 9164, 2018.

Gao, R., Song, Y., Poole, B., Wu, Y. N., and Kingma, D. P. Learning energy-based models by diffusion recovery likelihood. ar Xiv preprint ar Xiv:2012.08125, 2020.

Geyer, C. J. and Thompson, E. A. Annealing Markov chain Monte Carlo with applications to ancestral inference. Journal of the American Statistical Association, 90 (431):909 920, 1995.

Girolami, M. and Calderhead, B. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society Series B: Statistical Methodology, 73(2):123 214, 2011.

Grenander, U. and Miller, M. I. Representations of knowledge in complex systems. Journal of the Royal Statistical Society: Series B (Methodological), 56(4):549 581, 1994.

Gretton, A., Borgwardt, K. M., Rasch, M. J., Sch olkopf, B., and Smola, A. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723 773, 2012.

Habib, R. and Barber, D. Auxiliary variational MCMC. In International Conference on Learning Representations, 2018.

Hern andez-Lobato, J. M. and Adams, R. Probabilistic backpropagation for scalable learning of Bayesian neural networks. In International conference on machine learning, pp. 1861 1869. PMLR, 2015.

Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840 6851, 2020.

Diffusive Gibbs Sampling

Huang, X., Hanze Dong, Y. H., Ma, Y., and Zhang, T. Reverse Diffusion Monte Carlo. ar Xiv preprint ar Xiv:2307.02037, 2023.

Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., and Wilson, A. G. Averaging weights leads to wider optima and better generalization. ar Xiv preprint ar Xiv:1803.05407, 2018.

Lee, Y. T., Shen, R., and Tian, K. Structured logconcave sampling with a restricted gaussian oracle. In Conference on Learning Theory, pp. 2993 3050. PMLR, 2021.

Li, B. and Zhang, R. Entropy-mcmc: Sampling from flat basins with ease. ar Xiv preprint ar Xiv:2310.05401, 2023.

Louizos, C. and Welling, M. Multiplicative normalizing flows for variational bayesian neural networks. In International Conference on Machine Learning, pp. 2218 2227. PMLR, 2017.

Midgley, L. I., Stimper, V., Simm, G. N. C., Sch olkopf, B., and Hern andez-Lobato, J. M. Flow annealed importance sampling bootstrap. In The Eleventh International Conference on Learning Representations, 2023.

Neal, R. M. Annealed importance sampling. Statistics and computing, 11:125 139, 2001.

Neal, R. M. Slice sampling. The annals of statistics, 31(3): 705 767, 2003.

Neal, R. M. et al. MCMC using Hamiltonian dynamics. Handbook of Markov chain Monte Carlo, 2(11):2, 2011.

No e, F., Olsson, S., K ohler, J., and Wu, H. Boltzmann generators: Sampling equilibrium states of many-body systems with deep learning. Science, 365(6457):eaaw1147, 2019.

Pompe, E., Holmes, C., and Łatuszy nski, K. A framework for adaptive MCMC targeting multimodal distributions. The Annals of Statistics, 48(5):2930 2952, 2020.

Ramachandran, G., Ramakrishnan, C., and Sasisekharan, V. Stereochemistry of polypeptide chain configurations. Journal of Molecular Biology, 7(1):95 99, 1963.

Robbins, H. E. An empirical bayes approach to statistics. In Breakthroughs in Statistics: Foundations and basic theory, pp. 388 394. Springer, 1992.

Robert, C. P., Casella, G., and Casella, G. Monte Carlo statistical methods, volume 2. Springer, 1999.

Roberts, G. O. and Rosenthal, J. S. Optimal scaling of discrete approximations to langevin diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 60(1):255 268, 1998.

Roberts, G. O. and Smith, A. F. Simple conditions for the convergence of the Gibbs sampler and Metropolis Hastings algorithms. Stochastic processes and their applications, 49(2):207 216, 1994.

Roberts, G. O. and Stramer, O. Langevin diffusions and Metropolis-Hastings algorithms. Methodology and computing in applied probability, 4:337 357, 2002.

Roberts, G. O. and Tweedie, R. L. Exponential convergence of langevin distributions and their discrete approximations. Bernoulli, pp. 341 363, 1996.

Roth, K., Lucchi, A., Nowozin, S., and Hofmann, T. Stabilizing training of generative adversarial networks through regularization. Advances in neural information processing systems, 30, 2017.

Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp. 2256 2265. PMLR, 2015.

Song, Y. and Ermon, S. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32, 2019.

Song, Y. and Ermon, S. Improved techniques for training score-based generative models. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020.

Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021.

Staines, J. and Barber, D. Variational optimization. ar Xiv preprint ar Xiv:1212.4507, 2012.

Stimper, V., Sch olkopf, B., and Hern andez-Lobato, J. M. Resampling base distributions of normalizing flows. In International Conference on Artificial Intelligence and Statistics, pp. 4915 4936. PMLR, 2022.

Surjanovic, N., Syed, S., Bouchard-Cˆot e, A., and Campbell, T. Parallel tempering with a variational reference. Advances in Neural Information Processing Systems, 35: 565 577, 2022.

Swendsen, R. H. and Wang, J.-S. Replica Monte Carlo simulation of spin-glasses. Physical review letters, 57 (21):2607, 1986.

Diffusive Gibbs Sampling

Swendsen, R. H. and Wang, J.-S. Nonuniversal critical dynamics in Monte Carlo simulations. Physical review letters, 58(2):86, 1987.

Syed, S., Bouchard-Cˆot e, A., Deligiannidis, G., and Doucet, A. Non-reversible parallel tempering: a scalable highly parallel mcmc scheme. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(2):321 350, 2022.

Titsias, M. K. and Papaspiliopoulos, O. Auxiliary gradientbased sampling algorithms. Journal of the Royal Statistical Society Series B: Statistical Methodology, 80(4): 749 767, 2018.

Vempala, S. and Wibisono, A. Rapid convergence of the unadjusted langevin algorithm: Isoperimetry suffices. Advances in neural information processing systems, 32, 2019.

Vincent, P. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661 1674, 2011.

Welling, M. and Teh, Y. W. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 681 688, 2011.

Wenliang, L. and Kanagawa, H. Blindness of score-based methods to isolated components and mixing proportions. ar Xiv preprint ar Xiv:2008.10087, 2020.

Wu, H., K ohler, J., and No e, F. Stochastic normalizing flows. Advances in Neural Information Processing Systems, 33: 5933 5944, 2020.

Zhang, M., Hayes, P., Bird, T., Habib, R., and Barber, D. Spread divergence. In International Conference on Machine Learning, pp. 11106 11116. PMLR, 2020.

Zhang, M., Key, O., Hayes, P., Barber, D., Paige, B., and Briol, F.-X. Towards healing the blindness of score matching. ar Xiv preprint ar Xiv:2209.07396, 2022.

Zhang, M., Hawkins-Hooker, A., Paige, B., and Barber, D. Moment matching denoising gibbs sampling. Advances in Neural Information Processing Systems, 36, 2023a.

Zhang, M., Sun, Y., Zhang, C., and Mcdonagh, S. Spread flows for manifold modelling. In International Conference on Artificial Intelligence and Statistics, pp. 11435 11456. PMLR, 2023b.

Diffusive Gibbs Sampling

A. Proof of Theorem 2.1

Proof. A sufficient condition for irreducibility and recurrency of a Markov chain is that the joint distribution p(x, x) should satisfy the positivity condition (Robert et al., 1999; Roberts & Smith, 1994), which requires p(x , x ) > 0 for all x , x such that p(x ) > 0 and p( x ) > 0, where p( x ) = R p( x |x)p(x) dx. This requirement is satisfied in the Di GS. This is because p(x, x) = p( x|x)p(x), where the Gaussian convolution kernel p( x|x) has full support in Rd. Consequently, if p(x ) > 0, then it follows that p( x ) > 0 and p(x , x ) > 0.

B. An Analytical Example of Tempering v.s. Convolution

Although tempering-based sampling can alleviate the issue of multi-modal sampling in certain contexts, they still struggle in situations where the modes are less connected or completely isolated. Consider a toy example of a mixture of two Gaussians in 1D, given by p(x) = 1

2N(x|µ, σ2 g) + 1

2N(x| µ, σ2 g) where both Gaussian components have the same variance σ2 g and symmetric means positioned at µ, µ. A point of interest is x = 0, which lies in the low-density region between the two modes and acts as a barrier point, hindering the state transition from one mode to another. In this example, the tempered log-density has a closed-form expression up to some constant Cβ:

log pβ(x = 0) = β log p(x = 0) + Cβ

2σ2g log σg

For any given inverse temperature β and position µ, as σg 0, we have log pβ(x = 0) and thus pβ(x = 0) 0. This Mixture of Deltas example shows that tempering methods do not effectively overcome the low-density barrier in such situations. Furthermore, if we consider scenarios where each component of the mixture distribution is entirely disconnected, the tempered density in these regions remains zero, as tempering does not alter the support of a distribution.

In contrast, we define the convolved distribution p( x) = R p( x|x)p(x) dx for the same target p(x) with a convolution kernel p( x|x) = N( x|x, σ2). For any given µ and σg, without loss of generality, we choose σ σg/δ with a small constant δ > 0. This leads to a lower bound:

log p( x = 0) = µ2

2(σ2g + σ2) 1

2 log(2π(σ2 g + σ2))

2 log(2πσ2(1 + δ2)), (22)

illustrating that the convolved log-density at x = 0 is lower-bounded for any σ > 0 as σg 0, since the lower bound of log p( x = 0) is independent of σg and remains finite. This approach effectively guarantees the maintenance of a non-negligible density within the bridges connecting different modes of distribution. Figure 5 presents a one-dimensional comparison between the tempering and convolution methods. However, in cases where the standard deviation of the Mixture of Gaussians (Mo G) is exceptionally small, the efficacy of tempering diminishes significantly. In such instances, even with a large temperature T, tempering fails to connect the modes. Conversely, the convolution method continues to effectively bridge the modes, even with a relatively small σ. This comparison underscores the convolution method s robustness in handling scenarios with disconnected modes, a situation that also mirrors challenges often encountered in high-dimensional cases, whereas the tempering method struggles.

C. Discussions of Convergence Guarantees

In some specific cases, the convergence rate of Di GS may be established. Specifically, the convergence rate of Di GS depends on both the convergence of the inner denoising sampling step and the convergence of the outer Gibbs sampler. For instance, when using MALA to sample from the denoising posterior p(x| x), the convergence of MALA has been studied under different conditions such as strong log-concavity and log-Sobolev inequality (Vempala & Wibisono, 2019). Compared to direct sampling from the target distribution, the L2 regularizer in the log denoising posterior can improve the log-Sobolev condition:

log p(x| x) = E(x) αx x 2

2σ2 + const, (23)

Diffusive Gibbs Sampling

which in turn accelerates the convergence of MALA (Huang et al., 2023). Intuitively, this makes the target distribution more Gaussian-like . The convergence of the outer Gibbs sampler can be analyzed following the approach in Chen et al. (2022), demonstrating convergence under conditions of strong log-concavity/log-concavity under Wasserstein distance and log-Sobolev inequality under KL divergence. Therefore, the theoretical convergence guarantee of Di GS with a single noise level can be obtained by combining the convergence analyses of the inner MALA denoising sampling step and the outer Gibbs sampler from Vempala & Wibisono (2019); Chen et al. (2022); Huang et al. (2023).

However, in addition to the aforementioned factors, the multi-level noise schedule also plays a part in accelerating the convergence speed. Furthermore, our empirical investigation revealed that the performance of Di GS for practical problems such as molecular dynamics could also be affected by the initial condition (e.g., we followed the convention of molecular dynamics to initialize Di GS at the minimum energy configuration, which turns out to be better than random initialization). These analyses are out of the scope of this work, and we leave the theoretical study of the general Di GS framework as a future research direction.

D. Experimental Details

In all experiments, the step sizes of MALA and HMC are tuned via trial-and-error so that the acceptance rates are close to optimal values 0.574 (Roberts & Rosenthal, 1998) and 0.65 (Neal et al., 2011), respectively.

D.1. Comparison of Initialization Strategies for the Denoising Sampling Step

For the comparison of three initialization strategies for the denoising sampling step in Figure 3 and Table 1 in Section 2.2, we run Di GS with α = 1, σ = 1 for 200 Gibbs sweeps on an Mo G with unbalanced weights. The denoising sampling step begins with an initialization using the Metropolis-Hastings (MH) algorithm, followed by 50 MALA steps with a step size of 1 10 3. For evaluation, we employ the Maximum Mean Discrepancy (MMD) (Gretton et al., 2012) that utilizes 5 kernels with bandwidths {2 2, 2 1, 20, 21, 22}.

D.2. Comparison of Gaussian Convolution Hyperparameters

For the convolution parameter comparison in Section 2.3, we use Di GS to generate 1,000 samples. For each sample, we execute 1,000 Gibbs sweeps. The denoising sampling step begins with an initialization using the Metropolis-Hastings (MH) algorithm, followed by 10 MALA steps. The step size for MALA is set at 1 10 3. We set σ = 1.0 and vary α across {0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 2.0, 5.0} for the experiment shown in Figure 4(a). Similarly, we set α = 1.0 and vary σ across {0.1, 0.3, 0.5, 1, 2, 3, 4, 5, 6, 7, 8, 10, 15, 20} for the experiment shown in Figure 4(b). For evaluation, we employ the Maximum Mean Discrepancy (MMD) (Gretton et al., 2012) that utilizes 5 kernels with bandwidths {2 2, 2 1, 20, 21, 22}.

D.3. Multi-Level Noise Scheduling

For the multi-level noise experiment with the VP schedule in Section 2.4, we define αt = αT + (T t) α, where α = (α1 αT )/(T 1) and σt = p

1 α2 t. We set α1 = 0.9 and αT = 0.1, and experiment with T {2, 3, 4, 5} for the comparison shown in Figure 4(c). For evaluation, we employ the Maximum Mean Discrepancy (MMD) (Gretton et al., 2012) that utilizes 5 kernels with bandwidths {2 2, 2 1, 20, 21, 22}.

D.4. Comparison with Parallel Tempering

For comparison with parallel tempering on the Mixture of Delta problem in Figure 6 in Section 3.1, each sampler is initialized at the origin and generates 1,000 samples. Di GS employs a single noise level Gaussian convolution with parameters α = 1 and σ = 1, complemented by 1,000 Gibbs sweeps per sample. The denoising sampling step in each Gibbs sweep begins with MH initialization, followed by 5 steps of MALA with a step size of 1 10 3. PT consists of 5 chains with temperatures τ={1.0, 5.62, 31.62, 177.83, 1000.0}, where each chain is constructed by an HMC sampler with 1,000 leapfrog steps per sample and a step size of 1 10 2.

To make PT work on this problem, one needs to use 20 chains with temperatures τ={1.0, 2.07, 4.28, 8.86, 18.33, 37.93, 78.48, 162.38, 335.98, 695.19, 1432.45, 2976.35, 6158.48, 12742.75, 26366.51, 54555.95, 112883.79, 233572,15, 483293.02, 1000000.0} and the DEO scheme (Deng et al., 2023), where each chain is constructed by an HMC sampler with 500 leapfrog

Diffusive Gibbs Sampling

steps per sample and a step size of 1 10 2.

D.5. Comparison with RDMC

For the comparison with Reverse Diffusion Monte Carlo (RDMC) in Figure 7 in Section 3.4, we follow the original setting discussed in Huang et al. (2023) and use ULA for generating samples for the posterior distribution. The score function of the reverse distribution using K = 5 to construct the score approximation

xt log p(xt)

k=1 x(t,k) 0 xt

/σ2 t , (24)

where x(t,k) 0 p(x0|xt) denotes the samples drawn using ULA with LULA = 5 steps and a step size of 1 10 2. These samples are initialized through importance sampling (IS) with Sis = 100 samples. The discretization step size η is set to Γ/T, with the scaling factor at = e (Γ t η) and the standard deviation σt = p

1 α2 t. Following the procedure in Huang et al. (2023), we further apply ULA for additional LULA steps after RDMC. The parameter Γ is fixed at 0.1 for our experiments, and we explore different values of T {1, 2, 3, 4}, necessitating TK(LULA + Sis) + LULA energy evaluations to generate a single sample.

For Di GS, we use Di GS with one noise level Gaussian convolution with parameter α = 1, σ = 1. For sampling from the denoising distribution, we use the MH initialization followed by the ULA sampling with LULA = 5 steps and a step size of 1 10 2. This MH+ULA scheme ensures a fair comparison to the IS+ULA scheme used in the RDMC. We vary Sgibbs {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} Gibbs sweeps to generate a sample, which takes Sgibbs(LULA + 2) energy evaluations in total, where the constant 2 accounts for the two energy evaluations required by MH. For both Di GS and RDMC, a total of 1,000 samples are generated for comparison.

For evaluation, we employ the Maximum Mean Discrepancy (MMD) (Gretton et al., 2012) that utilizes 5 kernels with bandwidths {2 2, 2 1, 20, 21, 22}.

D.6. Mixture of 40 Gaussians

For the Mo G-40 experiment in Section 4.1, each method is initialized at the origin and generates 104 samples for evaluation. MALA runs 1,000 Langevin steps per sample with a step size of 1 10 1. HMC runs 1,000 leapfrog steps per sample with a step size of 1 10 1. PT consists of 5 chains with temperatures τ = {1.0, 5.62, 31.62, 177.83, 1000.0}, where each chain is constructed by an HMC sampler with 200 leapfrog steps per sample with a step size of 1 10 1. Di GS uses T = 1 noise level with α = 0.1 and σ2 = 1 α2, 200 Gibbs sweeps, and 5 MALA denoising sampling steps per Gibbs sweep with a step size of 1 10 1.

D.7. Bayesian Neural Networks

For the BNN experiment in Section 4.2, all methods are initialized at the same random sample from the prior and generate 150 samples from the posterior p(θ|Dtrain) for evaluation. MAP runs 7.5 105 full-batch gradient descent steps with a step size of 3 10 2. MALA runs 5,000 Langevin steps per sample with a step size of 1 10 4. HMC runs 5,000 leapfrog steps per sample with a step size of 5 10 4. PT consists of 5 chains with temperatures τ = {1.0, 5.62, 31.62, 177.83, 1000.0}, where each chain is constructed by an HMC sampler with 1,000 leapfrog steps per sample with a step size of 5 10 4. Di GS uses the VP schedule with T = 5 noise levels, ranging from αT = 0.1 to α1 = 0.9, each with 100 Gibbs sweeps and 10 MALA denoising sampling steps per Gibbs sweep with a step size of 1 10 4.

D.8. Molecular Configuration Sampling

For the molecular configuration sampling experiment in Section 4.3, each method is initialized at the minimum energy configuration as in Midgley et al. (2023) and generates 106 samples of configurations for evaluation. MALA runs 1,000 Langevin steps per sample with a step size of 1 10 4. HMC runs 1,000 leapfrog steps per sample with a step size of 1 10 3. Di GS uses the VP schedule with T = 2 noise levels (α2 = 0.1 and α1 = 0.9), each with 100 Gibbs sweeps and 5 MALA denoising sampling steps per Gibbs sweep with a step size of 1 10 4. Note that PT is the gold-standard method in molecular dynamics (MD) simulation, which serves as the ground-truth. The PT samples are taken from Midgley et al. (2023), which is generated using 21 chains starting at temperature 300K and increasing the temperature by 50K for

Diffusive Gibbs Sampling

Table 5. Sample quality on a 200-dimensional mixture of Gaussians problem.

Sampler MMD #energy evaluations

MALA 0.123 0.035 106

HMC 0.104 0.025 106

PT 0.094 0.014 106

Di GS 0.027 0.010 106

each subsequent chain, where each chain is constructed by an HMC sampler with 1,000 leapfrog steps per sample. We follow Midgley et al. (2023) and treat 107 PT samples as the ground-truth. In addition, we consider a baseline of 106 PT samples as a reference.

E. Additional Experimental Results

E.1. 200-Dimensional Mixture of Gaussians

We take the Mo G target in Figure 2(a) and generalize it to the 200-dimensional space. For each compared sampler, we draw 1,000 samples within a given compute budget of 106 energy evaluations. All samplers are tuned to achieve their optimal acceptance rate as described in Section 4. Table 5 shows that Di GS significantly outperforms all the other samplers even in the high-dimensional space.

E.2. Molecular Configuration Sampling

The Ramachandran plots produced by all compared methods for the molecular configuration sampling experiment in Section 4.3 can be found in Figure 11.

Diffusive Gibbs Sampling

(a) Dihedral angles ϕ, ψ in alanine dipeptide

(b) MD (107 samples, 2.3 1011 energy evaluations)

(c) MALA (106 samples, 1.0 109 energy evaluations)

(d) HMC (106 samples, 1.0 109 energy evaluations)

(e) PT (106 samples, 2.3 1010 energy evaluations)

(f) Di GS (106 samples, 1.0 109 energy evaluations)

Figure 11. (a) Visualization of the dihedral angles ϕ and ψ in alanine dipeptide (Midgley et al., 2023). (b)-(d) Ramachandran plots for the dihedral angles ϕ and ψ of alanine dipeptide. MD simulation (PT with 107 samples) serves as ground-truth.