# particle_denoising_diffusion_sampler__c763f394.pdf

Particle Denoising Diffusion Sampler

Angus Phillips 1 Hai-Dang Dau 1 Michael John Hutchinson 1 Valentin De Bortoli 2 George Deligiannidis 1

Arnaud Doucet 1

Denoising diffusion models have become ubiquitous for generative modeling. The core idea is to transport the data distribution to a Gaussian by using a diffusion. Approximate samples from the data distribution are then obtained by estimating the time-reversal of this diffusion using score matching ideas. We follow here a similar strategy to sample from unnormalized probability densities and compute their normalizing constants. However, the time-reversed diffusion is here simulated by using an original iterative particle scheme relying on a novel score matching loss. Contrary to standard denoising diffusion models, the resulting Particle Denoising Diffusion Sampler (PDDS) provides asymptotically consistent estimates under mild assumptions. We demonstrate PDDS on multimodal and high dimensional sampling tasks.

1. Introduction

Consider a target probability density π on Rd of the form

π(x) = γ(x)

Rd γ(x)dx, (1)

where γ : Rd R+ can be evaluated pointwise but its normalizing constant Z is intractable. We develop here a Monte Carlo scheme to sample approximately from π and estimate Z.

We follow an approach inspired by denoising diffusion models (Ho et al., 2020; Song et al., 2021; Song & Ermon, 2019) by considering a noising diffusion progressively transporting the original target to a Gaussian. The time-reversal of this diffusion, the denoising diffusion, allows us theoretically to sample from the target starting from Gaussian noise. However, it is impossible to simulate this process exactly as its drift depends on the gradient of the logarithm of the

1University of Oxford 2CNRS, ENS Ulm. Correspondence to: Angus Phillips <angus.phillips@stats.ox.ac.uk>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

intractable marginal densities of the noising diffusion, i.e. the score. For generative modeling, where one has access to samples from π, one can rely on neural networks and score matching (Hyv arinen, 2005; Vincent, 2011). This strategy is not applicable in the Monte Carlo sampling context as we cannot sample the noising diffusion since we do not have access to any samples from π to approximate the initial distribution of the diffusion with.

The idea of using denoising diffusion models for Monte Carlo sampling has already been explored by Berner et al. (2022); Mc Donald & Barron (2022); Vargas et al. (2023); Huang et al. (2024); Richter et al. (2024); Zhang et al. (2024). Berner et al. (2022); Richter et al. (2024); Vargas et al. (2023) focus on the minimization of a reverse Kullback Leibler divergence or log-variance criterion while Mc Donald & Barron (2022) rely on an importance sampling scheme which scales poorly in high dimensions. Finally, Huang et al. (2024) relies on a series of Markov chain Monte Carlo (MCMC) to estimate the score. This last scheme does not provide estimates of normalizing constants.

We develop here an alternative approach inspired by denoising diffusion models with guidance. Guided diffusions combine pre-trained diffusion models with a guidance term derived from a likelihood to sample approximately from posterior distributions; see e.g. Song et al. (2021); Chung et al. (2023); Song et al. (2023); Corso et al. (2023). While they provide samples with appealing perceptual properties, they rely on various approximations, in order of importance: (1) approximation of the score and guidance terms, (2) time-discretization of the diffusion and (3) approximate initialization of the diffusion. Wu et al. (2023); Cardoso et al. (2024) have used particle methods also known as Sequential Monte Carlo (SMC) (Doucet et al., 2001; Chopin & Papaspiliopoulos, 2020) to obtain consistent estimates in the generative modeling context.

Our contributions are as follows: (1) we adapt guided diffusions to sampling problems, (2) we provide theoretical results quantifying the error introduced by current guided diffusions in a simple scenario, (3) we develop an SMC scheme to provide consistent estimates in this setup and establish limit theorems, (4) we introduce an algorithm that reduces the variance of

Particle Denoising Diffusion Sampler

the SMC estimates based on a novel score matching loss. All proofs are postponed to the Appendix.

2. Denoising Diffusions with Guidance

2.1. Noising and denoising diffusions

Consider the following noising diffusion (Xt)t [0,T ],

d Xt = βt Xtdt + p

2βtd Wt, X0 π, (2)

where (Wt)t [0,T ] is a d-dimensional Brownian motion and βt > 0. The transition density of this diffusion is given by p(xt|x0) = N(xt; 1 λtx0, λt I) for λt =

1 exp h 2 R t 0 βsds i . We denote by πt the density of Xt

under (2). In practice, we consider R T 0 βsds 1, and therefore πT (x) N(x; 0, I). The diffusion (2) thus transforms π0 = π into approximately N(0, I). If instead we initialize (2) using p0(x) = N(x; 0, I), its marginals (pt)t [0,T ] satisfy pt = p0.

The time-reversal (Yt)t [0,T ] = (XT t)t [0,T ] of (2), the denoising diffusion, satisfies YT π and

d Yt = [βT t Yt + 2βT t log πT t(Yt)] dt+ p

2βT td Bt, (3) where (Bt)t [0,T ] is another Brownian motion; see e.g. Haussmann & Pardoux (1986); Cattiaux et al. (2023). The main idea of denoising diffusions is to sample from π by sampling (3) as YT π (Ho et al., 2020; Song et al., 2021). However, we cannot simulate (3) exactly as, in order of importance, (1) the score terms ( log πt)t [0,T ] are intractable, (2) it is necessary to time-discretize the diffusion and (3) πT cannot be sampled. We can always use numerical integrators and approximate πT with a unit Gaussian distribution to mitigate (2) and (3). In generative modeling (1) is addressed by leveraging tools from the score matching literature (Vincent, 2011; Hyv arinen, 2005) and using neural network estimators. In our sampling setting we do not have access to access to samples from π but only to its unnormalized density. Therefore, alternative approximations must be developed.

2.2. Denoising diffusions with guidance

For generative modeling, the use of denoising diffusions with guidance terms to sample approximately from posterior distributions has become prominent in the inverse problem literature, see e.g. Song et al. (2021); Chung et al. (2023); Song et al. (2023); Corso et al. (2023). We present here a simple extension of this idea applicable to any target π(x) defined by (1) by the rewriting

π(x) = p0(x)g0(x)

Z , for p0(x) = N(x; 0, I) (4)

where g0(x0) = γ(x0)/p0(x0).

Lemma 2.1. The following identities hold

πt(xt) = p0(xt)gt(xt)

Z ; log πt(xt) = xt+ log gt(xt), (5) where gt(xt) = R g0(x0)p(x0|xt)dx0, (6)

and p(x0|xt) = N(x0; 1 λtxt, λt I) is the conditional density of X0 given Xt = xt for the diffusion (2) initialized using X0 p0.

From Lemma 2.1, it follows that (3) can be rewritten as

d Yt = [ βT t Yt + 2βT t log g T t(Yt)] dt (7)

This is akin to having a diffusion model with tractable scores log pt(xt) = log p0(xt) = xt and with gt(xt) as a guidance term.

We stress that this guidance formulation is simply a restatement of Section 2.1 using some new notation. While it is possible to write all the sequel in terms of πt alone, introducing gt will make the exposition more intuitive.

2.3. Guidance approximation

The most important source of error when approximating (7) is the lack of a closed form expression for gt as it involves an intractable integral. In the context of inverse problems, a simple approximation used by (Chung et al., 2023; Song et al., 2023) is given by

gt(xt) g0( R x0p(x0|xt)dx0) = g0( 1 λtxt)

:= ˆgt(xt). (8)

This approximation is good when t is close to 0 or T but crude otherwise, as established by the following result.

Proposition 2.2. Let π(x) = N(x; µ, σ2) and (βt)t [0,T ] be any schedule satisfying lim T R T 0 βsds = . Consider the following approximation of (7)

d Z(T ) t = [ βT t Z(T ) t + 2βT t log ˆg T t(Z(T ) t )]dt (9)

2βT td Bt, Z(T ) 0 N(0, 1).

Then lim T E[Z(T ) T ] = µ and lim T Var(Z(T ) T ) = 1 for σ = 1 and otherwise

lim T E[Z(T ) T ] = µ 1 σ2 (1 e (1/σ2 1)), (10)

lim T Var(Z(T ) T ) = 1 e 2(1/σ2 1)

2(1/σ2 1) . (11)

Hence, even without considering any time discretization error, lim T E[Z(T ) T ] = µ and lim T Var(Z(T ) T ) = σ2

Particle Denoising Diffusion Sampler

for σ = 1. If we consider the target N(µ, σ2) d, a similar result holds along each dimension. Therefore, the Kullback Leibler divergence between the target and the result of running (9) grows linearly with d. As a result an exponentially increasing number of samples is needed to obtain an importance sampling approximation of the target of reasonable relative variance (Chatterjee & Diaconis, 2018).

3. Particle Denoising Diffusion Sampler

In this section, we propose a particle method to correct the discrepancy between the distribution outputted by the guided diffusion and the target. Let [P] = {1, ..., P} for P N. We first present the exact joint distribution of (Xtk)k {0,...,K} with tk = kδ for a fixed step size δ for the diffusion (2) where K = T/δ is an integer. We then make explicit the corresponding reverse-time Markov transitions for this joint distribution and show how they can be approximated. Finally we show how these approximations can be corrected to obtain consistent estimates using SMC. Instead of writing Xtk, we will write Xk to simplify notation. Similarly we will write gk for gtk, λk for λtk etc.

3.1. From continuous time to discrete time

For k [K], let αk = 1 exp h 2 R kδ (k 1)δ βsds i . The joint distribution of X0:K = (X0, X1, ..., XK) under (2) satisfies

π(x0:K) = π(x0) Q

k [K] p(xk|xk 1), (12)

for p(xk|xk 1) = N(xk; 1 αkxk 1, αk I). This implies in particular

π(xk, xk+1) = πk(xk)p(xk+1|xk). (13)

We also denote by p(x0:K) the joint distribution of (2) initialized at p0(x0), in this case the Markov process is stationary, i.e. pk(xk) = p0(xk), and reversible w.r.t. p0. From Bayes theorem and equations (5) and (13), the backward transitions of (12) satisfy

π(xk|xk+1) = πk(xk)p(xk+1|xk)

= p0(xk)p(xk+1|xk)

p0(xk+1) gk(xk) gk+1(xk+1) = p(xk|xk+1)gk(xk)

gk+1(xk+1) p(xk|xk+1) exp[ log gk+1(xk+1), xk xk+1 ]

= N(xk; 1 αk+1xk+1 + αk+1 log gk+1(xk+1), αk+1I), (14)

where we used gk gk+1 and a Taylor expansion of log gk+1(xk) around xk+1; both of which are reasonable when δ 1. Since αk+1 2βk+1δ and 1 αk+1 1 βk+1δ, this approximation of π(xk|xk+1) corresponds to a discretization of the time-reversal (7).

While this discrete-time representation is interesting, it is clearly typically impossible to exploit it to obtain exact samples from π since, like its continuous time counterpart (7), it relies on the intractable potentials (gk)K k=1. In the following Section 3.2, given an approximation ˆgk of gk, we propose a particle mechanism to sample exactly from π as the number of particles goes to infinity.

3.2. From discrete time to particles

We use a particle method to sample from π. The key idea is to break the difficult problem of sampling from π into a sequence of simpler intermediate sampling problems. Ideally we would sample from πK first then πK 1, πK 2, ... until π0 = π. Unfortunately this is not possible as this requires knowing gk. Suppose that we have an approximation ˆgk of gk (for instance, the guidance approximation defined in (8)). Inspired by (5), we sample instead for k [K] from the sequence of densities

ˆπk(xk) p0(xk)ˆgk(xk), ˆZk = R p0(xk)ˆgk(xk)dxk, (15) backward in time. At k = K, the function g K is almost constant so we choose ˆg K 1 and sampling from ˆπK = N(0, I) is easy. Given particles approximately distributed according to ˆπk+1, we aim to obtain samples from ˆπk. Drawing inspiration from (14), we sample according to the proposal

ˆπ(xk|xk+1) := N(xk; p

1 αk+1xk+1 + αk+1 log ˆgk+1(xk+1), αk+1I). (16)

This is not the only option: we can also use the exponential integrator

ˆπ(xk|xk+1) := N(xk; p

1 αk+1) log ˆgk+1(xk+1), αk+1I).

We could use instead the Euler integrator for (7) but it is clear that the latter would induce greater error. We then reweight the pairs (xk, xk+1) using the weights

wk(xk,xk+1) := ˆπk(xk)p(xk+1|xk) ˆπk+1(xk+1)ˆπ(xk|xk+1)

ˆgk(xk) ˆgk+1(xk+1)ˆπ(xk|xk+1) p0(xk)p(xk+1|xk)

= ˆgk(xk)p(xk|xk+1) ˆgk+1(xk+1)ˆπ(xk|xk+1), (17)

where the second line follows from (15). The first line of (17) means that the xk marginal of the weighted system consistently approximates ˆπk. It should be noted that wk quantifies the error in (16).

Finally, we resample these particles with weights proportional to (17). This resampling operation allows us to focus

Particle Denoising Diffusion Sampler

Algorithm 1 Particle Denoising Diffusion Sampler

Input: Schedule (βt)t [0,T ] as in (2); Approximations (ˆgk)K k=0 s.t. ˆg0 = g0, ˆg K = 1; Number of particles N

Sample Xi K iid N(0, I) for i [N] Set ˆZK 1 and ωi K 1/N for i [N] for k = K 1, . . . , 0 do

Move. Sample Xi k ˆπ( |Xi k+1) for i [N] (see (16)) Weight. ωi k ωk( Xi k, Xi k+1) for i [N] (see (17)) Set ˆZk ˆZk+1 1

i [N] ωi k Normalize ωi k ωi k/ P j [N] ωj k Resample. X1:N k resample( X1:N k , ω1:N k ) (see Section 3.3) MCMC (Optional). Sample Xi k Mk(Xi k, ) for i [N] using a ˆπk-invariant MCMC kernel Mk. end for Output: Estimates ˆπN = 1

i [N] δXi 0 of π, ˆZN 0 of Z

the computational efforts on promising regions of the space but some particles are replicated multiple times, reducing the population diversity. Therefore, we then optionally perturb the resampled particles using a MCMC kernel of invariant distribution ˆπk. The resulting Particle denoising diffusion sampler (PDDS) is summarized in Algorithm 1.

3.3. Algorithm settings

Reparameterization. In practice, we would like to have p0 to be such that g0 is bounded or has bounded moments. To achieve this, it can be desirable to obtain a variational approximation N(x; µ, Σ) of π then do the change of variables x = Σ 1/2(x µ), sample in this space using PDDS before mapping the samples back using x = µ + Σ1/2x .

Resampling. The idea of resampling is to only propagate particles in promising regions of the state space. Given N particles X1:N k and N weights ω1:N k summing to 1, resampling selects N output particles X1:N k such that, for any function φ : Rd R,

i [N]φ(Xi k)| X1:N k , ω1:N k i = P

i [N]ωi kφ( Xi k).

Popular schemes satisfying this identity are multinomial, stratified, residual, and systematic resampling (Douc & Capp e, 2005). We employ systematic resampling in all our simulations as it provides the lowest variance estimates.

Resampling can however reduce particle diversity by introducing identical particles in the output. As such, a popular recipe is to trigger resampling at time k only when the Effective Sample Size (ESS), a measure of particle diversity defined by (P i [N](ωi k)2) 1, is below a certain threshold (Del Moral et al., 2012; Dai et al., 2022). This is implemented using Algorithm 3 presented in Appendix A.

MCMC kernel. We want to design an MCMC kernel of

invariant distribution ˆπk(xk) defined in (15). As we have access to log ˆπk(xk) = xk + log ˆgk(xk), we can use a Metropolis-adjusted Langevin algorithm (MALA) or Hamiltonian Monte Carlo; e.g. MALA considers a proposal

x k = xk + γ log ˆπk(xk) + 2γϵ, ϵ N(0, I),

for a step size γ. This proposal is accepted with probability

min 1, ˆπk(x k)N(xk; x k + γ log ˆπk(x k), 2γI) ˆπk(xk)N(x k; xk + γ log ˆπk(xk), 2γI)

3.4. Theoretical results

Fixed number of discretization steps. We show below that the estimates ˆZN 0 and πNf = 1

N PN i=1 f(Xi 0) of Algorithm 1 satisfy a central limit theorem. This follows from standard SMC theory (Del Moral, 2004; Webber, 2019).

Proposition 3.1. Assume that E[wk(Xk, Xk+1)2] < where the expectation is w.r.t. ˆπ(xk+1)ˆπ(xk|xk+1) and that R πk(x)(gk/ˆgk)(x)dx < . Then ˆZN 0 is an unbiased estimate of Z and has finite variance. If multinomial resampling is used at every step,

N( ˆZN 0 /Z 1) is asymptotically normal with asymptotic variance

σ2 K = χ2(πK||ˆπK)+

k=0 χ2(πk(xk)π(xk|xk+1)||ˆπk(xk)ˆπ(xk|xk+1)),

with χ2( || ) the chi-squared divergence between two distributions. Moreover, for any bounded function f, we also have asymptotic normality of

N(ˆπNf πf).

The finiteness assumptions on E[wk(Xk, Xk+1)2] and R πk(x)(gk/ˆgk)(x)dx require that ˆgk is not too far from gk but still allow enough freedom in the choice of ˆgk. The expression for σ2 K suggests that choosing ˆgk close to gk will reduce the asymptotic variance as (16) will better approximate (14).

Infinitely fine discretization limit. We now investigate the performance of PDDS as the number of discretization time steps goes to infinity. Even when all the weights are equal, multinomial resampling still kills over a third of particles on average (Chopin & Papaspiliopoulos, 2020). Hence a fine discretization with repeated applications of multinomial resampling leads to the total collapse of the particle approximation. This has been formalized in the continuous-time setting (Chopin et al., 2022). In our case, it can be readily checked that σ2 K as K in general. This justifies using a more sophisticated resampling strategy. Indeed, when using the sorted stratified resampling strategy of Gerber et al. (2019), the following results show that our particle approximations remain well behaved as K .

Particle Denoising Diffusion Sampler

Proposition 3.2. Consider the setting of Proposition 3.1 with sorted stratified resampling. Then there exists a sequence of sets (BN) such that P(BN) 1 and

lim sup N E[N( ˆZN 0 /Z 1)21BN ] ζ2 K

ζ2 K := χ2(πK||ˆπK)+

Z πk+1(xk+1)2

ˆπk+1(xk+1) χ2(π(xk|xk+1)||ˆπ(xk|xk+1))dxk+1.

The next result bounds the limit when K , i.e. δ = T/K 0 Proposition 3.3. Under the setting of Proposition 3.2, assume that βt 1 and that the target distribution satisfies the regularity conditions in Appendix C.4.1. Let gt := gt/Z and gt := ˆgt/ R p0(x)ˆgt(x)dx. Assume further that the approximations ˆgt, gt, gt satisfy C 1 1 gt/ gt C1 and log ˆgt(xt) C2(1+ xt ) for some C1, C2 0. Then

lim sup K ζ2 K χ2(πT ||ˆπT )+

2 R T 0 EXt πt h gt gt (Xt) log gt(Xt) log ˆgt(Xt) 2i dt.

This result highlights precisely how the asymptotic error depends on the quality of the estimates of gt and its logarithmic gradient. Let πˆg(dx[0,T ]) be the path measure induced by running (7) with some approximation ˆgt, i.e. the distribution of (Yt)t [0,T ] given by (7). If importance sampling were directly used to correct between πˆg(dx[0,T ]) and π(dx[0,T ]), the asymptotic error for ˆZN 0 /Z would be equal to χ2(π||πˆg) which is greater than exp KL(π|πˆg) 1. In contrast, ignoring the negligible first term χ2(πT ||ˆπT ), the error in this proposition is upper bounded by 2C1KL(π|πˆg). This shows how PDDS helps reduce the error of naive importance sampling.

4. Learning Potentials via Score Matching

The previous theoretical results establish the consistency of PDDS estimates for any reasonable approximation ˆgk of gk but also show that better approximations lead to lower Monte Carlo errors. We show here how to use the approximation of π outputted by PDDS to learn a better neural network (NN) approximation ˆgθ(k, xk) of the potential functions gk by leveraging score matching ideas. Once we have learned those approximations, we can then run again PDDS (Algorithm 1) with the new learned potentials.

4.1. Different score identities

We follow the notation outlined at the beginning of Section 3. From (5), we have the identity

log gk(xk) = xk + log πk(xk). (18)

Hence, if we obtain an approximation log πθ(k, xk) of log πk(xk), then we get an approximation log ˆgθ(k, xk) = 1

2||xk||2 + log ˆπθ(k, xk) of log gk(xk).

To learn the score, we rely on the following result.

Proposition 4.1. The score satisfies the standard Denoising Score Matching (DSM) identity

log πk(xk) = R log p(xk|x0) π(x0|xk)dx0. (19)

Moreover, if R π(x0) e η x0 2dx0 < , η > 0, the Novel Score Matching (NSM) identity

log πk(xk) = κk R log g0(x0) π(x0|xk)dx0 xk (20) holds for κk = 1 λk and log g0(x0) = log π0(x0) + x0. Hence, we can approximate the score by minimizing one of the two following loss functions

ℓDSM(θ) = X

k [K] E log ˆπθ(k, Xk) log p(Xk|X0) 2,

ℓNSM(θ) = X

k [K] E|| log ˆπθ(k, Xk) + Xk

κk log g0(X0)||2,

where the first loss is applicable if π has finite second moment and the second loss is applicable if additionally Eπ[ log π(X) 2] < . All the expectations are taken w.r.t. π(x0)p(xk|x0). For expressive neural networks, both ℓDSM(θ) and ℓNSM(θ) are such that log ˆπθ(k, xk) = log πk(xk) at the minimizer.

The benefit of this novel loss is that it is much better behaved compared to the standard denoising score matching loss when δ 1 as established below, this is due to the fact that the variance of the terms appearing in ℓDSM for k close to zero become very large. This loss is not applicable to generative modeling as π is only available through samples.

4.2. Benefits of alternative score matching identity

We establish here formally the benefits of NSM over DSM. NSM score matching prevents variance blow up as k is close to time 0 and δ 0. This is more elegantly formalized in continuous-time. In this case, the renormalized loss functions ℓDSM and ℓNSM introduced in Proposition 4.1 become for sθ(t, xt) = log ˆπθ(t, xt)

ℓDSM(θ) = R T 0 E sθ(t, Xt) log p(Xt|X0) 2dt,

ℓNSM(θ) = R T 0 E||sθ(t, Xt) + Xt κt log g0(X0)||2dt,

where the expectations are w.r.t. π(x0)p(xt|x0). Unbiased estimates of the gradient of these losses ˆ ℓDSM(θ) and ˆ ℓNSM(θ) are obtained by computing the gradient

Particle Denoising Diffusion Sampler

w.r.t. θ of the argument within the expectation at a sample τ Unif[0, T], X0 π and Xτ p(xτ|X0).

The following result clarifies the advantage of NSM score matching. Proposition 4.2. Let d = 1, βt 1, and π(x0) = N(x0; µ, σ2). Suppose that the score network sθ(t, xt) is a continuously differentiable function jointly on its three variables. Moreover, assume that |s|+ θs C(1+|θ|+|t|+ |xt|)α for some C, α > 0; and that Eπ[ θsθ(0, X0) 2] > 0 for all θ. Then, for all θ, ℓDSM(θ) and E[ ˆ || ℓDSM(θ)||2] are infinite whereas ℓNSM(θ) and E[|| ˆ ℓNSM(θ)||2] are finite.

4.3. Neural network parametrization

Contrary to standard practice for generative modeling (Ho et al., 2020; Song et al., 2021), we parameterize a function and not a vector field. This is necessary to run PDDS (Algorithm 1) as it relies on potentials to weight particles.

We will use a parameterization of the form log ˆπθ(k, xk) = log ˆgθ(k, xk) 1

2||x||2 where we parametrize ˆgθ using two neural networks rη and Nγ such that θ = (η, γ). The network rη returns a scalar whereas Nγ returns a vector in Rd. The precise expression is given by

log ˆgθ(k, xk) = [rη(k) rη(0)] Nγ(k, xk), xk +

+ [1 rη(k) + rη(0)] log g0( p

This parametrization takes advantage of the simple approximation (8) with which it coincides at time 0. Zhang & Chen (2022) used a similar parametrization to incorporate gradient information in their control policy. Although log ˆgθ(k, xk) is a scalar, we deliberately let Nγ(k, xk) return a vector which is then scalar-multiplied with xk. This is usually done in the literature when the scalar potential instead of just its gradient is learned (see e.g. Salimans & Ho, 2021) and helps improve model expressiveness.

4.4. Training the neural network

Both ℓDSM(θ) and ℓNSM(θ) can be written as P

k [K] E [ℓ(θ, k, X0, Xk)] for appropriately defined local loss functions ℓ. Algorithm 2 describes the gradient updates according to the batch size B.

In practice, we first run Algorithm 1 with a simple approximation such as (8). We then use Algorithm 2 to learn ˆgθ(k, xk) and can then execute Algorithm 1 again with ˆgk(xk) = ˆgθ(k, xk) to obtain lower variance estimates of π and Z. We can further refine the approximation of gk using Algorithm 2. In practice we found that one or two iterations with larger Nup are sufficient, although more frequent iterations with smaller Nup can give better performance under a limited budget of target density evaluations.

Algorithm 2 Potential neural network training

Input: Particle approx. ˆπN outputted by Algorithm 1; Potential NN ˆgk(θ, xk); Initialization θ0; Local loss functions ℓ(θ, k, x0, xk); Batch size B; Number of gradient updates Nup; Learning rate η > 0. for i = 1, 2, . . . , Nup do

for b = 1, 2, . . . , B do

Sample Xb 0 ˆπN( ); kb Unif[K]; Sample Xkb pkb,0( |Xb 0) end for θi := θi 1 η

b [B] θℓ(θi 1, kb, Xb 0, Xkb) end for Output: Potential ˆgθNup(k, xk)

4.5. Mechanisms behind potential improvement

It could be unclear at first sight how the iterative procedure described in Section 4.4 leads to an improvement of ˆgk(xk). Superficially, it seems like we try to improve a poor potential approximation using particles produced by the poor approximation itself. However, results in Section 3.4 imply that the SMC mechanism provides a consistent estimate of the target for any approximation of the potential. Thus, we expect that for N large enough, the output of Algorithm 1 would have higher quality than the particles used to learn the current ˆgk. This mechanism has been studied in a different setting in Heng et al. (2020) and it would be interesting to extend their results to our case.

Moreover, the training process uses not only the output of Algorithm 1, but also further information about the target injected via a variety of mechanisms (the guidance loss described in Section 4.1 and the neural network parametrization described in Section 4.3). Quantifying the gain from these techniques is an open question. We provide ablation studies in Appendix D.3.

5. Related Work

SMC samplers (Del Moral et al., 2006; Dai et al., 2022) are a general methodology to sample sequentially from a sequence of distributions. They rely on the notion of forward and backward kernels in order to move from one distribution to another. PDDS can be cast in this framework where the forward kernel is chosen to be the forward noising diffusion and the backward kernel the approximate time-reversal. The novelty of our work is the exploitation of the special structure induced by such a choice to come up with efficient backward kernel estimates. Other standard methods include Annealed Importance Sampling (Neal, 2001) and Parallel Tempering (Geyer, 1991). Unlike our work, all these methods approximate a sequence of tempered versions of the target. While they are standard, it is also well-known that tempering strategies can exhibit poor performance for mul-

Particle Denoising Diffusion Sampler

Tempering Noising

Figure 1: Tempered (top) and noised (bottom) sequences of distributions for the target π(x) = 0.8N(x; 4, 0.52) + 0.2N(x; 4, 1). The tempered sequence follows πt(x) π(x)(1 ηt)ϕ(x)ηt where ϕ is the standard normal and ηt increases from 0 to 1. The noising sequence follows the forward diffusion in Equation (2). Red dots indicate the position of modes in the original target. The tempered sequence suffers from mode switching, i.e. the low mass large width mode becomes dominant across the tempered path. The noised sequence does not suffer from this problem.

timodal targets (Woodard et al., 2009; Tawn et al., 2020; Syed et al., 2022) as tempering can change dramatically the masses of distribution modes depending on their widths. Adding noise does not suffer from this issue (M at e & Fleuret, 2023). It only perturbs the distribution locally, preserving the weights of the modes; see Figure 1 for an illustration.

Our sampler can also be interpreted as sampling from a sequence of distributions using so-called twisted proposals. The general framework for approximating these twisted proposals using SMC has been considered in Guarniero et al. (2017); Heng et al. (2020); Lawson et al. (2022). In particular, Wu et al. (2023) and Cardoso et al. (2024) recently apply such ideas to conditional simulation in generative modeling given access to a pretrained score network. Wu et al. (2023) rely on the simple approximation (8) and do not quantify its error, while Cardoso et al. (2024) use a different proposal kernel which leverages the structure of a linear inverse problem. Our setup is here different as we consider general Monte Carlo sampling problems. We do not rely on any pretrained network and refine our potential approximations using an original loss function. We additionally provide theoretical results in particular when the discretization time step goes to zero.

6. Experimental Results

6.1. Normalizing constant estimation

We evaluate the quality of normalizing constant estimates produced by PDDS on a variety of sampling tasks. We

1 2 4 8 16 0.8

2 4 8 16 32

2 4 8 16 32

SMC PIS DDS CRAFT PDDS PDDSMCMC

8 16 32 64 128 Steps

Figure 2: log ˆZN 0 for our method (PDDS and PDDSMCMC), compared with SMC, CRAFT, DDS and PIS. Dotted black represents analytic ground truth where available, otherwise long-run SMC. Variation is displayed over both training and sampling seeds (2000 total). The y-axes on Sonar and LGCP have been cropped and outliers (present in all methods) removed for clarity. Uncurated samples are presented in Appendix D.4.

Particle Denoising Diffusion Sampler

8 16 32 64 128

CRAFT PDDSMCMC

Figure 3: log ˆZN 0 for CRAFT and PDDS-MCMC on the GMM task in 20 dimensions. Variation displayed over training and sampling seeds (1000 total).

consider two synthetic target densities and two posterior distributions, with results on additional targets included in Appendix D.4. The synthetic targets are Mixture, a 2-dimensional mixture of Gaussian distributions with separated modes (Arbel et al., 2021) and Funnel, a 10dimensional target displaying challenging variance structure (Neal, 2003). The posterior distributions are Sonar, a logistic regression posterior fitted to the Sonar (61-dimensional) dataset and LGCP, a log Gaussian Cox process (Møller et al., 1998) modelling the rate parameter of a Poisson point process on a 40 40 = 1600-point grid, fitted to the Pines dataset. Precise specification of these targets can be found in Appendix D.1.

We compare our performance to a selection of strong baselines. We consider two annealing algorithms, firstly an SMC sampler (Del Moral et al., 2006) with HMC kernels and secondly CRAFT (Matthews et al., 2022), which uses normalizing flows to transport particles at each step of an SMC algorithm. We also consider two diffusion-based sampling methods, Path Integral Sampler (PIS) (Zhang & Chen, 2022) and Denoising Diffusion Sampler (DDS) (Vargas et al., 2023). As mentioned in Section 3.3, we reparameterize the target using a variational approximation for all methods. We include hyperparameter and optimizer settings, run times, and experimental procedures in Appendix D.2.

We present the normalizing constant estimation results in Figure 2. PDDS uses the same training budget as PIS and DDS. We also include PDDS with optional MCMC steps (PDDS-MCMC). Considering the posterior sampling tasks, PDDS-MCMC is the best performing method in terms of estimation bias and variance everywhere, except for 4 steps on the Sonar task where CRAFT performs the best. PDDS without MCMC steps performs on par with CRAFT on average, specifically PDDS outperforms CRAFT for larger step regimes on Sonar and low step regimes on LGCP while CRAFT outperforms PDDS in the opposite regimes. Both PDDS methods uniformly outperform the diffusion-based approaches (PIS and DDS). Considering the synthetic target

densities, CRAFT is the best performing method on the synthetic Funnel task while PDDS and PDDS-MCMC are the best performing methods on Mixture. Our approach again outperforms both of the diffusion-based approaches.

While CRAFT performs competitively with our method on certain tasks, we note that CRAFT cannot be easily refined. Indeed, if we want more intermediate distributions between the reference and the target, we can simply decrease the discretization step size for PDDS, but would need to relearn the flows for CRAFT. In addition, choosing the flow structure in CRAFT can be challenging and is problem-specific. In contrast, we used the same simple MLP structure for all tasks for PDDS. Finally, CRAFT relies on MCMC moves to inject noise into the system and prevent particle degeneracy. On the other hand PDDS can produce competitive results without MCMC, with the option of boosting performance with MCMC steps if the computational budget allows.

We further compared PDDS-MCMC with CRAFT on a challenging Gaussian Mixture Model (GMM) with 40 highly separated modes (Midgley et al., 2023) in a range of dimensions. We present the normalising constant estimates for the task in 20 dimensions in Figure 3. We observe that PDDS-MCMC significantly outperforms CRAFT in bias and variance, particularly in low step regimes. Furthermore, PIS and DDS failed to produce competitive results. Results in additional dimensions are included in Appendix D.4.

6.2. Sample quality

We also visually assess the quality of samples from each method in Figure 4. We choose the multi-modal Mixture task in 2-dimensions for ease of visualization. Unsurprisingly we find that both PIS and DDS do not capture all 6 modes of the distribution due to the mode-seeking behaviour of the reverse KL objective. Samples from our approach appear of similar quality to those of SMC and CRAFT.

We further quantitatively compared the sample quality of our method versus CRAFT by evaluating the entropyregularized Wasserstein-2 distance between samples from the model and the target in the challenging GMM task. We present the results in Figure 6 for the task in 20 dimensions. Here PDDS-MCMC clearly outperforms CRAFT producing samples with significantly lower transport cost to the target distribution. From the sample visualization we observe that PDDS-MCMC is able to correctly recover a far greater proportion of the target modes than our CRAFT implementation.

6.3. Iterations of potential approximation

Here we demonstrate the improvement in normalizing constant estimates due to our iterative potential approximation scheme. Figure 5 shows the improvement in ESS and re-

Particle Denoising Diffusion Sampler

Target PDDS CRAFT

SMC DDS PIS

Figure 4: Samples from each method on the Gaussian Mixture task with 4 steps.

0 100 200 Step

30 40 50 60 70 80 90 100

0 100 200 Step

Figure 5: ESS curves on the Sonar task with 200 steps. Left: PDDS with approximation Equation (8), right: PDDS with learnt potential approximation.

duction in number of resampling steps. In the top pane of Figure 7, we see that the simple potential approximation (8) provides very poor normalizing constant estimates. Iterations of Algorithm 2 considerably improve performance.

In the second pane of Figure 7, we show the evolution of log ˆZN 0 during training for each of the diffusion-based approaches. While PDDS initially falls below PIS and DDS due to the simple initial approximation, we exceed each of these methods after around 7 107 density evaluations.

7. Discussion

This paper contributes to the growing literature on the use of denoising diffusion ideas for Monte Carlo sampling (Berner et al., 2022; Mc Donald & Barron, 2022; Vargas et al., 2023; Huang et al., 2024; Richter et al., 2024). It proposes an original iterative SMC algorithm which provides an unbiased estimate of the normalizing constant for any finite number of particles by leveraging an original score matching technique. This algorithm also provides asymptotically consistent estimates of the normalizing constant and of expectations with respect to the target. One limitation of PDDS is that it practically relies on g0(x) being a well-behaved potential function. While our approach using a variational approxi-

8 16 32 64 128 Steps

CRAFT PDDSMCMC

PDDSMCMC CRAFT

Figure 6: Top: Wγ 2 distance between samples from CRAFT and PDDS-MCMC and from GMM20. Variation displayed over training and sampling seeds (200 total). Bottom: samples (first two dimensions) from PDDS-MCMC and CRAFT (orange) using 32 steps versus GMM20 target (blue).

2 4 8 16 32 Steps

0.0 0.5 1.0 1.5 2.0 Target density evaluations 108

PDDS DDS PIS Truth

Figure 7: Top: log ˆZN 0 every 2 iterations of Algorithm 2 on the Sonar task. Left-most bar of each group shows PDDS with the simple approximation. Bottom: log ˆZN 0 during training, one realization on the Sonar task with 16 steps.

mation to guide a reparameterization of the target has been effective in our examples, more sophisticated techniques might have to be implemented (Hoffman et al., 2019).

Particle Denoising Diffusion Sampler

Impact Statement

Sampling is a ubiquitous problem. Therefore, PDDS can be applied to a wide range of applications. While we have obtained limit theorems for this scheme, it is important to exercise caution if the output were to be used for decision making as we practically only use a finite number of particles.

Acknowledgements

We would like to thank Pierre Del Moral, Christian A. Naesseth, Sam Power, Saifuddin Syed, Brian Trippe, Francisco Vargas, and Luhuan Wu for helpful discussions. AP is funded by the EPSRC CDT in Modern Statistics and Statistical Machine Learning through grant EP/S023151/1. HD is funded by a postdoctoral position through the Co SIn ES grant EP/R034710/1. AD acknowledges support from EPSRC grants EP/R034710/1 and EP/R018561/1.

Anastasiou, A., Barp, A., Briol, F.-X., Ebner, B., Gaunt, R. E., Ghaderinezhad, F., Gorham, J., Gretton, A., Ley, C., Liu, Q., Mackey, L., Oates, C. J., Reinert, G., and Swan, Y. Stein s method meets computational statistics: a review of some recent developments. Statistical Science, 38(1):120 139, 2023.

Arbel, M., Matthews, A., and Doucet, A. Annealed flow transport Monte Carlo. In International Conference on Machine Learning, 2021.

Assaraf, R. and Caffarel, M. Zero-variance principle for Monte Carlo algorithms. Physical Review Letters, 83: 4682 4685, 1999.

Berner, J., Richter, L., and Ullrich, K. An optimal control perspective on diffusion-based generative modeling. In Neur IPS Workshop on Score-Based Methods, 2022.

Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., Vander Plas, J., Wanderman-Milne, S., and Zhang, Q. JAX: composable transformations of Python+Num Py programs, 2018.

Cardoso, G., Idrissi, Y. J. E., Corff, S. L., and Moulines, E. Monte Carlo guided diffusion for Bayesian linear inverse problems. In International Conference on Learning Representations, 2024.

Cattiaux, P., Conforti, G., Gentil, I., and L eonard, C. Time reversal of diffusion processes under a finite entropy condition. Annales de l Institut Henri Poincar e (B) Probabilites et statistiques, 59(4):1844 1881, 2023.

Chatterjee, S. and Diaconis, P. The sample size required in importance sampling. The Annals of Applied Probability, 28(2):1099 1135, 2018.

Chopin, N. and Papaspiliopoulos, O. An Introduction to Sequential Monte Carlo. Springer Ser. Stat. Springer, 2020.

Chopin, N., Singh, S. S., Soto, T., and Vihola, M. On resampling schemes for particle filters with weakly informative observations. The Annals of Statistics, 50(6):3197 3222, 2022.

Chung, H., Kim, J., Mccann, M. T., Klasky, M. L., and Ye, J. C. Diffusion posterior sampling for general noisy inverse problems. In International Conference on Learning Representations, 2023.

Corso, G., Xu, Y., De Bortoli, V., Barzilay, R., and Jaakkola, T. Particle guidance: non-iid diverse sampling with diffusion models. In International Conference on Learning Representations, 2023.

Particle Denoising Diffusion Sampler

Dai, C., Heng, J., Jacob, P. E., and Whiteley, N. An invitation to sequential Monte Carlo samplers. Journal of the American Statistical Association, 117(539):1587 1600, 2022.

De Bortoli, V., Thornton, J., Heng, J., and Doucet, A. Diffusion Schr odinger bridge with applications to score-based generative modeling. In Advances in Neural Information Processing Systems, 2021.

Del Moral, P. Feynman-Kac Formulae: Genealogical and Interacting Particle Approximations. Springer, 2004.

Del Moral, P., Doucet, A., and Jasra, A. Sequential Monte Carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(3):411 436, 2006.

Del Moral, P., Doucet, A., and Jasra, A. On adaptive resampling strategies for sequential Monte Carlo methods. Bernoulli, 18(1):252 278, 2012.

Douc, R. and Capp e, O. Comparison of resampling schemes for particle filtering. In Proceedings of the 4th International Symposium on Image and Signal Processing and Analysis, pp. 64 69. IEEE, 2005.

Doucet, A., De Freitas, N., and Gordon, N. J. Sequential Monte Carlo Methods in Practice. Information Science and Statistics. New York, NY: Springer, New York, 2001.

Gerber, M., Chopin, N., and Whiteley, N. Negative association, ordering and convergence of resampling methods. The Annals of Statistics, 47(4):2236 2260, 2019.

Geyer, C. Markov chain Monte Carlo maximum likelihood. In Computing science and statistics: Proceedings of 23rd Symposium on the Interface Interface Foundation, Fairfax Station, 1991, pp. 156 163, 1991.

Guarniero, P., Johansen, A. M., and Lee, A. The iterated auxiliary particle filter. Journal of the American Statistical Association, 112(520):1636 1647, 2017.

Haussmann, U. G. and Pardoux, E. Time reversal of diffusions. The Annals of Probability, 14(3):1188 1205, 1986.

Heng, J., Bishop, A. N., Deligiannidis, G., and Doucet, A. Controlled sequential Monte Carlo. The Annals of Statistics, 48(5):2904 2929, 2020.

Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 2020.

Hoffman, M., Sountsov, P., Dillon, J. V., Langmore, I., Tran, D., and Vasudevan, S. Neu Tra-lizing bad geometry in Hamiltonian Monte Carlo using neural transport. ar Xiv preprint ar Xiv:1903.03704, 2019.

Huang, X., Dong, H., Hao, Y., Ma, Y., and Zhang, T. Reverse diffusion Monte Carlo. In International Conference on Learning Representations, 2024.

Hyv arinen, A. Estimation of non-normalized statistical models by score matching. The Journal of Machine Learning Research, 6:695 709, 2005.

Kingma, D. and Ba, J. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.

Kloeden, P. E. and Platen, E. Numerical Solution of Stochastic Differential Equations, volume 23 of Appl. Math. (N. Y.). Berlin: Springer-Verlag, 1992.

Lai, C.-H., Takida, Y., Murata, N., Uesaka, T., Mitsufuji, Y., and Ermon, S. Regularizing score-based models with score Fokker-Planck equations. In Neur IPS Workshop on Score-Based Methods, 2022.

Lawson, D., Ravent os, A., and Linderman, S. SIXO: Smoothing inference with twisted objectives. Advances in Neural Information Processing Systems, 2022.

Liptser, R. S. and Shiryayev, A. N. Statistics of Random Processes. I. General theory. Translated by A. B. Aries, volume 5 of Appl. Math. (N. Y.). Springer, New York, 1977.

M at e, B. and Fleuret, F. Learning deformation trajectories of Boltzmann densities. Transactions on Machine Learning Research, 2023.

Matthews, A. G. D. G., Arbel, M., Rezende, D. J., and Doucet, A. Continual repeated annealed flow transport Monte Carlo. In International Conference on Machine Learning, 2022.

Mc Donald, C. J. and Barron, A. R. Proposal of a score based approach to sampling using Monte Carlo estimation of score and oracle access to target density. In Neur IPS Workshop on Score-Based Methods, 2022.

Midgley, L. I., Stimper, V., Simm, G. N., Sch olk opf, B., and Hern andez-Lobato, J. M. Flow annealed importance sampling bootstrap. In International Conference on Learning Representations, 2023.

Mira, A., Solgi, R., and Imparato, D. Zero variance Markov chain Monte Carlo for Bayesian estimators. Statistics and Computing, 23(5):653 662, 2013.

Møller, J., Syversveen, A. R., and Waagepetersen, R. P. Log Gaussian Cox processes. Scandinavian Journal of Statistics, 25(3):451 482, 1998.

Neal, R. M. Annealed importance sampling. Statistics and Computing, 11(2):125 139, 2001.

Particle Denoising Diffusion Sampler

Neal, R. M. Slice sampling. The Annals of Statistics, 31: 705 767, 06 2003.

Nichol, A. Q. and Dhariwal, P. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, 2021.

Richter, L., Berner, J., and Liu, G.-H. Improved sampling via learned diffusions. In International Conference on Learning Representations, 2024.

Salimans, T. and Ho, J. Should EBMs model the energy or the score? In Energy Based Models Workshop-ICLR 2021, 2021.

Song, J., Vahdat, A., Mardani, M., and Kautz, J. Pseudoinverse-guided diffusion models for inverse problems. In International Conference on Learning Representations, 2023.

Song, Y. and Ermon, S. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 2019.

Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021.

Sountsov, P., Radul, A., and contributors. Inference gym, 2020. URL https://pypi.org/project/ inference_gym.

Syed, S., Bouchard-Cˆot e, A., Deligiannidis, G., and Doucet, A. Non-reversible parallel tempering: A scalable highly parallel MCMC scheme. Journal of the Royal Statistical Society Series B, 84(2):321 350, 2022.

Tawn, N. G., Roberts, G. O., and Rosenthal, J. S. Weightpreserving simulated tempering. Statistics and Computing, 30(1):27 41, 2020.

Vargas, F., Grathwohl, W., and Doucet, A. Denoising diffusion samplers. In International Conference on Learning Representations, 2023.

Vincent, P. A connection between score matching and denoising autoencoders. Neural Computation, 23(7):1661 1674, 2011.

Webber, R. J. Unifying sequential Monte Carlo with resampling matrices. ar Xiv preprint ar Xiv:1903.12583, 2019.

Woodard, D. B., Schmidler, S. C., and Huber, M. Sufficient conditions for torpid mixing of parallel and simulated tempering. Electronic Journal of Probability, 14:780 804, 2009.

Wu, L., Trippe, B. L., Naesseth, C. A., Blei, D., and Cunningham, J. P. Practical and asymptotically exact conditional sampling in diffusion models. In Advances in Neural Information Processing Systems, 2023.

Zhang, D., Chen, R. T., Liu, C.-H., Courville, A., and Bengio, Y. Diffusion generative flow samplers: Improving learning signals through partial trajectory optimization. In International Conference on Learning Representations, 2024.

Zhang, Q. and Chen, Y. Path integral sampler: a stochastic control approach for sampling. In International Conference on Learning Representations, 2022.

Particle Denoising Diffusion Sampler

The Appendix is organized as follows. In Appendix A we detail the PDDS algorithm when adaptive resampling is used. In Appendix B we reformulate the results of Webber (2019) in terms of chi-squared divergences. In Appendix C, we expose our proofs. Finally, in Appendix D we present details and additional work relating to our experiments.

A. Particle Denoising Diffusion Sampler with Adaptive Resampling

Algorithm 3 Particle Denoising Diffusion Sampler with Adaptive Resampling

Input: Schedule (βt)t [0,T ] as in (2); Approximations (ˆgk)K k=0 s.t. ˆg0 = g0, ˆg K = 1; Number of particles N; ESS resampling threshold α Sample Xi K iid N(0, I) for i [N] Set ˆZK 1 and ωi K 1/N for i [N] for k = K 1, . . . , 0 do

Move. Sample Xi k ˆπ( |Xi k+1) for i [N] (see (16)) Weight. ωi k ωi k ωk( Xi k, Xi k+1) for i [N] (see (17)) Set ˆZk ˆZk+1 1

i [N] ωi k Normalize ωi k ωi k/ P

j [N] ωj k Resample and MCMC. if (P

i [N](ωi k)2) 1 < αN then X1:N k resample( X1:N k , ω1:N k ) (see Section 3.3) (Optional). Sample Xi k Mk(Xi k, ) for i [N] using a ˆπk-invariant MCMC kernel. Reset ω1:N k 1/N else

X1:N k X1:N k end if end for Output: Particle estimates ˆπN = 1

i [N] δXi 0 of π and ˆZN 0 of Z

B. Asymptotic Error Formulae for SMC

The results of Webber (2019) play a fundamental role in our work. Compared to traditional SMC literature, they put more stress on the normalizing constant estimate, have results for different resampling schemes, and do not require weight boundedness.

In this section, we present these results using the language of chi-squared divergence. This gives alternative expressions which are easier to manipulate in our context.

B.1. Chi-squared divergence

For two probability distributions p and q such that q p, the chi-squared divergence of q with respect to p is defined as

χ2(q||p) = Z

We state without proof two simple properties of this divergence.

Lemma B.1. Let g be a nonnegative function from X to R such that p(g) := Ep[g(X)] < . Define the probability distribution π by π(dx) p(x)g(x). Then

Varp(g) = (p(g))2 χ2(π||p).

Particle Denoising Diffusion Sampler

Lemma B.2. The chi-squared divergence between two probability distributions p and q on X Y satisfies the decomposition

χ2(q(dx, dy)||p(dx, dy)) = χ2(q(dx)||p(dx)) + Z p(dx) dq

2 (x)χ2(q(dy|x)||p(dy|x)).

B.2. Generic Feynman-Kac formula and the associated SMC algorithm

Let M(dx0:T ) = M(dx0)M(dx1|x0) . . . M(dx T |x T 1)

be a Markov measure defined on X0 . . . XT . Let G0(x0), G1(x0, x1), . . ., GT (x T 1, x T ) be strictly positive functions. For any t < T, assume that there exists Zt > 0 such that

Qt(dx0:T ) = 1

Zt M0(dx0)G0(x0)M1(dx1|x0)G1(x0, x1) . . . Mt(dxt|xt 1)Gt(xt)M(dxt+1:T |xt)

is a probability measure. Then Qt(dx0:t) is called the Feynman-Kac model associated with the Markov measure M(dx0:t) and the weight functions (Gs)s t. Given a number of particles N, Algorithm 4 approximates QT (dx T ) and the normalizing constant ZT . It is called the SMC algorithm associated to the Feynman-Kac model QT .

Algorithm 4 Generic SMC algorithm

Input: Markov kernels M(dxt|xt 1); Functions Gt(xt 1, xt); Number of particles N

Sample X1:N 0 iid M0(dx0) Set ωn 0 = G0(Xn 0 ) and ZN 0 = 1

N P ωn 0 Normalize ωn 0 ωn 0 / P

m ωm 0 for t = 1, . . . , T do

Resample particles X1:N t 1 among particles X1:N t 1 with weights ω1:N t 1 Sample Xn t Mt(dxt| Xn t 1) Set ωn t = Gt( Xn t 1, Xn t ) and Zn t = Zn t 1 1 N P ωn t Normalize ωn t ωn t / P ωm t end for Output: Empirical measure P ωn T δXn T approximating QT (dx T ) and estimate ZN T approximating ZT

B.3. Sorted stratified resampling schemes

The resampling step of Algorithm 4 can be done in a number of different ways. It is well known that multinomial resampling should be avoided. Practitioners often rely on alternative schemes, in particular systematic resampling. However, the theoretical properties of these schemes are less well-studied.

To investigate theoretically the asymptotic error of PDDS when the discretization step tends to zero, we consider the stratified resampling scheme where particles are sorted by a certain coordinate θ : Rd R, see Algorithm 5.

Algorithm 5 Generic sorted stratified resampling

Input: Particles X1:N in Rd with weights W 1:N; Sorting function θ : Rd R

Sort the particles X1:N such that θ(Xs1) . . . θ(Xs N ) for a permutation s1:N of {1, 2, . . . , N} for n = 1, . . . , N do

Simulate U n Uniform[0, 1] Find the index kn such that Pkn 1 i=1 W si n 1+U n

N < Pkn i=1 W si

Set Xn Xskn end for Output: Resampled particles X1:N

Depending on the chosen coordinate θ, sorted stratified resampling can significantly reduce the asymptotic error of the particle filter. Precise formulations is given in Webber (2019, Theorem 3.2). In a nutshell, the variance of ZN T /ZT comes

Particle Denoising Diffusion Sampler

from two sources: the mismatch between the PF proposal Mt(dxt|xt 1) and the target law QT (dxt|xt 1); and the error caused by the resampling step. The first source is common to all resampling methods. The magnitude of the second source for multinomial resampling is PT t=1 Var Qt 1 ht 1(Xt 1) where

ht 1(xt 1) = QT (dxt 1)

Qt 1(dxt 1) =

R QT s=t Ms(dxs|xs 1)Gs(xs 1, xs)dxt:T R Qt 1(dx t 1) QT s=t Ms(dxs|xs 1)Gs(xs 1, xs)dx t 1dxt:T .

Write the resampling error for multinomial resampling as

t=1 Var Qt 1 ht 1(Xt 1) =

t=1 EQt 1 Var Qt 1 ht 1(Xt 1)|θ(Xt 1) +

t=1 Var Qt 1 EQt 1 ht 1(Xt 1)|θ(Xt 1) .

Then the error for stratified resampling when particles are sorted by θ : Rd R contains only the first term and not the second one. Thus ideally we would like to choose θ = ht 1 so that

Var ht 1(Xt 1)|θ(Xt 1) = 0. (21)

While the ideal function ht 1 is intractable, there are more generic choices of θ which guarantee (21). If θ : Rd R is an injective map then (21) automatically holds. Such maps usually arise as pseudo-inverses of space-filling curves.

Following Gerber et al. (2019), we take θ to be the pseudo-inverse of the Hilbert curve, of which the existence is given by their Proposition 2. More precisely, that proposition gives an injective map from [0, 1]d to [0, 1]; so to get an injection from Rd to [0, 1] one might first apply any injection from Rd to [0, 1]d. In practice, numerical implementations are available to sort particles with the Hilbert curve, for instance the hilbert module of the Python package particles .

We point out that we consider this particular resampling strategy mainly for theoretical convenience. In our experiments, adaptive systematic resampling works well, but as we mentioned earlier, very little is known about their theoretical properties.

B.4. Asymptotic error

We recall the following definition from Webber (2019).

Definition B.3. The notation |Yn c| Un means that there exists a sequence of sets (Bn) such that P(Bn) 1 and

lim sup n E

We are now ready to restate parts of Theorem 3.2 and Example 3.4 of Webber (2019) in terms of chi-squared divergences.

Theorem B.4. Given the Feynman-Kac model defined in Section B.2, assume that χ2(QT (dx0)||M0(dx0)) < and χ2(QT (dxt)||Qt(dxt)) < , t. Then

N(ZN T /ZT 1) is asymptotically normal with variance σ2 mult if multinomial

resampling is used;

N(ZN T /ZT 1) 2 σ2 sort if sorted resampling (Gerber et al., 2019) is used; with

σ2 mult = χ2(QT (dx0)||M0(dx0)) +

t=1 χ2(QT (dxt 1, dxt)||Qt 1(dxt 1)Mt(xt 1, dxt))

= χ2(QT (dx0)||M0(dx0)) +

Z Qt 1(dxt 1) QT (dxt 1)

Qt 1(dxt 1)

2 χ2(QT (dxt|xt 1)||Mt(dxt|xt 1))

| {z } σ2 sort

t=1 χ2(QT (dxt 1)||Qt 1(dxt 1))

where we have decomposed σ2 mult using the chain rule (Lemma B.2).

Particle Denoising Diffusion Sampler

Proof. The original formulation of Theorem 3.2 (Webber, 2019) is written in terms of the following quantities

ht(xt) := E

To translate these notations into our case, note that Gt = Qt(dx0:T )/Qt 1(dx0:T ) and thus

s=0 Gs|ht c|2 #

= min c R EQt |ht c|2 = Var Qt(ht) = [Qt(ht)]2 χ2(QT (dxt)||Qt(dxt))

using Lemma B.1. Moreover,

Var M(Gt+1(Xt, Xt+1)ht+1(Xt+1)|Xt) = h2 t(Xt)χ2(QT (dxt+1|xt)||MT (dxt+1|xt))

and thus, using ht(xt) = QT (dxt)

Qt(dxt) Qt(ht) we get

s=0 Gs Var(Gt+1ht+1|Xt)

= Zt Qt(ht)2EQt

Qt(dxt) (Xt) 2 χ2(QT (dxt+1|Xt)||MT (dxt+1|Xt)

The identity Zt Qt(ht) = ZT helps conclude the proof.

C.1. Proof of Proposition 2.1

Proof. Write

πt(xt) = Z π0(x0)p(xt|x0)dx0 = 1

Z g0(x0)p0(x0)p(xt|x0)dx0

Z g0(x0)p0(xt)p(x0|xt)dx0

Z p0(xt)gt(xt).

The proof is concluded by taking the log gradient of the obtained identity with respect to xt.

C.2. Proof of Lemma 2.2

We first state the following elementary lemma on the solution of a linear SDE (Kloeden & Platen, 1992, Chapter 4.2).

Lemma C.1. Let a : [0, T] R and c : [0, T] R be two continuous functions. Put Φt = exp R t 0 asds, t [0, T]. Then the solution to d Zt = (at Zt + ct)dt +

0 Φ 1 s csds + Z t

Using Itˆo isometry, we get the following straightforward corollary.

Corollary C.2. Under the setting of Lemma C.1, if Z0 N(0, 1), then

0 Φ 1 s csds, Var(Zt) = Φ2 t + 2Φ2 t

0 Φ 2 s ds. (25)

Particle Denoising Diffusion Sampler

We are now ready to give the proof of Proposition 2.2.

Proof of Proposition 2.2. Without loss of generality, we can assume that βt 1. Indeed, putting ˆ gt(xt) := g0(e txt) and defining Z(T ) t by

d Z(T ) t = h Z(T ) t + 2 log ˆ g T t( Z(T ) t ) i dt +

2d Bt, Z(T ) 0 N(0, 1), (26)

we see that Z(T ) 0:t has the same law as Z

R T 0 βsds R T T t βsds.

We only consider the case σ = 1 here. The case σ = 1 can be treated using similar but simpler calculations; it can also be recovered by letting σ 1 in the expressions below.

Since log g0(x0) = log π(x0) log p0(x0) = (x µ)/σ2 + x, Equation (9) has the form (23) with

at = h 1 + 2e2(t T )(1/σ2 1) i , ct = 2et T µ/σ2. (27)

Tedious but standard calculations yield

Φt = exp t e 2T (1/σ2 1)(e2t 1) (28)

and the integral of t-dependent terms in R T 0 Φ 1 t ctdt is

0 exp n 2t + (1/σ2 1)e2(t T )o dt = 1 2e 2T (1/σ2 1) exp 1/σ2 1 exp e 2T (1/σ2 1) .

Using Corollary C.2, we have

E[Z(T ) T ] = µ 1 σ2 1 exp (e 2T 1)(1/σ2 1) (29)

and Var(Z(T ) T ) = v T,1 + v T,2, where

v T,1 = 1 2(1/σ2 1) 1 exp 2(1/σ2 1)(1 e 2T ) , (30)

v T,2 = exp 2T 2(1/σ2 1)(1 e 2T ) . (31)

The lemma is proved.

C.3. Proof of Propositions 3.1 and 3.2

The propositions follow from a direct application of Theorem B.4 to the Feynman-Kac model

QK(y0:K) = 1

Z MK(y0:K)G0(y0)

k=1 Gk(yk 1, yk)

where we have the correspondence yk = x K k and

MK(y0:K) = N(x K|0, Id)ˆπ(x K 1|x K) . . . ˆπ(x0|x1)

G0(y0) = ˆg K(x K)

Gk(yk 1, yk) = ωK k(x K k, x K k+1).

C.4. Proof of Proposition 3.3

We start by providing some intuition for the result. Repeat that Xk is a shorthand for Xkδ where δ is the discretization step size. This convention only applies for Xk.

Particle Denoising Diffusion Sampler

When K is big (δ is small), we have, as established in (14),

π(xk|xk+1) N(xk; p

1 αk+1xk+1 + αk+1 log gk+1(xk+1), αk+1I),

ˆπ(xk|xk+1) = N(xk; p

1 αk+1xk+1 + αk+1 log ˆgk+1(xk+1), αk+1I).

Using the analytic formula for the chi-squared divergence between two Gaussians we get

χ2(π(xk|xk+1)||ˆπ(xk|xk+1)) eαk+1 log gk+1 log ˆgk+1 2(xk+1) 1 2δ log gk+1 log ˆgk+1 2(xk+1).

ζ2 K = χ2(πK||N(0, Id)) + X

Z πk+1(xk+1)2

ˆπk+1(xk+1) χ2(π(xk|xk+1)||ˆπ(xk|xk+1))dxk+1

χ2(πK||N(0, Id)) + X

Z πk+1(xk+1)2

ˆπk+1(xk+1) 2δ log gk+1 log ˆgk+1 2(xk+1)dxk+1

χ2(πT ||N(0, Id)) + 2 Z T

ˆπt(x) log gt(x) log ˆgt(x) 2dxdt.

C.4.1. REGULARITY CONDITIONS FOR PROPOSITION 3.3

We assume that the sequence of distributions πt( ) satisfy the following properties. Assumption C.3. There exists M1 > 0 such that log πt(xt) M1(1 + xt ). Assumption C.4. There exist M2 > 0, M3 > 0, α2 1, and α3 1 such that 2 log πt(xt) M2(1 + xt )α2 and 3 log πt(xt) M3(1 + xt )α3.

Assumption C.5. There exist ϑ > 0 and M < such that R πt(xt)eϑ xt 2dxt < M , t.

These assumptions are satisfied, for example, when the target distribution π0 is Gaussian. Remark C.6. Let φ(xt) = log πt(xt). Then the notation 3 log πt(xt) refers to the second order differential φ (xt) which is a bilinear mapping from Rd Rd to Rd. For any multilinear operator H : X1 . . . Xn Y, we define

H op := sup x1,...,xn =0

H(x1, . . . , xn)

x1 . . . xn .

By writing H we implicitly refer to H op. In fact, the space of such operators is of finite dimensions in our cases of interests, hence any two norms are bounded by a constant factor of each other.

The above assumptions only concern the differential of log πt(xt) with respect to x. The following lemma derives a bound with respect to t. Lemma C.7. Under the above assumptions, there exist constants M1 and α1 such that for all t t log πt(xt) M1(1 + xt ) α1.

Proof. Using the Fokker Planck equation for the score (Lai et al., 2022), we write

t log gt(xt) = divx log gt(xt) + log gt(xt) ( log gt(xt) xt).

For a fixed t, put φ(xt) = log gt(xt) and ψ(xt) = Tr(φ (xt)) + φ(xt) (φ(xt) xt). Then t log gt(xt) = ψ(xt). Viewing ψ (xt), φ (xt), and φ (xt) as elements of L(Rd, R), L(Rd, Rd), and L(Rd, L(Rd, Rd)) respectively; where L(A, B) is the space of linear operators from A to B; we can write

ψ (xt)h = Tr(φ (xt)h) + (φ (xt)h) (φ(xt) xt) + φ(xt) (φ (xt)h h), h Rd

where stands for the usual scalar product between two vectors in Rd. There is a constant C depending on the dimension such that Tr(L) C L op for endomorphisms L. Thus

ψ (xt)h op C φ (xt) op h + φ (xt) op φ(xt) xt h + φ(xt) φ (xt) Id op h

M1(1 + xt ) α1 h

for some M1 and α1 by Assumptions C.3 and C.4. This entails t log gt(xt) = ψ (xt) op M1(1 + xt ) α1.

Particle Denoising Diffusion Sampler

C.4.2. FORMAL PROOF

To formalize the error of the heuristic approximations presented at the beginning of Section C.4 we need the following technical lemma.

Lemma C.8. Let x0 and v be two vectors in Rd and suppose that XA t and XB t are respectively the unique solutions of the SDEs:

(A) : d XA t = ( XA t + 2v)dt +

2d W A t , X0 = x0

(B) : d XB t = ( XB t + 2ft(XB t ))dt +

2d W B t , X0 = x0

where there exist strictly positive constants M1, M1, M2, and M3; and strictly greater than 1 constants α1, α2, and α3; such that the function ft(xt) satisfies ft(xt) M1(1 + xt ), tft(xt) M1(1 + xt ) α1, and ift(xt) Mi+1(1 + xt )αi+1 for i {1, 2}. (The notation refers implicitly to the gradient with respect to x.) Denote PA(dx[0,T ]) and PB(dx[0:T ]) respectively the path measures associated with the solutions of (A) and (B). Then, there exist a parameter f M depending on all the aforementioned constants and a parameter f M1 depending only on M1 such that for any t 1/f M1, the chi-squared divergence of PB(dx[0,t]) with respect to PA(dx[0,t]) is finite, and

χ2(PB(dx[0:t])||PA(dx[0:t])) 2t f0(x0) v 2 f Mt2etf M1( x0 2+ v 2) (1 + x0 + v )4(1+ α1+α2+α3)

Proof. Put t(xt) = 2ft(xt) 2v. By an application of Girsanov s theorem (Liptser & Shiryayev, 1977, Example 3, Section 6.2.3), we have

d PA (X[0,t]) = exp Z t

s(Xs), d W A s

0 s(Xs) 2ds

where, for a vector-valued process Vt, the notation R t 0 Vs, d Ws := Pd i=1 R t 0 V i s d W i s. As a preliminary step, we would like to bound EA[Dα t (1 + Xt + v )n] for some α, n 1. Here, c( ) denotes a constant whose value might change from line to line and depends on the variables inside the round bracket. We also drop the subscript/superscript A from EA and W A t whenever there is no risk of confusion. We have

E[Dα t (1 + Xt + v )n] = E exp α Z t

0 s 2ds (1 + Xt + v )n

= E exp α Z t

0 s 2ds exp α2

0 s 2ds (1 + Xt + v )n

et(c(M1,α)+c(α) v 2)E exp α Z t

0 s 2ds exp c(M1, α) Z t

(1 + Xt + v )n using s 2 c(M1)(1 + xs 2) + 8 v 2

et(c(M1,α)+c(α) v 2)E1/2 exp

0 s, d Ws α2 Z t

E1/4 exp 4c(M1, α) Z t

0 Xs 2ds E1/4 [(1 + Xt + v )n]

using double Cauchy-Schwarz E[XY Z] E1/2[X2]E1/4[Y 4]E1/4[Z4].

In the last line of the above display, the first expectation is equal to 1 by the same Girsanov argument as Liptser & Shiryayev (1977, Example 3, Section 6.2.3). To bound the second expectation, we note that under PA, we have Xs N( 1 λs(x0 2v) + 2v, λs). Elementary calculations yield the bound

E[ek Xs 2] = 1 1 2kλs

( k 1 λs(x0 2v) + 2v 2

2)de16k( x0 2+ v 2)

Particle Denoising Diffusion Sampler

for 0 < k < 1/4. Write

E exp 4c(M1, α) Z t

0 E h e4tc(M1,α) Xs 2i ds c(1)e64tc(M1,α)( x0 2+ v 2)

if t 1 16c(M1,α), using Jensen s inequality and the above bound. The third expectation can be bounded by c(n)(1+ x0 4n+

Putting everything together, we establish that there exist constants c(n) and c(M1, α) such that

E[Dα t (1 + Xt + v )n] c(n)etc(M1,α)( x0 2+ v 2)(1 + x0 4n + v 4n), t 1 c(M1, α). (32)

Now to study χ2(PB(dx[0:t])||PA(dx[0:t])), we apply Ito s formula to D2 t and get, under PA

0 D2 s s, d Ws + Z t

Putting ηt(xt) = t(xt) 2 and ft(xt) = xt + 2v, we have

D2 t ηt = η0 +

0 D2 s ηs s + ηs, d Ws +

2 + ηs( f(Xs) + 2 s) + η

s + Tr 2ηs )

To study (33) and (34), we use (32) together with the following bounds which are consequences of the lemma s assumptions:

s(xs) c(M1)(1 + xs + v )

ηs(xs) c(M1)(1 + xs + v )2

ηs(xs) c(M1, M2)(1 + xs + v )1+α2 η s (s, Xs) c(M1, M1)(1 + xs + v )1+ α1

Tr 2ηs c(M1, M2, M3)(1 + xs + v )1+α2+α3.

The last inequality is justified by the fact that Tr 2ηs 2ηs , where by considering 2ηs(xs) as the second differential of ηs at xs (i.e. a bilinear form from Rd Rd to R), we have

x2 (x)[h, k] = 2 s(x) 2

x2 (x)[h, k] + (

x (x)k) , (h, k) Rd Rd

where stands for the usual scalar product between two vectors in Rd. These bounds show that the stochastic integrals (w.r.t. d Ws) in (33) and (34) are true martingales (as opposed to merely local martingales). Moreover, there exist a constant f M depending on M1, M2, M3, M1, α1, α2, and α3; and a constant f M1 depending on M1 only such that

2 + ηs( f(Xs) + 2 s) + η

4f Metf M1( x0 2+ v 2)

(1 + x0 + v )4(1+ α1+α2+α3), s t 1 f M1 .

Taking expectation of both sides of (34) and rearranging yields

E[D2 t ηt] η0 4tf Metf M1( x0 2+ v 2)(1 + x0 + v )4(1+ α1+α2+α3), t 1 f M1 .

Particle Denoising Diffusion Sampler

Then we have, by taking expectation of both sides of (33):

χ2(PB(dx[0:t])||PA(dx[0:t])) tη0

= E(D2 t ) 1 tη0

0 2sf Metf M1( x0 2+ v 2)(1 + x0 + v )4(1+ α1+α2+α3)ds

= t2 f Metf M1( x0 2+ v 2)(1 + x0 + v )4(1+ α1+α2+α3).

The proof is completed.

We are now ready to give the proof of Proposition 3.3.

Proof. To make the arguments clearer, we shall assume that the proposal distribution is

ˆπ(xk|xk+1) = N(xk| p

1 αk+1xk+1 + 2(1 p

1 αk+1) log ˆgk+1(xk+1), αk+1I)

which is slightly different from (16). As we will see, the proof also applies to the original discretization with minimal changes. For 0 < s < u < T, the distributions ˆπ(xs|xu) and π(xs|xu) can be obtained by respectively solving the following SDEs between times T u and T s:

ˆπ(xs|xu) : d Y A t = ( Y A t + 2 log ˆg T u(Y A t ))dt +

2d W A t , Y A T u = xu

π(xs|xu) : d Y B t = ( Y B t + 2 log g T t(Y B t ))dt +

2d W B t , Y B T u = xu.

The assumptions in Section C.4.1 and Lemma C.7 show that the conditions of Lemma C.8 are satisfied for this pair of SDEs. Thus χ2(PB(dy[T u,T s])||PA(dy[T u,T s])) 2(u s) log gu(xu) log ˆgu(xu) 2 f M(u s)2

e(u s)f M1( xu 2+ log ˆgu(xu) 2)(1 + xu + log ˆgu(xu) )α+

for α+ = 4(1 + α1 + α2 + α3) and u s 1 f M1 . This, together with the data processing inequality and the assumption on | log ˆgt|, implies

χ2(π(xk|xk+1)||ˆπ(xk|xk+1)) 2δ log gk+1 log ˆgk+1 2(xk+1) + f Mδ2eδ f M1(1+2C2 2)(1+ xk+1 )2

(1 + C2)α+(1 + xk+1 )α+, δ 1 f M1 .

Thus, for a sufficiently fine discretization,

Z πk+1(xk+1)2

ˆπk+1(xk+1) χ2(π(xk|xk+1)||ˆπ(xk|xk+1))dxk+1

k δ Z πk+1(xk+1)2

ˆπk+1(xk+1) 2 log gk+1 log ˆgk+1 2(xk+1)dxk+1+

Z C1πk+1(xk+1)f Mδ2eδ f M1(1+2C2 2)(1+ xk+1 2)(1 + C2)α+(1 + xk+1 )α+dxk+1. (35)

The first term is a Riemann sum and converges to R T 0 R πt(xt)2

ˆπt(xt) 2 log gt log ˆgt 2(xt)dxtdt. To bound the second term, first note that

Eπk+1[(1 + X )2α+] 22α+ 1E h 1 + X 2α+i 22α+ 1E 1 + eϑ X 2 max α+ !

ϑ α+ , α+ 1 !

22α+ 1 1 + M max α+ !

ϑ α+ , α+ 1 !

Particle Denoising Diffusion Sampler

where Mθ and ϑ appear in Assumption C.5. Thus, as long as δ f M1(1 + 2C2 2) ϑ/2, it holds that Z πk+1(x)eδ f M1(1+2C2 2) x 2(1 + x )α+dx Eπk+1 h eϑ X 2/2(1 + X )α+i

E1/2 h eϑ X 2i E1/2 (1 + X )2α+ C(M , α+, ϑ)

for some constant C(M , α+) depending on M , α+, and ϑ. Hence the second sum of (35) tends to 0 when δ 0. The proof is finished.

C.5. Proof of Proposition 4.1

The denoising score matching identity is standard and recalled here for convenience. We have

πk(xk) = Z p(xk|x0)π0(x0)dx0

so by using the log derivative we obtain

log πk(xk) = Z log p(xk|x0)π0(x0)p(xk|x0)

πk(xk) dx0 (36)

= Z log p(xk|x0)π(x0|xk)dx0. (37)

It can be easily verified that the interchange of differentiation and integration here does not require any regularity assumption on π0(x0) apart from differentiability.

To prove the novel score identity, we first note that, under the condition R π(x0) e η x0 2dx0 < , η > 0, we have Z x0 log π(x0|xk)π(x0|xk)dx0 = 0 (38)

according to Lemma C.9. Combining this identity with (37), we have, for any α R,

log π(xk) = Z [ xk log p(xk|x0) + α x0 log π(x0|xk)] π(x0|xk)dx0.

In particular:

For α = 1 1 λk , we get

log π(xk) = 1 1 λk

Z log π(x0)π(x0|xk)dx0

which is the identity presented in Appendix C.1.3 (De Bortoli et al., 2021);

For α = 1 λk, we get

log π(xk) = Z p

1 λk log g0(x0) xk π(x0|xk)dx0

which is the identity we wanted to prove.

The verifications are straightforward by remarking that x0 log π(x0|xk) = log g0(x0) + x0 log p(x0|xk). We also note that choosing α = 0 brings us back to the classical score matching loss. Therefore, different values of α give losses with different properties.

We finish this section with a technical lemma giving conditions for (38) to hold. The identity is a particular case of what is known in the literature as zero-variance control variates and Stein s control variates (Assaraf & Caffarel, 1999; Mira et al., 2013; Anastasiou et al., 2023).

Particle Denoising Diffusion Sampler

Lemma C.9. Let f : Rd R be a probability density, i.e. f 0 and R f(x)dx = 1. Suppose that f is continuously differentiable and R f(x) dx < . Then R f(x)dx = 0. Remark C.10. The condition R f(x) dx < is clearly the minimum necessary for R f(x)dx = 0 to make sense. On the other hand, we do not explicitly require that f or f vanishes at infinity.

Proof. Without loss of generality, we only prove that R 1f(x)dx = 0. Put g(x1) := R f(x1, x2:d)dx2:d. Fubini s theorem then implies that R

R g(x1)dx1 = 1. We have Z

Rd 1f(x)dx = lim M

M 1f(x1, x2:d)dx1dx2:d = lim M

Rd 1 f(M, x2:d) f( M, x2:d)dx2:d

= lim M g(M) g( M). (39)

Put I(M) := R 0 |g(M + x) g( M x)|dx. We have

0 |g(M + x)| + |g( M x)|dx,

thus lim M I(M) = 0 by the integrability of g. Combining this with Fatou s lemma yields

0 = lim inf M I(M) Z

0 lim inf M |g(M + x) g( M x)|dx = Z

Rd 1f(y)dy dx

by (39). This means that R

Rd 1f(y)dy = 0.

C.6. Proof of Proposition 4.2

Proof. Since we are in the Gaussian case with d = 1, we have log g0(x0) = ax0 + b for some a, b R. Therefore ℓNSM(θ) and Var(ˆℓDSM(θ)) are trivially bounded. To study ℓDSM(θ), we first note that

Var(X0|Xt) = λtσ2

λt + (1 λt)σ2 =: ρt.

E h sθ(t, Xt) log p(Xt|X0) 2 Xt i Var (sθ(t, Xt) log p(Xt|X0)| Xt) = Var Xt 1 λt X0

ℓDSM(θ) Z T

λ2 t ρtdt =

since ρt 2t as t 0. Concerning ˆ ℓDSM(θ) = 2T(sθ(τ, Xτ) log p(Xτ|X0)) θsθ(τ, Xτ), we have

E ˆ ℓDSM(θ) 2 Xt, τ = t = 4T 2 θsθ(t, Xt) 2E h sθ(t, Xt) log p(Xt|X0) 2 Xt, τ = t i

4T 2 θsθ(t, Xt) 2 1 λt

E ˆ ℓDSM(θ) 2 = 1

0 E ˆ ℓDSM(θ) 2 τ = t dt 4T Z T

0 E θsθ(t, Xt) 2 1 λt

0 θsθ(t, Xt) 2 1 λt

The integral inside the last expectation is infinite whenever the event θsθ(0, X0) = 0 holds, thanks to the continuity of θs and the path X[0,T ]. Since E θsθ(t, X0) 2 > 0 by assumption, that event has non-zero probability, which concludes the proof.

Particle Denoising Diffusion Sampler

D. Experimental Details

In this section we give additional details and ablations relating to our experimental results. We begin by providing details of the sampling tasks we considered. We then provide details of our implementation and the baseline methods. We finally provide additional ablation studies and results which demonstrate the properties of our method.

D.1. Benchmarking targets

Gaussian Here we consider the target π(x) = N(x; 2.75, 0.252).

Mixture This target was used in Arbel et al. (2021). It is an equally weighted mixture of 6 bivariate Gaussian distributions with means µ1 = (3.0, 0.0), µ2 = ( 2.5, 0.0), µ3 = (2.0, 3.0), µ4 = (0.0, 3.0), µ5 = (0.0, 2.5), µ6 = (3.0, 2.0) and covariances Σ1 = Σ2 = 0.7 0.0 0.0 0.05 , Σ4 = Σ5 = 0.05 0.0 0.0 0.07 , Σ3 = Σ6 = 1.0 0.95 0.95 1.0 . This target is symmetric around y = x.

Funnel This target was proposed by Neal (2003). Its density follows x0 N(0, σ2 f), x1:9|x0 N(0, exp(x0)I), with σf = 3.

Logistic Regression The Bayesian logistic regression model is defined by the prior distribution θ N(0, σ2I) and likelihood y|θ, x Bernoulli(σ(θT x)) where σ is the sigmoid function. We consider sampling the posterior θ|y, x on the Ionosphere and Sonar datasets, which are of 35 and 61 dimensions respectively.

Brownian Motion In this task, we make noisy observations of a simple Brownian motion over 30 time steps. The model was introduced by Sountsov et al. (2020) and is defined by the prior αinn Log Normal(0, 2), αobs Log Normal(0, 2), x1 N(0, α2 inn) and xi N(xi 1, α2 inn) for i = 2, ..., 30. The observation likelihood is given by yi N(xi, α2 obs) for i = 1, ..., 30. The goal is to sample the posterior distribution of αinn, αobs, x1, ..., x30|{yi}10 i=1 S{yi}30 i=21. This task is in 32 dimensions.

Log Gaussian Cox Process The LGCP model (Møller et al., 1998) was developed for the analysis of spatial data. The Poisson rate parameter λ(x) is modelled on the grid using an exponentially-transformed Gaussian process, and observations come from a Poisson point process. The unnormalized posterior density is given directly by γ(x) = N(x; µ, K) Q

i [1:M]2 exp(xiyi a exp(xi)), where xi are the points of a regular M M grid. In our experiments, we fit this model on the Pines forest dataset where M = 40, resulting in a problem in 1600 dimensions.

GMM The challenging Gaussian Mixture Model used in (Midgley et al., 2023) is an unequally weighted mixture of 40 Gaussian components. The mean of each component is uniformly distributed in the range [ 40, 40]d, the covariance is σ2Id where σ = log(1 + exp(0.1)) and the unnormalized weight is uniformly distributed in [0, 1]. We consider dimensions d {1, 2, 5, 10, 20}.

D.2. Algorithmic details and hyperparameter settings

Here we give details of the algorithmic settings used in our experiments. We first describe the considerations taken to ensure a fair comparison between baselines, and then we detail exact hyperparameter settings.

Our method was implemented in Python using the libraries of JAX (Bradbury et al., 2018), Haiku and Optax. Our implementation is available on Github1. We used the open source code-bases of Arbel et al. (2021) to run the SMC and CRAFT baselines and of Vargas et al. (2023) to run the PIS and DDS benchmarks, both of which are also implemented in JAX.

In all experiments we used 2000 particles to estimate the normalizing constant.

Variational approximation We used a variational approximation as the reference distribution for all methods. We found that this was required for numerical stability of the potential function gt(xt) in our method. We therefore used the same variational approximation for all methods to ensure a fair comparison. Note that PIS reverses a pinned brownian motion

1https://github.com/angusphillips/particle_denoising_diffusion_sampler

Particle Denoising Diffusion Sampler

Gaussian (16) Mixture (16) Funnel (32) Brownian (16) Ion (32) Sonar (32) LGCP (128)

PDDS 37570 / 84 / 0.10 37764 / 84 / 0.13 39316 / 114 / 0.15 43584 / 90 / 0.19 44166 / 97 / 0.12 49210 / 105 / 0.12 347776 / 2492 / 1.86

CRAFT 32 / 4 / 0.02 176608 / 26 / 0.09 6077440 / 178 / 0.18 1024 / 27 / 0.21 2240 / 38 / 0.09 3904 / 40 / 0.09 409600 / 3072 / 14.9

PIS 37570 / 129 / 0.00 37764 / 167 / 0.01 39316 / 394 / 0.02 43854 / 176 / 0.01 44166 / 324 / 0.02 49210 / 326 / 0.01 347776 / 3931 / 0.64

DDS 37570 / 136 / 0.00 37764 / 187 / 0.01 39316 / 381 / 0.02 43854 / 183 / 0.01 44166 / 338 / 0.01 49210 / 332 / 0.02 347776 / 3941 / 0.68

SMC 0 / 0 / 0.02 0 / 0 / 0.07 0 / 0 / 0.17 0 / 0 / 0.20 0 / 0 / 0.09 0 / 0 / 0.09 0 / 0 / 14.6

Table 1: Number of trainable parameters / training time total (seconds) / sampling time per 2000 particles (seconds). Timings are averaged over 3 training seeds.

and thus the reference distribution depends on the diffusion time span T and the noise coefficient σ. Since σ affects the performance of the PIS algorithm itself, we tune this parameter independently rather than setting this via the variational approximation. The variational approximation was a mean-field variational distribution i.e. a diagonal Gaussian distribution learnt by optimizing the ELBO. We used 20, 000 optimisation steps (50, 000 for the Funnel and Brownian tasks) with the Adam optimizer (Kingma & Ba, 2015) and learning rate 1e 3. We did not use a variational approximation in the Gaussian and GMM tasks where we used N(0, 1) and N(0, 202Id) respectively.

Network architectures and optimizer settings For the CRAFT baseline, we followed the flow network architectures and optimizer settings given in Matthews et al. (2022), which are restated below for completeness. For the diffusion-based methods (PDDS, DDS and PIS) we use the same network architecture and optimizer settings for each method. The neural network follows the PISGRAD network of Zhang & Chen (2022) with minor adaptations. We use a sinusoidal embedding of 128 dimensions for the time input. We use a 3-layer MLP with 64 hidden units per layer for the smoothing network (rη(t) in our notation and NN2(t) in Zhang & Chen (2022)). For the main potential/score network (Nγ in our notation and NN1(t, x) in Zhang & Chen (2022)), we use a 2 layer MLP of 64 hidden units per layer to encode the 128-dimensional time embedding. This is concatenated with the state input x before passing through a 3-layer MLP with 64 hidden units per layer, outputting a vector of dimension d. In PDDS, we take the scalar product of this output with the state input x to approximate the potential function, while PIS and DDS use the d-dimensional output to approximate the optimal control term. The activation function is Ge LU throughout. We train for 10, 000 iterations of the Adam optimizer (Kingma & Ba, 2015) with batch size 300 and a learning rate of 1e 3, which decays exponentially at a rate of 0.95 every 50 iterations (with the exception of the Funnel task where we did not use any learning rate decay).

The number of trainable parameters for each method and task can be found in Table 1, along with training time and sampling time (performed on a NVIDIA Ge Force GTX 1080 Ti GPU).

Annealing and noise schedules For the annealing based approaches (SMC and CRAFT) we used a geometric annealing schedule with initial distribution the variational approximation as described above. For the diffusion based approaches (PDDS, DDS and PIS) we carefully considered the appropriate noise schedules for each method. Firstly, we fix the diffusion time span at T = 1 and adapt the discretization step size depending on the number of steps K of the experiment, i.e. δ = T/K. This choice is equivalent to the fixed discretization step and varying diffusion time T as considered by Vargas et al. (2023), up to the choice of αmax.

For PIS, the original work of Zhang & Chen (2022) used by default a uniform noise schedule controlled by σ. Further, Vargas et al. (2023) were unable to find a noise schedule which improved performance above the default uniform noise schedule. As such as we stick with the uniform noise schedule and tune the noise coefficient σ by optimizing the ELBO objective over a grid search.

For DDS, it was observed by Vargas et al. (2023) that controlling the transition scale αk such that it goes smoothly to zero

Particle Denoising Diffusion Sampler

as k 0 is critical to performance. To achieve this they choose a cosine-based schedule α1/2 k = α1/2 max cos2 π

for s small (0.008 following Nichol & Dhariwal (2021)), which we term the DDS cosine schedule. The parameter αmax is tuned such that the noise at the final step in the reverse process is sufficiently small. We found that this scheduler did indeed result in the best performance when compared to a linear noise schedule (βt = β0 + βT t/T) or the popular cosine schedule of Nichol & Dhariwal (2021) (λt = 1 cos2 π

1+s ). As such we use the DDS cosine schedule and tune αmax by optimizing the ELBO objective over a grid search.

For our method PDDS, we obtained the best performance using the cosine schedule of Nichol & Dhariwal (2021), which sets λt = 1 cos2 π

1+s where we recall that λt = 1 exp 2 R t 0 βsds , i.e. the variance of the transition from 0 to t. We provide an illustration of the benefits of this schedule in the following section. In particular we found that the alternative DDS cosine schedule did not improve performance and added the complexity of tuning the parameter αmax.

In summary, while each of the diffusion based approaches used different noise schedulers, each was chosen to provide the best performance for the individual approach and thus ensures a fair comparison.

D.2.1. SMC AND CRAFT SETTINGS

We used 1 iteration of an HMC kernel with 10 leapfrog integrator steps as the proposal distribution in the SMC and CRAFT baselines. We tuned the HMC step sizes based on initial runs and obtained the step size schedules given below. We performed simple resampling when the ESS dropped below 30%. We trained CRAFT for 500 iterations (1000 on Funnel) with a batch size of 2000 and learning rate schedule detailed below. We also list the flow architecture in each task. Our parameter settings differ to those in Matthews et al. (2022) since we use the variational approximation, which results in larger MCMC step sizes and smaller learning rates.

Gaussian Step sizes [0.7, 0.7, 0.5, 0.4] linearly interpolated between times [0.0, 0.25, 0.5, 1.0]. CRAFT used a diagonal affine flow, with a learning rate of 1e 2.

Mixture Step sizes [0.5, 0.5, 0.5, 0.3] linearly interpolated between times [0.0, 0.25, 0.5, 1.0]. CRAFT used a spline inverse autoregressive flow with 10 spline bins and a 3 layer autoregressive MLP of 30 hidden units per layer, with a learning rate of 1e 3.

Funnel Step sizes [1.0, 0.9, 0.8, 0.7, 0.6] linearly interpolated between times [0.0, 0.25, 0.5, 0.75, 1.0]. CRAFT used an affine inverse autoregressive flow, trained for 4000 iterations with a learning rate of 1e 3.

Brownian Step sizes [0.8, 0.8, 0.7, 0.6, 0.5] linearly interpolated between times [0.0, 0.25, 0.5, 0.75, 1.0]. CRAFT used a diagonal affine flow, with a learning rate of 1e 3.

Ion Step sizes [0.7, 0.7, 0.6, 0.5, 0.4] linearly interpolated between times [0.0, 0.1, 0.25, 0.5, 1.0]. CRAFT used a diagonal affine flow, with a learning rate of 1e 3.

Sonar Step sizes [0.7, 0.7, 0.6, 0.5, 0.35] linearly interpolated between times [0.0, 0.1, 0.25, 0.5, 1.0]. CRAFT used a diagonal affine flow, with a learning rate of 1e 3.

LGCP Step sizes [0.35, 0.35, 0.3, 0.2] linearly interpolated between times [0.0, 0.25, 0.5, 1.0]. CRAFT used a diagonal affine flow, with a learning rate of 1e 4.

GMM1 Step sizes [5, 4, 3, 2.8, 2.5] linearly interpolated between times [0.0, 0.3, 0.5, 0.85, 1.0]. CRAFT used a diagonal affine flow, with a learning rate of 1e 3.

GMM2 Step sizes [4, 3, 2.5, 2.1, 2] linearly interpolated between times [0.0, 0.3, 0.5, 0.85, 1.0]. CRAFT used a diagonal affine flow, with a learning rate of 1e 3.

GMM5 Step sizes [5, 3.3, 2.3, 1.8, 1.6] linearly interpolated between times [0.0, 0.25, 0.5, 0.8, 1.0]. CRAFT used a diagonal affine flow, with a learning rate of 1e 3.

Particle Denoising Diffusion Sampler

Base step 1 2 4 8 16

Gaussian 1 0.86 0.86 1.00 0.96 0.82 Mixture 1 0.28 0.28 0.36 0.52 0.54 Brownian 1 0.76 0.76 0.84 0.80 0.72 Funnel 2 0.60 0.68 0.68 0.60 0.64 Ion 2 0.68 0.80 0.74 0.64 0.52 Sonar 2 0.68 0.82 0.78 0.64 0.50 LGCP 8 0.74 0.62 0.60 0.44 0.26 GMM1 2 0.22 0.28 0.24 0.20 0.22 GMM2 2 0.14 0.18 0.18 0.16 0.16 GMM5 2 0.20 0.28 0.26 0.24 0.20 GMM10 4 0.36 0.34 0.30 0.26 0.22 GMM20 8 0.40 0.32 0.32 0.26 0.18

Table 2: Optimal settings for αmax. The number of steps for a given entry is the base steps in the first column multiplied by the step multiplier in the zeroth row.

Steps 1 2 4 8 16

Gaussian 1 NA 1.00 1.00 1.00 1.00 Mixture 1 NA 2.40 2.20 1.92 1.88 Brownian 1 NA 0.08 0.10 0.10 0.13 Funnel 2 1.50 1.00 1.00 1.00 1.00 Ion 2 0.37 0.40 0.40 0.46 0.46 Sonar 2 0.25 0.31 0.40 0.46 0.49 LGCP 8 1.36 1.64 1.78 1.99 2.06 GMM1 2 15.00 14.00 15.00 15.00 16.00 GMM2 2 4.40 1.30 1.30 7.30 8.70 GMM5 2 1.30 1.30 1.40 1.40 1.40 GMM10 4 1.30 1.30 1.30 1.30 1.30 GMM20 8 1.30 1.30 1.30 1.20 1.20

Table 3: Optimal settings for σ. The number of steps for a given entry is the base steps in the first column multiplied by the step multiplier in the zeroth row. We were unable to tune PIS with one step size.

GMM10 Step sizes [3, 2, 1.5, 1.5] linearly interpolated between times [0.0, 0.5, 0.85, 1.0]. CRAFT used a diagonal affine flow, with a learning rate of 1e 3.

GMM20 Step sizes [3, 1.8, 1.4, 1.3] linearly interpolated between times [0.0, 0.5, 0.85, 1.0]. CRAFT used a diagonal affine flow, with a learning rate of 1e 3.

We used SMC with the above settings for 1000 steps to estimate the ground truth normalizing constant on the Bayesian posterior targets.

D.2.2. DDS SETTINGS

Optimal settings for αmax are given in Table 2. Note that we do not tune σ as in Vargas et al. (2023) since we used a variational approximation. We also tuned DDS on the GMM tasks but did not present the results as they were not competitive with PDDS and CRAFT.

D.2.3. PIS SETTINGS

Optimal settings for σ are given in Table 3. Note that we were unable to obtain reasonable performance with PIS with only 1 step. We also tuned PIS on the GMM tasks but did not present the results as they were not competitive with PDDS and CRAFT.

Particle Denoising Diffusion Sampler

D.2.4. PDDS SETTINGS

No tuning of the cosine noise schedule was required. We performed systematic resampling (Douc & Capp e, 2005) when the ESS dropped below 30%. PDDS-MCMC used 10 Metropolis-adjusted Langevin MCMC steps with step sizes tuned based on initial runs with the initial simple approximation, targeting an acceptance rate of approximately 0.6. The resulting step sizes can be found below. We used 20 iterations of PDDS, each trained for 500 steps with a fresh instance of the learning rate schedule (1e 3 with exponential decay at a rate of 0.95 per 50 iterations). At each refinement we initialise the potential approximation at it s previous state, rather than training from scratch at each refinement. We found that for a limited computational budget, better performance was obtained for a fast iteration rate (20 iterations with 500 training steps each). We also tested a slower iteration rate (2 iterations with 10,000 training steps each) which performed equivalently but required a larger overall training budget. The fast iteration schedule uses a lower number of density evaluations but has a larger training time due to requiring more frequent compilation of the PDDS sampler.

Gaussian Step sizes [0.1, 0.2, 0.5, 0.6] linearly interpolated between times [0, 0.5, 0.75, 1.0].

Mixture Step sizes [0.05, 0.15, 0.4, 0.6] linearly interpolated between times [0, 0.5, 0.75, 1.0].

Funnel Step sizes [0.4, 0.3, 0.5, 0.6, 0.6] linearly interpolated between times [0, 0.2, 0.5, 0.75, 1.0]. We found that the Langevin MCMC can become unstable on the Funnel task due to extreme gradients of the density function, therefore at each iteration of PDDS we reduced the step sizes by 50% for the first 10 iterations.

Brownian Step sizes [0.2, 0.4, 0.5, 0.5] linearly interpolated between times [0, 0.5, 0.75, 1.0].

Ion Step sizes [0.15, 0.25, 0.5, 0.6] linearly interpolated between times [0, 0.5, 0.75, 1.0].

Sonar Step sizes [0.1, 0.1, 0.18, 0.32, 0.4] linearly interpolated between times [0, 0.25, 0.5, 0.75, 1.0].

LGCP Step sizes [0.1, 0.1, 0.15, 0.2] linearly interpolated between times [0, 0.5, 0.75, 1.0].

GMM1 Step sizes [10, 11, 20, 35, 100] linearly interpolated between times [0.0, 0.25, 0.5, 0.75, 1.0].

GMM2 Step sizes [2, 2.5, 4, 6, 12, 50] linearly interpolated between times [0.0, 0.25, 0.5, 0.6, 0.75, 1.0].

GMM5 Step sizes [1.5, 1.5, 2.5, 4, 8, 30] linearly interpolated between times [0.0, 0.25, 0.5, 0.6, 0.75, 1.0].

GMM10 Step sizes [1, 1.2, 1.8, 3, 6, 20] linearly interpolated between times [0.0, 0.25, 0.5, 0.6, 0.75, 1.0].

GMM20 Step sizes [0.8, 1.0, 2, 3, 6, 20] linearly interpolated between times [0.0, 0.25, 0.5, 0.6, 0.75, 1.0].

D.3. Ablation studies

Importance of SMC and iterative potential approximations Here we study the behaviour of PDDS on a gaussian mixture task where the initial (here referred to as naive ) approximation Equation (8) only captures one out of three modes when simulating the reverse SDE Equation (7). Figure 8 shows that our method is able to recover the unknown modes after only two iterations of potential training. Following the discussion in Section 4.5, there are two mechanisms at play which allow our method to recover additional modes which are missed by the naive approximation.

Firstly, our asymptotically correct SMC scheme Algorithm 1 means that we do not simply re-learn the potential function of the previous iteration during training, but instead we learn an improved potential function since the training samples are improved by the SMC scheme. We evidence the improvement in potential functions by plotting the path of distributions induced by Equation (7) with the learnt potential at each PDDS iteration in Figure 9. We see that at each iteration of PDDS the sequence of distributions moves closer to the true denoising sequence for the given target. Furthermore we show that our SMC scheme is critical by showing the behaviour of PDDS if the SMC scheme is ignored (i.e. remove all resampling steps). Figure 10 and Figure 11 show the resulting samples and sequence of distributions when we simply simulate the reverse SDE Equation (7) using the previous potential approximation without applying the SMC correction scheme. We observe that very

Particle Denoising Diffusion Sampler

Figure 8: Samples from PDDS on a 3-mode mixture of Gaussians.

Figure 9: Left to right: marginal distributions of the reverse SDE Equation (7) using the potential approximation at each iteration of PDDS.

few particles do still reach the missing modes but this is not compounded in each iteration and after the first iteration no improvements are made.

The second, more subtle, mechanism at play is that the learning of the potential approximation is not based on the sample alone, but also uses information from the target distribution via the way we parametrize our neural network (Section 4.3). This is true for both the original and NSM losses (although the advantage of the NSM loss is to provide a lower variance regression target than the original DSM loss). We note, however, that this second mechanism alone is not sufficient without the help of SMC, again illustrated in Figure 10 and Figure 11.

Cosine scheduler As stated above, we follow the cosine scheduler introduced by Nichol & Dhariwal (2021). We found, as demonstrated in Figure 12 and Figure 13, that the cosine schedule was effective in ensuring the forward SDE converges to the target distribution while regularly spacing the ESS drops across the sampling path.

Number of particles In Figure 14 we show that the normalizing constant estimation error with the naive potential approximation does decrease as the number of particles increases. However, we could not eliminate the error before exceeding the computer memory. We conclude that we must improve our initial potential approximation to obtain feasible results, hence motivating the iterative potential approximation scheme.

Guidance path of NN potential approximation In Figure 15 we demonstrate that using a neural network to correct the naive approximation has the effect of correcting the path of the guidance SDE. Here we use a linear schedule for βt since the cosine schedule does not allow for analytic roll-out of the SDE.

D.4. Additional results

In Figure 16 we display the normalising constant estimates on all tasks, including the Brownian and Ion tasks which were omitted from the main text for space.

Figure 10: Samples from PDDS without SMC on a 3-mode mixture of Gaussians

Particle Denoising Diffusion Sampler

Figure 11: Left to right: marginal distributions of the reverse SDE Equation (7) using the potential approximation at each iteration of PDDS without SMC.

Figure 12: Left: linear schedule, βf = 8. Middle: linear schedule: βf = 12. Right: cosine schedule.

Figure 13: Left: linear schedule, βf = 8. Middle: linear schedule: βf = 12. Right: cosine schedule.

Particle Denoising Diffusion Sampler

Figure 14: Samples and estimated normalizing constant from PDDS on N(2.75, 0.252) target.

Figure 15: Guidance path using analytic potential function (left), naive potential approximation (middle) and neural network potential approximation (right).

Particle Denoising Diffusion Sampler

1 2 4 8 16 0.8

2 4 8 16 32

1 2 4 8 16 3

2 4 8 16 32

2 4 8 16 32 Steps

SMC PIS DDS CRAFT PDDS PDDSMCMC

8 16 32 64 128 Steps

Figure 16: Normalizing constant estimation results on all tasks. Outliers are hidden for clarity. Each box consists of 2000 estimates, coming from 20 training seeds each with 100 evaluation seeds.

Particle Denoising Diffusion Sampler

In Figure 17 we display the normalising constant estimation results and in Figure 18 the Wγ 2 distances for the GMM task in 1, 2, 5, 10, 20 dimensions.

D.5. Uncurated normalizing constant estimates

In Figure 19 we display the normalising constant estimation results of Figure 16 with outliers present. Note that all methods are susceptible to erroneous over-estimation of the normalising constant due to numerical errors an instability. It appears that all methods are equally susceptible to this issue, with no single method displaying more erroneous overestimation than the others. Results on the Gaussian task are displayed separately in Figure 20.

Particle Denoising Diffusion Sampler

2 4 8 16 32

2 4 8 16 32

2 4 8 16 32

4 8 16 32 64

8 16 32 64 128 Steps

CRAFT PDDSMCMC

Figure 17: Normalizing constant estimation results on the GMM task in 1, 2, 5, 10 and 20 dimensions. Outliers are hidden for clarity. Each box consists of 1000 estimates coming from 10 training seeds and 100 evaluation seeds.

Particle Denoising Diffusion Sampler

2 4 8 16 32

2 4 8 16 32

2 4 8 16 32

4 8 16 32 64

8 16 32 64 128 Steps

CRAFT PDDSMCMC

Figure 18: Entropy-regularized Wasserstein-2 distances between samples from the model and target distributions for CRAFT and PDDS-MCMC. Lower is better. Each box consists of 200 values coming from 10 training seeds and 20 evaluation seeds.

Particle Denoising Diffusion Sampler

2 4 8 16 32 2

2 4 8 16 32

2 4 8 16 32 Steps

SMC PIS DDS CRAFT PDDS PDDSMCMC

8 16 32 64 128 Steps

Figure 19: Normalizing constant estimation results on all tasks. Each box consists of 2000 estimates, coming from 20 training seeds each with 100 evaluation seeds.

Particle Denoising Diffusion Sampler

1 2 4 8 16 Steps

SMC PIS DDS CRAFT PDDS PDDSMCMC

1 2 4 8 16 Steps

2.0 Gaussian

Figure 20: Normalizing constant estimation results on the Gaussian task. Each box consists of 2000 estimates, coming from 20 training seeds each with 100 evaluation seeds. Left: outliers removed, right: uncurated.