# sequential_controlled_langevin_diffusions__f0a5ab30.pdf

Published as a conference paper at ICLR 2025

SEQUENTIAL CONTROLLED LANGEVIN DIFFUSIONS

Junhua Chen , ,1, Lorenz Richter ,2,3, Julius Berner , ,4, Denis Blessing ,5, Gerhard Neumann5, Anima Anandkumar6

1University of Cambridge, 2Zuse Institute Berlin, 3dida Datenschmiede Gmb H, 4NVIDIA, 5Karlsruhe Institute of Technology, 6California Institute of Technology

An effective approach for sampling from unnormalized densities is based on the idea of gradually transporting samples from an easy prior to the complicated target distribution. Two popular methods are (1) Sequential Monte Carlo (SMC), where the transport is performed through successive annealed densities via prescribed Markov chains and resampling steps, and (2) recently developed diffusion-based sampling methods, where a learned dynamical transport is used. Despite the common goal, both approaches have different, often complementary, advantages and drawbacks. The resampling steps in SMC allow focusing on promising regions of the space, often leading to robust performance. While the algorithm enjoys asymptotic guarantees, the lack of flexible, learnable transitions can lead to slow convergence. On the other hand, diffusion-based samplers are learned and can potentially better adapt themselves to the target at hand, yet often suffer from training instabilities. In this work, we present a principled framework for combining SMC with diffusion-based samplers by viewing both methods in continuous time and considering measures on path space. This culminates in the new Sequential Controlled Langevin Diffusion (SCLD) sampling method, which is able to utilize the benefits of both methods and reaches improved performance on multiple benchmark problems, in many cases using only 10% of the training budget of previous diffusion-based samplers.

1 INTRODUCTION

We consider the task of sampling from densities of the form

ptarget = ρtarget

Z with Z := R

Rd ρtarget(x)dx, (1)

where ρtarget C(Rd, R 0) can be evaluated pointwise, but the normalizing constant Z is typically intractable. This task is of great practical interest, with numerous applications in the natural sciences (Zhang et al., 2023b), for instance, for Boltzmann distributions in molecular dynamics or lattice field theory in quantum physics, as well as posterior sampling in Bayesian statistics (Gelman et al., 2013).

Sampling problems vs. generative modeling. The sampling problem poses unique challenges not found in other areas of probabilistic modeling. For instance, while both generative modeling and sampling involve approximating a target distribution ptarget, they differ fundamentally in terms of the information available. In generative modeling, one has access to samples X ptarget, whereas in sampling, we only have access to a pointwise oracle ρtarget (and, potentially, its pointwise gradients) and no samples. This distinction introduces obstacles for the sampling problem that do not exist in generative modeling. For example, a key challenge in modeling a distribution is identifying its regions of high probability, or modes. When samples are available, they can directly reveal the locations of these modes. In their absence, however, the sampling algorithm must include an exploration strategy to discover them and identify their shape. This exploration becomes exponentially more difficult as the dimensionality of the state space increases, making the sampling problem challenging even in moderate dimensions (e.g., 10 50).

Dynamical Measure Transport A general idea to approach the sampling problem is to draw particles from an easy prior distribution and gradually move them toward the complicated target (sometimes termed dynamical measure transport). In this work, we focus on two popular paradigms:

Equal contribution. Work partially done at the California Institute of Technology.

Published as a conference paper at ICLR 2025

Prior Diffusion Steps Resampling & MCMC Diffusion Steps Target

Figure 1: Illustration of our SCLD algorithm, which combines controlled Langevin diffusions with Sequential Monte Carlo methods. The goal is to sample from a target distribution by learning a stochastic evolution (diffusion steps) that starts from a tractable prior and evolves along a prescribed annealed density to the target. We do not have access to samples from the target distribution but can only evaluate its density up to a normalizing factor. At intermediate timesteps, we resample according to the importance weights of each subtrajectory (black dots) and use MCMC steps for additional refinement (yellow dots).

In Annealed Importance Sampling (AIS) (Neal, 2001) and its extension Sequential Monte Carlo (SMC) (Chopin, 2002; Del Moral et al., 2006) particles are successively updated and reweighted, as to approach relevant regions in space, targeting an annealed sequence of intermediate distributions. This procedure is typically formulated in discrete time and does not require learning. In diffusion-based sampling (Richter & Berner, 2024; Vargas et al., 2024) the idea is to learn a drift of a stochastic differential equation (SDE) to transport the samples from the prior to the desired target, typically formulated in continuous time. The absence of samples means that data-driven approaches such as for generative modeling (Song et al., 2021) are not possible, and training is instead done via variational inference, gaining information through evaluations of ρtarget.

Each paradigm brings its own advantages and drawbacks. Traditional SMC methods rely on predefined rules for particle updates, such as Markov Chain Monte Carlo (MCMC) and resampling methods, which help to direct computational effort onto promising regions of the space and enjoy asymptotic guarantees. While they do not require learning, the employed MCMC methods can, in many cases, exhibit slow convergence to the target (Del Moral et al., 2006). Diffusion-based samplers, on the other hand, require a training phase, which enables them to automatically adapt to the given target. However, training can take significant time and often suffers from numerical instabilities as well as mode collapse (Richter & Berner, 2024).

Sequential Controlled Langevin Diffusions. In this work, we show that the two methods can complement each other. SMC can benefit from the flexible nature of the learnable transitions, and resampling and MCMC can help diffusion-based samplers converge faster and counteract numerical stability issues arising, for instance, from outlier particles. Motivated by this, we identify a principled and general framework to unify the two methods, culminating in our Sequential Controlled Langevin Diffusion (SCLD) algorithm, which alternates between SMC and diffusion steps as illustrated in Figure 1. In addition, we devise a family of loss functions that enables end-to-end training (i.e., for which the algorithm used during inference can be directly optimized). This becomes possible by viewing both methods in continuous time and considering measures of the underlying SDEs on the path space. Overall, Our contributions can be summarized as follows:

Taking the continuous-time perspective, we can rigorously connect and unify SMC and diffusionbased sampling by performing importance sampling in path space. The principled framework of path space measures allows us to readily propose suitable loss functions, which allow for off-policy training with replay buffers and provably scale better to high dimensions than previously used losses. Building on those connections, we propose our new sampling method Sequential Controlled Langevin Diffusion (SCLD) as a special case of our framework. We show that our method achieves competitive performance on 11 real-world and synthetic examples, improving over other baseline methods in almost every task, and in many cases only using 10% of the training budget. In two tasks based on robotics control, our method is the only one to approximately recover the true distribution.

1.1 RELATED WORK

We present an extensive comparison to related works in Appendix A.1. To summarize, our proposed SCLD sampler relies on three crucial building blocks:

Published as a conference paper at ICLR 2025

Table 1: Comparison of different methods (see Appendix A.1 for details). By discretization-flexible, we describe the fact that we can include resampling and MCMC steps at arbitrary times. Finite-time convergence refers to the property that the target distribution can (theoretically, in the optimum) be reached in finite time. We note that stochastic transitions allow omitting or reducing (costly) MCMC steps in learned SMC methods.

Traditional SMC CMCD CRAFT PDDS SCLD (ours)

Learned Transition (MCMC) (Neural SDE) (Neural ODE) (Neural SDE) (Neural SDE) Stochastic Transition End-to-end Training - (needs importance weights) (alternating) (incl. hyperparameters) Particle Method Discretization-flexible (in theory) Finite-time Convergence

Sequential Monte Carlo (SMC). SMC methods (Chopin, 2002; Del Moral et al., 2006) describe a general methodology to sample sequentially from a sequence of annealed distributions, using transition kernels (typically based on MCMC) and resampling steps. To mitigate drawbacks such as long mixing times and tedious tuning, previous works proposed to learn the kernels (Phillips et al., 2024; Matthews et al., 2022; Arbel et al., 2021). However, prior training objectives suffer from various shortcomings, either requiring importance sampling with potentially high variance, exhibiting bias, or relying on alternating methods that preclude end-to-end training (see also Table 1). Further, they need to place restrictions on their parameterizations or suffer from unfavorable computational costs. In particular, approaches with deterministic transitions, such as normalizing flows, require computations of divergences or Jacobian determinants, and MCMC steps to recover sample diversity after resampling. This is not needed for methods based on stochastic transitions like our method SCLD.

Diffusion-based samplers on subtrajectories. To overcome these shortcomings and flexibly parameterize the transition kernels, we draw ideas from recent work on controlled SDEs for sampling problems (Zhang et al., 2023a; Richter & Berner, 2024). This can be done by partitioning the SDE trajectories in time. However, to compute importance weights (in path space), which are necessary for resampling as well as for MCMC kernels in SMC, the SDE marginals after each subtrajectory need to be known. To cope with that, we identify the recently proposed Controlled Monte Carlo Diffusions (CMCD) (Vargas et al., 2024) as a suitable framework since it allows us to define a prescribed (and therefore known) target evolution of the SDE marginals. Building upon this, we develop an extension of SMC to continuous time, where resampling (and, optionally, MCMC steps) can be employed at arbitrary times.

Log-variance loss. However, the subtrajectories and discrete resampling steps make optimization challenging. Previous methods either relied on alternating schemes or approaches based on the reverse KL divergence and importance sampling, known to suffer from mode collapse and potentially high variance. We show that the log-variance loss (N usken & Richter, 2021) offers a way to obtain a principled, efficient, and low-variance objective such that we can optimize our sampler and parts of the hyperparameters in an end-to-end fashion using replay buffers.

2 SEQUENTIAL CONTROLLED LANGEVIN DIFFUSIONS

We start by giving an introduction to Sequential Monte Carlo methods. However, different from previous work, our focus is on a continuous-time perspective that can be readily integrated with diffusion-based samplers.

2.1 A PRIMER ON SEQUENTIAL MONTE CARLO IN CONTINUOUS TIME

Importance sampling (IS). The idea of utilizing samples from a prior distribution in order to compute statistics relying on samples from a target can be motivated by importance sampling. In its simplest case, one can compute unbiased estimates w.r.t. the target distribution via

EXT ptarget [φ(XT )] = EX0 pprior [φ(X0)w(X0)] 1

K PK k=1 φ(X(k) 0 )w(X(k) 0 ), (2)

where φ C(Rd, R) is a function of interest, the weight is defined1 as w := ptarget

pprior , and (X(k) 0 )K k=1 are i.i.d. samples from pprior. Since importance sampling becomes highly inefficient if the high-

1If the normalizing constant Z is not available, we can compute unnormalized weights ew := ρtarget

pprior and normalize them by their sum, leveraging the identity Z = EX0 pprior[ ew(X0)] (self-normalized importance sampling). While this introduces bias, the estimator is still consistent as K (del Moral, 2013).

Published as a conference paper at ICLR 2025

probability regions of prior and target do not overlap substantially, a key idea is to gradually transport X0 to XT .

Annealed importance sampling (AIS). In particular, we may sequentially move particles from the prior to the target along a curve (π( , t))t [0,T ], chosen such that π( , 0) = pprior and π( , T) = ptarget, e.g., by linear interpolation in log-space (Dai et al., 2022). To this end, we consider two (time-dependent, forward and backward) Markov kernels ps|t and

pt|s. Given a time grid 0 = t0 < t1 < < t N = T (also referred to as annealing steps), we may now sample Xt0 pprior and iterate for each n = 1, . . . , N:

1. Sample Xtn ptn|tn 1( |Xtn 1).

2. Compute the weights wtn 1,tn(Xtn 1, Xtn) = π(Xtn,tn)

ptn 1|tn(Xtn 1|Xtn) π(Xtn 1,tn 1) ptn|tn 1(Xtn|Xtn 1).

We can then perform importance sampling on an augmented target distribution via the weights

w(Xt0, . . . , Xt N ) := QN n=1 wtn 1,tn(Xtn 1, Xtn) =

pt0,...,t N (Xt0,...,Xt N ) pt0,...,t N (Xt0,...,Xt N ), (3)

where pt0,...,t N and

pt0,...,t N are the joint densities of the forward and a corresponding backward operation. In particular, in analogy to (2), it holds that

EXt0,...,Xt N [φ(XT )w(Xt0, . . . , Xt N )] = EXT ptarget [φ(XT )] . (4)

Resampling. In principle, any forward and backward Markov kernels lead to an unbiased estimator of the expectation of interest, as stated in (4). In practice, however, a notorious problem with importance sampling is its potentially high variance. Specifically, the variance might increase exponentially with the dimension, sometimes termed curse of dimensionality, see, e.g., Chatterjee & Diaconis (2018); Hartmann & Richter (2024). To circumvent this issue, one idea is to sequentially update samples (also referred to as particles ) during the course of the simulation according to their weights, so as to refocus computational effort on promising particles a procedure referred to as resampling. For instance, we can select only certain (relevant) samples X(k) 0 for the estimation of the expectation in (2). To this end, let O(k) be a random variable with values in {0, . . . , K} and E[O(k)|X(1) 0 , . . . , X(K) 0 ] = KW(X(k) 0 ), where W(X(k) 0 ) := w(X(k) 0 )/ PK i=1 w(X(i) 0 ), defining how many times we select the k-th sample. Due to the tower property, we can then also obtain a consistent estimator of the expectation in (2) via

EXT ptarget [φ(XT )] 1

K PK k=1 φ(X(k) 0 )O(k). (5)

A common choice is to consider O MK(W(X(1) 0 ), . . . , W(X(K) 0 )) drawn from a multinomial distribution with K trials, where the normalized weights determine the event probabilities (Gordon et al., 1993). We note that with this resampling step, we introduce additional stochasticity. However, at the same time, it can bring statistical advantages by focusing on relevant samples, e.g., stabilizing effects and variance reduction (Dai et al., 2022). Remark 2.1 (SMC formulation in continuous vs. discrete time). We stress that, even though we evaluate our process X on N + 1 discrete time instances, the formalism above includes timecontinuous processes (Xt)t [0,T ]. While some transition kernels used in SMC, e.g., uncorrected Langevin kernels, can be interpreted in continuous time, SMC is typically stated for a fixed number of discrete steps. We will see in the sequel how the continuous-time formulation offers an elegant framework with certain advantages, in particular, allowing us to integrate learned SDE-based transition kernels and interleave them with resampling and MCMC steps at arbitrary times.

2.2 CONTROLLED SDES AND IMPORTANCE SAMPLING IN PATH SPACE

A central question in SMC is how to choose the forward and backward transition densities ps|t and

pt|s defined above. Clearly, when the forward and backward joint densities stated in (3) agree, we achieve perfect sampling in the sense that no corrections with importance weights are necessary. However, it is typically not possible to obtain such transitions, and thus the choice of ps|t and

pt|s to approximate this criterion is of critical importance to the success of SMC. Whereas, traditionally, MCMC steps have been employed as the transition kernel (Dai et al., 2022), they are known to require a large number of steps to achieve approximate transportation between densities. In recent years, there has been interest in employing learned transition densities to overcome the slow convergence times of fixed MCMC kernels (Matthews et al., 2022; Phillips et al., 2024). Advancing those

Published as a conference paper at ICLR 2025

Algorithm 1 Sequential Controlled Langevin Diffusion (SCLD). See Algorithm 2 for details.

Require: Annealing path π, learned control u, time grid 0 = t0 < < t N = T

1: Initialize: X0 := X(1:K) 0 pprior and w0 := w(1:K) 0 = 1 2: for n = 1 to n = N do 3: Transport: X[tn 1,tn] = simulate SDE

Xtn 1, u See (6) and (19)

4: Compute RNDs: w[tn 1,tn] = d

d P[tn 1,tn]

X[tn 1,tn] See (12) and (31)

5: Update weights: wn = wn 1w[tn 1,tn] See (13) 6: Resample: Xtn, wn = resample

Xtn, wn See Algorithm 5

7: return Samples XT := X(1:K) T approximately from ptarget

attempts, we will show how transition densities corresponding to SDEs yield a principled solution that, moreover, allows us to leverage recent advancements of diffusion models.

Diffusion bridges. To this end, let us consider the stochastic process Xu = (Xu t )t [0,T ], defined by the SDE

d Xu t = u(Xu t , t)dt + σ(t) d Wt, Xu 0 pprior, (6)

where u C(Rd [0, T], Rd) is a control function, σ C([0, T], R) the diffusion coefficient, and W a standard Brownian motion. This process uniquely defines a forward transition density ps|t and falls into the framework stated in Section 2.1 for any time steps 0 = t0 < t1 < < t N = T. In fact, we can leverage the ideas from CMCD (Vargas et al., 2024) and learn u such that the transport happens along a prescribed density in time, i.e., such that the density p Xu( , t) of Xu t is equal to a prescribed target density π( , t), connecting the prior and the target, for every t [0, T]; cf. Lemma 2.2 below. We will see that the knowledge of the marginals allows for a natural integration within SMC frameworks. Now, similar to the importance sampling framework from Section 2.1, the general idea is to exploit a time-reversed dynamics that starts in the desired target density. To be precise, we may further define a related reverse-time SDE

d Y v t = v(Y v t , t)dt + σ(t)

d Wt, Y v T ptarget, (7)

which depends on the control v C(Rd [0, T], Rd) and where

d Wt denotes backward 2 integration of Brownian motion. Now, if u and v are learned such that Xu and Y v are time-reversals of each other, then p Xu = p Y v, i.e., the two processes transport the prior to the target and vice versa. However, in this general setting, there are infinitely many such bridging processes, all fulfilling Nelson s identity (Nelson, 1967), i.e.,

u v = σ2 log p Xu = σ2 log p Y v. (8)

Since our goal is to satisfy p Xu = p Y v = π, we can incorporate this constraint via the ansatz v = u σ2 log π, leading to the SDE

d Y u t = (u σ2 log π)(Y u t , t)dt + σ(t)

d Wt, Y u T ptarget, (9)

as suggested in Vargas et al. (2024), noting that the process now also depends on the control u. Consequently, under mild conditions, this constraint leads to a unique gradient field representing the solution u to the time-reversal problem (Vargas et al., 2024, Proposition 3.2). We comment on more general, learnable density evolutions in Remark A.1.

Measures in path space. The task of learning the time-reversal can be approached via the perspective of measures on the space of continuous trajectories C([0, T], Rd), also called path space. Loosely speaking, a path space measure P = Pu,pprior of the process (6) can be thought of as the joint density pt0,...,t N (Xu t0, . . . , Xu t N ) in (3) when N , i.e., evaluated along infinitely many time instances (Baldi, 2017, Corollary 11.1).

In analogy to importance sampling described in Section 2.1, we may now consider a change of measure in path space, i.e.,

EXu P [φ(Xu T )w(Xu)] = EY u

P [φ(Y u T )] = Ex ptarget [φ(x)] , (10)

2See Vargas et al. (2024, Appendix A) for details and assumptions.

Published as a conference paper at ICLR 2025

where w = d

Pu,ptarget is the path space measure associated to (9). Furthermore, we can formulate the time-reversal task as the minimization problem

u = arg minu U D Pu,pprior,

Pu,ptarget , (11)

where D is a divergence and U C(Rd [0, T], Rd) the set of admissible controls, cf. Richter & Berner (2024). If we can bring the divergence to zero, we have indeed achieved time-reversal between the forward and backward transitions and, thus, perfect sampling. Both for (10) and typical divergences in (11), it is essential to have a tractable expression for the likelihood ratio w between the measures of the forward and the reverse-time process, also called the Radon-Nikodym derivative (RND). This is given by the following lemma; see Vargas et al. (2024) for the proof.

Lemma 2.2 (Likelihood ratio between path measures). Let P[s,t] and

P[s,t] be the path space measures of the solutions to the SDEs in (6) and (9) on the time interval [s, t] [0, T], where we assume Xu s π( , s) and Y u t π( , t). Then for a generic3 process X it holds

w[s,t](X) = d

P[s,t] d P[s,t] (X) = π(Xt,t)

π(Xs,s) exp R t s u 2 u σ2 log π 2

2σ2 (Xτ, τ)dτ

+ R t s u σ2 log π

d Xτ R t s u σ2 (Xτ, τ) d Xτ . (12)

As can be seen from Lemma 2.2, path space measures can be readily employed for sequential algorithms that operate on the time grid that we introduced before. In particular, we may divide our trajectories Xu and Y u into subtrajectories and thus our path space measure into multiple chunks. To be precise, we may write

P[t0,t1] d P[t0,t1] d

P[t N 1,t N ]

d P[t N 1,t N ] = w[t0,t1] w[t N 1,t N]. (13)

Different from the framework in Section 2.1, we note that Lemma 2.2 offers an explicit formula for computing the weights w[tn 1,tn] in continuous time. As can be seen in the importance sampling identity (10), the weights can be interpreted as correcting for a potentially imperfect time-reversal. For convenience, we state Algorithm 1 for a simplified, high-level overview of combining SMC with diffusion models and refer to Algorithm 2 in Appendix A.3 for a more detailed exposition. Further, we note that the suggested setting relates to the usual SMC algorithm (such as in Dai et al. (2022)) by taking a different forward transport step (where our Markov kernel is implemented by an SDE) and by adopting the weighting step (using the Radon-Nikodym derivative in place of the likelihood ratio). Using the target density π( , tn), we can also add MCMC refinements at each time tn; see Section 2.4.

2.3 LOSS FUNCTIONS AND OFF-POLICY TRAINING

We can adapt the idea of learning the optimal control u to our sequential setting by considering divergences on each subinterval [tn 1, tn] separately, in consequence bringing losses of the form

L(u) = PN n=1 D Pu,πn 1 [tn 1,tn],

Pu,πn [tn 1,tn] , (14)

where πn := π( , tn). We stress that with (14) optimization can in principle be conducted globally in spite of the resampling happening sequentially. However, depending on the choice of the divergence, this comes with additional challenges.

KL divergence. A classical choice is the Kullback-Leibler (KL) divergence D = DKL, i.e.,

DKL Pu,πn 1 [tn 1,tn]|

Pu,πn [tn 1,tn] = EXu P u,πn 1 [tn 1,tn] log w[tn 1,tn](Xu) , (15)

where w[tn 1,tn] is defined as in Lemma 2.2 and the minus originates from the reciprocal importance weights in the logarithm. However, for computing the expectation we need Xu tn 1 πn 1. If resampling has been employed in the previous iteration (at time tn 1; see Algorithm 1), a potential mismatch in the expectation is automatically corrected. Alternatively, we may correct with importance sampling in path space. To this end, let tm (with tm < tn 1) be the last time resampling has been conducted, i.e., the last time the weights have been reset; see Algorithm 5. As suggested in

3Note that the Radon-Nikodym derivative is only defined almost surely w.r.t. P[s,t]. In particular, it only depends on X on the time interval [s, t].

Published as a conference paper at ICLR 2025

Matthews et al. (2022), we can then consider the importance weight w[tm,tn 1] and compute

DKL Pu,πn 1 [tn 1,tn]|

Pu,πn [tn 1,tn] = EXu Pu,πm [tm,tn] log w[tn 1,tn](Xu) w[tm,tn 1](Xu) , (16)

for which Xu tn 1 does not need to be distributed according to πn 1 anymore. However, the importance weights potentially introduce additional variance into the loss, particularly in high dimensions. This observation is stated rigorously in the following proposition, cf. N usken & Richter (2021, Proposition 5.7), and proved in Appendix A.2.

Proposition 2.3 (Relative error of KL divergence). Denote by Dχ2 the χ2-divergence and by r(K) := Var( b D(K) KL )1/2/DKL the relative error of the Monte Carlo estimator b D(K) KL of the KL divergence in (16) with sample size K. Moreover, let tm be the last resampling time and let P I [tn 1,tn] and

P I [tn 1,tn] be the I-fold product measures of identical copies of P[tn 1,tn] and

P[tn 1,tn], respectively. Then there exists a constant c > 0, such that for any I 2 it holds that

r(K) P I [tn 1,tn]|

P I [tn 1,tn] c Dχ2

P[tm,tn 1]| P[tm,tn 1] + 1 I/2 . (17)

Given a path measure P of a D-dimensional process, we note that P I is a measure on the product space NI i=1 C([0, T], RD) C([0, T], RID). In particular, for D = 1 (corresponding to independent components), we can clearly identify d = I as the dimension of the considered problem. This means that the relative error of the estimator of the KL divergence (16) is expected to scale exponentially in the dimension, which is illustrated in Figure 11 in Appendix A.6.10. As shown in N usken & Richter (2021), the log-variance (LV) divergence does not exhibit this unfavorable property.

LV divergence and off-policy training. An alternative divergence can be defined by

DQ LV Pu,πn 1 [tn 1,tn]|

Pu,πn [tn 1,tn] = Var X Q log w[tn 1,tn](X) (18)

which, in fact, is a family of divergences parametrized by a reference measure Q = Peu,eπn 1 [tn 1,tn] that can be chosen with arbitrary controls eu and initial distributions eπn 1 (also called off-policy training, see Remark A.2 for details and connections to reinforcement learning). In particular, we do not need Xtn 1 πn 1 anymore, and thus reweighting such as in (16) is not necessary, irrespective of the fact that resampling at time tn might not have been conducted. We summarize the training procedure for both divergences in Algorithm 3 and present details in Appendix A.3.

2.4 ALGORITHMIC REFINEMENTS AND IMPLEMENTATIONAL DETAILS

In this section, we turn our theoretical considerations from Sections 2.1 to 2.3 into implementable algorithms. We collate these changes in Algorithm 2 in Appendix A.3, representing a practical version of Algorithm 1.

Loss Function. We focus on the log-variance divergence in the sequel and refer to Appendix A.6.10 for a comparison to the KL divergence. We choose eu = u (or previous versions when using a buffer, see replay buffers below) and simulate X in (18) starting from the prior, so eπn corresponds to the SDE marginal. However, since we do not take gradients w.r.t. the control eu of the reference measures, we detach the trajectory X, in line with Richter & Berner (2024). In particular, we do not need to differentiate through the SDE integrator.

Time discretization. In practice, we choose N equidistant resampling times, i.e. tn tn 1 = τ, for every n {1, . . . , N}, where the number of subtrajectories N may change across applications. We discretize the SDE (7) via the Euler-Maruyama scheme, containing L evenly spaced steps per subtrajectory, i.e., b Xu i = b Xu i 1 + u( b Xu i 1, (i 1)h)h + σ((i 1)h)

hξi, ξi N(0, Id), (19)

for i {1, . . . , NL} with h = τ/L. We refer to (31) in Appendix A.3 for the resulting discretization of the Radon-Nikodym derivative from Lemma 2.2 for computing the importance weights w[tn 1,tn].

Annealing path. For the prescribed density curve π we consider

π(x, t) pprior(x)1 β(t)ρtarget(x)β(t), (20)

Published as a conference paper at ICLR 2025

Table 2: Comparison of different methods in terms of ELBOs, i.e., lower bounds on the log-normalization constant log Z. We use this metric for all tasks where we do not have access to groundtruth metrics. We report NA if all considered hyperparameter choices diverged.

ELBO ( ) Seeds (26d) Sonar (61d) Credit (25d) Brownian (32d) LGCP (1600d)

SMC 74.63 0.14 111.50 0.96 589.82 5.72 2.21 0.53 385.75 7.65 SMC-ESS 74.07 0.60 109.10 0.17 505.57 0.18 0.49 0.19 497.85 0.11 497.85 0.11 497.85 0.11 SMC-FC 74.07 0.02 108.93 0.02 505.30 0.02 1.91 0.04 878.10 2.20 CRAFT 73.75 0.02 108.97 0.16 518.25 0.52 0.90 0.10 485.87 0.37 DDS 75.21 0.21 121.22 5.99 514.74 1.22 0.56 0.23 NA PIS 88.92 2.05 142.87 3.29 846.57 2.42 NA 479.54 0.40 CMCD-KL 73.51 0.01 109.09 0.01 507.23 6.40 0.86 0.01 478.75 0.34 CMCD-LV 73.67 0.01 109.50 0.03 504.90 0.02 0.54 0.03 472.79 0.44 SCLD (ours) 73.45 0.01 73.45 0.01 73.45 0.01 108.17 0.25 108.17 0.25 108.17 0.25 504.46 0.09 504.46 0.09 504.46 0.09 1.00 0.18 1.00 0.18 1.00 0.18 486.77 0.70

where β : [0, T] [0, 1] is a monotonically increasing function fulfilling β(0) = 0 and β(T) = 1. We choose to learn the function β to attain a smoother transition; see (35) and Appendix A.6.5.

Resampling. There is a wealth of literature (Webber, 2019; Doucet et al., 2001; Douc & Capp e, 2005) regarding designing SMC resampling schemes. However, for a fair comparison to CRAFT (Matthews et al., 2022), we utilize the common multinomial resampling scheme. Resampling can, however, reduce particle diversity by introducing identical particles in its output. As such, it is common to trigger resampling at a time tn only when the Effective Sample Size (ESS), a measure

of particle quality defined by ESS = ( PK k=1 w(k) n ) 2

PK k=1(w(k) n )2 , is below a certain threshold, where wn are the

importance weights at time tn (as in Algorithm 2). In line with prior works (Matthews et al., 2022; Phillips et al., 2024), we pick the threshold to be 0.3K where K is the number of particles.

MCMC refinements. In order to cope with sub-optimal controls u during the course of optimization, we add some MCMC refinement steps after each subtrajectory at time tn, using a Markov kernel with invariant measure π( , tn). In line with Matthews et al. (2022), after every subtrajectory, we use one Hamiltonian Monte Carlo (HMC) step with 10 leapfrog steps.

Replay buffers. Replay buffers are known to prevent mode collapse and improve sample efficiency for sampling tasks (Vemgal et al., 2023; Midgley et al., 2022; Sendera et al., 2024). As such, we utilize a prioritized replay buffer during training time. At a high level, we maintain a fixed-size rolling cache of paths generated by previous versions of the policy, i.e., learned control u. For the gradient updates, we then take half of the samples from the current policy and the other from the buffer using Radon-Nikodym derivatives as weights for prioritization, see Algorithm 4 in Appendix A.3 for details. We note that this procedure is easily feasible with the log-variance divergence since this divergence does not rely on an evaluation along the current policy (see Section 2.3).

3 EXPERIMENTS

We empirically demonstrate the performance of the proposed SCLD sampler on a wide variety of sampling benchmarks.4 We consider a combination of practical and synthetic examples taken from Blessing et al. (2024), the full descriptions of which are contained in Appendix A.4:

Examples from Bayesian statistics: The Seeds, Sonar, Credit, Brownian, and LGCP tasks.

Synthetic targets: A 40-mode Gaussian mixture model in 50d (GMM40), a 32-mode Many-Well task (MW54) in 5d, the popular 10d Funnel benchmark, and a 50d Student mixture model (Mo S). Many of these are in relatively high dimensions and with many well-separated modes.

The Robot1 and Robot4 tasks: Inspired by robotics control problems, these synthetic 10dimensional targets model the distribution over the configurations of a 10-joint robotic arm in the plane. They have multiple well-separated and sharp modes.

As baselines, we consider a representative selection of related sampling methods and refer to Appendix A.1 for descriptions. We study two metrics used frequently by previous works, such as in Blessing et al. (2024); Vargas et al. (2023). When groundtruth samples are available, we report the Sinkhorn distance (an optimal transport distance) to a set of generated samples (Cuturi, 2013), and otherwise consider the ELBO metric (i.e., a lower bound on log Z).

4Our code can be found at https://github.com/anonymous3141/SCLD.

Published as a conference paper at ICLR 2025

Table 3: Comparison of different methods in terms of Sinkhorn distances. We present all tasks where we have access to samples for the evaluation. We report NA if all considered hyperparameter choices diverged.

Sinkhorn ( ) Funnel (10d) MW54 (5d) Robot1 (10d) Robot4 (10d) GMM40 (50d) Mo S (50d)

SMC 149.35 4.73 20.71 5.33 24.02 1.06 24.08 0.26 46370.34 137.79 3297.28 2184.54 SMC-ESS 117.48 9.70 117.48 9.70 117.48 9.70 1.11 0.15 1.82 0.50 2.11 0.31 24240.68 50.52 1477.04 133.80 SMC-FC 211.43 30.08 2.03 0.17 0.37 0.08 1.23 0.02 39018.27 159.32 3200.10 95.35 CRAFT 133.42 1.04 11.47 0.90 2.92 0.01 4.14 0.50 28960.70 354.89 1918.14 108.22 DDS 142.89 9.55 0.63 0.24 11.44 12.50 5.38 2.44 5435.18 172.20 2154.88 3.86 PIS NA 0.42 0.01 0.42 0.01 0.42 0.01 1.54 0.72 2.02 0.36 10405.75 69.41 2113.17 31.17 CMCD-KL 124.89 8.95 0.57 0.05 3.71 1.00 2.62 0.41 22132.28 595.18 1848.89 532.56 CMCD-LV 139.07 9.35 0.51 0.08 28.49 0.07 27.00 0.07 4258.57 737.15 1945.71 48.79 SCLD (ours) 134.23 8.39 0.44 0.06 0.31 0.04 0.31 0.04 0.31 0.04 0.40 0.01 0.40 0.01 0.40 0.01 3787.73 249.75 3787.73 249.75 3787.73 249.75 656.10 88.97 656.10 88.97 656.10 88.97

CMCD-KL CMCD-LV CRAFT SCLD (ours) Groundtruth

Figure 2: Samples from our considered methods and the groundtruth for the GMM40 (50d) (top) and Robot4 (10d) (bottom) tasks. Our SCLD method accurately finds all modes and avoids low probablity regions.

We took great care to ensure the fairness of our experiments and refer the reader to Appendix A.5 for full experimental and reproducibility details and to Blessing et al. (2024) for a discussion on benchmarking samplers. We also include numerous additional experiments and metrics in the appendices, such as ablation studies in Appendices A.6.1 and A.6.2, runtime information in Appendix A.6.3, a study on log Z estimation in Appendix A.6.4, the effect of learning priors by variational inference in Appendix A.6.6, a comparison to PDDS in Appendix A.6.7, and a comparison of KL and LV training in Appendix A.6.10.

3.1 RESULTS

Our SCLD method exhibits strong performance on both ELBO and Sinkhorn benchmarks (Tables 2 and 3). Indeed, among all tasks except Funnel, we are able to achieve the top performance or come a close second when measuring performance by Sinkhorn distances (when it is available). For ELBO estimation, SCLD can utilize a large number of resampling steps to attain the strongest performances in all but one task. In particular, SCLD can surpass the outcomes of CMCD-KL and CMCD-LV with 40000 gradient steps using only 3000 steps. In the following, we comment on different aspects.

Avoiding mode collapse. We visualize the samples for GMM40 and the Robot4 task in Figure 2. For GMM40, we plot the first two dimensions of samples against the true marginal distribution. In all attempted hyperparameter settings, we found that CRAFT suffers from mode collapse (see also Appendix A.6.4) and that CMCD-KL gradually collapses to a few modes, covering low probability regions. CMCD-LV and SCLD perform much better, and indeed the samples from SCLD are virtually indistinguishable from the groundtruth. For Robot4, we visualize the sampled robot arm positions. Observe that for the Robot4 task, CMCD-KL and CRAFT both collapse onto 1 mode. CMCD-LV does not experience mode collapse but nevertheless does not sample accurately for any mode. Only our SCLD Method is able to identify and sample relatively precisely from all 8 modes.

Improved convergence properties of SCLD. We found that the SCLD algorithm demonstrates superior convergence properties. As SCLD is effectively initialized as an SMC sampler and is trained to improve upon it, we expect a good initial performance even before training and, thus, an improved starting point for optimization. As visualized in Figure 3, SCLD consistently attains better ELBOs for any given training time budget on all tasks when compared to CMCD-KL and CMCD-LV. While in some cases SCLD is initially worse than CRAFT, it always manages to catch up quickly and surpasses it. SCLD and CMCD steps require similar amounts of time for these tasks (see Appendix A.6.3), and thus SCLD offers a 10-fold decrease in training time as well as iteration count compared to CMCD. See Appendix A.6.9 for an alternative visualization.

Published as a conference paper at ICLR 2025

0 200 400 Time Elapsed (s)

0 100 200 300 Time Elapsed (s)

0 100 200 Time Elapsed (s)

0 100 200 Time Elapsed (s)

SCLD CMCD-KL CMCD-LV CRAFT Long Run CMCD (Best)

Figure 3: ELBOs during training for several tasks. We visualize the ELBO estimates attained by 4 methods as a function of the training time elapsed (until SCLD finished after 3000 iterations), running 3 seeds for each task. We mark the long run CMCD ELBOs (best out of KL and LV loss), corresponding to running for 40000 gradient steps as for the main table. Methods leveraging Sequential Monte Carlo (SCLD and CRAFT) generally exhibit improved convergence speed, but whereas CRAFT plateaus quickly, our SCLD method often achieves state-of-the-art performance in about 5 minutes.

0 1 2 4 8 16 32 64128 # SMC Steps (Train)

0 1 2 4 8 16 32 64 128 # SMC Steps (Evaluation)

0 1 2 4 8 16 32 64128 # SMC Steps (Train)

0 1 2 4 8 16 32 64128 # SMC Steps (Train)

0 1 2 4 8 16 32 64128 # SMC Steps (Train)

Sinkhorn Distance

Sinkhorn Distance

Figure 4: Performance of SCLD for different numbers of SMC steps at training and evaluation time for several tasks. Better results are shaded darker. We note that taking zero SMC steps corresponds to the CMCD method. Using more SMC steps has generally a beneficial effect during training. Our method allows us to select a different number at training and during inference.

Choice of number of SMC steps. Here, we study the effect of varying the number N of subtrajectories used in the SCLD sampler, i.e., SMC steps where we apply resampling (if ESS is lower than the threshold) and MCMC steps, and offer practical advice on choosing this value. For this study, we fix the number of gradient steps for training to 8000 but otherwise retain the same experimental design. The results are illustrated in Figure 4, where we visualize the relevant metric for four tasks and demonstrate the effect of varying the number of SMC steps used at training and evaluation. For most tasks, we found it advantageous to use as many SMC steps as possible at both training and evaluation time. Particularly for the Seeds and Sonar targets, the outcomes look strikingly similar. For these tasks, it is also shown that using a smaller number of SMC steps at training or even only adding SMC steps at evaluation already improves upon stand-alone diffusion-based samplers.

While it is well known that resampling can potentially lead to mode collapse and loss of sample diversity on highly multimodal tasks (Doucet et al., 2001), we found that even for such tasks, resampling, when used sparingly, was still beneficial during training. This is clearly reflected in the multimodal Robot4 task, where using SMC steps at training significantly improves sample quality. In line with the previous paragraph, this suggests that our SCLD training setup can help improve training convergence. Informed by our observations, we opt to use 4 subtrajectories only at training for all synthetic tasks except Funnel and Mo S, and, for all other tasks, we utilize 128 subtrajectories at both training and evaluation time for the main experiments. These choices, while not necessarily optimal, are robust and work well across our diverse set of benchmarks.

4 CONCLUSION

We have developed a framework for combining diffusion-based samplers with Sequential Monte Carlo algorithms and propose simple yet effective methods for training. Our framework culminates in a novel sampler, termed Sequential Controlled Langevin Diffusion (SCLD), in principle offering a great amount of design freedom. In particular, SCLD allows for accelerated training, flexible parameterizations, end-to-end training with prioritized replay buffers, and injection of resampling and MCMC steps at arbitrary times in the generative process. We provide careful ablation studies of our design choices and empirically show state-of-the-art performance on a diverse range of benchmarks.

Published as a conference paper at ICLR 2025

ACKNOWLEDGEMENTS

We thank Francisco Vargas, Lee Cheuk-Kit, and Dinghuai Zhang for very helpful discussions. J.C. was supported by a Summer Undergraduate Research Fellowship at the California Institute of Technology. D.B. is supported by funding from the pilot program Core Informatics of the Helmholtz Association (HGF). J.B. acknowledges support from the Wally Baer and Jeri Weiss Postdoctoral Fellowship. A.A. is supported in part by Bren endowed chair, ONR (MURI grant N00014-23-1-2654), and the AI2050 senior fellow program at Schmidt Sciences. The research of L.R. was partially funded by Deutsche Forschungsgemeinschaft (DFG) through the grant CRC 1114 Scaling Cascades in Complex Systems (project A05, project number 235221301). We thank the anonymous reviewers for their careful reading of our manuscript and their insightful comments and suggestions.

Michael S. Albergo and Eric Vanden-Eijnden. NETS: A non-equilibrium transport sampler, 2025. URL https://arxiv.org/abs/2410.02711.

Michael Arbel, Alex Matthews, and Arnaud Doucet. Annealed flow transport Monte Carlo. In International Conference on Machine Learning, pp. 318 330, 2021.

Oleg Arenz, Mingjun Zhong, and Gerhard Neumann. Trust-region variational inference with gaussian mixture models. Journal of Machine Learning Research, 21(163):1 60, 2020.

Paolo Baldi. Stochastic calculus. Springer, 2017.

Julius Berner, Lorenz Richter, and Karen Ullrich. An optimal control perspective on diffusion-based generative modeling. Transactions on Machine Learning Research, 2024.

Espen Bernton, Jeremy Heng, Arnaud Doucet, and Pierre E Jacob. Schr odinger bridge samplers. ar Xiv preprint ar Xiv:1912.13170, 2019.

Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag, Berlin, Heidelberg, 2006.

Denis Blessing, Xiaogang Jia, Johannes Esslinger, Francisco Vargas, and Gerhard Neumann. Beyond ELBOs: A large-scale evaluation of variational methods for sampling. In Forty-first International Conference on Machine Learning, 2024.

Benjamin Boys, Mark Girolami, Jakiw Pidstrigach, Sebastian Reich, Alan Mosca, and O Deniz Akyildiz. Tweedie moment projected diffusions for inverse problems. ar Xiv preprint ar Xiv:2310.06721, 2023.

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander Plas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+Num Py programs, 2018. URL http: //github.com/jax-ml/jax.

Alexander Buchholz, Nicolas Chopin, and Pierre E Jacob. Adaptive tuning of hamiltonian monte carlo within sequential monte carlo. Bayesian Analysis, 16(3):745 771, 2021.

Alberto Cabezas, Adrien Corenflos, Junpeng Lao, R emi Louf, Antoine Carnec, Kaustubh Chaudhari, Reuben Cohn-Gordon, Jeremie Coullon, Wei Deng, Sam Duffield, et al. Black JAX: Composable Bayesian inference in JAX. ar Xiv preprint ar Xiv:2402.10797, 2024.

Sourav Chatterjee and Persi Diaconis. The sample size required in importance sampling. The Annals of Applied Probability, 28(2):1099 1135, 2018.

Tianrong Chen, Guan-Horng Liu, and Evangelos Theodorou. Likelihood training of Schr odinger bridge using forward-backward SDEs theory. In International Conference on Learning Representations, 2022.

Eungchun Cho, Moon Jung Cho, and John Eltinge. The variance of sample variance from a finite population. International Journal of Pure and Applied Mathematics, 21(3):389, 2005.

Published as a conference paper at ICLR 2025

Nicolas Chopin. A sequential particle filter method for static models. Biometrika, 89(3):539 552, 2002.

Nicolas Chopin, Omiros Papaspiliopoulos, et al. An introduction to sequential Monte Carlo, volume 4. Springer, 2020.

Hyungjin Chung, Jeongsol Kim, Michael T Mccann, Marc L Klasky, and Jong Chul Ye. Diffusion posterior sampling for general noisy inverse problems. ar Xiv preprint ar Xiv:2209.14687, 2022a.

Hyungjin Chung, Byeongsu Sim, Dohoon Ryu, and Jong Chul Ye. Improving diffusion models for inverse problems using manifold constraints. Advances in Neural Information Processing Systems, 35:25683 25696, 2022b.

Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger (eds.), Advances in Neural Information Processing Systems, volume 26, 2013.

Marco Cuturi, Laetitia Meng-Papaxanthos, Yingtao Tian, Charlotte Bunne, Geoff Davis, and Olivier Teboul. Optimal transport tools (OTT): A JAX toolbox for all things wasserstein. ar Xiv preprint ar Xiv:2201.12324, 2022.

Chenguang Dai, Jeremy Heng, Pierre E Jacob, and Nick Whiteley. An invitation to sequential Monte Carlo samplers. Journal of the American Statistical Association, 117(539):1587 1600, 2022.

Paolo Dai Pra. A stochastic control approach to reciprocal diffusion processes. Applied mathematics and Optimization, 23(1):313 329, 1991.

Valentin De Bortoli, James Thornton, Jeremy Heng, and Arnaud Doucet. Diffusion Schr odinger bridge with applications to score-based generative modeling. Advances in Neural Information Processing Systems, 34:17695 17709, 2021.

Pierre del Moral. Mean field simulation for Monte Carlo integration. Monographs on Statistics & Applied Probability. Chapman&Hall, 2013.

Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. Sequential Monte Carlo samplers. Journal of the Royal Statistical Society Series B: Statistical Methodology, 68(3):411 436, 2006.

Kieran Didi, Francisco Vargas, Simon V Mathis, Vincent Dutordoir, Emile Mathieu, Urszula J Komorowska, and Pietro Lio. A framework for conditional diffusion modelling with applications in motif scaffolding for protein design. ar Xiv preprint ar Xiv:2312.09236, 2023.

Carles Domingo-Enrich. A taxonomy of loss functions for stochastic optimal control. ar Xiv preprint ar Xiv:2410.00345, 2024.

Carles Domingo-Enrich, Jiequn Han, Brandon Amos, Joan Bruna, and Ricky TQ Chen. Stochastic optimal control matching. ar Xiv preprint ar Xiv:2312.02027, 2023.

Carles Domingo-Enrich, Michal Drozdzal, Brian Karrer, and Ricky TQ Chen. Adjoint matching: Fine-tuning flow and diffusion generative models with memoryless stochastic optimal control. ar Xiv preprint ar Xiv:2409.08861, 2024.

Randal Douc and Olivier Capp e. Comparison of resampling schemes for particle filtering. In Proceedings of the 4th International Symposium on Image and Signal Processing and Analysis, pp. 64 69, 2005.

Arnaud Doucet, Nando de Freitas, and Neil J. Gordon (eds.). Sequential Monte Carlo Methods in Practice. Statistics for Engineering and Information Science. Springer, 2001.

Arnaud Doucet, Will Grathwohl, Alexander G de G Matthews, and Heiko Strathmann. Score-based diffusion meets annealed importance sampling. In Advances in Neural Information Processing Systems, 2022.

Marylou Gabri e, Grant M Rotskoff, and Eric Vanden-Eijnden. Efficient Bayesian sampling using normalizing flows to assist Markov Chain Monte Carlo methods. ar Xiv preprint ar Xiv:2107.08001, 2021.

Published as a conference paper at ICLR 2025

Marylou Gabri e, Grant M Rotskoff, and Eric Vanden-Eijnden. Adaptive Monte Carlo augmented with normalizing flows. Proceedings of the National Academy of Sciences, 119(10):e2109420119, 2022.

Tomas Geffner and Justin Domke. MCMC variational inference via uncorrected Hamiltonian annealing. In Advances in Neural Information Processing Systems, 2021.

Tomas Geffner and Justin Domke. Langevin diffusion variational inference. ar Xiv preprint ar Xiv:2208.07743, 2022.

A. Gelman, J.B. Carlin, H.S. Stern, D.B. Dunson, A. Vehtari, and D.B. Rubin. Bayesian Data Analysis, Third Edition. Chapman & Hall/CRC Texts in Statistical Science. Taylor & Francis, 2013.

Neil J Gordon, David J Salmond, and Adrian FM Smith. Novel approach to nonlinear/non-Gaussian Bayesian state estimation. In IEE Proc. F Radar Signal Proc., volume 140, pp. 107 113, 1993.

Paul Lyonel Hagemann, Johannes Hertrich, and Gabriele Steidl. Generalized normalizing flows via Markov chains. Cambridge University Press, 2023.

Carsten Hartmann and Lorenz Richter. Nonasymptotic bounds for suboptimal importance sampling. SIAM/ASA Journal on Uncertainty Quantification, 12(2):309 346, 2024.

Jeremy Heng, Adrian N. Bishop, George Deligiannidis, and Arnaud Doucet. Controlled sequential Monte Carlo. The Annals of Statistics, 48(5), 2017.

M.F. Hutchinson. A stochastic estimator of the trace of the influence matrix for Laplacian smoothing splines. Communication in Statistics Simulation and Computation, 18:1059 1076, 1989.

Bowen Jing, Gabriele Corso, Jeffrey Chang, Regina Barzilay, and Tommi Jaakkola. Torsional diffusion for molecular conformer generation. Advances in Neural Information Processing Systems, 35:24240 24253, 2022.

Takeshi Koshizuka and Issei Sato. Neural lagrangian Schr odinger bridge: Diffusion modeling for population dynamics. In The Eleventh International Conference on Learning Representations, 2023.

Guan-Horng Liu, Tianrong Chen, Oswin So, and Evangelos Theodorou. Deep generalized Schr odinger bridge. Advances in Neural Information Processing Systems, 35:9374 9388, 2022.

Guan-Horng Liu, Yaron Lipman, Maximilian Nickel, Brian Karrer, Evangelos A Theodorou, and Ricky TQ Chen. Generalized Schr odinger bridge matching. ar Xiv preprint ar Xiv:2310.02233, 2023.

Kanika Madan, Jarrid Rector-Brooks, Maksym Korablyov, Emmanuel Bengio, Moksh Jain, Andrei Cristian Nica, Tom Bosc, Yoshua Bengio, and Nikolay Malkin. Learning GFlow Nets from partial episodes for improved convergence and stability. In International Conference on Machine Learning, pp. 23467 23483, 2023.

Alex Matthews, Michael Arbel, Danilo Jimenez Rezende, and Arnaud Doucet. Continual repeated annealed flow transport Monte Carlo. In International Conference on Machine Learning, pp. 15196 15219, 2022.

Laurence Illing Midgley, Vincent Stimper, Gregor NC Simm, Bernhard Sch olkopf, and Jos e Miguel Hern andez-Lobato. Flow annealed importance sampling bootstrap. ar Xiv preprint ar Xiv:2208.01893, 2022.

Volodymyr Mnih. Playing Atari with deep reinforcement learning. ar Xiv preprint ar Xiv:1312.5602, 2013.

Jesper Møller, Anne Randi Syversveen, and Rasmus Plenge Waagepetersen. Log Gaussian Cox processes. Scandinavian Journal of Statistics, 25(3):451 482, 1998.

Radford M Neal. Annealed importance sampling. Statistics and computing, 11:125 139, 2001.

Published as a conference paper at ICLR 2025

Radford M Neal. Slice sampling. The annals of statistics, 31(3):705 767, 2003.

Kirill Neklyudov, Rob Brekelmans, Daniel Severo, and Alireza Makhzani. Action matching: Learning stochastic dynamics from samples. In International conference on machine learning, pp. 25858 25889. PMLR, 2023.

E Nelson. Dynamical theories of Brownian motion. Press, Princeton, NJ, 1967.

Nikolas N usken and Lorenz Richter. Solving high-dimensional Hamilton Jacobi Bellman PDEs using neural networks: perspectives from the theory of controlled diffusions and measures on path space. Partial differential equations and applications, 2(4):48, 2021.

Angus Phillips, Hai-Dang Dau, Michael John Hutchinson, Valentin De Bortoli, George Deligiannidis, and Arnaud Doucet. Particle denoising diffusion sampler. ar Xiv preprint ar Xiv:2402.06320, 2024.

Lorenz Richter and Julius Berner. Improved sampling via learned diffusions. In The Twelfth International Conference on Learning Representations, 2024.

Marcin Sendera, Minsu Kim, Sarthak Mittal, Pablo Lemos, Luca Scimeca, Jarrid Rector-Brooks, Alexandre Adam, Yoshua Bengio, and Nikolay Malkin. On diffusion models for amortized inference: Benchmarking and improving stochastic control and sampling. ar Xiv preprint ar Xiv:2402.05098, 2024.

Yuyang Shi, Valentin De Bortoli, Andrew Campbell, and Arnaud Doucet. Diffusion schr odinger bridge matching. Advances in Neural Information Processing Systems, 36, 2024.

Jiaming Song, Arash Vahdat, Morteza Mardani, and Jan Kautz. Pseudoinverse-guided diffusion models for inverse problems. In International Conference on Learning Representations, 2022.

Jiaming Song, Qinsheng Zhang, Hongxu Yin, Morteza Mardani, Ming-Yu Liu, Jan Kautz, Yongxin Chen, and Arash Vahdat. Loss-guided diffusion models for plug-and-play controllable generation. In International Conference on Machine Learning, pp. 32483 32498, 2023.

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021.

Jingtong Sun, Julius Berner, Lorenz Richter, Marius Zeinhofer, Johannes M uller, Kamyar Azizzadenesheli, and Anima Anandkumar. Dynamical measure transport and neural PDE solvers for sampling. ar Xiv preprint ar Xiv:2407.07873, 2024.

Alexander Tong, Kilian FATRAS, Nikolay Malkin, Guillaume Huguet, Yanlei Zhang, Jarrid Rector Brooks, Guy Wolf, and Yoshua Bengio. Improving and generalizing flow-based generative models with minibatch optimal transport. Transactions on Machine Learning Research, 2024.

Suriyanarayanan Vaikuntanathan and Christopher Jarzynski. Escorted free energy simulations: Improving convergence by reducing dissipation. Physical Review Letters, 100(19):190601, 2008.

Francisco Vargas, Pierre Thodoroff, Austen Lamacraft, and Neil Lawrence. Solving schr odinger bridges via maximum likelihood. Entropy, 23(9):1134, 2021.

Francisco Vargas, Will Sussman Grathwohl, and Arnaud Doucet. Denoising diffusion samplers. In International Conference on Learning Representations, 2023.

Francisco Vargas, Shreyas Padhy, Denis Blessing, and Nikolas N usken. Transport meets variational inference: Controlled Monte Carlo diffusions. In The Twelfth International Conference on Learning Representations, 2024.

Nikhil Vemgal, Elaine Lau, and Doina Precup. An empirical study of the effectiveness of using a replay buffer on mode discovery in GFlow Nets. ar Xiv preprint ar Xiv:2307.07674, 2023.

Published as a conference paper at ICLR 2025

Siddarth Venkatraman, Moksh Jain, Luca Scimeca, Minsu Kim, Marcin Sendera, Mohsin Hasan, Luke Rowe, Sarthak Mittal, Pablo Lemos, Emmanuel Bengio, et al. Amortizing intractable inference in diffusion models for vision, language, and control. ar Xiv preprint ar Xiv:2405.20971, 2024.

Robert J Webber. Unifying sequential Monte Carlo with resampling matrices. ar Xiv preprint ar Xiv:1903.12583, 2019.

Max Welling and Yee Whye Teh. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pp. 681 688, Madison, WI, USA, 2011.

Hao Wu, Jonas K ohler, and Frank No e. Stochastic normalizing flows. Advances in Neural Information Processing Systems, 33:5933 5944, 2020.

Luhuan Wu, Brian Trippe, Christian Naesseth, David Blei, and John P Cunningham. Practical and asymptotically exact conditional sampling in diffusion models. Advances in Neural Information Processing Systems, 36, 2024.

Bingliang Zhang, Wenda Chu, Julius Berner, Chenlin Meng, Anima Anandkumar, and Yang Song. Improving diffusion inverse problem solving with decoupled noise annealing. ar Xiv preprint ar Xiv:2407.01521, 2024.

Dinghuai Zhang, Ricky Tian Qi Chen, Cheng-Hao Liu, Aaron Courville, and Yoshua Bengio. Diffusion generative flow samplers: Improving learning signals through partial trajectory optimization. ar Xiv preprint ar Xiv:2310.02679, 2023a.

Qinsheng Zhang and Yongxin Chen. Path integral sampler: A stochastic control approach for sampling. In International Conference on Learning Representations, 2022.

Xuan Zhang, Limei Wang, Jacob Helwig, Youzhi Luo, Cong Fu, Yaochen Xie, Meng Liu, Yuchao Lin, Zhao Xu, Keqiang Yan, et al. Artificial intelligence for science in quantum, atomistic, and continuum systems. ar Xiv preprint ar Xiv:2307.08423, 2023b.

Published as a conference paper at ICLR 2025

A.1 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

A.2 Proofs and theoretical remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

A.3 Algorithmic details and pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . 21

A.3.1 Computation of the Radon-Nikodym derivative . . . . . . . . . . . . . . . 21

A.3.2 A practical algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

A.4 Benchmark target distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

A.4.1 Bayesian statistics tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

A.4.2 Synthetic targets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

A.5 Experimental details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

A.5.1 Metrics and evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

A.5.2 Design choices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

A.5.3 Hyperparameter selection . . . . . . . . . . . . . . . . . . . . . . . . . . 27

A.6 Additional experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

A.6.1 Ablation studies of SCLD . . . . . . . . . . . . . . . . . . . . . . . . . . 29

A.6.2 Removing MCMC components . . . . . . . . . . . . . . . . . . . . . . . 29

A.6.3 Timings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

A.6.4 Estimations of the normalizing constant . . . . . . . . . . . . . . . . . . . 31

A.6.5 The learned annealing schedule . . . . . . . . . . . . . . . . . . . . . . . 32

A.6.6 Mean field prior for SCLD . . . . . . . . . . . . . . . . . . . . . . . . . . 32

A.6.7 Comparison to PDDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

A.6.8 Comparison with advanced SMC schemes . . . . . . . . . . . . . . . . . . 34

A.6.9 Convergence of different methods by iteration count . . . . . . . . . . . . 34

A.6.10 KL-based training of SCLD . . . . . . . . . . . . . . . . . . . . . . . . . 35

A.1 RELATED WORKS

Adding to Section 1.1, this section provides additional related works.

SMC. SMC methods (Chopin, 2002; Del Moral et al., 2006) describe a general methodology to sample sequentially from a sequence of (annealed) distributions. They rely on forward and backward kernels in order to move from one distribution to another and leverage resampling steps in between. Popular choices for the kernels include MCMC. However, while enjoying theoretical guarantees, they suffer from drawbacks such as long mixing times and tedious tuning (Dai et al., 2022).

SMC with learned kernels. To make the transition kernels more flexible and reduce the amount of manual tuning, previous approaches have been proposed to learn them (Wu et al., 2020; Geffner & Domke, 2021). Combinations with SMC include the works by Bernton et al. (2019); Heng et al. (2017). While they propose learned SMC transitions, they do not utilize neural networks (partially due to tractability issues). Bernton et al. (2019) build on the prior work of Heng et al. (2017), which uses ideas from optimal control to iteratively modify the prior distribution and transition kernels through an approximate dynamic programming approach. However, this requires the prior distribution to be conjugate with respect to the policy of the underlying optimal control problem, among other drawbacks discussed in Bernton et al. (2019). The latter work, in turn, proposes the Sequential Schr odinger Bridge Sampler (SSB), which produces a trained SMC sampler by applying sequential approximate iterative proportional fitting (IPF) to learn the forward and backward kernels.

Published as a conference paper at ICLR 2025

Whereas the paper works in discrete time, we take a continuous time perspective and, in doing so, obtain a family of simpler, unbiased training procedures, as well as reveal additional design choices like the ability to choose the integrator. We also note that our objective is fundamentally different from IPF and, in particular, yields a different solution for a finite numbers of steps (see Vargas et al. (2024, Proposition 3.4)).

Methods combining SMC with neural networks include Annealed Flow Transport Monte Carlo (AFT) (Arbel et al., 2021), as well as its improved version Continual Repeated Annealed Flow Transport Monte Carlo (CRAFT) (Matthews et al., 2022). Those works use normalizing flows to transition between adjacent annealing steps. While achieving improved performance, the deterministic nature of the transitions requires MCMC steps after the resampling steps to avoid particles collapsing to the same location. Moreover, the log-determinant of the Jacobian (or divergence of the drift for continuous time) is required. To avoid costly computations in high dimensions, one either needs to place architectural restrictions on the architecture or require the use of noisy estimators (such as Hutchinson s trace estimator (Hutchinson, 1989) for the divergence). We remark that there is also a series of works that combines normalizing flows with MCMC methods (Midgley et al., 2022; Gabri e et al., 2021; 2022; Hagemann et al., 2023).

Diffusion-based samplers. Works on diffusion-based samplers such as Path Integral Samplers (PIS), Denoising Diffusion Samplers (DDS), Time-reversed Diffusion Samplers (DIS), and others introduced by Zhang & Chen (2022); Berner et al. (2024); Vargas et al. (2023; 2024); Sendera et al. (2024); Sun et al. (2024) have focused on transporting a prior to the target distribution using controlled stochastic differential equations (SDEs), where the control is learned by minimizing suitable divergences between induced measures on the SDE trajectories; see the framework described in Section 2.2. As a historical note, some of the ideas of diffusion sampling were anticipated in earlier works such as Vaikuntanathan & Jarzynski (2008). In this work, we aim to harness their flexibility together with the power of SMC. Orthogonal to our work, techniques from diffusion models have been employed to approximate the extended target distribution needed in AIS methods (Doucet et al., 2022; Geffner & Domke, 2022).

Subtrajectories. In our work, we utilize the idea of dividing a path measure into sequential sections. This bears resemblance to the concept of subtrajectories as introduced in a discrete-time setting in the context of GFlownets (Zhang et al., 2023a; Madan et al., 2023), and thus we will also use this term. While conceptually similar, the latter work only proposed subtrajectories as an alternative training loss, whereas we use them to facilitate integration with SMC methods. Additionally, their formulation requires learning the evolution of the SDE marginals, whereas we adapt recent Controlled Monte Carlo Diffusions (CMCD) (Vargas et al., 2024) to get rid of this requirement.

SMC with diffusion-based samplers. To the best of our knowledge, the only prior work on combining diffusion-based methods with SMC steps is the Particle Denoising Diffusion Sampler (PDDS) (Phillips et al., 2024), where the backward kernel is chosen to be the noising diffusion and the forward kernel the approximate (learned) time-reversal. While also inspired by diffusion-based samplers, PDDS significantly differs from our approach. First, we take a more general continuoustime perspective, allowing us more freedom in design choices while still recovering the (discretetime) setup of PDDS as a special case (i.e., where we use one Euler-Marumaya step per subtrajectory). Next, their setup requires learning potential functions and relies on automatic differentiation to compute the control, which can be unstable and challenging to optimize. Indeed, PDDS was empirically found to require variational approximations for the prior distribution to train stably, which has certain drawbacks (see Appendices A.6.6 and A.6.7). Moreover, it uses an alternating training setup that uses (approximate) samples from the partially trained model, whereas we train our model end-to-end, i.e., our setup is the same during training and inference. We empirically compare methods and discuss the impact of this difference in training methodology in Appendix A.6.7. We additionally compare different SMC-based methods in Table 1. Lastly, we note that the recent concurrent work by Albergo & Vanden-Eijnden (2025) also utilizes resampling steps as part of a learned SDE-based sampling algorithm.

Diffusion-based generative modeling. As outlined in our introduction, sampling problems are substantially different from problems in generative modeling, where samples from the target distribution are provided. However, many successful techniques from diffusion-based generative modeling, such as SDE integrators, noise schedules, and probability flow ODEs, can be translated to diffusionbased samplers. Loosely related to CMCD, and thus SCLD, are (entropic) action-matching ap-

Published as a conference paper at ICLR 2025

proaches (Neklyudov et al., 2023), where the intermediate distributions are prescribed via samples as compared to (unnormalized) densities in our setting. In both settings, there exist unique gradient fields representing the optimal controls, which can be characterized as solutions to infinitesimal Schr odinger bridge problems at the intermediate distributions, i.e., minimizers of the kinetic energy (see Vargas et al. (2024, Proposition 3.4) and Neklyudov et al. (2023, Appendix B.3)). As described in Remark A.1, we could replace the CMCD framework with more general bridges, i.e., arbitrary, learnable density evolutions as considered in Richter & Berner (2024), at the cost of learning the (unnormalized) marginals with a separate model. A corresponding objective in generative modeling has been considered by Chen et al. (2022). While such approaches do not exhibit unique solutions, one can additionally minimize the KL divergence of the learned path measure to a reference measure, typically given by a Brownian motion, which leads to dynamic Schr odinger bridge problems (i.e., entropy-regularized optimal transport). This has been explored by, e.g., Vargas et al. (2021); De Bortoli et al. (2021); Shi et al. (2024) in the context of generative modeling, and we refer to Koshizuka & Sato (2023); Liu et al. (2022; 2023) for extensions beyond kinetic energy minimization (related to mean-field games).

Finally, we mention that generative modeling frameworks that allow likelihood computations can also be used for sampling problems. Specifically, one can optimize objectives from generative modeling (e.g., score-matching objectives) using approximate samples from the target distribution obtained from the partially trained model together with importance sampling based on the likelihoods of the samples. This can be viewed as a version of the cross-entropy method and is used, e.g., in Jing et al. (2022, Section 3.6) for diffusion models5 and in Tong et al. (2024, Appendix C.2) for flow matching. However, a mismatch of the high-probability regions of the proposal (given by the partially trained model) and target distributions often leads to high variance in high-dimensional settings. We note that PDDS can be viewed as a very elaborate version of such an approach, counteracting the aforementioned problems by incorporating SMC steps into the proposal as well as training with a combination of target-matching and score-matching objectives. We compare to PDDS in Appendix A.6.7.

Diffusion-based posterior sampling and stochastic optimal control. For our considered sampling problems, we only assume minimal to no prior knowledge of the properties of the target distribution. However, for sampling from posterior distributions arising from Bayesian inference problems, one can decompose the target as ptarget = p X|Y ( , y) = p Xp Y |X(y| )

Z , where y is a given measurement and p X and p Y |X are the prior and likelihood, respectively. In our Bayesian statistics tasks (see Appendix A.4), the prior p X is given by a simple, tractable distribution, and we do not incorporate knowledge about the prior into our framework.

However, for certain problems, the prior can also be more complex, e.g., in inverse problems on image, audio, or video distributions. Assuming different from our setting that samples from the prior p X are given, recent methods leverage diffusion priors, i.e., diffusion models pre-trained on p X, to simplify sampling from ptarget; see, e.g., Chung et al. (2022a;b); Song et al. (2022; 2023); Boys et al. (2023); Zhang et al. (2024). Using the decomposition of ptarget, they draw approximate samples from ptarget based on approximations of the likelihood score (i.e., the difference of the score for the noised posterior and prior distributions) during inference. For instance, the common reconstruction guidance approximates this score by the (scaled) gradient of the log-likelihood evaluated at the denoised sample obtained via Tweedie s formula and the pre-trained model. While such plug-and-play approaches can yield impressive results for high-dimensional distributions without additional training, they typically lack theoretical guarantees and typically suffer from instabilities and mode collapse.

At the cost of simulating multiple particles during the generative process, the bias originating from approximating the likelihood score can be eliminated (in the limit of infinitely many particles) by leveraging ideas from SMC, i.e., by computing importance weights and interleaving the generative process with resampling steps (Wu et al., 2024). Taking into account an additional training phase, one can also obtain theoretical guarantees by writing the likelihood score as a solution to an stochastic optimal control (SOC) problem (as in DDS, however, with the pre-trained diffusion model as a reference process; see Didi et al. (2023, Section 2.4) and also Venkatraman et al. (2024). The SOC problem can then be solved using, e.g., the log-variance divergence. While such posterior sampling

5While Jing et al. (2022) use the probability flow ODE to obtain likelihoods, one could alternatively obtain importance weights in path space; see the references on diffusion-based samplers above.

Published as a conference paper at ICLR 2025

pprior p1 p2 ptarget

pt1|0 pt2|t1 p T |t2

Figure 5: Illustration of annealed importance sampling along a geometric path, where we consider either one (top arrow) or three (bottom arrows) transition steps from the prior to the target.

approaches assume more structure than our considered sampling problem and rely on pre-trained diffusion prior, one could also adopt the idea of SCLD to such settings (see also Remark A.1), which we leave to future work. This would basically correspond to a combination of the approaches by Wu et al. (2024) and Didi et al. (2023), where the likelihood score is learned but training is facilitated by leveraging SMC steps.

We note that ideas similar to Didi et al. (2023) have recently also been used for fine-tuning diffusion models (where p Y |X(y| ) corresponds to a reward function), using adjoint matching to minimize the KL divergence instead of, e.g., the log-variance divergence, to solve the SOC (Domingo-Enrich et al., 2024). While we propose to use the log-variance divergence to allow off-policy training and reduce variance (see Section 2.3), we note that adjoint matching and related approaches (Domingo Enrich et al., 2023; Domingo-Enrich, 2024) could also be used for SCLD to solve the SOC problems in each subtrajectory.

A.2 PROOFS AND THEORETICAL REMARKS

In this section, we provide additional remarks on our theory and the proof of Proposition 2.3.

Remark A.1 (Generalizations). We note that, in principle, prescribing an annealing, i.e. p Xu = π, is not strictly necessary, and one could instead consider general bridges allowing for arbitrary density evolutions between the prior to the target. This, however, would come with the additional challenge of learning the (unnormalized) log-density log p Xu of the controlled process, see, e.g., Richter & Berner (2024, Appendix A.7), which could make optimization potentially more difficult. Moreover, we can only use the approximate densities for the MCMC refinements as compared to using the target density π in case of a prescribed annealing. While the general bridges do not exhibit unique solutions, one can consider the case studied in diffusion models, where the control v of the reversetime process in (7) is fixed such that Y v 0 is approximately distributed as pprior. Nelson s identity in (8) shows that it is sufficient to learn the log-density log p Y v and the optimal control can be computed using automatic differentiation, as leveraged in Phillips et al. (2024); Richter & Berner (2024). However, this can potentially be unstable and computationally more expensive.

Remark A.2 (Connections to reinforcement learning). The objectives of diffusion-based samplers can be viewed as stochastic optimal control problems; see, e.g., Dai Pra (1991); Zhang & Chen (2022); Berner et al. (2024). More generally, stochastic optimal control problems can be understood as versions of maximum entropy reinforcement learning in continuous time and space; see, e.g., Domingo-Enrich et al. (2024, Appendix C). Specifically, the prior distribution pprior together with the control u define policies and transitions via the SDE (6) (or, in discrete time, via the transition kernels in (32) given by the Euler-Maruyama scheme). This allows the transfer of successful ideas from reinforcement learning to diffusion-based samplers. Motivated by previous work (Zhang et al., 2023a; Richter & Berner, 2024; Sendera et al., 2024), we propose to use off-policy training with prioritized replay buffers for SCLD, which is enabled by the log-variance loss (see Section 2.3).

Published as a conference paper at ICLR 2025

Proof of Proposition 2.3. We follow the proof ideas from N usken & Richter (2021, Proposition 5.7), however, need to be careful since the reweighting of the measure P[tn 1,tn] = Pu,πn 1 [tn 1,tn] is done w.r.t. a measure on the previous time interval [tm, tn 1]. Let us recall the KL divergence (16), namely

D := DKL P[tn 1,tn]|

P[tn 1,tn] = EX P[tn 1,tn] log w[tn 1,tn](X)

= EX P[tm,tn] log w[tn 1,tn](X) w[tm,tn 1](X) ,

where we abbreviate w[s,t] := d

Pu,π( ,s) [s,t] d Pu,π( ,t) [s,t] . Using the analogous abbreviation w I [s,t] for the product

measures, we note that

Var h b D(K) KL ( P I [tn 1,tn]|

P I [tn 1,tn]) i = 1

K Var X P I [tm,tn]

h log w I [tn 1,tn](X) w I [tm,tn 1](X) i

= MI D2 I K , (21)

MI := EX P I [tm,tn]

log2 w I [tn 1,tn](X) w I [tm,tn 1](X) 2

DI := EX P I [tm,tn]

h log w I [tn 1,tn](X) w I [tm,tn 1](X) i

= EX P I [tn 1,tn]

h log w I [tn 1,tn](X) i = DKL P I [tn 1,tn]|

P I [tn 1,tn] = ID. (22)

Moreover, we can compute

MI = EX P I [tm,tn]

i=1 log w(i) [tn 1,tn](X)

!2 w I [tm,tn 1](X) 2

i=1 EX P I [tm,tn]

log2 w(i) [tn 1,tn](X) w I [tm,tn 1](X) 2

EX P I [tm,tn]

log w(i) [tn 1,tn](X) log w(j) [tn 1,tn](X) w I [tm,tn 1](X) 2

= IMCI 1 + I(I 1)D2CI 2,

where w(i) [s,t] denotes the weight for the i-th factor of the product measure and we abbreviate

M := EX P[tm,tn]

h log2 w[tn 1,tn](X) w[tm,tn 1](X) 2i D2 (24)

C := EX P[tm,tn]

h w[tm,tn 1](X) 2i = EX P[tm,tn 1]

h w[tm,tn 1](X) 2i

P[tm,tn 1]| P[tm,tn 1] + 1 1. (25)

Combining the definition of the relative error with (21), (22), and (23), we obtain that

r(K) P I [tn 1,tn]|

P I [tn 1,tn] =

MI D2 I KD2 I = CI/2

MC + D2(I 1)

which, in view of (24) and (25), proves the claim.

As already stated in the main text, we note that the log-variance divergence, defined in (18), does not scale exponentially in the dimension, as already proved in (N usken & Richter, 2021, Proposition 5.7). For convenience of the reader, let us explicitly verify that this statement also holds in our

Published as a conference paper at ICLR 2025

setting. To this end, first note that

LV P I [tn 1,tn]|

P I [tn 1,tn] = Var X Q I h log w I [tn 1,tn](X) i = (26a)

i=1 Var X Q h log w(i) [tn 1,tn](X) i = IDQ LV P[tn 1,tn]|

P[tn 1,tn] , (26b)

where we recall that Q is an arbitrary reference measure. Following Cho et al. (2005), the sample variance satisfies

Var h b DQ I,(K) LV P I [tn 1,tn]|

P I [tn 1,tn] i = 1

LV P I [tn 1,tn]|

P I [tn 1,tn] 2 ,

µ4 = EX Q I log w I [tn 1,tn](X) EX Q I h log w I [tn 1,tn](X) i 4 . (28)

We can calculate

µ4 = EX Q I

log w(i) [tn 1,tn](X) EX Q h log w(i) [tn 1,tn](X) i !4

= I EX Q h log w[tn 1,tn](X) EX Q log w[tn 1,tn](X) 4i (29b)

EX Q h log w[tn 1,tn](X) EX Q log w[tn 1,tn](X) 2i2 , (29c)

where we have used the fact that, for instance,

" log w(i) [tn 1,tn](X) EX Q h log w(i) [tn 1,tn](X) i

log w(j) [tn 1,tn](X) EX Q h log w(j) [tn 1,tn](X) i 3 #

for i = j. Combining this with (26), it follows that Var h b D(K) LV P I [tn 1,tn]|

P I [tn 1,tn] i = O(I2).

Recalling the definition of the relative error, r(K) := Var( b D(K) LV )1/2/DLV, we see that it does not scale exponentially in I.

A.3 ALGORITHMIC DETAILS AND PSEUDOCODE

We first provide formulas to compute the Radon-Nikodym derivative (RND) and the forward and backward kernels in discrete time. Then, we give an implementable method in Algorithm 2 and provide details on the resampling step and training with a buffer. Note that we can also use nonuniform discretizations within subtrajectories by adapting the times hi, i = 0, . . . , N, accordingly.

A.3.1 COMPUTATION OF THE RADON-NIKODYM DERIVATIVE

As in Vargas et al. (2024), we obtain an approximate, computable formula for the Radon-Nikodym derivative in Lemma 2.2 between the (n 1)-th and n-th time step, given by

w[tn 1,tn](X) = d

P[tn 1,tn] d P[tn 1,tn]

X π(Xtn, tn) π(Xtn 1, tn 1)

p(i 1)h|ih(X(i 1)h|Xih) pih|(i 1)h(Xih|X(i 1)h), (31)

where the transition densities for the forward and reverse-time SDEs, coming from the Euler Maruyama discretization as in (19), are given as

pt|s Xt|Xs = N Xt; Xs + u(Xs, s)(t s), σ2(s)(t s)

ps|t Xs|Xt = N Xs; Xt + (σ2 log π u)(Xt, t)(t s), σ2(t)(t s) . (32)

In practice, in line with Vargas et al. (2024), we parameterize the control as

u = σ2euθ + σ2

2 log π, (33)

Published as a conference paper at ICLR 2025

where euθ is parametrized by a neural network. When euθ is initialized as the zero function, we recover an annealed form of Langevin dynamics (Welling & Teh, 2011), providing an improved starting point for optimization.

A.3.2 A PRACTICAL ALGORITHM

In Algorithm 2, we give a practical and detailed version of Algorithm 1.

Algorithm 2 SCLD-Forward Pass

Require: Target ρtarget, (learnable) prior pprior = N(µθ, diag(exp(2ℓθ)), number of subtrajectories N, steps

per subtrajectory L and step size h, annealing schedule βθ as in (35), noise schedule σ, control u given by

neural network euθ as in (33), number of particles K

1: Sample from prior (by reparametrization): b X(1:K) 0 pprior Independent for each particle

2: Initialize (unnormalized) importance weights: w(1:K) 0 = 1

3: Evaluate control and prior: u( b X(1:K) 0 , 0) and pprior( b X(1:K) 0 )

4: for n = 1 to N do Note that tn = n Lh

5: for i = (n 1)L + 1 to n L do Consider the time interval [(i 1)h, ih]

6: Euler-Maruyama simulation: b X(1:K) i pih|(i 1)h( | b X(1:K) i 1 ) as in (32) See (19)

7: Evaluate control: u( b X(1:K) i , ih)

8: Evaluate (unnormalized) annealing: π( b X(1:K) n L , tn) = (p1 βθ(tn) prior ρβθ(tn) target )( b X(1:K) n L ) See (20)

9: Compute RNDs: w(1:K) [tn 1,tn] as in (31) For every k, we use Xih = b X(k) i 10: Update weights: w(1:K) n = w(1:K) n 1 w(1:K) [tn 1,tn] 11: Resample: b X(1:K) n L , w(1:K) n = resample( b X(1:K) n L , w(1:K) n ) See Algorithm 5

12: MCMC step: Update b X(1:K) n L with π( , tn)-invariant kernel

13: return RNDs (w(1:K) [tn 1,tn])N n=1, weights (w(1:K) n )N n=0, trajectories b X(1:K),

log Z estimate PN n=1 log PK k=1 w(k) n 1w(k) [tn 1,tn] , ELBO PN n=1 PK k=1 w(k) n 1 log w(k) [tn 1,tn]

Prioritized replay buffer. We give the exact algorithm of our replay buffer in Algorithm 4. We note that there are many alternative possibilities for choosing the buffer priority (including by importance weight), which we leave to future exploration. Moreover, as in traditional replay buffers (Mnih, 2013), there is an option to perform multiple gradient steps per simulation to reduce computation costs.

Resampling. The work of Webber (2019) shows that there is great scope to design resampling methods. However, in line with prior work, we opt to use the simple adaptive multinomial resampling for which pseudocode is provided in Algorithm 5.

Annealing Schedule While we choose to learn the annealing schedule, an alternative approach that takes advantage of the flexibility of our continuous-time perspective is to set β(t) = t and instead adaptively choose the time discretization. For example, the constant-ESS scheme annealing schedule Buchholz et al. (2021) is a commonly used scheme in the SMC literature and can be easily adapted to the SCLD setting for both training and sampling. We leave the investigation of this idea to future work.

A.4 BENCHMARK TARGET DISTRIBUTIONS

Here, we introduce the target densities considered in our experiments more formally. Most of these are standard benchmarks taken from, e.g., Heng et al. (2017); Arbel et al. (2021); Geffner & Domke (2022); Richter & Berner (2024); Blessing et al. (2024).

A.4.1 BAYESIAN STATISTICS TASKS

For these tasks, no groundtruth samples are available.

Published as a conference paper at ICLR 2025

Algorithm 3 SCLD-Training See Algorithm 4 for training with buffers.

Require: Number of iterations I, initial parameters θ(0), optimizer update update, inputs for Algorithm 2

1: for i = 0 to I 1 do 2: Run Algorithm 2: (w(1:K) [tn 1,tn])N n=1, (w(1:K) n )N n=0 = SCLD-Forward Pass (θ(i))

3: if LV then Trajectories b X(1:K) are detached during forward pass

4: Compute loss: L = PN n=1 1 K PK k=1 log w(k) [tn 1,tn] 1

K PK i=1 log w(i) [tn 1,tn] 2

5: else if KL then 6: Compute loss: L = PN n=1 1 K PK k=1 detach(w(k) n 1) log w(k) [tn 1,tn]

7: Compute gradient w.r.t. parameters: G(i) = θ(i)L 8: Optimizer step: θ(i+1) = update(θ(i), (G(j))i j=0) We use Adam

9: return Optimized parameters θ(I)

Algorithm 4 SCLD-Buffer-Training

Require: Buffer (Bn)N n=1 for every sutrajectory, inputs for Algorithm 3 1: for i = 0 to I 1 do 2: Run Algorithm 2: (w(1:K) [tn 1,tn])N n=1, b X(1:K) = SCLD-Forward Pass (θ(i)) with K particles 3: for n = 1 to N do 4: Store subtrajectories: ( b X(1:K) i )n L i=(n 1)L into Bn with weights w(1:K) [tn 1,tn], replacing oldest entries

5: Sample from buffer: e X(1:K/2) Bn with probability proportional to buffer weights 6: Recompute RNDs: ew(1:K/2) [tn 1,tn] for detached e X(1:K/2) using (31) and current parameters θ(i)

7: Update buffer: Set ew(1:K/2) [tn 1,tn] as weights for e X(1:K/2) Updating all B particles is too slow

8: Sample other half from simulation: ew(K/2+1:K) [tn 1,tn] from w(1:K) [tn 1,tn] uniformly without replacement

9: Compute log-variance loss: L = PN n=1 1 K PK k=1 log ew(k) [tn 1,tn] 1

K PK i=1 log ew(i) [tn 1,tn] 2

10: Compute gradient w.r.t. parameters: G(i) = θ(i)L 11: Optimizer step: θ(i+1) = update(θ(i), (G(j))i j=0) We use Adam

12: return Optimized parameters θ(I)

Bayesian Logistic Regression (Sonar and Credit). We used two binary classification problems in our benchmark, which have also been used in various other works to compare different state-ofthe-art methods in variational inference and MCMC. Specifically, we assess the performance of a Bayesian logistic model with

ρtarget(x) = p(x)

i=1 Bernoulli (yi; sigmoid(x ui))

on two standardized datasets ((ui, yi))n i=1, namely Sonar (d = 61) and German Credit (d = 25) with n = 208 and n = 1000 data points, respectively. We choose p = N (0, I) for Sonar and p 1 for Credit (in line with the code of Blessing et al. (2024) which omitted the prior).

Random Effect Regression (Seeds). The Seeds (d = 26) target uses a random effect regression model given by:

τ Gamma(0.01, 0.01) a0, a1, a2, a12 N(0, 10)

bi N 0, 1 τ

, i = 1, . . . , 21,

logitsi = a0 + a1xi + a2yi + a12xiyi + b1, i = 1, . . . , 21, ri Binomial (logitsi, Ni) , i = 1, . . . , 21.

The goal is to do inference over the variables τ, a0, a1, a2, a12 and bi for i = 1, . . . , 21, given observed values for xi, yi, and Ni from a dataset modeling the germination proportion of seeds; see Geffner & Domke (2022) for details.

Published as a conference paper at ICLR 2025

Algorithm 5 Adaptive Multinomial Resampling

Require: particles X(1:K), unnormalized weights w(1:K)

1: Normalize: W (k) = w(k)/ PK i=1 w(i), k = 1, . . . , K 2: Compute ESS: ESS = 1/ PK k=1(W (k))2

3: if ESS < αK then We take α = 0.3 4: for k = 1 to K do 5: Sample index from categorical distribution: i {1, . . . , K} with probabilities W (1:K)

6: Define resampled particle: e X(k) = X(i)

7: Reset weights: W (1:K) = 1/K 8: else 9: Keep particles: e X(1:K) = X(1:K)

10: return resampled particles e X(1:K), updated and normalized weights W (1:K)

Time Series Models (Brownian). The Brownian (d = 32) model corresponds to the time discretization of a Brownian motion with Gaussian observation noise:

αinn Log Normal(0, 2), αobs Log Normal(0, 2), x1 N(0, αinn), xi N(xi 1, αinn), i = 2, . . . , 30, yi N(xi, αobs), i = 1, . . . , 30.

Inference is performed over the variables αinn, αobs, and {xi}30 i=1 given the observations {yi}10 i=1 and {yi}30 i=20 (i.e., the middle observations are missing); see Geffner & Domke (2022).

Spatial Statistics (LGCP). The Log Gaussian Cox process (LGCP) is a popular high-dimensional task in spatial statistics (Møller et al., 1998), which models the position of pine saplings. Using a d = 40 40 = 1600 grid, we obtain the unnormalized target density by

ρtarget = N(x; µ, Σ)

i=1 exp xiyi exp (xi)

where y is a given dataset and µ and Σ are the mean and covariance matrix of the given prior. We use the more challenging unwhitened version; see Heng et al. (2017); Arbel et al. (2021) for details.

A.4.2 SYNTHETIC TARGETS

For these tasks, groundtruth samples are available.

Robot. The Robot targets (Arenz et al., 2020) (Robot1, Robot4) aim at learning joint configurations of a 10 degrees-of-freedom planar robot, parameterized by

α = (α1, . . . , α10),

such that it reaches a desired goal position while enforcing smooth configurations. The target density is given by

ρtarget(α) = pconf(α)pcart(α),

where pconf enforces smooth configurations and pcart penalizes deviations from the goal position. pconf is modeled as zero-mean Gaussian distribution with a diagonal covariance matrix, where the angle α1 of the first joint has a variance of 1 and the remaining joint angles α2, . . . , α10 have a variance of 4 10 2.

Formally, we define the locations of the robot joints by

j=1 cos(αj), i = 0, . . . , 10,

j=1 sin(αj), i = 0, . . . , 10.

Published as a conference paper at ICLR 2025

In the Robot1 task there is one goal at (7, 0), and we specify

pcart(α) = N x10(α) y10(α)

, 10 4I , (34)

i.e., a Gaussian distribution centered at the Cartesian coordinates of the goal position, with a variance of 10 4 in both directions.

In the Robot4 task there are 4 goals at ( 7, 0) and (0, 7), and so pcart is given by the maximum over the four respective Gaussian distributions as in (34) (up to a constant of proportionality). Groundtruth samples are generated by long slice sampling runs (Neal, 2003) and taken from the repository of Arenz et al. (2020).

Mixture distributions (GMM and Mo S). For the GMM and Mo S tasks, we define a mixture distribution with m components as

ptarget = 1

The Gaussian Mixture Model (GMM), taken from Blessing et al. (2024), consists of m = 40 mixture components with

pi = N(µi, I), µi Ud( 40, 40),

where Ud(l, u) refers to a uniform distribution on [l, u]d. We take d = 50 for the main experiments.

The Mixture of Student s t-distributions (Mo S), taken from Blessing et al. (2024), comprises m = 10 Student s t-distributions t2, where the 2 refers to the degree of freedom. Specifically, we use

pi = t2 + µi, µi Ud( 10, 10),

where µi refers to the translation of the individual components, and take d = 50. For both the GMM and Mo S tasks, the µi s are fixed throughout experiments, i.e., selected with the same random seed.

Funnel. The Funnel target introduced in Neal (2003) is a challenging funnel-shaped distribution given by

ptarget(x) = N(x1; 0, σ2)N(x2, . . . , x10; 0, exp(x1)I),

with σ2 = 9 for any number of dimensions d 2. We take d = 10 in our main experiments.

Many-Well (MW). A typical problem in molecular dynamics considers sampling from the stationary distribution of Langevin dynamics. In our example we shall consider a d-dimensional many-well potential, corresponding to the (unnormalized) density

ρtarget(x) = exp

i=1 (x2 i δ)2 1

In line with Berner et al. (2024); Sun et al. (2024), we take d = 5, m = 5, and δ = 4, leading to 2m = 32 well-separated modes. Groundtruth log Z and samples can be obtained by noting that the distribution factors over dimensions.

A.5 EXPERIMENTAL DETAILS

In this section, we describe the experimental setup and evaluation protocol. We also discuss design choices for our main experiments as well as how our hyperparameters are selected.

A.5.1 METRICS AND EVALUATION

Maximization of the ELBO. The ELBO refers to a lower bound on log Z. This is a classic benchmark for samplers, and higher ELBOs are usually associated with precise sampling from discovered modes. However, the ELBO is not necessarily indicative of mode collapse; see Blessing et al. (2024) and Appendix A.6.4 for details.

Minimization of the Sinkhorn distance. The Sinkhorn distance W2 is an optimal transport (OT) distance. When computed between a set of generated samples and a groundtruth set of samples

Published as a conference paper at ICLR 2025

from the target (when the latter is available), this gives an estimate of the OT distance from the distribution generated by the sampler to ptarget. As discussed further in Blessing et al. (2024), low OT distances are associated with good mode coverage (i.e., avoiding mode collapse).

For both ELBO and optimal transport evaluation, we follow the protocol of Blessing et al. (2024). In particular, we use the Sinkhorn distance as implemented in Cuturi et al. (2022) and use standard formulas for the ELBO computations of our baselines. For SCLD, the ELBO computation is stated in Algorithm 2. We compute all performance criteria 100 times during training using 2000 samples, applying a running average with a length of 5 over these evaluations to obtain robust results within a single run. To ensure robustness across runs, we use four different random seeds and average the best results from each run. As we use the same evaluation protocol as Blessing et al. (2024), we re-use their results for DDS and PIS whenever available.

As discussed in Blessing et al. (2024), ELBO metrics are insensitive to mode collapse and, as such, may not accurately reflect the quality of samples on multimodal tasks. As groundtruth samples are available for the synthetic tasks considered and due to their generally multimodal nature, we report Sinkhorn distances for these tasks.

A.5.2 DESIGN CHOICES

We follow the following principles:

SCLD follows design choices of other methods when these are shared.

SCLD reuses the hyperparameter choices of baseline methods when shared such that it is not tuned excessively.

Baseline methods should be given as much or more computational budget compared to SCLD.

General remarks. For SCLD, CMCD, DDS, and PIS we take the convention that T = 1 as rescaling time is equivalent to rescaling the noise level. Since the objectives of DDS and the Time-Reversed Diffusion Sampler (DIS) (Berner et al., 2024) only differ by choice of the reference process (see also Berner et al. (2024, Appendix A.10.1), Richter & Berner (2024, Section 3), and Vargas et al. (2024, Appendix C.3)), we do not explicitly compare against DIS in this work.

CMCD and SCLD. As SCLD and CMCD share numerous design choices, we mostly follow the choices of CMCD as in Vargas et al. (2024). In particular, we opt to learn the prior as well as the annealing schedule. For the former, we define pprior := N(µθ, diag(exp(2ℓθ)). In other words, we parameterize the Gaussian prior through its mean µθ Rd and logarithmic standard deviations ℓθ Rd, initialized to N(0, σ2I), i.e., µθ = 0 and (ℓθ)i = log(σ), for some σ > 0 (referred to as initial scale) to be tuned. We update µθ and ℓθ via the parameterization trick as training progresses. For learning the annealing, we parameterize the schedule in (20) for every j {1, . . . , NL} by

softplus(θi) PNL i=1 softplus(θi) , (35)

where θi R are learnable parameters. We choose the buffer size to be 20 times the training batch size, i.e., B = 20K. Moreover, we parametrize the control u as in (33). For SCLD, we use the subtrajectory settings from Section 3.

CRAFT. We use the implementation by Blessing et al. (2024), following the standard settings of Matthews et al. (2022). Specifically, we employ diagonal affine flows as the transport maps.

SMC operations. We use the same resampling strategy and MCMC kernel for CRAFT, SMC, and SCLD. In particular, every SMC step consists of adaptive resampling with a threshold of 0.3K, followed by one Hamiltonian Monte Carlo (HMC) step with 10 leapfrog steps. For details on the advanced SMC schemes (SMC-ESS and SMC-FC), we refer to Buchholz et al. (2021) and Appendix A.6.8.

Optimization and batch size. We utilize the Adam optimizer for all methods that require learning. We also found that clipping gradients to 1 was important for stable training on all diffusion-based methods. We use batch size 2000 for training except for LGCP, where batch size 300 is used. We always evaluate with K = 2000 particles.

Published as a conference paper at ICLR 2025

Number of annealing / diffusion steps. For SMC, DDS, PIS, CMCD, and SCLD in the main experiments, we fix 128 steps. In particular, we have L = 128/N for SCLD. For CRAFT, we sweep over [4, 8, 128] annealing steps (which also define the number of SMC operations).

Number of training iterations. We select the number of training iterations such that all methods are given roughly the same number of target function evaluations (NFEs) for a given number of SMC operations or subtrajectories N, evaluations per SMC operation M, and annealing or diffusion steps per subtrajectory L. In our setup, M = 10 due to the 10 leapfrog steps in HMC, (N, L) = (1, 128) for DDS, PIS, and CMCD, and N {4, 8, 128} for CRAFT (with L = 1) and SCLD (with L = 128/N). As a reference value, we use 40000 iterations for DDS and PIS as in Blessing et al. (2024). We report the chosen number of iterations for each method in Table 4.

Table 4: Number of training iterations for our considered method depending on the number of SMC operations or subtrajectories N. The last rows show the approximate number of target function evaluations (NFEs) per particle in each iteration w.r.t. the number of evaluations per SMC operation M and annealing or diffusion steps per subtrajectory L.

CRAFT SCLD DDS, PIS, CMCD-KL, CMCD-LV

N = 1 4 104

N = 4 105 2.5 104 N = 8 5 104 N = 128 3 103 3 103

Approx. NFEs per particle MN MN + L L Our setup L = 1, M = 10 LN = 128, M = 10 N = 1, L = 128

We note that all baselines converged satisfactorily within the given iteration budget. Moreover, the generous budget of 40000 iterations for DDS, PIS, and CMCD required running for 4 20 times as long as SCLD s training process on equivalent architecture for our considered tasks (see also Table 9).

A.5.3 HYPERPARAMETER SELECTION

General remarks. We follow the spirit of experimental design in Blessing et al. (2024) to fairly compare SCLD with our diverse range of baselines. We describe the search space and selection procedure below. We select the best configuration based on the target metric and a single seed. We note that alternative experimental setups such as done in Vargas et al. (2024) are possible, leveraging the ability of CMCD and SCLD to learn further hyperparameters end-to-end or use variational mean field approximations (see Appendix A.6.6) instead of a grid-search.

Prior scale. For all methods that require a N(0, σ2I) prior, we sweep over σ in [0.1, 1, 10] for tasks where we have no information about the target. For GMM40 and Mo S tasks, we know that the initial scale should be around 40 and 15, respectively, by construction of the problem, so we fix these values for all methods. Similarly, for the Robot tasks, we know that the coordinates correspond to radial angles, so we set the initial scale to 2 to cover the [ π, π] range.

Diffusion noise schedule. For diffusion-based samplers a noise schedule σ as in (6) needs be specified. For PIS, we use a linear noise schedule as in Zhang & Chen (2022), and for DDS, CMCD, and SCLD we use a cosine schedule as in Vargas et al. (2023). Both noise schedules are parameterized by a minimum diffusion and a maximum diffusion coefficient. We set the minimum diffusion noise level to 0.01 for all tasks and methods except the Robot tasks, where we set it to 0.001. For all methods and tasks we perform grid searches over the maximum diffusion parameter. For all tasks except the Robot and GMM40 tasks we search in [0.1, 1, 10]. Due to the large initial scale of GMM40, we search the maximum diffusion parameter over [5, 10, 20]. For Robot, we search it in [0.003, 0.03, 0, 3] instead of the usual grid due to the constructed sharpness of the modes.

Architecture. For hyperparameter selection on CMCD-KL, CMCD-LV and SCLD, we use the PISGrad Net architecture (with detached score and 2 hidden layers of 64 units) for all diffusion-based methods as in Vargas et al. (2023). However, for CMCD-KL, we found that using the simpler MLP architecture described in Vargas et al. (2024) (which we term PISNet) gave significantly better performance than PISGrad Net on most tasks. As such, to ensure strong baselines for CMCD-KL and CMCD-LV, we also select the best architecture among PISGrad Net and PISNet (with 2 hidden layers

Published as a conference paper at ICLR 2025

of 90 units to ensure similar parameter counts), re-sweeping learning rates as necessary. For SCLD we use PISGrad Net on all tasks.

CMCD-KL, CMCD-LV, DDS, and PIS. We jointly grid search the initial scale and maximum diffusion along with the learning rates. We use one learning rate for the model euθ and prior pprior, and another for the annealing schedule β. We sweep over the learning rate of the model in [10 3, 10 4, 10 5] and learning rate of the annealing schedule in [10 2, 10 3]. We perform model selection using 8000 gradient steps instead of 40000 due to the large grid.

SMC. For all tasks not present in Blessing et al. (2024) (namely the Robot tasks and MW54), we search the same parameter grid used for other methods for the scale of the prior, jointly with HMC step sizes. For all tasks present in the benchmark, we re-use their results and SMC configuration for SCLD and CRAFT. We tuned the step size of HMC, using different step sizes for t < T/2 and t > T/2 (where time corresponds to annealing steps in CRAFT) in the same fashion as Blessing et al. (2024). We search step sizes in the set [0.001, 0.01, 0.5, 0.1, 0.2]

CRAFT. We sweep over [4, 8, 128] for the number of annealing steps, jointly with the prior scale and the learning rate (also in [10 3, 10 4, 10 5]), and choose the best value. As in Blessing et al. (2024), we re-use the HMC step sizes that were tuned for SMC. Our results uniformly reproduce or improve upon those presented in the aforementioned paper due to the extended search space.

SCLD. To ensure a fair comparison with baseline methods, we reuse the chosen scale and diffusion parameters of CMCD-LV as well as the HMC step sizes tuned for SMC. The only grid search we perform for SCLD is over the learning rate of the model in [10 3, 10 4] and the learning rate of the annealing schedule in [10 2, 10 3]. However, as reflected in Table 5, setting all learning rates to 10 3 typically turned out to be a robust choice.

Table of hyperparameter choices. In Tables 5 and 6 we present the tuned hyperparameters we obtained. Please note that PGN refers to the PISGrad Net architecture, whereas PN refers to the PISNet architecture. We refer to Blessing et al. (2024) for further details and design choices for PIS and DDS. In Table 6, we specify the hyperparameters for DDS, PIS, and SMC on tasks not present in Blessing et al. (2024).

Table 5: Hyperparameter choices of our considered methods for the tasks in Blessing et al. (2024).

Funnel (10d)

Robot1 (10d)

Robot4 (10d)

GMM40 (50d)

Seeds (26d)

Sonar (61d)

Credit (25d)

Brownian (32d)

LGCP (1600d)

CMCD-KL Initial Scale 1.0 1.0 2.0 2.0 40.0 15.0 1.0 0.1 10.0 0.1 1.0 Max Diffusion 10.0 1.0 0.03 0.03 10.0 10.0 1.0 1.0 1.0 1.0 10.0 Architecture PN PGN PGN PGN PN PGN PN PN PGN PN PGN Model LR 0.001 0.0001 0.001 0.001 0.0001 0.001 0.001 0.001 0.001 0.001 0.0001 Annealing Schedule LR 0.01 0.001 0.001 0.001 0.01 0.001 0.01 0.001 0.01 0.01 0.01

CMCD-LV Initial scale 1.0 1.0 2.0 2.0 40.0 15.0 1.0 1.0 0.1 0.1 1.0 Maximum diffusion 1.0 10.0 0.03 0.03 20.0 1.0 1.0 1.0 0.1 1.0 10.0 Architecture PN PN PGN PGN PGN PN PGN PN PN PGN PGN Model LR 0.001 0.0001 0.0001 0.001 0.0001 0.001 0.001 0.001 0.001 0.001 0.0001 Annealing schedule LR 0.01 0.01 0.01 0.01 0.01 0.001 0.01 0.001 0.01 0.01 0.001

SCLD Model LR 0.001 0.0001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 Annealing schedule LR 0.01 0.01 0.01 0.001 0.001 0.001 0.01 0.01 0.01 0.001 0.001

CRAFT Number of steps 128 128 8 4 4 128 128 128 8 128 128 LR 0.001 0.00001 0.001 0.001 0.00001 0.001 0.0001 0.0001 0.0001 0.001 0.001 initial scale 1.0 1.0 2.0 2.0 40.0 15.0 0.1 1.0 1.0 1.0 1.0

Experimental details. Here, we provide additional details on the experiments in the main part of the paper.

Improved convergence (Figure 3). All experiments were performed on a single Nvidia RTX4090 GPU using the same settings as the main experiments.

Published as a conference paper at ICLR 2025

Table 6: Hyperparameter choices of DDS, PIS, and SMC for the tasks not present in Blessing et al. (2024).

Robot1 Robot4 MW54

SMC Initial scale 2.0 2.0 1.0 HMC step sizes [0.001, 0.01] [0.01, 0.001] [0.01, 0.001]

DDS Initial scale 2.0 2.0 0.1 Maximum diffusion 0.3 0.3 10.0 LR 0.001 0.001 0.00001

PIS Maximum diffusion 0.3 0.3 10.0 LR 0.00001 0.00001 0.00001

Varying the number of SMC steps (Figure 4). For this study, we train for 8000 gradient steps in all instances and vary the number of subtrajectories at training and evaluation time. Apart from that, we use the same hyperparameters and procedures as in the main experiments. In particular, the total number of annealing steps is fixed to 128.

A.6 ADDITIONAL EXPERIMENTS

In this section, we present additional experiments.

A.6.1 ABLATION STUDIES OF SCLD

In Figure 6, we study the effect of removing various parts of SCLD on several tasks. We investigate the use of the buffer, resampling, and MCMC steps. For this experiment, all other design choices are kept the same as in the main experiments. In particular, the reported results for the full SCLD algorithm here coincide with those in the main experiments up to variation due to seeds. On the other hand, the No (Buffer,Resampling,MCMC) Algorithm corresponds to CMCD-LV with subtrajectories.

In all studied cases except the Seeds task, the addition of each component (MCMC, resampling, and buffer) improves performance (we use a logarithmic scale for clarity on the Robot task). In the case of the Seeds task, the performances of all choices are effectively the same (note the small range of the y-axis). In summary, this study shows that none of our components are redundant.

A.6.2 REMOVING MCMC COMPONENTS

Here, we investigate the effect of not using MCMC steps during training. This is an interesting question because, unlike SMC methods with deterministic transitions like CRAFT, where MCMC steps are needed to remove the particle degeneracy caused by resampling steps, our stochastic transitions do this automatically. As such, it is possible to remove MCMC steps from the SCLD training procedure, and we investigate the effect of doing so here, as it offers potentially accelerated training.

Table 7: ELBOs attained by SCLD when removing MCMC steps during training and evaluation.

ELBO ( ) Seeds (26d) Sonar (61d) Credit (25d) Brownian (32d) LGCP (1600d)

SCLD 73.45 0.01 73.45 0.01 73.45 0.01 108.17 0.25 108.17 0.25 108.17 0.25 504.46 0.09 504.46 0.09 504.46 0.09 1.00 0.18 1.00 0.18 1.00 0.18 486.77 0.70 486.77 0.70 486.77 0.70 SCLD-No MCMC 73.48 0.03 109.39 1.10 504.72 0.34 0.82 0.09 415.83 19.53

Table 8: Sinkhorn distances attained by SCLD when removing MCMC steps during training and evaluation.

Sinkhorn ( ) Funnel (10d) MW54 (5d) Robot1 (10d) Robot4 (10d) GMM40 (50d) Mo S (50d)

SCLD 134.23 8.39 134.23 8.39 134.23 8.39 0.44 0.06 0.44 0.06 0.44 0.06 0.31 0.04 0.31 0.04 0.31 0.04 0.40 0.01 0.40 0.01 0.40 0.01 3787.73 249.75 3787.73 249.75 3787.73 249.75 656.10 88.97 656.10 88.97 656.10 88.97 SCLD-No MCMC 147.38 7.84 0.44 0.05 0.44 0.05 0.44 0.05 0.31 0.04 0.31 0.04 0.31 0.04 0.41 0.01 0.41 0.01 0.41 0.01 3929.52 753.27 1252.87 183.95

Using the same experimental setting as the main experiments, we compare the effect of omitting SMC steps during training and evaluation in Tables 7 and 8. Unsurprisingly, removing MCMC steps has an adverse effect on performance. However, in many cases, the difference is not too big. In particular, on tasks where a smaller number of 4 subtrajectories have been used (Robot1, Robot4,

Published as a conference paper at ICLR 2025

Sinkhorn Distance

Sinkhorn Distance

No (Buffer,Resampling)

No (Buffer,Resampling)

No (Buffer,Resampling)

No (Buffer,Resampling)

No (Buffer,Resampling,MCMC)

No (Buffer,Resampling,MCMC)

No (Buffer,Resampling,MCMC)

No (Buffer,Resampling,MCMC)

Figure 6: Ablation study of the different components of SCLD on four tasks. We sequentially add MCMC steps, resampling, and a prioritized relay buffer to LV-CMCD with subtrajectories (corresponding to the No (Buffer,Resampling,MCMC) method) to arrive at our proposed SCLD method. We observe that on most tasks, each of these components improves performance.

GMM40, MW54), the effect was negligible, as MCMC steps did not feature prominently in the training process in the first place. On the other tasks, where 128 SMC steps have been employed, the impact on performance was larger. However, the performance was still competitive with other approaches, noting that we did not increase the number of gradient steps. In all, using SCLD without MCMC steps is shown to be a viable possibility. It is also plausible that increased noise levels could help compensate for the lack of additional randomness.

A.6.3 TIMINGS

In Table 9, we report the timings on each task for each of the methods in the main table with regards to time taken per gradient step (except SMC, which does not require training), using the same hyperparameters as for the main experiments. We worked in the JAX framework and used jitting, discarding the first iteration (Bradbury et al., 2018). We average across 3 seeds on a single Nvidia RTX4090 GPU for 500 iterations. Dynamical memory allocation via XLA PYTHON CLIENT ALLOCATOR=platform was required for CMCD-KL on GMM40 to fit within the memory limit, resulting in slower runtimes.

Table 9: Average time per gradient step for all considered methods and tasks.

Time (s) Brownian Credit LGCP Seeds Sonar Funnel GMM40 MW54 Robot1 Robot4 Mo S

CMCD-KL 0.21 0.14 0.39 0.12 0.13 0.14 0.58 0.10 0.24 0.24 0.20 CMCD-LV 0.34 1.41 0.42 0.13 0.18 0.10 0.14 0.09 0.11 0.12 0.11 SCLD 0.13 0.15 1.48 0.07 0.11 0.07 0.12 0.07 0.08 0.08 0.09 CRAFT 0.06 0.004 0.89 0.03 0.04 0.02 0.01 0.02 0.01 0.05 0.04 DDS 0.04 0.03 0.11 0.03 0.03 0.03 0.04 0.03 0.04 0.04 0.03 PIS 0.04 0.03 0.10 0.03 0.03 0.03 0.04 0.03 0.04 0.04 0.03

Published as a conference paper at ICLR 2025

The dimension of the target, the number of SMC operations, as well as the difficulty of evaluating the target all significantly influence the computation time. It may seem strange that SCLD, with the added complexity of SMC steps, was generally faster than the CMCD variants. This can be attributed to two points. First, SCLD detaches the trajectory due to the use of the off-policy log-variance loss, unlike CMCD-KL, which results in a simplified computation graph, saving both time and memory. We refer to Richter & Berner (2024) for a full discussion on using detaching in the log-variance loss. Due to our use of the subtrajectory-based LV loss, the gradients for each subtrajectory can be computed independently and in parallel, improving speed over CMCD-LV. Please note, however, that timings are highly dependent on implementational details.

A.6.4 ESTIMATIONS OF THE NORMALIZING CONSTANT

When the true normalizing constant Z for a density is known, another benchmark often used to evaluate a sampler is to study how accurately it can estimate Z or log Z. It is, however, known that for multimodal tasks, methods that achieve good log Z estimates often do so at the expense of mode collapse. Conversely, methods that avoid mode collapse sometimes yield poor log Z estimates (Blessing et al., 2024). Indeed, applying (tuned) CRAFT to the GMM40 (50d) task achieves an log Z estimate of 3.63 (the true value is log 1 = 0) and is one of the better-performing methods for the task. While this may sound impressive, it is realized that 3.63 log 40 corresponds to sampling perfectly from exactly 1 of the 40 modes (as evidenced by Figure 7). Thus while CRAFT achieves relatively good estimates of the true log Z, it performs poorly as a sampler. Likewise, when SCLD is optimized for Sinkhorn distances, it often has worse estimation errors but achieves significantly better sample quality.

Figure 7: CRAFT only samples from one mode of GMM40 (50d).

Acknowledging this trade-off between log Z estimation and mode collapse, we present two sets of results for CRAFT, CMCD-KL, CMCD-LV, and SCLD corresponding to the log Z estimation error when methods are optimized for Sinkhorn distances (named CRAFT-SD, SCLD-SD, CMCDKL-SD, CMCD-LV-SD) and when methods are optimized for log Z estimation (named correspondingly). In Table 10, we present errors of normalizing constant estimations on a selection of tasks where true log Z values are available, averaged over 4 seeds and using the same evaluation protocol as the main experiments. For this experiment, results for DDS and PIS are also taken from Blessing et al. (2024) when available.

Table 10: log Z estimations for different tasks.

log Z ( ) Funnel (10d) MW54 (5d) GMM40 (2d) GMM40 (50d) Mo S (50d)

SMC 0.19 0.09 1.45 1.53 0.08 0.03 761.93 21.55 3.88 1.76 PIS 0.92 0.60 0.36 0.07 0.27 0.01 7.12 0.63 12.25 0.33 DDS 0.19 0.08 3.34 0.08 0.01 0.01 1.74 0.44 7.95 0.30 CRAFT-SD 0.10 0.02 0.16 0.05 0.02 0.02 6295.25 144.71 0.75 0.19 CRAFT-log Z 0.10 0.02 0.16 0.05 0.02 0.01 3.63 0.05 0.75 0.19 CMCD-KL-SD 0.04 0.01 0.04 0.01 0.04 0.01 1.65 0.10 0.01 0.00 3.53 0.12 2.72 0.45 CMCD-KL-log Z 0.04 0.01 0.04 0.01 0.04 0.01 1.65 0.10 0.01 0.00 3.53 0.12 2.19 0.36 CMCD-LV-SD 0.24 0.10 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.00 1.45 0.35 3.04 0.41 CMCD-LV-log Z 0.18 0.05 0.01 0.01 0.01 0.01 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 1.45 0.35 3.04 0.41 SCLD-SD 0.09 0.01 0.14 0.03 0.02 0.01 7.10 4.05 0.05 0.03 0.05 0.03 0.05 0.03 SCLD-log Z 0.09 0.01 0.01 0.00 0.01 0.00 0.01 0.00 0.02 0.01 0.77 0.66 0.77 0.66 0.77 0.66 0.05 0.03 0.05 0.03 0.05 0.03

Published as a conference paper at ICLR 2025

We found that using SMC at evaluation time (with the same configuration as during training) consistently improved log Z estimate quality for SCLD and consequently used it for all tasks. We maintain the same subtrajectory settings as we did for the main experiments. SCLD significantly outperforms all other methods on the GMM40 (50d) and Mo S tasks and is best or a close second on the other tasks. This illustrates that our method can also be adjusted to target better log Z estimates.

A.6.5 THE LEARNED ANNEALING SCHEDULE

For CMCD-KL, CMCD-LV, and SCLD, we found that using a learned annealing schedule as in (35) is crucial to obtaining good results. We illustrate this in Figure 8 with a case study on SCLD, visualizing the linearly interpolated annealing schedule, i.e., (20) with β(t) = t/T, and the learned annealing schedule in (35) for the 2-dimensional GMM40 task.

Figure 8: We compare the uniform annealing schedule with the annealing schedule learned by SCLD for 0 t T/2. SCLD is able to learn a more gradual annealing schedule, which potentially allows transitions between adjacent densities to be learned more easily.

A.6.6 MEAN FIELD PRIOR FOR SCLD

While we opt for a prior of the form N(0, σ2I) for SCLD in our main experiments, an alternative approach is to initialize it using a diagonal Gaussian trained using Mean Field Variational Inference (MFVI) (Bishop, 2006). We study this design choice experimentally here.

We use 50000 iterations of MFVI with batch size 2000 and constant learning rate 10 3, initializing with N(0, I). We retain the same experimental setup and hyperparameter settings as for the main experiments, except for the max diffusion coefficient, where we divide the values from the main experiments by 10. This is because MFVI is mode seeking, and so aims to cover a high probability region of the target distribution tightly, leading to a prior with smaller support. We compare the attained ELBOs in Table 11 and also report results for SCLD-MFVI at initialization (i.e., without training the control), termed No Train .

Table 11: Performance of SCLD when fitting the diagonal of the prior covariance matrix using MFVI at initialization ( No Train ) and after training ( SCLD-MFVI ).

ELBOs ( ) Brownian Credit LGCP Seeds Sonar

No Train 1.07 0.23 513.70 0.70 500.42 0.37 500.42 0.37 500.42 0.37 73.48 0.05 114.89 1.35 SCLD 1.00 0.18 504.46 0.09 504.46 0.09 504.46 0.09 486.77 0.70 73.45 0.01 73.45 0.01 73.45 0.01 108.17 0.25 108.17 0.25 108.17 0.25 SCLD-MFVI 1.14 0.05 1.14 0.05 1.14 0.05 504.59 0.15 500.56 0.12 500.56 0.12 500.56 0.12 73.44 0.01 73.44 0.01 73.44 0.01 108.93 0.34

Impressively, SCLD often achieves near-state-of-the-art results even without training when initialized with MFVI, such as on the LGCP task. We can attribute this to SCLD being initialized as an SMC sampler with Unadjusted Langevin Annealing (ULA) transition kernels as well as MCMC steps, which, in conjunction with the mode-seeking behavior of MFVI, leads to high ELBO values. SCLD-MFVI attains competitive performances on all tasks. Given that we performed no re-tuning on SCLD-MFVI, it is probable that with more careful setting and hyperparameter choices, even higher ELBOs could be attained.

However, using MFVI-fitted priors in practice often carries serious drawbacks. In line with the experiments of Blessing et al. (2024), we found that using MFVI priors leads to mode collapse

Published as a conference paper at ICLR 2025

(due to the mode-seeking nature of MFVI training restricting the sampling to a subset of the target modes), and thus potentially poor sample quality. We illustrate this in Figure 9 on the GMM40 (50d) target, where we use the same hyperparameters as in the main experiment except for the prior.

Figure 9: The samples drawn by SCLD when an MFVI-fitted prior is used. MFVI obtains a prior density that covers exactly one mode of the GMM40 distribution. As such, SCLD is unable to discover the other modes and experiences complete mode collapse. This is in contrast to Figure 2 where SCLD samples are visually indistinguishable from the target density.

A.6.7 COMPARISON TO PDDS

In this section, we empirically compare the PDDS and SCLD methods. We employ the exact experimental methodology of Phillips et al. (2024). In particular, we train for 20000 gradient steps, refreshing the model every 500 steps. We employed 50000 gradient steps to train the mean field prior. We note that this corresponds to a significantly higher iteration budget than was allocated to SCLD. In line with the findings of Phillips et al. (2024), we found that sweeping over the prior scale as opposed to using a variational approximation significantly degraded performance on all tasks (and indeed on several tasks, such as Robot and GMM40 could not train at all). One reason for the degraded performance might be that PPDS is unable to further optimize the prior during training (as is done in SCLD). We thus opt to use variational approximations (by mean field Gaussians) to initialize the prior for all tasks. Benchmarking was done exactly as in the main experiments, and we analyzed the performance of PDDS with and without MCMC steps.

For all tasks present in the benchmark of Phillips et al. (2024) (including the Gaussian mixture tasks), we used the pre-tuned MCMC step sizes. For the other tasks, we chose a linearly interpolated step size schedule from t = 0 to t = T where step sizes at times 0 and T are taken from the grid [0.1, 0.3, 1, 3, 10] since the method for tuning MCMC step sizes was not specified. We select the best parameters directly based on the target metric and present the results in Tables 12 and 13.

Table 12: Comparison of SCLD against PDDS (Phillips et al., 2024) in terms of ELBOs.

ELBOs ( ) Brownian Credit LGCP Seeds Sonar

PDDS 1.12 0.23 1.12 0.23 1.12 0.23 -502.80 0.72 -502.80 0.72 -502.80 0.72 499.35 0.65 -73.48 0.21 -108.61 0.06 PDDS-MCMC 1.04 0.04 -502.90 0.28 -502.90 0.28 -502.90 0.28 499.83 0.08 -73.47 0.19 -108.67 0.04 SCLD (ours) 1.00 0.18 504.46 0.09 486.77 0.70 73.45 0.01 73.45 0.01 73.45 0.01 108.17 0.25 108.17 0.25 108.17 0.25 SCLD-MFVI (ours) 1.14 0.05 1.14 0.05 1.14 0.05 504.59 0.15 500.56 0.12 500.56 0.12 500.56 0.12 73.44 0.01 73.44 0.01 73.44 0.01 108.93 0.34

Table 13: Comparison of SCLD against PDDS (Phillips et al., 2024) in terms of Sinkhorn distances.

Sinkhorn ( ) Funnel GMM40 MW54 Robot1 Robot4 Mo S

PDDS 145.81 13.28 42157.92 346.21 1.28 0.18 3.36 0.08 3.09 0.16 3119.83 98.64 PDDS-MCMC 151.02 28.00 42157.92 346.21 1.07 0.25 3.35 0.08 3.08 0.14 3108.75 98.61 SCLD (ours) 134.23 8.39 134.23 8.39 134.23 8.39 3787.73 249.75 3787.73 249.75 3787.73 249.75 0.44 0.06 0.44 0.06 0.44 0.06 0.31 0.04 0.31 0.04 0.31 0.04 0.40 0.01 0.40 0.01 0.40 0.01 656.10 88.97 656.10 88.97 656.10 88.97

PDDS attains comparable ELBOs to SCLD on the Bayesian statistics tasks. This is due to both methods being initialized as SMC samplers with a prior obtained by the same variational approximation

Published as a conference paper at ICLR 2025

(for SCLD-MFVI). We also observed, in line with the findings of Phillips et al. (2024) and similar to Appendix A.6.6, that often relatively little training is required to achieve optimal performance, so the gap in performance between the initial, untrained SMC scheme and the trained sampler is small.

However, PDDS consistently presents significantly worse Sinkhorn distances (on all tasks where this is available) than SCLD. This is due to the reliance of PDDS on using an MFVI prior, which, as discussed in Appendix A.6.6, is prone to mode collapse. On the other hand, SCLD is able to operate stably without relying on using the MFVI prior, avoiding mode collapse.

A.6.8 COMPARISON WITH ADVANCED SMC SCHEMES

In the section, we compare SCLD against two advanced SMC schemes implemented in the framework by Cabezas et al. (2024). We consider adaptive tempered SMC, which utilizes the constant-ESS method for choosing the annealing schedule as seen in Buchholz et al. (2021). We term this method SMC-ESS. In line with SCLD, we utilize a single HMC step for the SMC kernel with 10 leapfrog integration steps, and apply the same tuning procedure for HMC step size as we did for our own SMC method. We additionally sweep over the ESS threshold α {0.3, 0.5, 0.75, 0.9, 0.95, 0.99}. Due to the large search grid, we run 10 seeds per task to mitigate outliers. Unlike SCLD, which uses multinomial resampling (for a fair comparison to our other baselines), we use systematic resampling (see, e.g., Chopin et al. (2020, Chapter 9)) for SMC-ESS, which we found led to best performance. We consider another method from Buchholz et al. (2021), utilizing the full-covariance tuning approach for Independent Rosenbluth Metropolis-Hastings (IRMH) proposals (on top of using adaptive tempered SMC). We use 100 MCMC steps per step and term this method SMC-FC.

We report results in Tables 2 and 3, using the same evaluation protocol (in particular, using 2000 particles). For reference, we also compare all SMC methods with SCLD in Tables 14 and 15.

Table 14: Comparison of SCLD against advanced SMC methods (Buchholz et al., 2021) in terms of ELBOs.

ELBOs ( ) Brownian Credit LGCP Seeds Sonar

SMC 2.21 0.53 589.82 5.72 385.75 7.65 74.63 0.14 111.50 0.96 SMC-ESS 0.49 0.19 505.57 0.18 497.85 0.11 497.85 0.11 497.85 0.11 74.07 0.60 109.10 0.17 SMC-FC 1.91 0.04 505.30 0.02 878.10 2.20 74.07 0.02 108.93 0.02 SCLD (ours) 1.00 0.18 1.00 0.18 1.00 0.18 504.46 0.09 504.46 0.09 504.46 0.09 486.77 0.70 73.45 0.01 73.45 0.01 73.45 0.01 108.17 0.25 108.17 0.25 108.17 0.25

Table 15: Comparison of SCLD against advanced SMC methods (Buchholz et al., 2021) in terms of Sinkhorn distances.

Sinkhorn ( ) Funnel GMM40 MW54 Robot1 Robot4 Mo S

SMC 149.35 4.73 46370.34 137.79 20.71 5.33 24.02 1.06 24.08 0.26 3297.28 2184.54 SMC-ESS 117.48 9.70 117.48 9.70 117.48 9.70 24240.68 50.52 1.11 0.15 1.82 0.50 2.11 0.31 1477.04 133.80 SMC-FC 211.43 30.08 39018.27 159.32 2.03 0.17 0.37 0.08 1.23 0.02 3200.10 95.35 SCLD (ours) 134.23 8.39 3787.73 249.75 3787.73 249.75 3787.73 249.75 0.44 0.06 0.44 0.06 0.44 0.06 0.31 0.04 0.31 0.04 0.31 0.04 0.40 0.01 0.40 0.01 0.40 0.01 656.10 88.97 656.10 88.97 656.10 88.97

The full-covariance tuning and the ESS-based scheme for selecting the annealing schedule significantly outperform our baseline implementation of SMC at the expense of longer and variable (possibly unbounded) sampling times. Nevertheless, all considered SMC methods are superseded by SCLD in performance on all but two tasks. While SCLD uses a relatively simple version of SMC for fair comparisons to our baselines, our framework enables the usage of more advanced techniques, such as those used for SMC-ESS and SMC-FC. Thus, we expect that the performance of SCLD can be even further improved.

A.6.9 CONVERGENCE OF DIFFERENT METHODS BY ITERATION COUNT

In Figure 10, we visualize the same data as in Section 3.1 but plotting by the number of elapsed gradient steps. In this perspective, the same conclusions hold that SCLD exhibits superior convergence properties, attaining the best ELBOs on each task for all numbers of gradient steps. Note that CRAFT was not competitive on the Credit task in this perspective.

Published as a conference paper at ICLR 2025

0 1000 2000 # Gradient Steps

0 1000 2000 # Gradient Steps

0 1000 2000 # Gradient Steps

0 1000 2000 # Gradient Steps

SCLD CMCD-KL CMCD-LV CRAFT Long Run CMCD (Best)

Figure 10: The same experiments as in Figure 3 plotted instead by iterations.

A.6.10 KL-BASED TRAINING OF SCLD

We compare KL and LV-based training of the SCLD algorithm, using the family of Funnel distributions with d {10, 20, 30, 40, 50} as a case study. We train SCLD using KL and LV losses with 4 and 128 subtrajectories as described in Section 2.3 for 3000 gradient steps using the same hyperparameters and settings (including learning the annealing schedule and prior) as in the d = 10 case for the main experiments. In Figure 11, we visualize the ELBOs attained (using the same settings during evaluation as for training) alongside CMCD-KL and CMCD-LV.

10 15 20 25 30 35 40 45 50 Funnel Dimension

ELBOs Across Varying Funnel Dimensions for SCLD Variants

CMCD-kl CMCD-lv SCLD-KL-4-Subtraj SCLD-LV-4-Subtraj SCLD-KL-128-Subtraj SCLD-LV-128-Subtraj

Figure 11: ELBOs across Varying Funnel Dimensions for different SCLD-variants

All methods except SCLD with LV loss and 128 subtrajectories experience some form of performance degradation as dimensions scale. For the LV loss, adding subtrajectories reduces the amount of performance degradation. This may be due to SMC steps counteracting increased dimensionality by focusing computation on high-density regions. For the KL loss, however, increasing the number of subtrajectories resulted in worse performance, especially as the dimension increased. Indeed, SCLD-KL with 128 subtrajectories scales the most poorly of the methods tried as d increases. As discussed in Section 2.3, this may be due to the use of importance sampling to estimate the loss function. Indeed, a set of importance weights is required for each subtrajectory to estimate the loss, and thus, using more subtrajectories demands a greater reliance on importance sampling. As the variance of importance sampling can increase significantly with dimension, this may account for the decreased performance of KL-based subtrajectory losses. In summary, this supports the hypothesis that losses avoiding importance sampling, such as the log-variance loss, are more suited to the training of SCLD on higher-dimensional tasks.