# denoising_diffusion_samplers__3cebe8d9.pdf

Published as a conference paper at ICLR 2023

DENOISING DIFFUSION SAMPLERS

Francisco Vargas1 , Will Grathwohl2 & Arnaud Doucet2

1 University of Cambridge, 2 Deep Mind

Denoising diffusion models are a popular class of generative models providing state-of-the-art results in many domains. One adds gradually noise to data using a diffusion to transform the data distribution into a Gaussian distribution. Samples from the generative model are then obtained by simulating an approximation of the time-reversal of this diffusion initialized by Gaussian samples. Practically, the intractable score terms appearing in the time-reversed process are approximated using score matching techniques. We explore here a similar idea to sample approximately from unnormalized probability density functions and estimate their normalizing constants. We consider a process where the target density diffuses towards a Gaussian. Denoising Diffusion Samplers (DDS) are obtained by approximating the corresponding time-reversal. While score matching is not applicable in this context, we can leverage many of the ideas introduced in generative modeling for Monte Carlo sampling. Existing theoretical results from denoising diffusion models also provide theoretical guarantees for DDS. We discuss the connections between DDS, optimal control and Schr odinger bridges and finally demonstrate DDS experimentally on a variety of challenging sampling tasks.

1 INTRODUCTION

Let π be a probability density on Rd of the form

π(x) = γ(x)

Rd γ(x)dx, (1)

where γ : Rd R+ can be evaluated pointwise but the normalizing constant Z is intractable. We are here interested in both estimating Z and obtaining approximate samples from π.

A large variety of Monte Carlo techniques has been developed to address this problem. In particular Annealed Importance Sampling (AIS) (Neal, 2001) and its Sequential Monte Carlo (SMC) extensions (Del Moral et al., 2006) are often regarded as the gold standard to compute normalizing constants. Variational techniques are a popular alternative to Markov Chain Monte Carlo (MCMC) and SMC where one considers a flexible family of easy-to-sample distributions qθ whose parameters are optimized by minimizing a suitable metric, typically the reverse Kullback Leibler discrepancy KL(qθ||π). Typical choices for qθ include mean-field approximation (Wainwright & Jordan, 2008) or normalizing flows (Papamakarios et al., 2021). To be able to model complex variational distributions, it is often useful to model qθ(x) as the marginal of an auxiliary extended distribution; i.e. qθ(x) = R qθ(x, u)du. As this marginal is typically intractable, θ is then learned by minimizing a discrepancy measure between qθ(x, u) and an extended target pθ(x, u) = π(x)pθ(u|x) where pθ(u|x) is an auxiliary conditional distribution (Agakov & Barber, 2004).

Over recent years, Monte Carlo techniques have also been fruitfully combined to variational techniques. For example, AIS can be thought of a procedure where qθ(x, u) is the joint distribution of a Markov chain defined by a sequence of MCMC kernels whose final state is x while pθ(x, u) is the corresponding AIS extended target (Neal, 2001). The parameters θ of these kernels can then be optimized by minimizing KL(qθ||pθ) using stochastic gradient descent (Wu et al., 2020; Geffner & Domke, 2021; Thin et al., 2021; Zhang et al., 2021; Doucet et al., 2022; Geffner & Domke, 2022).

Instead of following an AIS-type approach to define a flexible variational family, we follow here an approach inspired by Denoising Diffusion Probabilistic Models (DDPM), a powerful class of

Work done while at Deep Mind

Published as a conference paper at ICLR 2023

generative models (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song et al., 2021c). In this context, one adds noise progressively to data using a diffusion to transform the complex data distribution into a Gaussian distribution. The time-reversal of this diffusion can then be used to transform a Gaussian sample into a sample from the target. While superficially similar to Langevin dynamics, this process mixes fast even in high dimensions as it inherits the mixing properties of the forward diffusion (De Bortoli et al., 2021, Theorem 1). However, as the time-reversal involves the derivatives of the logarithms of the intractable marginal densities of the forward diffusion, these so-called scores are practically approximated using score matching techniques. If the score estimation error is small, the approximate time-reversal still enjoys remarkable theoretical properties (De Bortoli, 2022; Chen et al., 2022; Lee et al., 2022).

These results motivate us to introduce Denoising Diffusion Samplers (DDS). Like DDPM, we consider a forward diffusion which progressively transforms the target π into a Gaussian distribution. This defines an extended target distribution p(x, u) = π(x)p(u|x). DDS are obtained by approximating the time-reversal of this diffusion using a process of distribution qθ(x, u). What distinguishes DDS from DDPM is that we cannot simulate sample paths from the diffusion we want to time-reverse, as we cannot sample its initial state x from π. Hence score matching ideas cannot be used to approximate the score terms.

We focus on minimizing KL(qθ||p), equivalently maximizing an Evidence Lower Bound (ELBO), as in variational inference. We leverage a representation of this KL discrepancy based on the introduction of a suitable auxiliary reference process that provides low variance estimate of this objective and its gradient. We can exploit the many similarities between DDS and DDPM to leverage some of the ideas developed in generative modeling for Monte Carlo sampling. This includes using the probability flow ordinary differential equation (ODE) (Song et al., 2021c) to derive novel normalizing flows and the use of underdamped Langevin diffusions as a forward noising diffusion (Dockhorn et al., 2022). The implementation of these samplers requires designing numerical integrators for the resulting stochastic differential equations (SDE) and ODE. However, simple integrators such as the standard Euler Maryuama scheme do not yield a valid ELBO in discrete-time. So as to guarantee one obtains a valid ELBO, DDS relies instead on an integrator for an auxiliary stationary reference process which preserves its invariant distribution as well as an integrator for the approximate time-reversal inducing a distribution absolutely continuous w.r.t. the distribution of the discretized reference process. Finally we compare experimentally DDS to AIS, SMC and other state-of-the-art Monte Carlo methods on a variety of sampling tasks.

2 DENOISING DIFFUSION SAMPLERS: CONTINUOUS TIME

We start here by formulating DDS in continuous-time to gain insight on the structure of the timereversal we want to approximate. We introduce C = C([0, T], Rd) the space of continuous functions from [0, T] to Rd and B(C) the Borel sets on C. We will consider in this section path measures which are probability measures on (C, B(C)), see L eonard (2014a) for a formal definition. Numerical integrators are discussed in the following section.

2.1 FORWARD DIFFUSION AND ITS TIME-REVERSAL

Consider the forward noising diffusion given by an Ornstein Uhlenbeck (OU) process1

dxt = βtxtdt + σ p

2βtd Bt, x0 π, (2)

where (Bt)t [0,T ] is a d-dimension Brownian motion and t βt is a non-decreasing positive function. This diffusion induces the path measure P on the time interval [0, T] and the marginal density of xt is denoted pt. The transition density of this diffusion is given by pt|0(xt|x0) = N(xt; 1 λtx0, σ2λt I) where λt = 1 exp( 2 R t 0 βsds).

From now on, we will always consider a scenario where R T 0 βsds 1 so that p T (x T ) N(x T ; 0, σ2I). We can thus think of (2) as transporting approximately the target density π to this Gaussian density.

1This is referred to as a Variance Preserving diffusion by Song et al. (2021c).

Published as a conference paper at ICLR 2023

From (Haussmann & Pardoux, 1986), its time-reversal (yt)t [0,T ] = (x T t)t [0,T ], where equality is here in distribution, satisfies

dyt = βT t{yt + 2σ2 ln p T t(yt)}dt + σ p

2βT td Wt, y0 p T , (3)

where (Wt)t [0,T ] is another d-dimensional Brownian motion. By definition this time-reversal starts from y0 p T (y0) N(y0; 0, σ2I) and is such that y T π. This suggests that if we could approximately simulate the diffusion (3), then we would obtain approximate samples from π.

However, putting this idea in practice requires being able to approximate the intractable scores ( ln pt(x))t [0,T ]. To achieve this, DDPM rely on score matching techniques (Hyv arinen, 2005; Vincent, 2011) as it is easy to sample from (2). This is impossible in our scenario as sampling from (2) requires sampling x0 π which is impossible by assumption.

2.2 REFERENCE DIFFUSION AND VALUE FUNCTION

In our context, it is useful to introduce a reference process defined by the diffusion following (2) but initialized at pref 0 (x0) = N(x0; 0, σ2I) rather than π(x0) thus ensuring that the marginals of the resulting path measure Pref all satisfy pref t (xt) = N(xt; 0, σ2I). From the chain rule for KL for path measures (see e.g. (L eonard, 2014b, Theorem 2.4) and (De Bortoli et al., 2021, Proposition 24)), one has KL(Q||Pref) = KL(q0||pref 0 ) + Ex0 q0[KL(Q( |x0)||Pref( |x0))]. Thus P can be identified as the path measure minimizing the KL discrepancy w.r.t. Pref over the set of path measures Q with marginal q0(x0) = π(x0) at time t = 0, i.e. P = arg min Q{KL(Q||Pref) : q0 = π}.

A time-reversal representation of Pref is given by (yt)t [0,T ] = (x T t)t [0,T ] satisfying

dyt = βT tytdt + σ p

2βT td Wt, y0 pref 0 . (4)

As βT tyt + 2σ2 ln pref T t(yt) = βT tyt, we can rewrite the time-reversal (3) of P as

dyt = βT t{yt 2σ2 ln ϕT t(yt)}dt + σ p

2βT td Wt, y0 p T , (5)

where ϕt(x) = pt(x)/pref t (x) is a so-called value function which can be shown to satisfy a Kolmogorov equation such that ϕt(xt) = EPref[ϕ0(x0)|xt], the expectation being w.r.t. Pref.

2.3 LEARNING THE TIME-REVERSAL

To approximate the time-reversal (3) of P, consider a path measure Qθ whose time-reversal is induced by

dyt = βT t{yt + 2σ2sθ(T t, yt)}dt + σ p

2βT td Wt, y0 N(0, σ2I), (6)

so that yt qθ T t. To obtain sθ(t, x) ln pt(x), we parameterize sθ(t, x) by a neural network whose parameters are obtained by minimizing

KL(Qθ||P) = KL(N(0, σ2I)||p T ) + Ey0 N(0,σ2I)[KL(Qθ( |y0)||P( |y0))] (7)

= KL(N(0, σ2I)||p T ) + σ2EQθ h Z T

0 βT t||sθ(T t, yt) ln p T t(yt)||2dt i ,

where we have used the chain rule for KL then Girsanov s theorem (see e.g. (Klebaner, 2012)). This expression of the KL is reminiscent of the expression obtained in (Song et al., 2021b, Theorem 1) in the context of DDPM. However, the expectation appearing in (7) is here w.r.t. Qθ and not w.r.t. P and we cannot get rid of the intractable scores ( ln pt(x))t [0,T ] using score matching ideas.

Instead, taking inspiration from (5), we reparameterize the time-reversal of Qθ using

dyt = βT t{yt 2σ2fθ(T t, yt)}dt + σ p

2βT td Wt, y0 N(0, σ2I), (8)

i.e. fθ(t, x) = sθ(t, x) ln pref t (x) = sθ(t, x) + x/σ2. This reparameterization allows us to express KL(Qθ||P) in a compact form.

Proposition 1. The Radon Nikodym derivative d Qθ

d Pref (y[0,T ]) satisfies under Qθ

0 βT t||fθ(T t, yt)||2dt + σ

2βT tfθ(T t, yt) d Wt. (9)

Published as a conference paper at ICLR 2023

From the identity KL(Qθ||P) = KL(Qθ||Pref) + Ey T qθ 0[ln pref 0 (y T ) p0(y T ) ], it follows that

KL(Qθ||P) = EQθ h σ2

0 βT t||fθ(T t, yt)||2dt + ln N(y T ; 0, σ2I)

For θ minimizing (10), approximate samples from π can be obtained by simulating (8) and returning y T qθ 0. We can obtain an unbiased estimate of Z via the following importance sampling identity

ˆZ = γ(y T ) N(y T ; 0, σ2I) d Pref

d Qθ (y[0,T ]), (11)

where the second term can be computed directly from (9) and y[0,T ] is obtained by solving (8).

2.4 CONTINUOUS-TIME NORMALIZING FLOW

To approximate the log likelihood of DDPM, it was proposed by Song et al. (2021c) to rely on the probability flow ODE. This is an ODE admitting the same marginal distributions (pt)t [0,T ] as the noising diffusion given by dxt = βt xt + σ2 ln pt(xt) dt. We use here this ODE to design a continuous-time normalizing flow to sample from π. For θ such that fθ(t, x) ln ϕt(x) (obtained by minimization of the KL in (10)), we have xt + σ2 ln pt(x) σ2fθ(t, x). So it is possible to sample approximately from π by integrating the following ODE over [0, T] dyt = σ2βT tfθ(T t, yt)dt, y0 N(0, σ2I). (12) If denote by qθ T t the distribution of yt then y T qθ 0 is an approximate sample from π. We can use this normalizing flow to obtain an unbiased estimate of Z using importance sampling ˆZ = γ(y T )/ qθ 0(y T ) for y T qθ 0. Indeed, contrary to the proposal qθ 0 induced by (8), we can compute pointwise qθ 0 using the instantaneous change of variables formula (Chen et al., 2018) such that ln qθ 0(y T ) = ln N(y0; 0, σ2I) σ2R T 0 βT t fθ(T t, yt)dt and where (yt)t [0,T ] arises from the ODE (12). The expensive divergence term can be estimated using the Hutchinson estimator (Grathwohl et al., 2018; Song et al., 2021c).

2.5 EXTENSION TO UNDERDAMPED LANGEVIN DYNAMICS

For DDPM, it has been proposed to extend the original state x Rd by a momentum variable p Rd. One then diffuses the data distribution augmented by a Gaussian distribution on the momentum using an underdamped Langevin dynamics targeting N(x; 0, σ2I)N(m; 0, M) where M is a positive definite mass matrix (Dockhorn et al., 2022). It was demonstrated empirically that the resulting scores are smoother, hence easier to estimate and this leads to improved performance. We adapt here this approach to Monte Carlo sampling; see Section B for more details.

We diffuse π(x, m) = π(x)N(m; 0, M) using an underdamped Langevin dynamics, i.e.

dxt = M 1mtdt, dmt = xt

σ2 dt βtmtdt + p

2βt M 1/2d Bt, (x0, m0) π. (13)

The resulting path measure on [0, T] is denoted P and the marginal of (xt, mt) is denoted ηt.

Here, the reference process Pref is defined by the diffusion (13) initialized using x0 N(0, σ2I), m0 N(0, M). This ensures that ηref t (xt, mt) = N(xt; 0, σ2I)N(mt; 0, M) for all t and the time-reversal process (yt, nt)t [0,T ] = (x T t, m T t)t [0,T ] (where equality is in distribution) of this stationary diffusion satisfies

dyt = M 1ntdt, dnt = yt

σ2 dt βT tntdt + p

2βT t M 1/2d Wt. (14)

Using manipulations identical to Section 2.2, the time-reversal of P can be also be written for ϕt(x, m) := ηt(x, m)/ηref t (x, m) as dyt = M 1ntdt and

σ2 dt βT tntdt + 2βT t nt ln ϕT t(yt, nt)dt + p

2βT t M 1/2d Wt. (15)

To approximate P, we will consider a parameterized path measure Qθ whose time reversal is defined for (y0, n0) N(y0; 0, σ2I)N(n0; 0, M) by dyt = M 1ntdt and

σ2 dt βT tntdt + 2βT t Mfθ(T t, yt, nt)dt + p

2βT t M 1/2d Wt. (16)

We can then express the KL of interest in a compact way similar to Proposition 1, see Appendix B.2. A normalising flow formulation is presented in Appendix B.3.

Published as a conference paper at ICLR 2023

Algorithm 1 DDS Training

1: Input: σ > 0, γ : Rd R+, (βk)K k=1 (R+)K, θ Rp, λ > 0 2: for i = 1, . . . , training iterations do 3: for n = 1, . . . , N do # Iterate over training batch size. 4: Sample y0,n N(0, σ2I) and set r0 = 0 5: for k = 0, . . . , K 1 do # Iterate over integration steps. 6: λK k := 1 1 αK k 7: yk+1,n = 1 αK kyk,n+2σ2λK kfθ(K k, yk,n)+σ αK kεk,n, εk,n i.i.d. N(0, I)

8: rk+1,n = rk,n + 2σ2λ2 K k αK k ||fθ(K k, yk,n)||2

9: end for 10: θ θ λ θ 1

N PN n=1 r K,n + ln N(y K,n;0,σ2I)

11: end for 12: end for 13: Return: θ

3 DENOISING DIFFUSION SAMPLERS: DISCRETE-TIME

We introduce here discrete-time integrators for the SDEs and ODEs introduced in Section 2. Contrary to DDPM, we not only need an integrator for Qθ but also for Pref to be able to compute an approximation of the Radon Nikodym derivative d Qθ/d Pref. Additionally this integrator needs to be carefully designed as explained below to preserve an ELBO. For sake of simplicity, we consider a constant discretization step δ > 0 such that K = T/δ is an integer. In the indices, we write k for tk = kδ to simplify notation; e.g. xk for xkδ.

3.1 INTEGRATOR FOR Pref

Proposition 1 shows that learning the time reversal in continuous-time can be achieved by minimising the objective KL(Qθ||P) given in (10). This expression is obtained by using the identity KL(Qθ||P) = KL(Qθ||Pref) + Ey T qθ 0[ln pref 0 (y T )/p0(y T ) ]. This is also equivalent to maximiz-

ing the corresponding ELBO, EQθ[ln ˆZ], for the unbiased estimate ˆZ of Z given in (11)

An Euler Maruyama (EM) discretisation of the corresponding SDEs is obviously applicable. However, it is problematic as established below.

Proposition 2. Consider integrators of Pref and Qθ leading to approximations pref(x0:K) and qθ(x0:K). Then KL(qθ||pref) + Ey K qθ 0[ln pref 0 (y K)/π(y K) ] is guaranteed to be non-negative and

EQθ[ln ˆZ] is an ELBO for ˆZ = (γ(y K)pref(y0:K))/(N(y K; 0, σ2)qθ(y0:K)) if one has pref K (y K) = pref 0 (y K). The EM discretisation of Pref does not satisfy this condition.

A direct consequence of this result is that the estimator for ln Z that uses the EM discretisation can be such that EQθ[ln ˆZ] ln Z as observed empirically in Table 5. To resolve this issue, we derive in the next section an integrator for Pref which ensures that pref k (y) = pref 0 (y) for all k.

3.2 ORNSTEIN UHLENBECK

We can integrate exactly (4). This is given by y0 N(0, σ2I) and

1 αK kyk + σ αK kεk, εk i.i.d. N(0, I), (17)

for αk = 1 exp 2 R kδ (k 1)δ βsds 2. This defines the discrete-time reference process. We propose

the following exponential type integrator (De Bortoli, 2022) for (8) initialized using y0 N(0, σ2I)

1 αK kyk + 2σ2(1 p

1 αK k)fθ(K k, yk) + σ αK kεk, (18)

2Henceforth, it is assumed that αk can be computed exactly.

Published as a conference paper at ICLR 2023

0 2000 4000 6000 8000 10000 Training Iteration

PIS training loss (Ionosphere steps = 256) over all values.

0 2000 4000 6000 8000 10000 Training Iteration

DDS training loss (Ionosphere, steps = 256) over all , max values.

Figure 1: Training loss per hyperparameter: PIS (left) vs DDS (right).

where εk i.i.d. N(0, I). Equations (17) and (18) define the time reversals of the reference process pref(x0:K) and proposal qθ(x0:K). For its time reversal, we write qθ(y0:K) = N(y0; 0, σ2I) QK k=1 qθ k 1|k(y K k+1|y K k) abusing slightly notation, and similarly for pref(y0:K). We will be relying on the discrete-time counterpart of (9).

Proposition 3. The log density ratio between qθ(y0:K) and pref(y0:K) satisfies for y0:K qθ(y0:K)

ln qθ(y0:K)

λ2 k αk ||fθ(k, y K k)||2 + 2σ

λk αk fθ(k, y K k) εk (19)

where λk := 1 1 αk and εk defined through (18) is such that εk i.i.d. N(0, I). In particular,

one obtains from KL(qθ||p) = KL(qθ||pref) + Eqθ 0

h ln π(y K) N(y K;0,σ2I) i that

KL(qθ||p) = Eqθ

λ2 k αk ||fθ(k, y K k)||2 + ln N(y K; 0, σ2I)

We compute an unbiased gradient of this objective using the reparameterization trick and the JAX software package (Bradbury et al., 2018). The training procedure is summarized in Algorithm 1 in Appendix D. Unfortunately, contrary to DDPM, qθ k|K is not available in closed form for k < K 1 so we can neither mini-batches over the time index k without having to simulate the process until the minimum sampled time nor reparameterize xk as in Ho et al. (2020). Once we obtain the parameter θ minimizing (20), DDS samples from qθ using (18). The final sample y K has a distribution qθ 0 approximating π by design. By using importance sampling, we obtain an unbiased estimate of the normalizing constant ˆZ = (γ(y K)pref(y0:K))/(N(y K; 0, σ2)qθ(y0:K)) for y0:K qθ(y0:K). Finally Appendices B.4 and B.6 extend this approach to provide a similar approach to discretize the underdamped dynamics proposed in Section 2.5. In this context, the proposed integrators rely on a leapfrog scheme (Leimkuhler & Matthews, 2016).

3.3 THEORETICAL GUARANTEES

Motivated by DDPM, bounds on the total variation between the target distribution and the distribution of the samples generated by a time-discretization of an approximate time reversal of a forward noising diffusion have been first obtained in (De Bortoli et al., 2021) then refined in (De Bortoli, 2022; Chen et al., 2022; Lee et al., 2022) and extended to the Wasserstein metric. These results are directly applicable to DDS because their proofs rely on assumptions on the score approximation error but not on the way these approximations are learned, i.e. via score matching for DDPM and reverse KL for DDS. For example, the main result in Chen et al. (2022) shows that if the true scores are L-Lipschitz, the L2(pt) error on the scores is bounded and other mild integrability assumptions then DDS outputs samples ϵ-close in total variation to π in O(L2d/ϵ2) time steps. As pointed out by the authors, this matches state-of-the-art complexity bounds for Langevin Monte Carlo algorithm for sampling targets satisfying a log-Sobolev inequality without having to make any log-concavity assumption on π. However, the assumption on the approximation error for score estimates is less realistic for DDS than DDPM as we do not observe realizations of the forward diffusion.

Published as a conference paper at ICLR 2023

100 200 300 400 500 Steps

ln Z estimate

ou pis smc ground truth vi

100 200 300 400 500 Steps

ln Z estimate

ou pis smc ground truth vi

100 200 300 400 500 Steps

ln Z estimate

ou pis smc ground truth vi

Figure 2: ln Z estimate (median plus upper/lower quartiles) as a function of number of steps K - a) Funnel , b) LGCP, c) Logistic Ionosphere dataset. Yellow dotted line is MF-VI and dashed magenta is the gold standard.

3.4 INTEPRETATION, RELATED WORK AND EXTENSIONS

DDS as KL and path integral control. The reverse KL we minimize can be expressed as

KL(qθ||p) = Eqθ

" ln N(x0; 0, σ2I)

qθ k 1|k(xk 1|xk)

pref k 1|k(xk 1|xk)

This objective is a specific example of KL control problem (Kappen et al., 2012). In continuoustime, (10) corresponds to a path integral control problem; see e.g. (Kappen & Ruiz, 2016).

Heng et al. (2020) use KL control ideas so as to sample from a target π and estimate Z. However, their algorithm relies on pref(x0:K) being defined by a discretized non-homogeneous Langevin dynamics such that p K(x K) is typically not approximating a known distribution. Additionally it approximates the value functions (ϕk)K k=0 using regression using simple linear/quadratic functions. Finally it relies on a good initial estimate of p(x0:K) obtained through SMC. This limits the applicability of this methodology to a restricted class of models.

Connections to Schr odinger Bridges. The Schr odinger Bridge (SB) problem (L eonard, 2014a; De Bortoli et al., 2021) takes the following form in discrete time. Given a reference density pref(x0:K), we want to find the density psb(x0:K) s.t. psb = arg minq{KL(q||pref) : q0 = µ0, q K = µK} where µ0, µK are prescribed distributions. This problem can be solved using iterative proportional fitting (IPF) which is defined by the following recursion with initialization p1 = pref

p2n = arg min q {KL(q||p2n 1) : q0 = µ0}, p2n+1 = arg min q {KL(q||p2n) : q K = µK}. (22)

Consider the SB problem where µ0(x0) = π(x0), µK(x K) = N(x K; 0, σ2I) and the time-reversal of pref(x0:K) is defined through (17). In this case, p2 = p corresponds to the discrete-time version of the noising process and p3 to the time-reversal of p but initialized at µK instead of p K. This is the process DDS is approximating. As p K µK for K large enough, we have approximately psb p3 p2. We can thus think of DDS as approximating the solution to this SB problem.

Consider now another SB problem where µ0(x0) = π(x0), µK(x K) = δ0(x K) and pref(x0:K) = δ0(x K) QK 1 k=0 N(xk; xk+1, δσ2I), i.e. pref is a pinned Brownian motion running backwards in time. This SB problem was discussed in discrete-time in (Beghi, 1996) and in continuous-time in (F ollmer, 1984; Dai Pra, 1991; Tzen & Raginsky, 2019). In this case, p2(x0:K) = π(x0)pref(x1:K|x0) is a modified noising process that transports π to the degenerate measure δ0 and it is easily shown that psb = p2. Sampling from the time-reversal of this measure would generate samples from π starting from δ0. Algorithmically, Zhang et al. (2021) proposed approximating this time-reversal by some projection on some eigenfunctions. In parallel, Barr et al. (2020), Vargas et al. (2021) and Zhang & Chen (2022) approximated this SB by using a neural network parameterization of the gradient of the logarithm of the corresponding value function trained by minimizing a reverse KL. We will adopt the Path Integral Sampler (PIS) terminology proposed by Zhang & Chen (2022) for this approach. DDS can thus be seen as an alternative to PIS which relies on a reference dynamics corresponding to an overdamped or underdamped OU process instead of a pinned Brownian motion. Theoretically the drift of the resulting time-reversal for DDS is not as a steep as for PIS (see Appendix A.2) and empirically this significantly improves numerical stability of the training procedure; see Figure 1. The use of a pinned Brownian motion is also not amenable to the construction of normalizing flows.

Published as a conference paper at ICLR 2023

100 200 300 400 500 Steps

ln Z estimate

ou pis smc ground truth vi

100 200 300 400 500 Steps

ln Z estimate

ou pis smc ground truth vi

100 200 300 400 500 Steps

ln Z estimate

ou pis smc ground truth vi

Figure 3: ln Z estimate as a function of number of steps K - a) Logistic Sonar dataset, b) Brownian motion, c) NICE. Yellow dotted line is MF-VI and dashed magenta is the gold standard.

Forward KL minimization. In Jing et al. (2022), diffusion models ideas are also use to sample unnormalized probability densities. The criterion being minimized therein is the forward KL as in DDPM, that is KL(p||qθ). As samples from p are not available, an importance sampling approximation of p based on samples from qθ is used to obtain an estimate of this KL and its gradient w.r.t. θ. The method is shown to perform well in low dimensional examples but is expected to degrade significantly in high dimension as the importance sampling approximation will be typically poor in the first training iterations.

4 EXPERIMENTS

We present here experiments for Algorithm 1. In our implementation, fθ follows the PIS-GRAD network proposed in (Zhang & Chen, 2022): fθ(k, x) = NN1(k, x; θ) + NN2(k; θ) ln π(x). Across all experiments we use a two layer architecture with 64 hidden units each (for both networks), as in Zhang & Chen (2022), with the exception of the NICE (Dinh et al., 2014) target where we use 3 layers with 512, 256, and 64 units respectively. The final layers are initialised to 0 in order to make the path regularisation term null. We use α1/2 k α1/2 max cos2 π 2 1 k/K+s

1+s with s = 0.008 as in (Nichol & Dhariwal, 2021). We found that detaching the target score stabilised optimization in both approaches without affecting the final result. We adopt this across experiments, an ablation of this feature can be seen in Appendix C.9.1.

Across all tasks we compare DDS to SMC (Del Moral et al., 2006; Zhou et al., 2016), PIS (Barr et al., 2020; Vargas et al., 2021; Zhang & Chen, 2022), and Mean Field-VI (MF-VI) with a Gaussian variational distribution. We also compare DDS to AIS (Neal, 2001) and optimized variants of AIS using score matching (MCD) (Doucet et al., 2022; Geffner & Domke, 2022) for two standard Bayesian models. Finally we explore a task introduced in (Doucet et al., 2022) that uses a pre-trained normalising flow as a target. Within this setting we propose a benchmarking criterion that allows us to assess mode collapse in high dimensions and explore the benefits of incorporating inductive biases into fθ. We carefully tuned the hyper-parameters of all algorithms (e.g. step size, diffusion coefficient, and such), details can be found in Appendix C.2. Finally training time can be found in Appendix C.5. Additional experiments for the normalizing flows are presented in Appendix C.11 and for the underdamped approach in Appendix C.12. We note that these extensions did not bring any benefit compared to Algorithm 1.

4.1 BENCHMARKING TARGETS

We first discuss two standard target distributions which are often used to benchmark methods; see e.g. (Neal, 2003; Arbel et al., 2021; Heng et al., 2020; Zhang & Chen, 2022). Results are presented in Figure 2.

Funnel Distribution: This 10-dimensional challenging distribution is given by γ(x1:10) = N(x1; 0, σ2 f)N(x2:10; 0, exp(x1)I), where σ2 f = 9 (Neal, 2003). 3

3Zhang & Chen (2022) inadvertently considered the more favourable σf = 1 scenario for PIS but used σf = 3 for other methods. This explains the significant differences between their results and ours.

Published as a conference paper at ICLR 2023

Log Gaussian Cox process: This model arises in spatial statistics (Møller et al., 1998). We use a d = M M = 1600 grid, resulting in the unnormalized target density γ(x) = N(x; µ, K) Q

i [1:M]2 exp(xiyi a exp(xi)).

4.2 BAYESIAN MODELS

We explore two standard Bayesian models and compare them with standard AIS and MCD benchmarks (See Appendix C.8) in addition to the SMC and VI benchmarks presented so far. Results for Ionosphere can be found in the 3rd pane of Figure 2 whilst SONAR and Brownian are in Figure 3.

Logistic Regression: We set x N(0, σ2 w I), yi Bernoulli(sigmoid(x ui)). This Bayesian logistic model is evaluated on two datasets, Ionosphere (d = 32) and Sonar (d = 61).

Brownian Motion: We consider a discretised Brownian motion with a Gaussian observation model and a latent volatility as a time series model, d = 32. This model, proposed in the software package developed by Sountsov et al. (2020), is specified in more detail in Appendix C.3.

4.3 MODE COLLAPSE IN HIGH DIMENSIONS

Normalizing Flow Evaluation: Following Doucet et al. (2022) we train NICE (Dinh et al., 2014) on a down-sampled d = 14 14 variant of MNIST (Le Cun & Cortes, 2010) and use the trained model as our target. As we can generate samples from our target, we evaluate the methods samples by measuring the Sinkhorn distance between true and estimated samples. This evaluation criteria allows to asses mode collapse for samplers in high dimensional settings. Results can be seen in Table 1 and the third pane of Figure 3.

Method DDS PIS SMC MCD

ln Z Estimate 3.204 0.645 3.933 0.754 4.255 2.043 6.25 W2 γ=0.01(πtrue, ˆπ) 658.079 658.778 750.245 NA

Table 1: Results on NICE target, we performed 30 runs with different seeds for each approach. For MCD we used the results from (Doucet et al., 2022).

5 DISCUSSION

We have explored the use of DDPM ideas to sample unnormalized probability distributions and estimate their normalizing constants.

The DDS in Algorithm 1 is empirically competitive with state-of-the-art SMC and numerically more stable than PIS. This comes at the cost of a non-negligible training time compared to SMC. When accounting for it, SMC often provide better performance on simple targets. However, in the challenging multimodal NICE example, even a carefully tuned SMC sampler using Hamiltonian Monte Carlo transitions was not competitive to DDS. This is despite the fact that DDS (and PIS) are prone to mode dropping as any method relying on the reverse KL.

We have also investigated normalizing flows based on the probability flow ODE as well as DDS based on an underdamped dynamics. Our experimental results were disappointing in both cases in high dimensional scenarios. We conjecture that more sophisticated numerical integrators need to be developed for the normalizing flows to be competitive and that the neural network parameterization used in the underdamped scenario should be improved and better leverage the structure of the logarithmic derivative of the value functions.

Overall DDS are a class of algorithms worth investigating and further developing. There has much work devoted to improving successfully DDPM over the past two years including, among many others, modified forward noising mechanisms (Hoogeboom & Salimans, 2022), denoising diffusion implicit models (Song et al., 2021a) and sophisticated numerical integrators (Karras et al., 2022). It is highly likely that some of these techniques will lead to more powerful DDS. Advances in KL and path integral control might also be adapted to DDS to provide better training procedures; see e.g. Thalmeier et al. (2020).

Published as a conference paper at ICLR 2023

Felix V Agakov and David Barber. An auxiliary variational method. In Advances in Neural Information Processing Systems, 2004.

Michael Arbel, Alex Matthews, and Arnaud Doucet. Annealed flow transport Monte Carlo. In International Conference on Machine Learning, 2021.

Ariel Barr, Willem Gispen, and Austen Lamacraft. Quantum ground states from reinforcement learning. In Mathematical and Scientific Machine Learning, 2020.

Alessandro Beghi. On the relative entropy of discrete-time Markov processes with given end-point densities. IEEE Transactions on Information Theory, 42(5):1529 1535, 1996.

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander Plas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+Num Py programs, 2018. URL http: //github.com/google/jax.

Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. Advances in Neural Information Processing Systems, 2018.

Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim, and Anru R. Zhang. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. ar Xiv preprint ar Xiv:22209.11215, 2022.

Paolo Dai Pra. A stochastic control approach to reciprocal diffusion processes. Applied Mathematics and Optimization, 23(1):313 329, 1991.

Valentin De Bortoli. Convergence of denoising diffusion models under the manifold hypothesis. Transactions on Machine Learning Research, 2022.

Valentin De Bortoli, James Thornton, Jeremy Heng, and Arnaud Doucet. Diffusion Schr odinger bridge with applications to score-based generative modeling. In Advances in Neural Information Processing Systems, 2021.

Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. Sequential Monte Carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(3):411 436, 2006.

Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation. ar Xiv preprint ar Xiv:1410.8516, 2014.

Tim Dockhorn, Arash Vahdat, and Karsten Kreis. Score-based generative modeling with criticallydamped Langevin diffusion. In International Conference on Learning Representations, 2022.

Arnaud Doucet, Will Grathwohl, Alexander G de G Matthews, and Heiko Strathmann. Score-based diffusion meets annealed importance sampling. In Advances in Neural Information Processing Systems, 2022.

Hans F ollmer. An entropy approach to the time reversal of diffusion processes. Lecture Notes in Control and Information Sciences, 69:156 163, 1984.

Tomas Geffner and Justin Domke. MCMC variational inference via uncorrected Hamiltonian annealing. In Advances in Neural Information Processing Systems, 2021.

Tomas Geffner and Justin Domke. Langevin diffusion variational inference. ar Xiv preprint ar Xiv:2208.07743, 2022.

Will Grathwohl, Ricky TQ Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. Ffjord: Free-form continuous dynamics for scalable reversible generative models. In International Conference on Learning Representations, 2018.

Klaus Greff, Rapha el Lopez Kaufman, Rishabh Kabra, Nick Watters, Christopher Burgess, Daniel Zoran, Loic Matthey, Matthew Botvinick, and Alexander Lerchner. Multi-object representation learning with iterative variational inference. In International Conference on Machine Learning, 2019.

Published as a conference paper at ICLR 2023

Ulrich G Haussmann and Etienne Pardoux. Time reversal of diffusions. The Annals of Probability, 14(3):1188 1205, 1986.

Jeremy Heng, Adrian N Bishop, George Deligiannidis, and Arnaud Doucet. Controlled sequential Monte Carlo. The Annals of Statistics, 48(5):2904 2929, 2020.

Matteo Hessel, David Budden, Fabio Viola, Mihaela Rosca, Eren Sezener, and Tom Hennigan. Optax: composable gradient transformation and optimisation, in jax., 2020.

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 2020.

Emiel Hoogeboom and Tim Salimans. Blurring diffusion models. ar Xiv preprint ar Xiv:2209.05557, 2022.

Aapo Hyv arinen. Estimation of non-normalized statistical models by score matching. The Journal of Machine Learning Research, 6:695 709, 2005.

Bowen Jing, Gabriele Corso, Jeffrey Chang, Regina Barzilay, and Tommi Jaakkola. Torsional diffusion for molecular conformer generation. ar Xiv preprint ar Xiv:2206.01729, 2022.

Hilbert J Kappen and Hans Christian Ruiz. Adaptive importance sampling for control and inference. Journal of Statistical Physics, 162(5):1244 1266, 2016.

Hilbert J Kappen, Vicenc G omez, and Manfred Opper. Optimal control as a graphical model inference problem. Machine learning, 87(2):159 182, 2012.

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusionbased generative models. In Advances in Neural Information Processing Systems, 2022.

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.

Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. ar Xiv preprint ar Xiv:1312.6114, 2013.

Fima C Klebaner. Introduction to Stochastic Calculus with Applications. Imperial College Press, 2012.

Yann Le Cun and Corinna Cortes. MNIST handwritten digit database. http://yann.lecun.com/exdb/mnist/, 2010. URL http://yann.lecun.com/exdb/ mnist/.

Yann Le Cun, L eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Holden Lee, Jianfeng Lu, and Yixin Tan. Convergence of score-based generative modeling for general data distributions. ar Xiv preprint ar Xiv:2209.12381, 2022.

Ben Leimkuhler and Charles Matthews. Molecular Dynamics with Deterministic and Stochastic Numerical Methods. Springer, 2016.

Christian L eonard. A survey of the Schr odinger problem and some of its connections with optimal transport. Discrete and Continuous Dynamical Systems-Series A, 34(4):1533 1574, 2014a.

Christian L eonard. Some properties of path measures. In S eminaire de Probabilit es XLVI, pp. 207 230. Springer, 2014b.

Luke Metz, Niru Maheswaranathan, Ruoxi Sun, C Daniel Freeman, Ben Poole, and Jascha Sohl Dickstein. Using a thousand optimization tasks to learn hyperparameter search strategies. ar Xiv preprint ar Xiv:2002.11887, 2020.

Jesper Møller, Anne Randi Syversveen, and Rasmus Plenge Waagepetersen. Log Gaussian Cox processes. Scandinavian Journal of Statistics, 25(3):451 482, 1998.

Published as a conference paper at ICLR 2023

Radford M Neal. Annealed importance sampling. Statistics and Computing, 11(2):125 139, 2001.

Radford M. Neal. Slice sampling. The Annals of Statistics, 31:705 767, 06 2003.

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, 2021.

George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. Normalizing flows for probabilistic modeling and inference. The Journal of Machine Learning Research, 22(1):2617 2680, 2021.

Paavo Parmas, Carl Edward Rasmussen, Jan Peters, and Kenji Doya. Pipps: Flexible model-based policy search robust to the curse of chaos. In International Conference on Machine Learning, 2018.

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, 2015.

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021a.

Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of scorebased diffusion models. In Advances in Neural Information Processing Systems, 2021b.

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021c.

Tommi Sottinen and Simo S arkk a. Application of Girsanov theorem to particle filtering of discretely observed continuous-time non-linear systems. Bayesian Analysis, 3(3):555 584, 2008.

Pavel Sountsov, Alexey Radul, and contributors. Inference gym, 2020. URL https://pypi. org/project/inference_gym.

Dominik Thalmeier, Hilbert J Kappen, Simone Totaro, and Vicenc G omez. Adaptive smoothing for path integral control. Journal of Machine Learning Research, 21:1 37, 2020.

Achille Thin, Nikita Kotelevskii, Alain Durmus, Eric Moulines, Maxim Panov, and Arnaud Doucet. Monte Carlo variational auto-encoders. In International Conference on Machine Learning, 2021.

Belinda Tzen and Maxim Raginsky. Theoretical guarantees for sampling and inference in generative models with latent diffusions. In Conference on Learning Theory, 2019.

Francisco Vargas, Andrius Ovsianas, David Fernandes, Mark Girolami, Neil Lawrence, and Nikolas N usken. Bayesian learning via neural Schr odinger F ollmer flows. ar Xiv preprint ar Xiv:2111.10510, 2021.

Paul Vicol, Luke Metz, and Jascha Sohl-Dickstein. Unbiased gradient estimation in unrolled computation graphs with persistent evolution strategies. In International Conference on Machine Learning, 2021.

Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Computation, 23(7):1661 1674, 2011.

Martin J. Wainwright and Michael I. Jordan. Graphical Models, Exponential Families, and Variational Inference. Foundations and Trends in Machine Learning, 1(1 2):1 305, November 2008.

Hao Wu, Jonas K ohler, and Frank No e. Stochastic normalizing flows. In Advances in Neural Information Processing Systems, 2020.

Benjamin Zhang, Tuhin Sahai, and Youssef Marzouk. Sampling via controlled stochastic dynamical systems. In I (Still) Can t Believe It s Not Better! Neur IPS 2021 Workshop, 2021.

Published as a conference paper at ICLR 2023

Guodong Zhang, Kyle Hsu, Jianing Li, Chelsea Finn, and Roger Grosse. Differentiable annealed importance sampling and the perils of gradient noise. In Advances in Neural Information Processing Systems, 2021.

Qinsheng Zhang and Yongxin Chen. Path integral sampler: a stochastic control approach for sampling. In International Conference on Learning Representations, 2022.

Yan Zhou, Adam M Johansen, and John AD Aston. Toward automatic model comparison: an adaptive sequential monte carlo approach. Journal of Computational and Graphical Statistics, 25 (3):701 726, 2016.

A.1 PROOF OF PROPOSITION 1

The Radon Nikodym derivative expression (9) directly follows from an application of Girsanov theorem (see e.g. Klebaner (2012)). The path measures P and Pref are induced by two diffusions following the same dynamics but with different initial conditions so that

P(ω) = P(ω|y T )π(y T )

= Pref(ω|y T )π(y T )

= Pref(ω|y T )N(y T ; 0, σ2I) π(y T ) N(y T ; 0, σ2I)

= Pref(ω) π(y T ) N(y T ; 0, σ2I).

So it follows directly that KL(Qθ||P) = KL(Qθ||Pref) + Ey T qθ 0

h ln pref 0 (y T ) p0(y T ) i . Note we ap-

ply Girsanov theorem to the time reversals of the path measures Qθ and P using that the Radon Nikodym between two path measures and their respective reversals are the same. As Wt is a Brownian motion under Qθ, the final expression (10) for KL(Qθ||P) follows.

A.2 COMPARING DDS AND PIS DRIFTS FOR GAUSSIAN TARGETS

The optimal drift for both PIS and DDS can be expressed in terms of logarithmic derivative of the value function plus additional terms:

b DDS(x, t) = βT t(x 2σ2 x ln ϕT t(x)), b PIS(x, t) = σ2 x ln ϕT t(x) (23)

where ϕt is the corresponding value function for DDS or PIS respectively.

Recall that ln ϕt(x) = ln pt(x) ln pref t (x). For a target π(x) = N(x; µ, Σ), pref t (x) = N(xt; µref t , bref t I) and pt|0(xt|x0) = N(xt; atx0, bt I), we obtain

pt(x) = Z π(x0)pt|0(x|x0)dx0 = N(x; atµ, a2 tΣ + bt I)

so ln ϕt(x) = (a2 tΣ + bt I) 1(x atµ) + (bref t ) 1(x µref t ). (24)

Corollary 1. For DDS, we have µref t = 0, bref t = σ2I, at = 1 λt, bt = σ2λt I with λt = 1 exp( 2 R t 0 βsds). So, for example, for σ = βt = 1 and Σ = I, we have

ln ϕt(x) = µ exp( t), b DDS(x, t) = x + 2µ exp( (T t)). (25)

Corollary 2. For PIS, we have a reference process which is is a pinned Brownian motion running backwards in time with µref t = 0, bref t = σ2(T t)I, at = T t

T , bt = σ2 t(T t)

T . For σ = 1, Σ = I, we obtain b PIS(x, t) = (a2 T t + b T t) 1(x a T tµ) + x

In particular b PIS(x, t) x

t (x µ) as t 0.

Published as a conference paper at ICLR 2023

Hence for PIS the drift function explodes close to the origin compared to DDS. This explosion holds for any target π as it is only related to the fact that pref t concentrates to δ0 as t T. This makes it harder to approximate for a neural network. Additionally when discretizing the resulting diffusion, this means that smaller discretization steps should be used close to t = 0.

A.3 PROOF OF PROPOSITION 2

Proof. Let us express the discrete time version of the reference process as

pref(y0:K) = pref 0 (y0)

k=0 pref k+1|k(yk+1|yk) (27)

and denote by pref k the corresponding marginal density of yk which satisfies

pk+1(yk+1) = Z pref k+1|k(yk+1|yk) pk(yk)dyk. (28)

The backward decomposition of this joint distribution is given by

pref(y0:K) = pref K (y K)

k=0 pref k|k+1(yk|yk+1), (29)

pref k|k+1(yk|yk+1) = pref k+1|k(yk+1|yk)pk(yk)

pk+1(yk+1) . (30)

If our chosen integrator induces a transition kernel pref k+1|k(yk+1|yk) which is such that pref K (y K) = pref 0 (y K), then

pref(y0:K) π(y K)

pref 0 (y K) = π(y K)

k=0 pref k|k+1(yk|yk+1) (31)

is a valid (normalised) probability density. Hence it follows that

KL(qθ(y0:K)||pref(y0:K)) + Ey K qθ 0

h ln pref 0 (y K) π(y K)

=KL qθ(y0:K) pref(y0:K) π(y K)

pref 0 (y K)

If the integrator does not preserve the marginals we have that

pref(y0:K) π(y K)

pref 0 (y K) = pref K (y K) pref 0 (y K)π(y K)

k=0 pref k|k+1(yk|yk+1). (33)

This is not a probability density and thus the objective is no longer guaranteed to be positive and consequently the expectation of our estimator of ln Z will not be necessarily a lower bound for ln Z. Finally simple calculations show that the Euler discretisation does not preserve the invariant distribution of Pref in DDS.

A.4 PROOF OF PROPOSITION 3

ln qθ(y0:K)

= ln qθ(y0:K)

+ ln N(y0; 0, σ2I)

Now by construction, we have from (17) that pref k 1|k(y K k+1|y K k) = N(y K k+1; 1 αky K k, σ2αk I) and from (18) we obtain qθ k 1|k(y K k+1|y K k) = N(y K k+1; 1 αky K k + 2σ2(1 1 αk)fθ(k, y K k), σ2αk I).

Published as a conference paper at ICLR 2023

It follows that

ln qθ(y0:K)

= ln q K(y0)

pref K (y0)

qθ k 1|k(y K k+1|y K k)

pref k 1|k(y K k+1|y K k)

h ||y K k+1

1 αky K k||2

1 αky K k 2σ2(1

1 αk)f(k, y K k)||2i ,

where we have exploited the fact that q K(y0) = pref K (y0) = N(y0; 0, σ2I).

Now using ϵk := 1 σ αk (y K k+1 1 αky K k 2σ2(1 1 αk)f(k, y K k)), we can rewrite (35) as

ln qθ(y0:K)

||ϵk + 2σ (1 1 αk) αk fθ(k, y K k)||2 ||ϵk||2

αk ||fθ(k, y K k)||2 + 2σ

(1 1 αk) αk fθ(k, y K k) ϵk.

Now (19) follows directly from (34) and (66). Note we have for δ 1 that 2σ2 (1 1 αk)2

σ2βkδ and 2σ (1 1 αk) αk 2βkδδ as expected from (9).

Finally the final expression (20) of the KL follows now from the fact that Eqθ[fθ(k, y K k) ϵk] = Eqθ k[Eqθ k 1|k[fθ(k, y K k) ϵk|y K k]] = Eqθ k[fθ(k, y K k) Eqθ k 1|k[ϵk|y K k]] = 0.

A.5 ALTERNATIVE KULLBACK LEIBLER DECOMPOSITION

A KL decomposition similar in spirit to the one developed for DDPM (Ho et al., 2020) can be be derived. It leverages the fact that

pk 1|k,0(xk 1|xk, x0) = N(xk; µk(xk, x0), σ2 βk I) (37)

for µk(xk, x0) = αk 1βk

1 αk x0 + αk(1 αk 1)

1 αk xk, βk = 1 αk 1

Proposition 4. The reverse Kullback Leibler discrepancy KL(qθ||p) satisfies

KL(qθ||p) = Eqθ[KL(q K(x K)||p K|0(x K|x0)) +

k=2 KL(qθ k 1|k(xk 1|xk)||pk 1|0,k(xk 1|x0, xk))

+ KL(qθ 0|1(x0|x1)||π(x0))].

So for qθ k 1|k(xk 1|xk) = N(xk 1; 1 βkxk + σ2βkfθ(k, xk), σ2βk I), the terms KL(qθ k 1|k(xk 1|xk)||pk 1|0,k(xk 1|x0, xk)) are KL between two Gaussian distributions and can be calculated analytically.

We found this decomposition to be numerically unstable and prone to diverging in our reverse KL setting.

Published as a conference paper at ICLR 2023

Proof. The reverse KL can be decomposed as follows

KL(qθ||p) = Eqθ h ln qθ(x0:K)

i = Eqθ h ln qθ(x0:K) π(x0)p(x1:K|x0)

= Eqθ h ln qθ(x0:K)

i Eqθ[ln π(x0)]

= L(θ) Eqθ[ln π(x0)]

L(θ) = Eqθ h ln qθ(x0:K)

i = Eqθ h ln q K(x K) +

k=1 ln qθ k 1|k(xk 1|xk)

pk|k 1(xk|xk 1)

Now using the identity for k 2

pk 1,k|0(xk 1, xk|x0) = pk 1|0(xk 1|x0)pk|k 1(xk|xk 1)

= pk 1|0,k(xk 1|x0, xk)pk|0(xk|x0),

we can rewrite L(θ) as

L(θ) = Eqθ h ln q K(x K) +

k=2 ln qθ k 1|k(xk 1|xk)

pk 1|0,k(xk 1|x0, xk).pk 1|0(xk 1|x0)

pk|0(xk|x0)

+ ln qθ 0|1(x0|x1)

p1|0(x1|x0)

= Eqθ h ln q K(x K) p K|0(x K|x0)

k=2 ln qθ k 1|k(xk 1|xk)

pk 1|0,k(xk 1|x0, xk)

+ ln qθ 0|1(x0|x1) i

= Eqθ[KL(q K(x K)||p K|0(x K|x0))] +

k=2 KL(qθ k 1|k(xk 1|xk)||pk 1|0,k(xk 1|x0, xk))

+ ln qθ 0|1(x0|x1) i

The result now follows directly.

B UNDERDAMPED LANGEVIN DYNAMICS

In the generative modeling context, it has been proposed to extend the original state x Rd by a momentum variable m Rd. One then diffuses in this extended space using an underdamped Langevin dynamics (Dockhorn et al., 2022) targeting N(x; 0, σ2I)N(m; 0, M). It was demonstrated empirically that the resulting scores one needs to estimate are smoother and this led to improved performance. We adapt this approach to Monte Carlo sampling. This adaptation is non trivial and in particular requires to design carefully numerical integrators.

B.1 CONTINUOUS TIME

We now consider an augmented target distribution π(x)N(m; 0, M) where M is a positive definite mass matrix. We then diffuse this extended target using the following underdamped Langevin dynamics, i.e.

dxt = M 1mtdt, (38)

σ2 dt βtmtdt + p

2βt M 1/2d Bt,

where x0 π, m0 N(0, M). The resulting path measure on [0, T] is denoted again P. From (Haussmann & Pardoux, 1986), the time-reversal process is also a diffusion satisfying

dyt = M 1ntdt, (39)

σ2 dt + βT tntdt + 2βT t M nt ln ηT t(yt, nt)dt + p

2βT t M 1/2d Wt,

Published as a conference paper at ICLR 2023

for (y0, n0) ηT where ηt denotes the density of (xt, mt) under (38).

Now consider a reference process Pref on [0, T] defined by the forward process (38) initialized using x0 N(0, σ2), m0 N(0, M). In this case one can check that ηref t (xt, mt) = N(xt; 0, σ2I)N(mt; 0, M) and the time-reversal process of Pref satisfies

dyt = M 1ntdt, (40)

σ2 dt + βT tntdt 2βT tntdt + p

2βT t M 1/2d Wt

σ2 dt βT tntdt + p

2βT t M 1/2d Wt,

as n ln ηref t (y, n) = n ln(N(y; 0, σ2)N(n; 0, M)) = M 1n.

Hence it follows that the time-reversal (39) of P can be also be written as

dyt = M 1ntdt, (41)

σ2 dt βT tntdt + 2βT t M nt ln ϕT t(yt, nt)dt + p

2βT t M 1/2d Wt,

where ϕt(x, m) := ηt(x, m)/ηref t (x, m).

To approximate P, we consider a parameterized path measure Qθ whose time reversal is defined for (y0, n0) N(y0; 0, I)N(n0; 0, M) by dyt = M 1ntdt and

σ2 dt βT tntdt + 2βT t Mfθ(T t, yt, nt)dt + p

2βT t M 1/2d Wt. (42)

B.2 LEARNING THE TIME-REVERSAL THROUGH KL MINIMIZATION

To approximate P, we will consider a parameterized diffusion whose time reversal is defined by

dyt = M 1ntdt, (43)

σ2 dt βT tntdt + 2βT t Mfθ(T t, yt, nt)dt + p

2βT t M 1/2d Wt

for (y0, n0) N(y0; 0, I)N(n0; 0, M) inducing a path measure Qθ on the time interval [0, T].

We can now compute the Radon-Nikodym derivative between Qθ and Pref using an extension of Girsanov theorem (Theorem A.3. in Sottinen & S arkk a (2008))

d Pref = Z T

2βT t M 1/2fθ(T t, yt, nt) d Wt + 1

0 ||2βT t Mfθ(T t, yt, nt)||2 (2βT t M) 1dt

To summarize, we have the following proposition.

Proposition 5. The Radon-Nikodym derivative d Qθ

d Pref (y[0,T ], n[0,T ]) satisfies under Qθ

0 βT t||fθ(T t, yt, nt)||2 Mdt +

2βT t M 1/2fθ(T t, yt, nt) d Wt. (45)

From KL(Qθ||P) = KL(Qθ||Pref) + EQθ h ln pref 0 (y T ,n T ) η0(y T ,n T ) i , it follows that

KL(Qθ||P) = EQθ h Z T

0 βT t||fθ(T t, yt, nt)||2 Mdt + ln N(y T ; 0, σ2I)

The second term on the r.h.s. of (46) follows from the fact that ln(pref 0 (y T , n T )/p0(y T , n T )) = ln(N(y T ; 0, σ2I)/π(y T )).

Published as a conference paper at ICLR 2023

B.3 NORMALIZING FLOW THROUGH ORDINARY DIFFERENTIAL EQUATION

The following ODE gives exactly the same marginals ηT as the SDE (38) defining P dxt = M 1mtdt, (47)

σ2 dt βtmtdt βt M mt ln ηt(xt, mt)dt,

σ2 dt βt M mt ln ϕt(xt, mt)dt,

Thus if we integrate its time reversal from 0 to T starting from (y0, q0) ηT dyt = M 1ntdt, (48)

σ2 dt + βT t M nt ln ϕT t(yt, nt)dt,

then we would obtain at time T a sample (y T , n T ) π(y T )N(n T ; 0, M).

In practice, this suggests that once we have learned an approximation fθ (t, x, m) m ln ϕt(x, m) by minimization of the ELBO then we can construct a proposal using

dyt = M 1ntdt, (49)

σ2 dt + βT t Mfθ (T t, yt, nt)dt,

for (y0, n0) N(y0; 0, σ2I)N(n0; 0, M). The resulting sample (y T , n T ) η0 will have distribution close to π N(0, M). Again it is possible to compute pointwise the distribution of this sample to perform an importance sampling correction.

B.4 FROM CONTINUOUS TIME TO DISCRETE TIME

We now need to come up with discrete-time integrator for the time-reversal of Pref given by (40) and the time-reversal of Qθ given by (43). Let us start with (40). We split it into the two components

dyt = M 1ntdt, dnt = yt

σ2 dt, (50)

and dyt = 0, dnt = βT tntdt + p

2βT t M 1/2d Wt. (51) We will compose these transitions. To obtain (yk+1, nk+1) from (yk, nk), we first integrate (50). To do this, consider the Hamiltonian equation dyt = M 1ntdt, dnt = yt

σ2 dt which preserves N(y; 0, σ2I)N(n; 0, M) as invariant distribution. We can integrate this ODE exactly over an interval of length δ and denote its solution Φ(y, n); see Section B.5 for details. We use its inverse Φ 1(y, n) = Φflip(y, n) Φ(y, n) Φflip(y, n) where Φflip(y, n) = (y, n) so that (yk+1, n k) = Φ 1(yk, nk). Then we integrate exactly (51) using

1 αK kn k + αk 1M 1/2ϵk. (52) We have thus design a transition of the form

pref(yk+1, nk+1, n k|yk, nk) = δΦ 1(yk,nk)(yk+1, n k)N(nk+1; p

1 αK kn k; αK k M). (53) Now to integrate (43), we split it in three parts. We first integrate (50) using exactly the same integrator (yk+1, n k) = Φ 1(yk, nk). Then we integrate dnt = 2βT t Mfθ(T t, yt, nt)dt (54) using n k = n k + 2(1 p

1 αK k)Mfθ(K k, yk+1, n k), (55) where we abuse notation and write fθ(K k, yk+1, n k) instead of fθ((K k)δ, yk+1, n k). Finally, we integrate the OU part using (52) but replacing n k by with n k. So the final transition we get is

qθ(yk+1, nk+1, n k, n k|yk, nk) = δΦ 1(yk,nk)(yk+1, n k)δn k+2(1

1 αK k)Mfθ(K k,yk+1,n k))(n k)

1 αK kn k; αK k M). However, we can integrate n k analytically to obtain

qθ(yk+1, nk+1, n k|yk, nk) = δΦ 1(yk,nk)(yk+1, n k)

1 αK k(n k + 2(1 p

1 αK k)Mfθ(K k, yk+1, n k)); αK k M). (56)

Published as a conference paper at ICLR 2023

B.5 EXACT SOLUTION OF HAMILTONIAN EQUATION

Consider the Hamiltonian equation and M = I, the solution of

dxt = mtdt, dmt = xt

can be exactly written as (xt, mt) = Φt(x0, m0) defined through

xt = x0 cos(t/σ) + m0σ sin(t/σ), mt = x0

σ sin(t/σ) + m0 cos(t/σ). (58)

This is the so-called harmonic oscillator.

Now the inverse of the Hamiltonian flow satisfies Φ 1(x, m) = Φflip(x, m) Φ(x, m) Φflip(x, m) (see e.g. Leimkuhler & Matthews (2016)) so we can simply integrate

dyt = ntdt, dnt = yt

σ2 dt, (59)

using (yk+1, n k) = Φ 1 τ (yk, nk).

yk+1 = yk cos(τ/σ) nkσ sin(τ/σ) (60)

σ sin(τ/σ) nk cos(t/σ))

σ sin(τ/σ) + nk cos(t/σ). (61)

Obviously, if we have M = αI then

α dt, dmt = xt

σ2 dt, (62)

then we use a reparameterization mt = mt/α and

dxt = mtdt, d mt = 1

α xt σ2 dt, (63)

and the solution is as above with σ being replaced by σ α.

So for nk = nk/α, we have

yk+1 = yk cos(τ/( ασ)) nkσ α sin(τ/( ασ)),

n k = yk ασ sin(τ/( ασ)) + nk cos(t/( ασ)),

so as nk = α nk and similarly n k = α n k, we finally obtain

yk+1 = yk cos(τ/( ασ)) nk σ α sin(τ/( ασ)), (64)

σ yk sin(τ/( ασ)) + nk cos(τ/( ασ)). (65)

B.6 DERIVATION OF UNDERDAMPENED KL IN DISCRETE TIME

In this section we derive the discrete-time KL for the underdamped noising dynamics. Proposition 6. The log density ratio lr = ln(qθ(y0:K, n0:K, n 0:K)/pref(y0:K, n0:K, n 0:K)) for y0:K, n0:K, n 0:K qθ( ) equals

k=1 2 h κ2 k αk ||fθ(k, y K k+1, n K k)||2 M + M 1/2 κk αk fθ(k, y K k+1, n K k) εk i (66)

where κk := 1 αk(1 1 αk) and εk is obtained as a function of nk+1 and n k when

describing the integrator for Qθ. In particular, we have εk i.i.d. N(0, I) and one obtains

KL(qθ||p) = 2Eqθ

h ||fθ(k, y K k+1, n K k)||2 M + ln N(y K; 0, σ2I)

Published as a conference paper at ICLR 2023

The integrator has been designed so that the ratio between the transitions of the proposal (56) and the reference process (53) is well-defined as the deterministic parts are identical in the two transitions so cancel and qθ(yk+1, nk+1, n k|yk, nk) pref(yk+1, nk+1, n k|yk, nk) (68)

=N(nk+1; 1 αK k(n k + 2(1 1 αK k)Mfθ(K k, yk+1, n k)); αK k M) N(nk+1; 1 αK kn k; αK k M) .

The calculations to compute the Radon Nikodym derivative are now very similar to what we did in the proof of Proposition 3.

ln qθ(y0:K, n0:K, n 0:K) pref(y0:K, n0:K, n 0:K)

= ln q K(y0, n0)

pref K (y0, n0)

qθ k 1|k(y K k+1, n K k+1, n K k|y K k, n K k)

pref k 1|k(y K k+1, n K k+1, n K k|y K k, n K k)

h ||n K k+1

1 αkn K k||2 (2αk M) 1

1 αk(n K k + 2(1

1 αk)Mfθ(k, y K k+1, n K k))||2 (2αk M) 1 i

where we have exploited the fact that q K(y0, n0) = pref K (y0, n0) = N(y0; 0, σ2I)N(n0; 0, M).

Now let us introduce

ϵk := M 1/2

αk (n K k+1

1 αkn K k 2

1 αk)Mfθ(k, y K k, n K k)) (70)

hence M 1/2

αk (n K k+1

=ϵk + 2M 1/2

1 αk)Mfθ(k, y K k+1, n K k)

=ϵk + 2M 1/2

1 αk)fθ(k, y K k+1, n K k).

We can rewrite (69) as

ln qθ(y0:K, n0:K, n 0:K) pref(y0:K, n0:K, n 0:K)

h ||ϵk + 2M 1/2

1 αk)fθ(k, y K k+1, n K k))||2 ||ϵk||2i

h ||fθ(k, y K k+1, n K k))||2

4M (1 αk)(1

1 αk)fθ(k, y K k+1, n K k)) εk i

h ||fθ(k, y K k+1, n K k))||2

2M (1 αk)(1

+ 2M 1/2 1 αk(1 1 αk) αk fθ(k, y K k+1, n K k)) εk i

where we note that 1 αk(1 1 αk) αk αk/2 βkδδ.

Published as a conference paper at ICLR 2023

C EXPERIMENTS

In this section we incorporate additional experiments and ablations as well as further experimental detail. For both PIS and DDS we created a grid with δ = K

T = 0.05 and values of T {3.4, 6.4, 12.8, 25.6} and corresponding number of steps K {64, 128, 128, 256}. When tuning PIS we explore different values of σ which in practice, when discretised, amounts to the same effect as changing δ. For DDS we apply the cosine schedule directly on the number of steps however we ensure that P

k αk = αmax T such that we have a similar scaling as in PIS. Finally both PIS and DDS where trained with Adam (Kingma & Ba, 2015) to at most 11000 iterations, although in most cases most both converged in less than 6000 iterations. Across all experiments used a sample size of 300 to estimate the ELBO at train time and of 2000 to for the reported IS estimator of Z for all methods.

Finally for the ground truth where available we use the true normalizing constant as is the case with the Funnel distribution and the Normalising Flows used in the NICE Dinh et al. (2014) task. For the rest of the tasks we follow prior work (Zhang & Chen, 2022; Arbel et al., 2021) use a long run SMC chain (1000 temperatures, 30 seeds with 2000 samples each).

C.1 NETWORK PARAMETRISATION

In order to improve numerical stability we re-scale the neural network drift parametrisation as follows:

fθ(k, y) = 2 1λ 1 K αk fθ(k, y), fθ(k, y) = NN1(k, y; θ) + NN2(k; θ) ln π(y).

This was done such that the reciprocal term α 1 K k in the DDS objective is cancelled as the term reaches very small values in the boundaries causing the overall objective to be large. Finally as λk = 1 1 αk αk

2 it follows that our proposed re-parametrisation converges to the same SDE as before, but now with stabler and simpler updates:

1 αK kyk,n + σ2αK k fθ(K k, yk,n) + σ αK kεk,n, εk,n i.i.d. N(0, I), (72)

rk+1,n = rk,n + 2 1σ2αK k|| fθ(K k, yk,n)||2. (73)

C.2 TUNING HYPER-PARAMETERS

For both DDS and PIS we explore a grid of 25 hyper-parameter values. For PIS we search for the best performing value of the volatility coefficient over 25 values, depending on the task we vary the end points of the drift however we noticed that PIS leads to numerical instabilities for γ > 4 thus we never search for values larger than this. For DDS we searched across 5 values for σ and 5 values for αmax, this led to a total of 25 combinations we explored for each experiment. Finally for SMC we searched over 3 of its different step sizes leading to a total of 53 = 125.

C.3 BROWNIAN MOTION MODEL WITH GAUSSIAN OBSERVATION NOISE

The model is given by

αinn Log Normal(0, 2), αobs Log Normal(0, 2), x1 N(0, αinn), xi N(xi 1, αinn), i = 2, . . . 20, yi N(xi, αobs), i = 1, . . . 30.

The goal is to perform inference over the variables αinn, αobs and {xi}30 i=1 given the observations {yi}10 i=1 {yi}30 i=20.

Published as a conference paper at ICLR 2023

0 10 20 30 40 50 60 Step Number

40 PIS drift magnitude DDS drift magnitude

Figure 4: Magnitude of the learnt neural net approximation of the drift x ln ϕt(x) (see (23)) as a function of t.

Training Times per Iteration - seconds (LGCP)

k = 64 k = 128 k = 256 k = 512

DDS 0.104 0.0004 0.205 0.0004 0.406 0.0004 0.809 0.0004 PIS 0.104 0.0004 0.205 0.0003 0.406 0.0005 0.808 0.0006

Table 2: Training time per ELBO gradient update on 300 samples.

C.4 DRIFT MAGNITUDE

In Figure 4 we compare the magnitudes of the NN drifts learned by PIS and DDS on a scalar target N(6, σ2) target. The drifts where sampled on randomly evaluated trajectories from each of the learned samplers. We observe the PIS drift to have large magnitudes with higher variance more prone to numerical instabilities, whilst the OU drift has a notably lower magnitude with significantly less variance. We conjecture this is the reason that PIS becomes unstable at training time for a large number of steps across several of our PIS experiments, many of which did not converge due numerical instabilities. For simplicity in this experiment we set αk to be uniform.

C.5 TRAINING TIME

We evaluate the training times of DDS and PIS on an 8-chip TPU circuit and average over 99 runs. Table 2 shows running times per training iteration. In total we trained for 11000 iterations thus the total training times are 19 minutes; 37 minutes; 1 hour and 14 minutes; 2 hours and 26 minutes for k = 64, 128, 256, 512 respectively.

C.6 VARIATIONAL ENCODER IN LATENT SPACE

Following Arbel et al. (2021) we explore estimating the posterior of a pre-trained Variational Auto Encoder (VAE) (Kingma & Welling, 2013). We found all approaches to reach a very small error for ln Z very quickly however as seen with PIS across experiments results became unstable for large T and thus many hyper-parameters when tuning the volatility for PIS led to the loss exploding.

C.7 GENERATED IMAGES

In this section we provide some of the generated samples for the normalising flow evaluation task. In Figure 6 we can observe how both PIS and DDS are able to generate more image classes than SMC due to mode collapse. Out of the 3 approaches we can see that DDS mode collapses the least whilst SMC generates the higher quality images. Using a neural network with inductive biases such

Published as a conference paper at ICLR 2023

64 128 256 512 Number of Steps

ln Z Estimate

DDS PIS SMC

Figure 5: Results on pretrained VAE from Arbel et al. (2021).

Figure 6: a) DDS , b) PIS , c) SMC. We can see how SMC mode collapses significantly as it only seems to have 5 image classes. For PIS we identified 7 classes whilst for DDS we found 9 out of 10.

as a convolutional neural network (Le Cun et al., 1998) should improve the image quality over the simple feed-forward networks we have used, we believe this is the reason behind the low quality in the images as both minimising the path KL and sampling high quality images is a difficult task for a small feed-forward network.

C.8 COMPARISON TO AIS AND MCD

We compare our results to the AIS and MCD baselines presented in Geffner & Domke (2022). Results can be seen in Tables 4 and 3, we can see that both DDS and PIS outperform LDVI across all number of steps and tasks with DDS outperforming PIS on higher values of k due to inconssitent and unstable behaviour presented by PIS.

C.9 EULER MAYURAMA VS EXPONENTIAL INTEGRATOR

In this section we demonstrate empirically how using the Euler Mayurama integrator directly on the DDS objective leads to overestimating ln Z. We compare on the Funnel dataset for which we know ln Z to be 0 and the Ionosphere dataset which is small and for which we have a reliable estimate of ln Z via a long run SMC. Results in Table 5 show case how for lower values of k = 64, 128

Ionosphere Sonar

ULA MCD UHA LDVI DDS PIS ULA MCD UHA LDVI DDS PIS

k = 64 -113.8 -112.5 -112.8 -112.1 -111.7 -111.6 -115.3 -111.1 -111.9 -109.7 -109.4 -108.9 k = 128 -113.1 -112.2 -112.3 -111.9 -111.6 -112.0 -113.5 -110.2 -110.6 -109.1 -108.9 -109.3 k = 256 -112.7 -112.1 -112.1 -111.7 -111.6 -113.0 -112.1 -109.7 -109.7 -108.9 -108.9 -108.9

Table 3: Comparison of DDS and PIS to AIS based approaches on logistic regression models.

Published as a conference paper at ICLR 2023

Brownian Motion

ULA MCD UHA LDVI DDS PIS

k = 64 -0.7 0.1 0.1 - 0.5 1.0 1.0 k = 128 -0.3 0.2 0.4 0.7 1.1 1.0 k = 256 -0.1 0.5 0.6 0.9 1.0 0.7

Table 4: Comparison of DDS and PIS to AIS based approaches on Brownian motion time series target.

Funnel Ionosphere

Exponential Integrator Euler Mayurama Ground Truth Exponential Integrator Euler Mayurama Ground Truth

k = 64 0.206 0.059 2.425 0.266

111.693 0.169 106.364 0.371

111.560 k = 128 0.176 0.099 2.011 0.532 111.587 0.136 111.158 0.399 k = 256 0.221 0.076 0.086 0.124 111.575 0.163 112.271 0.366 k = 512 0.176 0.068 0.154 0.066 111.582 0.136 112.912 0.353

Table 5: Comparing EM vs Exponential integrators on Funnel and Ionosphere datasets. We can see how EM significantly overestimates ln Z.

the Euler Maruyama (EM) based approach significantly overestimates ln Z and overall does not perform well.

C.9.1 DETACHED GRADIENT ABLATIONS

In this section we perform an ablation over our proposed modification of the PIS-Grad network architecture. We compare the same architecture with and without the gradient attached. We found that detaching gradient led to a favourable performance as well as more stable training. Results are presented in table 7 and were carried out for K = 128 and using the best found diffusion coefficient from each task.

We chose to explore this feature as it is known that optimization through unrolled computation graphs can be chaotic and introduce exploding/vanishing gradients (Parmas et al., 2018; Metz et al., 2020) which can numerically introduce bias. This has been successfully applied in related prior work (Greff et al., 2019) where detached scores are provided as features to the recognition networks. Alternative approaches exist based on smoothing the chaotic loss-landscape (Vicol et al., 2021) but we leave this for future work.

C.10 COSINE BASED DECAY FOR PIS

Unlike DDS, with PIS it is not immediately clear how to schedule the step-sizes in a similar manner to Nichol & Dhariwal (2021). The simplest approach would be to schedule δ (the discretisation step). We perform an ablation with the cosine schedule on the discretisation step with PIS. Results are displayed in Table C.10, we can see that overall the cosine discretisation decreases or preserves the performance of PIS with the exception of the Sonar data-set in which it increases. Whilst there may be a way to improve PIS with a bespoke discretisation we were unable to find such in this work. Additionally we can see from Table C.10 how DDS improves significantly across every task when using the cosine squared schedule compared to uniform.

C.11 PROBABILITY FLOW ODE

Figures 7 and 8 show samples obtained from the probability flow ODE of a trained DDS model. We can see that the probability flow ODE is able to perfectly sample from a uni-modal Gaussian, matters became more challenging with multi-modal distributions. Initially we discretised the ODE (12) using the same type of integrators as the SDEs to obtain yk+1 = yk +δσ2(1 1 αK k)fθ(K k, yk) for y0 N(0; σ2I). Unfortunately under this discretisation we found the probability flow ODE to become stiff with more complex distributions such as the mixture of Gaussians, resulting in strange effects in the samples (matching the modes but completely wrong shapes). By using a Heun

Published as a conference paper at ICLR 2023

Method Funnel LGCP VAE Sonar Ionosphere Brownian NICE

PIS Cos decay 0.256 0.094 501.060 0.881 110.025 0.0720 108.843 0.242 111.618 0.127 1.035 0.097 4.212 0.620 PISuniform 0.268 0.07 502.002 0.957 110.025 0.070 109.349 0.356 111.602 0.145 0.960 0.145 3.985 0.349 DDS 0.176 0.098 497.944 0.994 110.012 0.071 108.903 0.226 111.587 0.136 1.047 0.156 3.490 0.568 DDSuniform 0.275 0.127 477.680 1.338 110.245 0.188 109.556 0.468 111.736 0.161 0.475 0.166 7.047 0.691

Table 6: Ablation for cos2 scheduling.

Method Funnel LGCP VAE Sonar Ionosphere Brownian NICE

PIS Grad Detach 0.268 0.07 502.002 0.957 110.025 0.0704 109.349 0.356 111.602 0.145 0.960 0.145 3.985 0.349 PIS Grad 0.268 0.09 501.273 0.798 110.033 0.0647 109.377 0.361 111.705 0.190 5.793 0.499 3.927 0.666

Table 7: Results for PIS grad with and without detaching the gradient.

integrator as proposed in Karras et al. (2022) we were able to obtain better results, however we also found we needed to increase the network size to improve the results as well as train with a learning rate decay of 0.99. With these extra features we were able to simulate a probability flow ODE that roughly matched the marginal densities of the SDE. However even with these fixes we still found the probability flow based estimator of ln Z to drastically overestimate, this is not entirely surprising as in discrete time the expectation of this estimator is not guaranteed to be an ELBO.

C.12 UNDERDAMPED OU RESULTS

We perform some additional experiments with the underdamped OU reference process. Similar to the damped setting we parametrise the network as:

fθ(k, x, p) = NN1(k, x, p; θ) + NN2(k; θ) ln π(x). (74)

Unfortunately, unlike in DDS and PIS we cannot directly aid the update/proposal for xt with ln π(x), thus this naive parametrisation is not the ideal inductive bias.

Results in Figures 9 and 10 show some experiments for this approach. The results perform worse than DDS for a standard OU reference process and PIS. However they are still better than VI. We believe that future work exploring better inductive biases for the network fθ(k, x, p) could narrow the performance gap in this approach. Additionally we highlight that in Figure 9 (Funnel Distribution) we can see the underdamped approach overestimates ln Z for k = 64. We believe this may be due to numerical error when generating a sample.

C.13 BOX PLOTS FOR MAIN RESULTS

For completeness in this section we also present our results via box plots as can be seen in Figures 11 and 12.

Figure 7: a) Target distribution (Mo G), b) DDS SDE samples, c) trained probability flow ODE.

Published as a conference paper at ICLR 2023

Figure 8: a) Target distribution N(6, 1), b) DDS SDE c) Trained DDS probability flow ODE.

D DENOISING DIFFUSION SAMPLERS IN DISCRETE-TIME: FURTHER DETAILS

D.1 FORWARD PROCESS AND ITS TIME REVERSAL

In discrete-time, we consider the following discrete-time forward Markov process

p(x0:K) = π(x0)

k=1 pk|k 1(xk|xk 1). (75)

Following (17), we use pk|k 1(xk|xk 1) = N(xk; 1 αkxk 1, σ2αk I). The parameters (αk)K k=1 are such that p K(x K) N(x K; 0, σ2I).

We propose here to sample approximately from π by approximating the ancestral sampling scheme for (75) corresponding to the backward decomposition

p(x0:K) = p K(x K)

k=1 pk 1|k(xk 1|xk), with pk 1|k(xk 1|xk) = pk 1(xk 1)pk|k 1(xk|xk 1)

(76) where p0(x0) = π(x0). If we could thus sample x K p K( ) then xk 1 pk 1|k( |xk) for k = K, ..., 1 then x0 would indeed be a sample from π. However as the marginal densities (pk)K k=1 are not available in closed-form, we cannot implement exactly this ancestral sampling procedure.

As we have by design p K(x K) N(x K; 0, σ2I), we can simply initialize the ancestral sampling procedure by a Gaussian sample. The approximation of the backward Markov kernels pk 1|k(xk 1|xk) is however more involved.

D.2 REFERENCE PROCESS AND BELLMAN RECURSION

We introduce the simple reference process pref(x0:K) defined by

pref(x0:K) = N(x0; 0, σ2I)p(x1:K|x0) = N(x0; 0, σ2I)

k=1 pk|k 1(xk|xk 1) (77)

which is designed by construction to admits the marginal distributions pref k (xk) = N(xk; 0, σ2I) for all k. It can be easily verified using the chain rule for KL that the extended target process p is the distribution minimizing the reverse (or forward) KL discrepancy w.r.t. pref over the set of path measures q(x0:K) with marginal q0(x0) = π(x0) at the initial time, i.e.

p = arg min q

n KL(q||pref) : q0 = π o .

The intractable backward Markov densities pk 1|k one wants to approximate can also be rewritten as twisted versions of the tractable backward Markov densities pref k 1|k by some value functions (ϕk)K k=0.

Published as a conference paper at ICLR 2023

64 128 256 512 Number of Steps

ln Z Estimate

DDS DDS_UDMP PIS SMC

64 128 256 512 Number of Steps

ln Z Estimate

DDS DDS_UDMP PIS SMC

64 128 256 512 Number of Steps

ln Z Estimate

DDS DDS_UDMP PIS SMC

Figure 9: ln Z estimate as a function of number of steps K - a) Funnel , b) LGCP, c) Logistic Ionosphere dataset. Yellow Dotted line is MF-VI and dashed magenta is the gold standard.

64 128 256 512 Number of Steps

ln Z Estimate

DDS DDS_UDMP PIS SMC

64 128 256 512 Number of Steps

ln Z Estimate

DDS DDS_UDMP PIS SMC

64 128 256 512 Number of Steps

ln Z Estimate

DDS DDS_UDMP PIS SMC

Figure 10: ln Z estimate as a function of number of steps K - a) Logistic Sonar dataset, b) Brownian motion, c) NICE. Yellow dotted line is MF-VI and dashed magenta is the gold standard.

Proposition 7. We have for k = 1, ..., K

pk 1|k(xk 1|xk) = ϕk 1(xk 1)pref k 1|k(xk 1|xk)

ϕk(xk) , for ϕk(xk) = pk(xk)

pref k (xk). (78)

The value functions (ϕk)K k=1 satisfy a forward Bellman type equation

ϕk(xk) = Z ϕk 1(xk 1)pref k 1|k(xk 1|xk)dxk 1, ϕ0(x0) = π(x0)

pref 0 (x0). (79)

It follows that

ϕk(xk) = Z ϕ0(x0) pref 0|k(x0|xk)dx0. (80)

Proof. To establish (78), we use Bayes rule

pk 1|k(xk 1|xk) = pk 1(xk 1)pk|k 1(xk|xk 1)

= ϕk 1(xk 1)pref k 1(xk 1)pk|k 1(xk|xk 1) ϕk(xk)pref k (xk 1)

= ϕk 1(xk 1)pref k 1|k(xk 1|xk)

where we have used the fact that pk(xk) = ϕk(xk)pref k (xk), pref k|k 1(xk|xk 1) = pk|k 1(xk|xk 1) and Bayes rule again. Now we have R pk 1|k(xk 1|xk)dxk 1 = 1 for any xk, so it follows directly from the expression of this transition kernel that the value function satisfies

ϕk(xk) = Z ϕk 1(xk 1)pref k 1|k(xk 1|xk)dxk 1, ϕ0(x0) = π(x0)

pref 0 (x0). (81)

By iterating this recursion, it follows that

ϕk(xk) = Z ϕ0(x0) pref 0|k(x0|xk)dx0. (82)

Published as a conference paper at ICLR 2023

64 128 256 512 Number of Steps

ln Z Estimate

DDS PIS SMC

64 128 256 512 Number of Steps

ln Z Estimate

DDS PIS SMC

64 128 256 512 Number of Steps

ln Z Estimate

DDS PIS SMC

Figure 11: ln Z estimate as a function of number of steps K - a) Funnel , b) LGCP, c) Logistic Ionosphere dataset. Yellow Dotted line is MF-VI and dashed magenta is the gold standard.

64 128 256 512 Number of Steps

ln Z Estimate

DDS PIS SMC

64 128 256 512 Number of Steps

ln Z Estimate

DDS PIS SMC

64 128 256 512 Number of Steps

ln Z Estimate

DDS PIS SMC

Figure 12: ln Z estimate as a function of number of steps K - a) Logistic Sonar dataset, b) Brownian motion, c) NICE. Yellow dotted line is MF-VI and dashed magenta is the gold standard.

D.3 APPROXIMATING THE BACKWARD KERNELS

We want to approximate the backward Markov transitions pk 1|k. From (78), this would require not only to approximate the value functions - Monte Carlo estimates based on (80) are high variance - but also to sample from the resulting twisted kernel which is difficult. However, if we select βk 0, then we have ϕk 1(x) ϕk(x) and by a Taylor expansion of ln ϕk around xk we obtain

pk 1|k(xk 1|xk) = pref k 1|k(xk 1|xk) exp(ln ϕk 1(xk 1) ln ϕk(xk))

1 αkxk + σ2(1

1 αk) ln ϕk(xk)). (83)

By differentiating the identity (80) w.r.t. xk, we obtain the following expression for ϕk

ln ϕk(xk) = Z ln pref 0|k(x0|xk) p0|k(x0|xk)dx0. (84)

Alternatively, we can also use ln ϕk(xk) = ln pk(xk) + x σ2 where ln pk(xk) can be easily shown to satisfy

ln pk(xk) = Z ln pk|0(xk|x0) p0|k(x0|xk)dx0. (85)

For DDPM, the conditional expectation in (85) is reformulated as the solution to a regression problem, leveraging the fact that we can easily obtain samples from π(x0)pk|0(xk|x0) as π is the data distribution in this context. In the Monte Carlo sampling context considered here, we could sample approximately from p0|k(x0|xk) π(x0)pk|0(xk|x0) using MCMC so as to approximate the conditional expectations in (84) and (85) but this would defeat the purpose of the proposed methodology. We propose instead to approximate ln ϕk by minimizing a suitable criterion.

In practice, we will consider a distribution qθ(x0:K) approximating p(x0:K) of the form

qθ(x0:K) = N(x K; 0, σ2I)

k=1 qθ k 1|k(xk 1|xk), (86)

i.e. we select qθ K(x K) = q K(x K) = N(x K; 0, σ2I) as p K(x K) N(x K; 0, σ2I). We want qθ k 1|k(xk 1|xk) to approximate pk 1|k(xk 1|xk), i.e. fθ(k, xk) ln ϕk(xk). Inspired by (83), we will consider an approximation of the form

qθ k 1|k(xk 1|xk) = N(xk 1;

1 αkxk + σ2(1

1 αk)fθ(k, xk), σ2αk I). (87)

where fθ(k, xk) is differentiable w.r.t. θ.

Published as a conference paper at ICLR 2023

D.4 HYPERPARAMETERS

D.4.1 FITTED HYPERPARAMETERS

To aid reproducibility we report all fitted hyperparameters for each of our methods and PIS across all experiments in Tables 8 and 9.

D.4.2 OPTIMISATION HYPERPARAMETERS

As mentioned in the experimental section across all experiments modulo the Funnel we use the Adam optimiser with a learning rate of 0.0001 with no learning decay and 11000 training iterations, for the rest of the optimisation parameters use the default settings as provided by the Optax library (Hessel et al., 2020) which are b1 = 0.9, b2 = 0.999, ϵ = 10 8 naming as per Kingma & Ba (2015).

From the github repository of Zhang & Chen (2022) we were only able to find hyperparameters reported for the Funnel distribution. In order to first reproduce their results we used the a learning rate of 0.005 and a learning rate decay of 0.95 as per their implementation, their results were initially not reproducible due to a bug in setting σf = 1 despite comparing to methods at the less favourable values of σf = 3. For σf = 1 we were able to reproduce their results. However we report results at σf = 3 as this is the traditional value used for this loss. As no other optimisation configuration files were reported we used the more conservative learning rate of 0.0001 since PIS was very unstable for 0.005 with decay 0.95 across many of our tasks. Finally we would like to clarify that the exact same optimiser settings where used for both PIS and DDS in order to ensure a fair comparison.

D.4.3 DRIFT AND GRADIENT CLIPPING

We follow the same gradient clipping as in Zhang & Chen (2022) that is :

fθ(k, x)=clip

NN1(k, x; θ) + NN2(k; θ) clip

ln π(x), 102, 102

Published as a conference paper at ICLR 2023

funnel lgcp ion

DDS PIS UDMP DDS PIS UDMP DDS PIS UDMP K = 64 σ = 1.075, α = 1.075 σ = 1.068 σ = 1.85, α = 1.67, m = 0.9 σ = 2.1, α = 1.50 σ = 1.068 σ = 1.1, α = 2.5, m = 0.4 σ = 0.688, α = 1.463 σ = 0.253 σ = 0.6, α = 3.85, m = 0.600 K = 128 σ = 1.075, α = 0.6875 σ = 0.416 σ = 1.85, α = 3.7, m = 0.9 σ = 2.1, α = 0.75 σ = 0.742 σ = 1.4, α = 2.5, m = 0.4 σ = 0.3, α = 1.075 σ = 0.09 σ = 0.6, α = 3.85, m = 0.600 K = 256 σ = 1.85, α = 0.3 σ = 0.742 σ = 1.075, α = 2.5, m = 0.9 σ = 2.1, α = 0.900 σ = 0.579 σ = 1.4, α = 4.5, m = 0.4 σ = 0.3, α = 0.688 σ = 0.416 σ = 0.6, α = 3.85, m = 1.0 K = 512 σ = 1.463, α = 0.3 σ = 0.253 σ = 0.688, α = 3.7, m = 0.9 σ = 2.1, α = 1.500 σ = 0.416 σ = 1.7, α = 4.5, m = 0.4 σ = 0.688, α = 0.688 σ = 0.253 σ = 0.6, α = 3.85, m = 1.0

Table 8: Fitted hyperparameters for funnel, lgcp and ion experiments.

lr sonar vae brownian nice

DDS PIS UDMP DDS PIS UDMP DDS PIS UDMP DDS PIS UDMP K = 64 σ = 0.3, α = 1.650 σ = 0.253 σ = 1.15, α = 1.7, m = 2.2 σ = 0.61, α = 2.2 σ = 0.253 NA σ = 0.1, α = 2.35 σ = 0.084 σ = 0.115, α = 4.8, m = 2.2 σ = 1.5, α = 2.125 σ = 0.75 σ = 1.2, α = 1.0, m = 0.9 K = 128 σ = 0.3, α = 1.2 σ = 0.253 σ = 0.55, α = 1.7, m = 3.1 σ0.61, α = 1.670 σ = 0.2523 NA σ = 0.1, α = 1.8 σ = 0.093 σ = 0.115, α = 4.8, m = 2.2 σ = 1.5, α = 1.75 σ = 0.588 σ = 1.2, α = 1.0m = 1.65 K = 256 σ = 0.3, α = 0.75 σ = 0.253 σ = 0.55, α = 2.9, m = 2.2 σ = 0.61, α = 1.140 σ = 0.506 NA σ = 0.1, α = 2.35 σ = 0.043 σ = 0.115, α = 3.75, m = 2.2 σ = 1.5, α = 1.75 σ = 0.425 σ = 1.2, α = 2.5, m = 0.9 K = 512 σ = 0.3, α = 0.75 σ = 0.253 σ = 0.55, α = 2.9, m = 3.1 σ = 0.61, α = 1.670 σ = 0.416 NA σ = 0.1, α = 1.8 σ = 0.0408 σ = 0.115, α = 4.8, m = 2.2 σ = 1.5, α = 2.5 σ = 0.263 σ = 1.2, α = 2.5, m = 1.65

Table 9: Fitted hyperparameters for sonar, brownian, vae and nice experiments.