# reverse_diffusion_monte_carlo__47175b29.pdf

Published as a conference paper at ICLR 2024

REVERSE DIFFUSION MONTE CARLO

Xunpeng Huang Hanze Dong Yifan Hao Yi-An Ma Tong Zhang

The Hong Kong University of Science and Technology Salesforce AI Research University of California, San Diego University of Illinois Urbana-Champaign

We propose a Monte Carlo sampler from the reverse diffusion process. Unlike the practice of diffusion models, where the intermediary updates the score functions are learned with a neural network, we transform the score matching problem into a mean estimation one. By estimating the means of the regularized posterior distributions, we derive a novel Monte Carlo sampling algorithm called reverse diffusion Monte Carlo (rd MC), which is distinct from the Markov chain Monte Carlo (MCMC) methods. We determine the sample size from the error tolerance and the properties of the posterior distribution to yield an algorithm that can approximately sample the target distribution with any desired accuracy. Additionally, we demonstrate and prove under suitable conditions that sampling with rd MC can be significantly faster than that with MCMC. For multi-modal target distributions such as those in Gaussian mixture models, rd MC greatly improves over the Langevin-style MCMC sampling methods both theoretically and in practice. The proposed rd MC method offers a new perspective and solution beyond classical MCMC algorithms for the challenging complex distributions.

1 INTRODUCTION

Recent success of diffusion models has shown great promise for the the reverse diffusion processes in generating samples from a complex distribution (Song et al., 2020; Rombach et al., 2022). In the existing line of works, one is given samples from the target distribution and aims to generate more samples from the same target. One would diffuse the target distribution into a standard normal one, and use score matching to learn the transitions between the consecutive intermediary distributions (Ho et al., 2020; Song et al., 2020). Reversing the learned diffusion process leads us back to the target distribution. The benefit of the reverse diffusion process lies in the efficiency of convergence from any complex distribution to a normal one (Chen et al., 2022a; Lee et al., 2023). For example, diffusing a target multi-modal distribution into a normal one is not harder than diffusing a single mode. Backtracking the process from the normal distribution directly yields the desired multi-modal target. If one instead adopts the forward diffusion process from a normal distribution to the multi-modal one, there is the classical challenge of mixing among the modes, as illustrated in Figure 1. This observation motivates us to ask:

Can we create an efficient, general purpose Monte Carlo sampler from reverse diffusion processes?

For Monte Carlo sampling, while we have access to an unnormalized density function p (x) exp( f (x)), samples from the target distribution are unavailable (Neal, 1993; Jerrum & Sinclair, 1996; Robert et al., 1999). As a result, we need a different and yet efficient method of score estimation to perform the reverse SDE. This leads to the first contribution of this paper. We leverage the fact that the diffusion process from our target distribution p towards a standard normal one p is an Ornstein-Uhlenbeck (OU) process, which admits explicit solutions. We thereby transform the score matching problem into a non-parametric mean estimation one, without training a parameterized diffusion model. We name this new algorithm as reverse diffusion Monte Carlo (RDMC).

Equal contribution. Random order.

Published as a conference paper at ICLR 2024

Figure 1: Langevin dynamics (first row) versus reverse SDE (second row). The first and second rows depict the intermediate states of the Langevin algorithm and the reverse SDE, respectively, illustrating the transition from a standard normal p0 to a Gaussian mixture p . It can be observed that due to the local nature of the information contained in ln p , the Langevin algorithm tends to get stuck in modes close to the initializations. In contrast, the reverse SDE excels at transporting particles to different modes proportional to the target densities.

To implement RDMC and solve the aforementioned mean estimation problem, we propose two approaches. One is to sample from a normal distribution ρt determined at each time t by the transition kernel of the OU process then compute the mean estimates weighted by the target p . This approach translates all the computational challenge to the sample complexity in the importance sampling estimator. The iteration complexity required to achieve an overall ϵ TV accuracy is O(ϵ 2), under minimal assumptions. Another approach is to use Unadjusted Langevin Algorithm (ULA) to generate samples from the product distribution of the target density p and the normal one ρt. This approach greatly reduces sample complexity in high dimensions, and yet is a better conditioned algorithm than ULA over p due to the multiplication of the normal distribution. In our experiments, we find that a combination of the two approaches excels at distributions with multiple modes.

We then analyze the efficacy and efficiency of our RDMC method. We study the benefits of the reverse diffusion approach for both multi-modal distributions and high dimensional heavy-tail distributions that breaks the isoperimetric properties (Gross, 1975; Poincar e, 1890; Vempala & Wibisono, 2019). In multi-modal distributions, the Langevin algorithm based MCMC approaches suffer from an exponential cost for the need of barrier crossing, which makes mixing time extremely long. The RDMC approach can circumvent the hurdle and solve the problem. For high-dimensional heavytail distributions, our RDMC method circumvents the often-required isoperimetric properties of the target distributions, thereby avoiding the curse of dimensionality.

Contributions. We propose a non-parametric sampling algorithm that leverages the reverse SDE of the OU process. Our proposed approach involves estimating the score function by a mean estimation sub-problem for posteriors, enabling the efficient generation of samples through the reverse SDE. We focus on the complexity of the sub-problems and establish the convergence of our algorithm. We found that our approach effectively tackles sampling tasks with ill-conditioned log-Sobolev constants. For example, it excels in challenging scenarios characterized by multi-modal and long-tailed distributions. Our analysis sheds light on the varying complexity of the sub-problems at different time points, providing a fresh perspective for revisiting score estimation in diffusion-based models.

2 PRELIMINARIES

In this section, we begin by introducing related work from the perspectives of Markov Chain Monte Carlo (MCMC) and diffusion models, and we discuss the connection between these works and the present paper. Next, we provide a notation for the reverse process of diffusion models, which specifies the stochastic differential equation (SDE) that particles follow in RDMC.

We first introduce the related works as below.

Langevin-based MCMC. The mainstream gradient-based sampling algorithms are mainly based on the continuous Langevin dynamics (LD) for sampling from a target distribution p exp( f ). The Unadjusted Langevin Algorithm (ULA) discretizes LD using the Euler-Maruyama scheme and

Published as a conference paper at ICLR 2024

obtains a biased stationary distribution. Due to its simplicity, ULA is widely used in machine learning. Its convergence has been investigated for different criteria, including Total Variation (TV) distance, Wasserstein distances (Durmus & Moulines, 2019), and Kullback-Leibler (KL) divergence (Dalalyan, 2017; Cheng & Bartlett, 2018; Ma et al., 2019; Freund et al., 2022), in different settings, such as strongly log-concave, log-Sobolev inequality (LSI) (Vempala & Wibisono, 2019), and Poincar e inequality (PI) (Chewi et al., 2021). Several works have achieved acceleration convergence for ULA by decreasing the discretization error with higher-order SDE, e.g., underdamped Langevin dynamics (Cheng et al., 2018; Ma et al., 2021; Mou et al., 2021), and aligning the discretized stationary distribution to the target p , e.g., Metropolis-adjusted Langevin algorithm (Dwivedi et al., 2018) and proximal samplers (Lee et al., 2021b; Chen et al., 2022b;b; Altschuler & Chewi, 2023). Liu & Wang (2016); Dong et al. (2023) also attempt to perform sampling tasks with deterministic algorithms whose limiting ODE is derived from the Langevin dynamics.

Regarding the convergence guarantees, most of these works have landscape assumptions for the target distribution p , e.g., strong log-concavity, LSI, or PI. For more general distributions, Erdogdu & Hosseinzadeh (2021) and Mousavi-Hosseini et al. (2023) consider KL convergence in modified LSI and weak PI. These extensions allow for slower tail-growth of negative log-density f , compared to the quadratic or even linear case. Although these works extend ULA to more general distributions, the computational burden of ill-conditioned isoperimetry still exists, e.g., exponentially dependent on the dimension (Raginsky et al., 2017). In this paper, we introduce another SDE to guide the sampling algorithm, which is non-Markovian and time-inhomogeneous. Our algorithm discretizes such an SDE and can reduce the isoperimetric and dimension dependence wrt TV convergence.

Diffusion Models and Stochastic Localization. In recent years, diffusion models have gained significant attention due to their ability to generate high-quality samples (Ho et al., 2020; Rombach et al., 2022). The core idea of diffusion models is to parameterize the score, i.e., the gradient of the log-density, during the entire forward OU process from the target distribution p . In this condition, the reverse SDE is associated with the inference process in diffusion models to perform unnormalized sampling for intricate target distributions. Apart from conventional MCMC trajectories, the most desirable property of this process is that, if the score function can be well approximated, it can sample from a general distribution Chen et al. (2022a); Lee et al. (2022; 2023). This implies the reverse SDE has a lower dependency on the isoperimetric requirement for the target distribution, which inspires the designs of new sampling algorithms. On the other hand, stochastic localization framework (Eldan, 2013; 2020; 2022; El Alaoui et al., 2022; Montanari, 2023) formalizes the general reverse diffusion and discusses the relationship with current diffusion models. Along this line, Montanari & Wu (2023) consider the unnormalized sampling for the symmetric spiked model. However, for the properties of the general unnormalized sampling, the investigation is limited.

This work employs the backward path of diffusion models to design a sampling strategy which we can prove to be more effective than the forward path of Langevin algorithm under suitable conditions. Several other recent studies (Vargas et al., 2023a; Berner et al., 2022; Zhang et al., 2023; Vargas et al., 2023c;b) have also utilized the diffusion path in their posterior samplers. Instead of the closed form MC sampling approach, the above works learns an approximate score function via a parametrized model, following the standard practice of the diffusion generative models. The resulting approximate distribution following the backward diffusion path is akin to the ones obtained from variational inference, and contains errors that depend on the expressiveness of the parametric models for the score functions. In addition, the convergence of the sampling process using such an approach depends on the generalization error of learned score functions (Tzen & Raginsky, 2019) (See Appendix for more details). It remains open how to bound sample complexity of such approaches, due to the need for bounding the error of the learned score function along the entire trajectory.

In contrast, RDMC proposed in this paper can be directly analyzed. Specifically, our algorithm draws samples from the unnormalized distribution without training data and the algorithm is designed to avoid learning based score function estimation. We are interested in the comparisons between RDMC and conventional MCMC algorithms. Thus, we analyze the complexity to estimate the score by drawing samples from the posterior distribution and found that our proposed algorithm is much more efficient in certain cases when the log-Sobolev constant of the target distribution is small.

Reverse SDE in Diffusion Models. In this part, we introduce the notations and formulation of the reverse SDE associated with the inference process in diffusion models, which will also be commonly used in the following sections. We begin with an OU process starting from p formulated as

Published as a conference paper at ICLR 2024

dxt = xtdt +

2d Bt, x0 p0 e f , (1)

where (Bt)t 0 is Brownian motion and p0 is assigned as p . This is similar to the forward process of diffusion models (Song et al., 2020) whose stationary distribution is N (0, I) and we can track the random variable x at time t with density function pt.

According to Cattiaux et al. (2021), under mild conditions in the following forward process dxt = bt(xt)dt +

2d Bt, the reverse process also admits an SDE description. If we fix the terminal time T > 0 and set xt = x T t, for t [0, T], the process ( xt)t [0,T ] satisfies the SDE d xt = bt( xt)dt+

2d Bt, where the reverse drift satisfies the relation bt+ b T t = 2 ln pt, xt pt. In this condition, the reverse process of Eq. (1) is as d xt = ( xt + 2 ln p T t( xt)) dt +

2d Bt, xt pt. Thus, once the score function ln p T t is obtained, the reverse SDE induce a sampling algorithm.

Discretization and realization of the reverse process. To numerically solve the previous SDE, suppose k := t/η for any t [0, T], we approximate the score function ln p T t with vk for t [kη, (k + 1)η]. This modification results in a new SDE, given by

d xt = ( xt + vk( xkη)) dt +

2d Bt, t [kη, (k + 1)η] . (2)

Specifically, when k = 0, we set x0 to be sampled from p0, which can approximate p T .To find the closed form of the solution by setting an auxiliary random variable as ri ( xt, t) := xt,ie t, we have

dri( xt, t) = ri

xt,i [vk,i + xt,i] + Tr 2ri

=vk,ie tdt +

where the equalities are derived by the Itˆo s Lemma. Then, we set the initial value r0( xt, 0) = xt,i, and integral on both sides of the equation. We have

xt+s = es xt + (es 1) vk + N 0, e2s 1 Id . (3)

For the specific construction of vk to approximate the score, we defer it to Section 3.

3 THE REVERSE DIFFUSION MONTE CARLO APPROACH

As shown in SDE (2), we introduce an estimator vk x ln p T t(x) to implement the reverse diffusion. This section is dedicated to exploring viable estimators and the benefits of the reverse diffusion process. We found that we can reformulate the score as an expectation with the transition kernel of the forward OU process. We also derive the intuition that RDMC can reduce the isoperimetric dependence of the target distribution compared with conventional Langevin algorithm. Lastly, we introduce the detailed implementation of RDMC in real practice with different score estimators.

3.1 SCORE FUNCTION IS THE EXPECTATION OF THE POSTERIOR

We start with the formulation of x ln pt(x). In general SDEs, the score functions ln pt do not have an analytic form. However, our forward process is an OU process (SDE (1)) whose

transition kernel is given as pt|0(x|x0) = 2π 1 e 2t d/2 exp x e tx0 2

. Such condi-

tional density presents the probability of obtaining xt = x given x0 in SDE (1). Note that pt(x) = Ep0(x0)pt|0(x|x0), we can use the property to derive other score formulations. Bayes theorem demonstrates that the score can be reformulated as an expectation by the following Lemma. Lemma 1. Assume that Eq. (1) defines the forward process. The score function can be rewritten as

x ln p T t(x) = Ex0 q T t( |x) e (T t)x0 x (1 e 2(T t)) ,

q T t(x0|x) exp

x e (T t)x0 2

2 (1 e 2(T t))

Published as a conference paper at ICLR 2024

Algorithm 1 RDMC: reverse diffusion Monte Carlo

1: Input: Initial particle x0 sampled from p0, Terminal time T, Step size η, η , Sample size n. 2: for k = 0 to T/η 1 do 3: Set vk = 0; 4: Create n Monte Carlo samples to estimate

vk Ex q T t

xkη e (T kη)x

(1 e 2(T kη))

, where q T t(x| xkη) exp f (x) xkη e (T kη)x 2

2(1 e 2(T kη))

5: x(k+1)η = eη xkη + (eη 1) vk + ξ where ξ is sampled from N 0, e2η 1 Id . 6: end for 7: Return: x T/η η .

t = 0 t = 0.1 t = 0.4

t = 0.01 t = 0.1 t = 0.4

0.0 0.1 0.2 0.3 0.4 t

log-Sobolev constant

Figure 2: Illustrations of pt, qt, and their log-Sobolev (LSI) constants. The target distribution p is a Gaussian mixture. We choose qt( |x = 0) for illustration. As t increases, the modes of pt collapse to zero rapidly, corresponding to an improving LSI constant. For qt, the barrier height of qt remains small, resulting in a relatively large LSI constant as long as T = O(1). Thus initializing with p T and performing RDMC reduces computation complexity for multi-modal p .

The proof of Lemma 1 is presented in Appendix C.1. For any t > 0, we observe that log q T t incorporates an additional quadratic term. In scenarios where p adheres to the log-Sobolev inequality (LSI), this term enhances q T t s log-Sobolev (LSI) constant, thereby accelerating convergence. Conversely, with heavy-tailed p (where f s growth is slower than a quadratic function), the extra term retains quadratic growth, yielding sub-Gaussian tails and log-Sobolev properties. Notably, as t approaches T, the quadratic component becomes predominant, rendering q T t strongly log-concave and facilitating sampling. In summary, every q T t exhibits a larger LSI constant than p . As t increases, this constant grows, ultimately leading q T t towards strong convexity. Consequently, this provides a sequence of distributions with LSI constants surpassing those of p , enabling efficient score estimation for ln p T t.

From Lemma 1, the expectation formula of q T t( |x) can be obtained by empirical mean estimator with sufficient samples from q T t( |x). Thus, the gradient complexity required in this sampling subproblem becomes the bottleneck of our algorithm. Suppose {x(i) k } is samples drawn from q T kη( |x) when xkη = x for any x Rd, the construction of vk(x) in Eq. (2) can be presented as

vk(x) = 1 nk Pnk i=1 v(i) k (x) where v(i) k (x) = 2 1 e 2(T kη) 1 e (T kη)x(i) k x . (5)

3.2 REVERSE SDE VS LANGEVIN DYNAMICS: INTUITION

From Figure 1, we observe that the RDMC method deviates significantly from the conventional gradient-based MCMC techniques, such as Langevin dynamics. It visualizes the paths from a Gaussian distribution to a mixture of Gaussian distributions. It can be observed that Langevin dynamics, being solely driven by the gradient information of the target density p , tends to get trapped in local regions, resulting in uneven sampling of the mixture of Gaussian distribution. More precisely, p admits a small LSI constant due to the separation of the modes (Ma et al., 2019; Schlichting, 2019; Menz & Schlichting, 2014). Consequently, the convergence of conventional MCMC methods becomes notably challenging in such scenarios (Tosh & Dasgupta, 2014).

To better demonstrate the effect of our proposed SDE, we compute the LSI constant estimates for 1-d case in Figure 2. Due to the shrinkage property of the forward process, the LSI constant of pt

Published as a conference paper at ICLR 2024

can be exponentially better than the original one. Meanwhile, for a T = O(1), the LSI constant of qt is also well-behaved. In our algorithm, a well-conditioned LSI constant for both p T and q T reduces the computation overhead. Thus, we can choose a intermediate T to connect those local modes and perform reverse SDE towards p . RDMC can distribute samples more evenly across different modes and enjoy faster convergence in those challenging cases. Moreover, if the growth rate of log p is slower than the quadratic function (heavy tail), we can choose a large T and use p to estimate p T . Then, all log qt have quadratic growth, which implies log-Sobolev property. These intuitions also explain why diffusion models excel in modeling complex distributions in high-dimensional spaces. We will provide the quantitative explanation in Section 4.

3.3 ALGORITHMS FOR SCORE ESTIMATION WITH q T t( |x)

According to the expectation form of scores shown in Lemma 1, a detailed reverse sampling algorithm can be naturally proposed in Algorithm 1. Specifically, it can be summarized as the following steps: (1) choose proper T and p0 such that p T p0. This step can be done by either p0 = p for large T or performing the ULA for p T (Algorithm 3 in Appendix); (2) sample from a distribution that approximate q T t (Step 4 of Algorithm 1); (3) follow pt with the approximated q T t samples (Step 5 of Algorithm 1); (4) repeat until t T. After (4), we can also perform Langevin algorithm to fine-tune the steps when the gradient complexity limit is not reached.

The key of implementing Algorithm 1 is to estimate the scores ln p T t via Step 4 with samples from q T t. In what follows, we discuss the implementation that combines the importance weight sampling with the adjusted Langevin algorithm (ULA).

Importance weighted score estimator. We first consider importance sampling approach for estimating the score ln p T t. From Eq. (4), we know that the key is to estimate:

x ln p T t(x) = Ex0 q T t( |x)

e (T t)x0 x

(1 e 2(T t))

Z Ex0 ρT t( |x)

e (T t)x0 x

(1 e 2(T t)) e f (x0) ,

and Z = Ex0 ρT t( |x) [exp( f (x0))] , where ρT t( |x) exp x e (T t)x0 2

2(1 e 2(T t))

that sampling from ρT t takes negligible computation resource. The main challenge is the sample complexity of estimating the two expectations.

Since ρT t(x0|x) is Gaussian with variance σ2 t = 1 e 2(T t)

e 2(T t) , we know that as long as

x e (T t)x0

(1 e 2(T t)) exp( f (x0)) is G-Lipschitz, the sample complexity scales as N = O σ2 t G2

for the resulting errors of the two mean estimators to be less than ϵ (Wainwright, 2019). However, the sample size required of the importance sampling method to achieve an overall small error is known to scale exponentially with the KL divergence between ρT t and q T t (Chatterjee & Diaconis, 2018), which can depend on the dimension. In our current formulation, this is due to the fact that the true denominator Z = Ex0 ρT t( |x)[e f (x0)] can be as little as exp( d). As a result, to make the overall score estimation error small, the error tolerance and in turn the sample size required for estimating Z can scale exponentially with the dimension.

ULA score estimator. An alternative score estimator considers that the mean of the underlying distribution q T t( |x) of these samples needs to sufficiently approach q T t( |x), which can be achieved by closing the gap of KL divergence or Wasserstein distance between q T t( |x) and q T t( |x). Since the additional quadratic term shown in Eq. (4) helps improve a quadratic tail growth for q T t( |x), which implies the establishment of the isoperimetric condition. We expect the convergence can be achieved by performing the ULA on a sampling subproblem whose target distribution is q T t( |x). We provide the detailed implementation in Algorithm 2.

ULA score estimator with importance sampling initialization. Inspired by the previous estimators, we can combine the importance sampling approach with the ULA. In particular, we first implement the importance sampling method to form a rough score estimator. We then perform ULA at the mean estimator and obtain a refined accurate score estimate. Via this combination, we are able to efficiently obtain accurate score estimation by virtue of the ULA algorithm when t is close to T. When t is close to 0, we are able to quickly obtain rough score estimates via the importance sampling approach. We discover empirically that this combination generally perform well when t interpolates the two regimes (Figure 3 in Section 4).

Published as a conference paper at ICLR 2024

Algorithm 2 ULA inner-loop for the qt( |x) sampler (Step 4 of Algorithm 1)

1: Input: Condition x and time t, Sample size n, Initial particles {xi 0}n i=1, Iters K, Step size η. 2: for k = 1 to K do 3: for i = 1 to n do

4: xi k = xi k 1 η f (xi k 1) + e t(e txi k 1 x) 1 e 2t

+ 2ηξk, where ξk N (0, Id)

5: end for 6: end for 7: Return: {xi K}n i=1 .

4 ANALYSES AND EXAMPLES OF THE RDMC APPROACH

In this section, we analyze the overall complexity of the RDMC via ULA inner loop estimation. Since the complexity of the importance sampling estimate is discussed in Section 3.3, we only consider Algorithm 1 with direct ULA sampling of q T t( |x) rather than the smart importance sampling initialization to make our analysis clear.

To provide the analysis of the convergence of RDMC, we make the following assumptions.

[A1] For all t 0, the score ln pt is L-Lipschitz.

[A2] The second moment of the target distribution p is upper bounded, i.e., Ep h 2i = m2 2.

These assumptions are standard in diffusion analysis to guarantee the convergence to the target distribution (Chen et al., 2022a). Specifically, Assumption [A1] governs the smoothness characteristics of the forward process, which ensure the feasibility of numerical discretization methods used for approximating the solution of continuous SDE. In addition, Assumption [A2] introduces essential constraints on the moments of the target distribution. These constraints effectively prevent an excessive accumulation of probability mass in the tail region, thereby ensuring a more balanced and well-distributed target distribution.

4.1 OUTER LOOP COMPLEXITY

According to Algorithm 1, the overall gradient complexity depends on the number of outer loops k, as well as the complexity to achieve accurate score estimations (Line 5 in Algorithm 1). When the score is well-approximated and satisfies Ep T kη h vk ln p T kη 2i ϵ2, the overall error

in TV distance, DTV(p , p T ), can be made O(ϵ) by choosing a small η. Under this condition, the number of outer loops satisfies k = Ω(L2dϵ 2), which shares the same complexity as that in diffusion analysis (Chen et al., 2022a; Lee et al., 2022; 2023). Such a complexity of the score estimation oracles is independent of the log-Sobolev (LSI) constant of the target density p , which means that the isoperimetric dependence of RDMC is dominated by the subproblem of sampling from q T t. Specifically, the following theorem demonstrates the conditions for achieving DTV(p , p T ) = O(ϵ). Theorem 1. For any ϵ > 0, assume that DKL(p T p0) < ϵ for some T > 0, ˆp as suggested in Algorithm 1, η = C1(d + m2 2) 1ϵ2. If the OU process induced by p satisfies Assumption [A1], [A2], and q T kη satisfies the log-Sobolev inequality (LSI) with constant µk (kη T). We set

nk = 64Tdµ 1 k η 3ϵ 2δ 1, Ek = 2 13 T 4d 2µ2 kη8ϵ4δ4 (6)

where nk is the number of samples to estimate the score, and Ek is the KL error tolerence for the inner loop. With probability at least 1 δ, Algorithm 1 converges in TV distance with O(ϵ) error. Therefore, the key points for the convergence of RDMC can be summarized as

The LSI of target distributions of sampling subproblems, i.e., q T kη is maintained. The initialization of the reverse process p0 is sufficiently close to p T .

4.2 OVERALL COMPUTATION COMPLEXITY

In this section, we consider the overall computation complexity of RDMC. Note that the LSI constants of q T kη depend on the properties of p . As a result we consider more specific assumptions

Published as a conference paper at ICLR 2024

that bound the LSI constants for q T kη. In particular, we demonstrate the benefits of using q T kη for targets p with infinite or exponentially large LSI constants. Compared with ULA, RDMC improves the gradient complexity, due to the improved LSI constant of q T kη over that of p in finite T. Even for heavy-tailed p that does not satisfy the Poincar e inequality, target distributions of sampling subproblems, i.e., q T kη, can still preserve the LSI property, which helps RDMC to alleviate the exponential dimension dependence of gradient complexity for achieving TV distance convergence.

Specifically, Section 4.2.1 consider the case that the LSI constant of p depend on the radius R exponentially, which can usually be found in mixture models. Our proposed RDMC can reduce the exponent by a factor. Section 4.2.2 consider the case that p is not LSI, but RDMC can create an LSI subproblem sequence which makes the dimension dependency polynomial.

We first provide an estimate for the LSI constant of qt under general Lipschitzness assumption [A1].

Lemma 2. Under [A1], the LSI constant for qt in the forward OU process is e 2t 2(1 e 2t) when 0 t 1

2 ln 1 + 1 2L . This estimation indicate that when quadratic term dominate the log-density of qt, the log-Sobolev property is well-guaranteed.

4.2.1 IMPROVING THE LSI CONSTANT DEPENDENCE FOR MIXTURE MODELS

Apart from Assumption [A1], [A2], we study the case where p satisfies the following assumption:

[A3] There exists R > 0, such that f (x) is m-strongly convex when x R. In this case, the target density p admits an LSI constant which scales exponentially with the radius of the region of nonconvexity R, i.e., m

2 exp( 16LR2), as shown in (Ma et al., 2019). This implies that if we draw samples from p with ULA, the gradient complexity will have exponential dependency with respect to R2. However, in RDMC with suitable choices of T = O(log R) and p0, the exponential dependency on R is removed, which is a bottleneck for mixture models.

In the Proposition 1 below, we select values of T and p0 to achieve the desired level of accuracy. Notably, Lemma 2 suggest that we can choose a O(1) termination time to make the LSI constant of qt well-behehaved and p T is much simpler than p0. That is to say, RDMC can exhibit lower isoperimetric dependency compared to conventional MCMC techniques. Thus, We find that the overall computation complexity of RDMC reduces the dependency on R. Proposition 1. Assume that the OU process induced by p satisfies [A1], [A2], [A3]. We estimate p T with p0 by ULA, where the iterations wrt LSI constant is Ω m 1 exp(16LR2e 2T 2T) . For any ϵ > 0 and T 1

2 ln 1 + 1 2L , the Algorithm 1 from p0 to target distribution convergence with a probability at least 1 δ and total gradient complexity Ω(poly(ϵ, δ)) independent of R. Example: Gaussian Mixture Model. We consider an example that

p (x) exp 1

2 x 2 + exp 1

2 x y 2 , y 1.

The LSI constant is Θ(e C0 y 2) (Ma et al., 2019; Schlichting, 2019; Menz & Schlichting, 2014), corresponding to the complexity of the target distribution. Tosh & Dasgupta (2014) prove that the lower bound iteration complexity by MCMC-based algorithm to sample from the Gaussian mixture scales exponentially with the squared distance y 2 between the two modes: Ω(exp( y 2/8)). Note that RDMC is not a type of conventional MCMC. With importance sampling score estimator, the O(ϵ 2) iteration complexity and O(ϵ 2) samples at each step, the TV distance converges with O(ϵ) error (Appendix B).

The computation overhead of the outer-loop process does not depend on y. For the inner-loop score estimation we can choose T = 1

2 to make the LSI constant of qt to be O(1). We can perform ULA to initialize p T , which reduces the barrier between modes significantly. Specifically, the LSI constant of p T is Θ(e C0 e T y 2), which improves the dependence on y 2 by a e 2T = 2

3 factor. Since this factor is on the exponent, the reduction is exponential.

Figure 3 is a demonstration for this example. We choose different r to represent the change of R in [A3]. We compare the convergence of Langevin Monte Carlo (LMC), Underdamped LMC (ULMC) and RDMC in terms of gradient complexity. As r increases, we find that LMC fails to converge within a limited time. ULMC, with the introduction of velocity, can alleviate this situation. Notably, our algorithm can still ensure fast mixing for large r. The inner loop is initialized with importance sampling mean estimator. Hyper-parameters and more results are provided in Appendix F.

Published as a conference paper at ICLR 2024

0 50 100 150 200 grad complexity

LMC ULMC rd MC

0 50 100 150 200 grad complexity

LMC ULMC rd MC

0 50 100 150 200 grad complexity

LMC ULMC rd MC

r = 0.5 r = 1.0 r = 2.0 Figure 3: Maximum Mean Discrepancy (MMD) convergence of LMC, ULMC, RDMC. First row shows different target distributions, with increasing mode separation r and barrier heights, leading to reduced log-Sobolev (LSI) constants. Second row displays the algorithms convergence, revealing RDMC s pronounced advantage convergence compared to ULMC/LMC, especially for large r.

4.2.2 IMPROVING THE DIMENSION DEPENDENCE IN HEAVY-TAILED TARGET DISTRIBUTIONS

In this subsection, we study the case where p satisfies Assumption [A1], [A2], and

[A4] For any r > 0, we can find some R(r) satisfying f (x)+r x 2 is convex for x R(r). Without loss of generality, we suppose R(r) = c R/rn for some n > 0, c R > 0. Assumption [A4] can be considered as a soft version of [A3]. Specifically, it permits the tail growth of the negative log-density f to be slower than quadratic functions. This encompasses certain target distributions that fall outside the constraints of LSI and PI. Furthermore, given that the additional quadratic term present in Eq. (4) dominates the tail, q T kη satisfies LSI for all t > 0.

Lemma 3. Under [A1], [A4], the LSI constant for qt in the forward OU process is e 2t 6(1 e 2t)

e 16 3L R2 e 2t

for any t 0. The tail property guarantees a uniform LSI constant.

The uniform bound on the LSI constant enables us to estimate the score for any pt. We can consider cases that are free from the constraints on the properties of p T and let T be sufficiently large. By setting T at O(ln 1/ϵ) level, we can approximate p T with p the stationary distribution of the forward process. Furthermore, since qt are log-Sobolev, we can perform RDMC to sample from heavy-tailed p in the absence of a log-Sobolev inequality. The explicit computational complexity of RDMC, needed to converge to any specified accuracy, is detailed in the subsequent proposition. Proposition 2. Assume that the target distribution p satisfies Assumption [A1], [A2], and [A4]. We take p to be p0. For any ϵ > 0, by performing Algorithm 1 with ULA inner loop and hyperparameters in Theorem 1, with a probability at least 1 δ, we have DTV ( pt, p ) O(ϵ) with Ω d18ϵ 16n 83δ 6 exp ϵ 16n gradient complexity. Example: Potentials with Sub-Linear Tails. We consider an example that

p (x) exp x 2 + 1 a where a (0, 0.5).

Lemma 5 demonstrates that this p satisfies Assumption [A4] with n = (2 2a) 1 1. Moreover, these target distributions with sub-linear tails also satisfy weak Poincare inequality (w PI) introduced in Mousavi-Hosseini et al. (2023), in which the dimension dependence of ULA to achieve the convergence in TV distance is proven to be Ω(d4a 1+3). Compared with this result, the complexity of RDMC shown in Proposition 2 has a lower dimension dependence when a 4/15.

Conclusion. This paper presents a novel sampling algorithm based on the reverse SDE of the OU process. The algorithm efficiently generates samples by utilizing the mean estimation of a subproblem to estimate the score function. It demonstrates convergence in terms of total variation distance and proves efficacy in general sampling tasks with or without isoperimetric conditions. The algorithm exhibits lower isoperimetric dependency compared to conventional MCMC techniques, making it well-suited for multi-modal and high-dimensional challenging sampling. The analysis provides insights into the complexity of score estimation within the OU process, given the conditional posterior distribution.

Published as a conference paper at ICLR 2024

ACKNOWLEDGEMENTS

The research is partially supported by the NSF awards: SCALE Mo DL-2134209, CCF-2112665 (TILOS). It is also supported in part by the DARPA AIE program, the U.S. Department of Energy, Office of Science, the Facebook Research Award, as well as CDC-RFA-FT-23-0069 from the CDC s Center for Forecasting and Outbreak Analytics.

Jason M Altschuler and Sinho Chewi. Faster high-accuracy log-concave sampling via algorithmic warm starts. ar Xiv preprint ar Xiv:2302.10249, 2023.

Dominique Bakry, Ivan Gentil, Michel Ledoux, et al. Analysis and geometry of Markov diffusion operators, volume 103. Springer, 2014.

Julius Berner, Lorenz Richter, and Karen Ullrich. An optimal control perspective on diffusion-based generative modeling. ar Xiv preprint ar Xiv:2211.01364, 2022.

Patrick Cattiaux, Giovanni Conforti, Ivan Gentil, and Christian L eonard. Time reversal of diffusion processes under a finite entropy condition. ar Xiv preprint ar Xiv:2104.07708, 2021.

Djalil Chafa ı. Entropies, convexity, and functional inequalities, on \phi-entropies and \phi-sobolev inequalities. Journal of Mathematics of Kyoto University, 44(2):325 363, 2004.

Sourav Chatterjee and Persi Diaconis. The sample size required in importance sampling. Annals of Applied Probability, 28(2), 2018.

Hongrui Chen, Holden Lee, and Jianfeng Lu. Improved analysis of score-based generative modeling: User-friendly bounds under minimal smoothness assumptions. In International Conference on Machine Learning, pp. 4735 4763. PMLR, 2023.

Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim, and Anru R Zhang. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. ar Xiv preprint ar Xiv:2209.11215, 2022a.

Yongxin Chen, Sinho Chewi, Adil Salim, and Andre Wibisono. Improved analysis for a proximal algorithm for sampling. In Conference on Learning Theory, pp. 2984 3014. PMLR, 2022b.

Xiang Cheng and Peter Bartlett. Convergence of langevin mcmc in kl-divergence. In Algorithmic Learning Theory, pp. 186 211. PMLR, 2018.

Xiang Cheng, Niladri S Chatterji, Peter L Bartlett, and Michael I Jordan. Underdamped langevin mcmc: A non-asymptotic analysis. In Conference on learning theory, pp. 300 323. PMLR, 2018.

Sinho Chewi, Murat A Erdogdu, Mufan Bill Li, Ruoqi Shen, and Matthew Zhang. Analysis of langevin monte carlo from poincar e to log-sobolev. ar Xiv preprint ar Xiv:2112.12662, 2021.

Arnak Dalalyan. Further and stronger analogy between sampling and optimization: Langevin monte carlo and gradient descent. In Conference on Learning Theory, pp. 678 689. PMLR, 2017.

Hanze Dong, Xi Wang, LIN Yong, and Tong Zhang. Particle-based variational inference with preconditioned functional gradient flow. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=6Oph WWAE3c S.

Alain Durmus and Eric Moulines. High-dimensional bayesian inference via the unadjusted langevin algorithm. 2019.

Raaz Dwivedi, Yuansi Chen, Martin J Wainwright, and Bin Yu. Log-concave sampling: Metropolishastings algorithms are fast! In Conference on learning theory, pp. 793 797. PMLR, 2018.

Ahmed El Alaoui, Andrea Montanari, and Mark Sellke. Sampling from the sherrington-kirkpatrick gibbs measure via algorithmic stochastic localization. In 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS), pp. 323 334. IEEE, 2022.

Published as a conference paper at ICLR 2024

Ronen Eldan. Thin shell implies spectral gap up to polylog via a stochastic localization scheme. Geometric and Functional Analysis, 23(2):532 569, 2013.

Ronen Eldan. Taming correlations through entropy-efficient measure decompositions with applications to mean-field approximation. Probability Theory and Related Fields, 176(3-4):737 755, 2020.

Ronen Eldan. Analysis of high-dimensional distributions using pathwise methods. Proceedings of ICM, to appear, 2022.

Murat A Erdogdu and Rasa Hosseinzadeh. On the convergence of langevin monte carlo: The interplay between tail growth and smoothness. In Conference on Learning Theory, pp. 1776 1822. PMLR, 2021.

Yoav Freund, Yi-An Ma, and Tong Zhang. When is the convergence time of langevin algorithms dimension independent? a composite optimization viewpoint. Journal of Machine Learning Research, 23(214), 2022.

Leonard Gross. Logarithmic sobolev inequalities. American Journal of Mathematics, 97(4):1061 1083, 1975.

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840 6851, 2020.

Jian Huang, Yuling Jiao, Lican Kang, Xu Liao, Jin Liu, and Yanyan Liu. Schr odinger-f ollmer sampler: sampling without ergodicity. ar Xiv preprint ar Xiv:2106.10880, 2021.

Mark Jerrum and Alistair Sinclair. The markov chain monte carlo method: an approach to approximate counting and integration. Approximation algorithms for NP-hard problems, pp. 482 520, 1996.

Jean-Franc ois Le Gall. Brownian motion, martingales, and stochastic calculus. Springer, 2016.

Holden Lee, Chirag Pabbaraju, Anish Sevekari, and Andrej Risteski. Universal approximation for log-concave distributions using well-conditioned normalizing flows. ar Xiv preprint ar Xiv:2107.02951, 2021a.

Holden Lee, Jianfeng Lu, and Yixin Tan. Convergence for score-based generative modeling with polynomial complexity. ar Xiv preprint ar Xiv:2206.06227, 2022.

Holden Lee, Jianfeng Lu, and Yixin Tan. Convergence of score-based generative modeling for general data distributions. In International Conference on Algorithmic Learning Theory, pp. 946 985. PMLR, 2023.

Yin Tat Lee, Ruoqi Shen, and Kevin Tian. Structured logconcave sampling with a restricted gaussian oracle. In Conference on Learning Theory, pp. 2993 3050. PMLR, 2021b.

Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. Advances in neural information processing systems, 29, 2016.

Yi-An Ma, Yuansi Chen, Chi Jin, Nicolas Flammarion, and Michael I Jordan. Sampling can be faster than optimization. Proceedings of the National Academy of Sciences, 116(42):20881 20885, 2019.

Yi-An Ma, Niladri S Chatterji, Xiang Cheng, Nicolas Flammarion, Peter L Bartlett, and Michael I Jordan. Is there an analog of Nesterov acceleration for gradient-based MCMC? Bernoulli, 27(3), 2021.

Georg Menz and Andr e Schlichting. Poincar e and logarithmic sobolev inequalities by decomposition of the energy landscape. 2014.

Andrea Montanari. Sampling, diffusions, and stochastic localization. ar Xiv preprint ar Xiv:2305.10690, 2023.

Published as a conference paper at ICLR 2024

Andrea Montanari and Yuchen Wu. Posterior sampling from the spiked models via diffusion processes. ar Xiv preprint ar Xiv:2304.11449, 2023.

Wenlong Mou, Yi-An Ma, Martin J Wainwright, Peter L Bartlett, and Michael I Jordan. High-order langevin diffusion yields an accelerated mcmc algorithm. The Journal of Machine Learning Research, 22(1):1919 1959, 2021.

Alireza Mousavi-Hosseini, Tyler Farghly, Ye He, Krishnakumar Balasubramanian, and Murat A Erdogdu. Towards a complete analysis of langevin monte carlo: Beyond poincar\ e inequality. ar Xiv preprint ar Xiv:2303.03589, 2023.

Radford M Neal. Probabilistic inference using Markov chain Monte Carlo methods. Department of Computer Science, University of Toronto Toronto, ON, Canada, 1993.

Henri Poincar e. Sur les equations aux d eriv ees partielles de la physique math ematique. American Journal of Mathematics, pp. 211 294, 1890.

Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. In Conference on Learning Theory, pp. 1674 1703. PMLR, 2017.

Christian P Robert, George Casella, and George Casella. Monte Carlo statistical methods, volume 2. Springer, 1999.

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj orn Ommer. Highresolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684 10695, 2022.

OS Rothaus. Diffusion on compact riemannian manifolds and logarithmic sobolev inequalities. Journal of functional analysis, 42(1):102 109, 1981.

Andr e Schlichting. Poincar e and log sobolev inequalities for mixtures. Entropy, 21(1):89, 2019.

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. ar Xiv preprint ar Xiv:2011.13456, 2020.

Christopher Tosh and Sanjoy Dasgupta. Lower bounds for the gibbs sampler over mixtures of gaussians. In International Conference on Machine Learning, pp. 1467 1475. PMLR, 2014.

Belinda Tzen and Maxim Raginsky. Theoretical guarantees for sampling and inference in generative models with latent diffusions. In Conference on Learning Theory, pp. 3084 3114. PMLR, 2019.

Francisco Vargas, Will Grathwohl, and Arnaud Doucet. Denoising diffusion samplers. ar Xiv preprint ar Xiv:2302.13834, 2023a.

Francisco Vargas, Andrius Ovsianas, David Fernandes, Mark Girolami, Neil D Lawrence, and Nikolas N usken. Bayesian learning via neural schr odinger f ollmer flows. Statistics and Computing, 33(1):3, 2023b.

Francisco Vargas, Teodora Reu, and Anna Kerekes. Expressiveness remarks for denoising diffusion based sampling. In Fifth Symposium on Advances in Approximate Bayesian Inference, 2023c.

Santosh Vempala and Andre Wibisono. Rapid convergence of the unadjusted langevin algorithm: Isoperimetry suffices. Advances in neural information processing systems, 32, 2019.

C edric Villani. Topics in optimal transportation, volume 58. American Mathematical Soc., 2021.

Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019.

Dinghuai Zhang, Ricky Tian Qi Chen, Cheng-Hao Liu, Aaron Courville, and Yoshua Bengio. Diffusion generative flow samplers: Improving learning signals through partial trajectory optimization. ar Xiv preprint ar Xiv:2310.02679, 2023.

Published as a conference paper at ICLR 2024

A PROOF SKETCH

For better understanding for our paper, we provide a proof sketch as below.

Lemma 1 is a direct result of Bayes theorem (Appendix C.1), which is the main motivation of our algorithm, that estimate the score with samples from q.

Our main contribution is to prove the convergence and evaluate the complexity of Algorithm 1, where the inner loop is performed by Algorithm 2.

Our analysis is based on the TV distance1, where we use data-processing inequality, triangle inequality, and Pinsker s inequality (refer to Eq. (19) for details) to provide the upper bound as below

DTV ( p T , p )

1 2DKL ( p0 p T ) | {z } Term 1

1 2DKL ˆPT P p T T

| {z } Term 2

where Term 1 is the error between p0 and p T and Team 2 is the score estimation loss of the whole trajectory.

Theorem 1 considers the case that all q T kη is log-Sobolev with constant µk (kη T), where the log-Sobolev constants can further be estimated with additional assumptions on p .

By definition,

2(Term 2)2 =

4 vk(xkη) 2 ln p T t(xt) 2 dt,

where Term 2 is defined by an integration.

To bound the error between integration and discretized algorithm, we have Lemma 7 that when η = O(ϵ2)

h vk(xkη) 2 ln p T t(xt) 2i 4ϵ2 + 1

h vk(xkη) 2 ln p T kη(xkη) 2i

| {z } ϵscore (7) Note that the ϵscore can be controlled by

vk(x) 2Ex 0 q T kη( |x)

x e (T kη)x 0 1 e 2(T kη)

| {z } Term 2.1

2 2e (T kη)

1 e 2(T kη)

h Ex 0 q T kη( |x)[x 0] Ex0 q T kη( |x)[x0] i

| {z } Term 2.2

where q is the estimated inner loop distribution by ULA.

Lemma 8 provide the concentration for Term 2.1.

We also have Term 2.2 8η 2C 1 LSI,k DKL q T kη( |x) q T kη( |x) ,

where the KL divergence is controlled by the convergence of ULA (Lemma 9). Thus, the desired convergence can be obtained.

Note that Theorem 1 is based on the fact that all q T kη are log-Sobolev. So we aim to estimate the log-Sobolev constants for every q T kη.

Lemma 2 and 3 provide two approaches to estimate the log-Sobolev constant for different time steps.

1Please refer to Table 1 for the notation definitions.

Published as a conference paper at ICLR 2024

When p is strongly convex outside a ball with radius R, we can derive that p T can reduce the radius so that improve the log-Sobolev constant. Thus, we can choose proper T where p T has larger log-Sobolev constant than p0 and qt are strongly convexity so that the log-Sobolev constants are independent of R, which makes the algorithm mix fast.

For more general distributions without log-Sobolev constant, Lemma 10 provide the log-Sobolev constant the whole trajectory, so that the computation complexity can be obtained.

B NOTATIONS AND DISCUSSIONS

B.1 ALGORITHM

Algorithm 3 Initialization of ˆp if ˆp = p

1: Input: Initial particle x0, OU process terminate time T, Iters T0, Step size η0, Sample size n. 2: for k = 1 to T0 do 3: Create nk Monte Carlo samples to estimate

, s.t. qt(x|xk 1) exp

xk 1 e T x 2

2 (1 e 2T )

4: Compute the corresponding estimator vk. 5: xk = xk 1 + η0vk + 2η0ξ where ξ is sampled from N (0, Id). 6: end for 7: Return: x T0 .

B.2 NOTATIONS

According to Cattiaux et al. (2021), under mild conditions in the following forward process dxt = bt(xt)dt +

2d Bt, the reverse process also admits an SDE description. If we fix the terminal time T > 0 and set ˆxt = x T t, for t [0, T], the process (ˆxt)t [0,T ] satisfies the SDE dˆxt = ˆbt(ˆxt)dt +

2d Bt, where the reverse drift satisfies the relation bt + ˆb T t = 2 ln pt, xt pt. In this condition, the reverse process of SDE (1) is as follows

dˆxt = (ˆxt + 2 ln p T t(ˆxt)) dt +

Thus, once the score function ln p T t is obtained, the reverse SDE induce a sampling algorithm. To obtain the particles along SDE (8), the first step is to initialize the particle with a tractable starting distribution. In real practice, it is usually hard to sample from the ideal initialization p T directly due to its unknown properties. Instead, we sample from an approximated distribution ˆp. For large T, p is chosen for approximating p T as their gap can be controlled. For the iteration, we utilize the numerical discretization method, i.e., DDPM Ho et al. (2020), widely used in diffusion models literature. Different from ULA, DDPM divides SDE (8) by different time segments, and consider the following SDE for each segment

d xt = ( xt + 2 ln p T kη( xkη)) dt +

2d Bt, t [kη, (k + 1)η] , x0 p T (9)

to discretize SDE (8). Suppose we obtain vk to approximate 2 ln p T kη at each iteration. Then, we obtain the SDE 2 for practical updates of particles.

Reiterate of Algorithm 1 Once the score function can be estimated, the sampling algorithm can be presented. The detailed algorithm is described in Algorithm 1. By following these steps, our proposed algorithm efficiently addresses the given problem and demonstrates its effectiveness in practical applications. Specifically, we summarize our algorithm as below: (1) choose proper T and proper ˆp such that p T ˆp. This step can be done by either ˆp = p for large T or performing the Langevin Monte Carlo for p T ; (2) sample from ˆp; (3) sample from a distribution that approximate q T t; (4) update pt with the approximated q T t samples. This step can be done by Langevin Monte Carlo inner loop, as log q T t is explicit; (5) repeat (3) and (4) until t T. After (5), we can also

Published as a conference paper at ICLR 2024

Constant Value Constant Value

C0 DKL (p0 p ) C1 2 16 L 2

C2 48L 62n 2 8n C8n 0 C3 64 C0C 3 1 C6

C4 2 13 C 4 0 C8 1C 2 6 C5 28 34 52L2 C2C 1 4 C2 6 ln 2 4L2C4 0C 1 4 C6

C6 6 2 4 C4 0 C 5 222 34 5 L 2C4 0C 8 1 ln 28

L C8 0C 8 1

C 3 64L 1 C0C 3 1

Table 2: Constant List

perform Langevin algorithm to fine-tune the steps when the gradient complexity limit is not reached. The main algorithm as Algorithm 1.

Here, we reiterate our notation in Table 1 and the definition of log-Sobolev inequality as follows.

Definition 1. (Logarithmic Sobolev inequality) A distribution with density function p satisfies the log-Sobolev inequality with a constant µ > 0 if for all smooth function g: Rd R with Ep[g2] < ,

Ep g2 ln g2 Ep g2 ln Ep g2 2

µEp h g 2i . (10)

Table 1: Notation List Symbols Description

φσ2 The density function of the centered Gaussian distribution, i.e., N 0, σ2I .

p , p0 The target density function (initial distribution of the forward process)

(xt)t [0,T ] The forward process, i.e., SDE (1) pt The density function of xt, i.e., xt pt p The density function of stationary distribution of the forward process.

(ˆxt)t [0,T ] The ideal reverse process, i.e., SDE (8) ˆpt The density function of ˆxt, i.e., ˆxt ˆpt and pt = ˆp T t ˆPT The law of the ideal reverse process SDE (8) over the path space C [0, T]; Rd .

( xt)t [0,T ] The reverse process following from SDE (9) pt The density function of xt, i.e., xt pt ( xt)t [0,T ] The practical reverse process following from SDE (2) with initial distribution q pt The density function of xt, i.e., xt pt PT The law of the reverse process ( xt)t [0,T ] over the path space C [0, T]; Rd .

( xp T t )t [0,T ] The reverse process following from SDE (2) with initial distribution p T pp T t The density function of xt, i.e., xp T t pp T t P p T T The law of the reverse process ( xp T t )t [0,T ] over the path space C [0, T]; Rd .

Besides, there are many constants used in our proof. We provide notations here to prevent confusion.

B.3 EXAMPLES

Lemma 4. (Proof of the Gaussian Mixture example) The iteration and sample complexity of rd MC with importance sampling estimator is O(ϵ 2).

Published as a conference paper at ICLR 2024

Proof. Similar to Eq. (19) in Theorem 1, we have

DTV ( p T , p )

1 2DKL ( p0 p T ) | {z } Term 1

1 2DKL ˆPT P p T T

| {z } Term 2

by using data-processing inequality, triangle inequality, and Pinsker s inequality.

To ensure Term 1 controllable, we choose T = 2 ln DKL(p p )

For Term 2,

DKL ˆPT P p T T = E ˆ PT

kη vk(xkη) 2 ln p T t(xt) 2 dt

4 vk(xkη) 2 ln p T t(xt) 2 dt,

By Lemma 7, 1 4 E ˆ PT

h vk(xkη) 2 ln p T t(xt) 2i 4ϵ2 + 1

h vk(xkη) 2 ln p T kη(xkη) 2i

| {z } ϵscore (12)

Thus, the iteration complexity of the RDS algorithm is O(ϵ 2) when the score estimator L2 error is O(ϵ), which is controlled by concentration inequalities.

For time t, since we have DKL( p T t p T t) ϵ and p T t is sub-Gaussian, the density p T t is also sub-Gaussian with variance σ 2 t .

We have P(|x Ex| >

2σ t ln(3/δ)) δ

For Gaussian mixture model, exp( f (x0)) is 2-Lipschitz and function f is 1-Lipschitz smooth, x e (T t)x0

(1 e 2(T t)) exp( f (x0)) is Gx,t-Lipschitz.

For time t, the variance is σ2 t = 1 e 2(T t)

Assume that the expectation and the estimator of x e (T t)x0

(1 e 2(T t)) exp( f (x0)) is µX and X. The

expectation and the estimator of exp( f (x0)) is µY and Y

P(|X µX| > ϵ) exp nϵ2

2σ2 t G2 x,t

P(|Y µY | > ϵ) exp nϵ2

We choose x+ = Ex +

2σ t ln(3/δ), G = Gx+,t.

If n = max(G2, 4)σ2 t ϵ 2 ln(3δ 1), with probability 1 δ, the error of the estimator is at most ϵ.

Moreover, if we choose the inner loop iteration to estimate the posterior. We can start from some T > 0, and estimate p T initially.

Specifically, for the Gaussian mixture, we can get a tighter log-Sobolev bound. For time t, we have pt e 1

2 x 2 + e 1

2 x e ty 2, which indicate the log-Sobolev constant, C 1 LSI,y,t 1 +

1 4(e e ty 2 + 1). Considering the smoothness, we have d2

dx2 log p(x) =

(ex2/2 + e1/2(x y)2)2 + ex2/2

(ex2/2 + e1/2(x y)2)

Published as a conference paper at ICLR 2024

When we choose T = 1

2 log 2L 2L+1 = 1 2 log 3

2, The estimation of p T needs O(e 2 3 y2) iterations,

which improves the original O(ey2).

Lemma 5. Suppose the negative log density of target distribution f = ln p satisfies

f (x) = x 2 + 1 a

where a [0, 0.5].

Proof. Consider the Hessian of f , we have

2f (x) = 2a (x2 + 1)a 2 (2a 1)x2 + 1 .

If we require |x| r 1/(2 2a), it has

|x| r 1/(2 2a) |x|2 2a r 1 r |x|2

r a (1 2a)|x|2 1

(|x|2 + 1)2 a 2a (x2 + 1)a 2 (2a 1)x2 + 1 + 2r = 2(f (x) + r|x|2) 0.

It means if we choose CR = 1 and n = 1/(2 2a)

B.4 MORE DISCUSSION ABOUT PREVIOUS WORKS

Recent studies have underscored the potential of diffusion models, exploring their integration across various domains, including approximate Bayesian computation methods. One line of research (Vargas et al., 2023a; Berner et al., 2022; Zhang et al., 2023; Vargas et al., 2023c;b) involves applying reverse diffusion in the VI framework to create posterior samplers by neural networks. These studies have examined the conditional expected form of the score function, similar to Lemma 1. Such score-based VI algorithms have shown to offer improvements over traditional VI. However, upon adopting a neural network estimator, VI-based algorithms are subject to an inherent unknown error.

Other research (Tzen & Raginsky, 2019; Chen et al., 2022a) has also delved into the characteristics of parameterized diffusion-like processes under assumed error conditions. Yet, the comparative advantages of diffusion models against MCMC methods and the computational complexity involved in learning the score function are not well-investigated. This gap hinders a straightforward comparison with Langevin-based dynamics.

Another related work is the Schr odinger-F ollmer Sampler (SFS) (Huang et al., 2021), which also tend to estimate the drift term with non-parametric estimators. The main difference of the proposed algorithm is that SFS leverages Schr odinger-F ollmer process. In this process, the target distribution p is transformed into a Dirac delta distribution. This transformation often results in the gradient log pt becoming problematic when pt closely resembles a delta distribution, posing challenges for maintaining the Lipschitz continuity of the drift term. Huang et al. (2021); Tzen & Raginsky (2019) note that the assumption holds when both p exp( x 2/2) and its gradient are Lipschitz continuous, and the former is bounded below by a positive value. However, this condition may not be met when the variance of p exceeds 1, limiting its general applicability in the Schr odinger-F ollmer process. The comparison of SFS with traditional MCMC methods under general conditions remains an open question. However, given that the p of the OU process represents a smooth distribution a standard Gaussian, the requirement for the Lipschitz continuity of pt is much weaker, as diffusion analysis suggested (Chen et al., 2022a; 2023). Additionally, Lee et al. (2021a) indicated that L-smoothness in log-concave p implies the smoothness in pt. Moreover, the SFS algorithm considers the importance sampling estimator and the error analysis is mainly based on the Gaussian mixture model. As we mentioned in our Section 3.3, importance sampling estimator would suffer from curse of dimensionality in real practice. Our ULA-based analysis can be adapted to more general distributions for both ill-behaved LSI and non-LSI distributions. In a word, our proposed RDMC provide the complexity for general distributions and SFS is more task specific. It remains open to further investigate the complexity for SFS-like algorithm, which can be interesting future work. Moreover, it is also possible to adapt the ULA estimator idea to SFS.

Our approach, stands as an asymptotically exact sampling algorithm, akin to traditional MCMC methods. It allows us to determine an overall convergence rate that diminishes with increasing

Published as a conference paper at ICLR 2024

computational complexity, enabling a direct comparison of complexity with MCMC approaches. The main technique of our algorithm is to analyze the complexity of the score estimation with nonparametric algorithm and we found the merits of the proposed one compared with MCMC. Our theory can also support the diffusion-based VI against Langevin-based ones.

Published as a conference paper at ICLR 2024

C MAIN PROOFS

C.1 PROOF OF LEMMA 1

Proof. When the OU process, i.e., Eq. 1, is selected as our forward path, the transition kernel of (xt)t 0 has a closed form, i.e.,

p(x, t|x0, 0) = 2π 1 e 2t d/2 exp

" x e tx0 2

In this condition, we have

p T t(x) = Z

Rd p0(x0) p T t|0(x|x0)dx0

Rd p0(x0) 2π 1 e 2(T t) d/2 exp

" x e (T t)x0 2

2 1 e 2(T t)

Plugging this formulation into the following equation

x ln p T t(x) = p T t(x)

x ln p T t(x) = R

Rd p0(x0) 2π 1 e 2(T t) d/2 exp x e (T t)x0 2

2(1 e 2(T t))

Rd p0(x0) 2π 1 e 2(T t) d/2 exp x e (T t)x0 2

2(1 e 2(T t))

Rd p0(x0) exp x e T tx0 2

2(1 e 2(T t))

x e (T t)x0

(1 e 2(T t))

Rd p0(x0) exp x e (T t)x0 2

2(1 e 2(T t))

=Ex0 q T t( |x)

x e (T t)x0 1 e 2(T t)

where the density function q T t( |x) is defined as

q T t(x0|x) = p0(x0) exp x e T tx0 2

2(1 e 2(T t))

Rd p0(x0) exp x e T tx0 2

2(1 e 2(T t))

x e (T t)x0 2

2 1 e 2(T t)

Hence, the proof is completed.

C.2 PROOF OF LEMMA 2 AND 3

Lemma 6. (Proposition 1 in Ma et al. (2019)) For p e U, where U is m-strongly convex outside of a region of radius R and L-Lipschitz smooth, the log-Sobolev constant of p

Proof. By Lemma 6, for any t = T kη, we have the LSI constant of q T kη satisfies

CLSI,k e 2(T kη)

6(1 e 2(T kη)) exp 16 3L R2 e 2(T kη)

6(1 e 2(T kη))

When 2L 1+2L e 2(T kη) 1. We have

L + e 2(T kη)

2 1 e 2(T kη) 0,

Published as a conference paper at ICLR 2024

which implies

Σk,max I := 3e 2(T kη)

2 1 e 2(T kη) I 2g T kη(x) e 2(T kη)

2 1 e 2(T kη) I := Σk,min I.

(14) Due to the fact that g T kη is Σk,min-strongly convex, Σk,max-smooth and Lemma 20, we have CLSI,k Σk,min.

C.3 PROOF OF MAIN THEOREM

Proof. We have

DTV ( p T , p ) DTV PT , ˆPT DTV PT , P p T T + DTV P p T T , ˆPT

DTV ( p0, p T ) + DTV P p T T , ˆPT

1 2DKL ( p0 p T ) | {z } Term 1

1 2DKL ˆPT P p T T

| {z } Term 2

(15) where the first and the third inequalities follow from data-processing inequality, the second inequality follows from the triangle inequality, and the last inequality follows from Pinsker s inequality.

For Term 2,

DKL ˆPT P p T T = E ˆ PT

kη vk(xkη) 2 ln p T t(xt) 2 dt

4 vk(xkη) 2 ln p T t(xt) 2 dt,

By Lemma 7,

h vk(xkη) 2 ln p T t(xt) 2i 4ϵ2 + 1

h vk(xkη) 2 ln p T kη(xkη) 2i

| {z } ϵscore (16)

According to Eq. 4, for each term of the summation, we have

h vk(xkη) 2 ln p T kη(xkη) 2i

vk(xkη) 2Ex0 q T kη( |xkη)

xkη e (T kη)x0 1 e 2(T kη)

Published as a conference paper at ICLR 2024

For each xkη = x, we have vk(x) 2Ex0 q T kη( |x)

x e (T kη)x0 1 e 2(T kη)

vk(x) 2Ex 0 q T kη( |x)

x e (T kη)x 0 1 e 2(T kη)

+ 2e (T kη)

1 e 2(T kη)

h Ex 0 q T kη( |x)[x 0] Ex0 q T kη( |x)[x0] i

vk(x) 2Ex 0 q T kη( |x)

x e (T kη)x 0 1 e 2(T kη)

| {z } Term 2.1

+ 2 2e (T kη)

1 e 2(T kη)

h Ex 0 q T kη( |x)[x 0] Ex0 q T kη( |x)[x0] i

| {z } Term 2.2

where we denote q T kη( |x) denote the underlying distribution of output particles of the auxiliary sampling task.

For Term 2.2, we denote an optimal coupling between q T kη( |x) and q T kη( |x) to be

γ Γopt(q T kη( |x), q T kη( |x)).

Hence, we have 2e (T kη)

1 e 2(T kη)

h Ex 0 q T kη( |x)[x 0] Ex0 q T kη( |x)[x0] i

= E(x0,x 0) γ

1 e 2(T kη) (x0 x 0)

2 4e 2(T kη)

(1 e 2(T kη))2 E(x0,x 0) γ h x0 x 0 2i

= 4e 2(T kη)

(1 e 2(T kη))2 W 2 2 q T kη( |x), q T kη( |x)

4e 2(T kη) 1 e 2(T kη) 2 2 CLSI,k DKL q T kη( |x) q T kη( |x)

8η 2C 1 LSI,k DKL q T kη( |x) q T kη( |x) , (18) where the first inequality follows from Jensen s inequality, the second inequality follows from the Talagrand inequality and the last inequality follows from

e 2(T kη) e 2η 1 η e 2(T kη)

(1 e 2(T kη))2 η 2,

when η 1/2.

By Lemma 8 and 9, the desired convergence can be obtained.

Published as a conference paper at ICLR 2024

C.4 PROOF OF THE MAIN PROPOSITIONS

Proof. By Eq. (19), we have the upper bound for r

1 2DKL ( p0 p T ) | {z } Term 1

1 2DKL ˆPT P p T T

| {z } Term 2

We aim to upper bound these two terms in our analysis.

Errors from the forward process For Term 1 of Eq. 19, we can either choose DKL (ˆp p T ) or choose large T with ˆp = p .

If we choose ˆp = p , we have

DKL ( p0 p ) = DKL (p T p ) C0 exp T

where the inequality follows from Lemma 21. By requiring

2ϵ2 T 2 ln C0

we have Term 1 ϵ. To simplify the proof, we choose T as its lower bound.

If we choose ˆp, then the iteration complexity depend on the log-Sobolev constant of p T . In (Ma et al., 2019), it is demonstrated that any distribution satisfying Assumptions [A1] and [A3] has a log-Sobolev constant of m

2 exp( 16LR2), which scales exponentially with the radius R.

Considering the p T , we have XT = e T X0 + p

Assume that the density of e T X0 is h,

log h(e T x) + log |e T I| = log p0(x)

log h(e T x) = log p0(x) + d T.

Assume that y = e T x and Assumption [A3] holds,

2h(y) = e2T 2p0(e T y) e2T m I.

We have outside a ball with e T R, the negative log-density is e2T m strongly convex. By Lemma 16, the final log-Sobolev constant is

1 2 m exp( 16LR2e T +2T ) + 1 1 e 2T = O(m exp( 16LR2e T + 2T)).

Errors from the backward process Without loss of generality, we consider the Assumption [A4] case, where t has been split to two intervals. The Assumption [A3] case can be recognized as the first interval of Assumption [A4]. For Term 2, we first consider the proof when Novikov s condition holds for simplification. A more rigorous analysis without Novikov s condition can be easily extended with the tricks shown in (Chen et al., 2022a). Considering Corollary 2, we have

DKL ˆPT P p T T = E ˆ PT

kη vk(xkη) 2 ln p T t(xt) 2 dt

4 vk(xkη) 2 ln p T t(xt) 2 dt,

(20) where N = T/η .

Published as a conference paper at ICLR 2024

We also have 1 4 E ˆ PT

h vk(xkη) 2 ln p T t(xt) 2i 4ϵ2 + 1

h vk(xkη) 2 ln p T kη(xkη) 2i

| {z } ϵscore (21) following Lemma 7 by choosing the step size of the backward path satisfying

η C1 d + m2 2 1 ϵ2. (22) To simplify the proof, we choose η as its upper bound. Plugging Eq. 32 into Eq. 20, we have

DKL ˆPT P p T T 8ϵ2 ln C0

k=0 η E ˆ PT

h vk(xkη) 2 ln p T kη(xkη) 2i .

Besides, due to Lemma 10, we have

k=0 η E ˆ PT

h vk(xkη) 2 ln p T kη(xkη) 2i 20ϵ2 ln C0

with a probability at least 1 ϵ by requiring an O max (C3C5, C 3C 5) C 1 1 C0 (d + m2 2)18ϵ 16n 88 exp 5C2ϵ 16n

gradient complexity. Hence, we have

1 2 (4ϵ2 + 5ϵ2) T 3ϵ

which implies

DTV ( pt, p ) ϵ + 3ϵ

= O(ϵ). (23)

Hence, the proof is completed.

D IMPORTANT LEMMAS

Lemma 7. (Errors from the discretization) With Algorithm 1 and notation list 1, if we choose the step size of outer loop satisfying η C1 d + m2 2 1 ϵ2, then for t [kη, (k + 1)η] we have

" ln p T kη(xkη)

+ L2 E ˆ PT

h xkη xt 2i ϵ2.

h vk(xkη) 2 ln p T t(xt) 2i 4ϵ2 + 1

h vk(xkη) 2 ln p T kη(xkη) 2i

| {z } ϵscore

Proof. According to the choice of t, we have T kη T t. With the transition kernel of the forward process, we have the following connection

p T kη(x) = Z p T t(y) P (x, T kη|y, T t) dy

= Z p T t(y) h 2π 1 e 2(t kη) i d

" x e (t kη)y 2

2 1 e 2(t kη)

= Z e(t kη)dp T t e(t kη)z h 2π 1 e 2(t kη) i d

2 1 e 2(t kη)

Published as a conference paper at ICLR 2024

where the last equation follows from setting z = e (t kη)y. We should note that

p T t(z) := e(t kη)dp T t(e(t kη)z)

is also a density function. For each element xkη = x, we have ln p T t(x)

ln p T t(x) e(t kη)dp T t e(t kη)x + ln e(t kη)dp T t e(t kη)x

ln p T t(x) e(t kη)dp T t e(t kη)x

ln e(t kη)dp T t e(t kη)x p T t φ(1 e 2(t kη))

where the inequality follows from the triangle inequality and Eq. 24. For the first term, we have ln p T t(x) e(t kη)dp T t e(t kη)x

= ln p T t(x) e(t kη) ln p T t e(t kη)x

ln p T t(x) e(t kη) ln p T t(x) + e(t kη) ln p T t(x) ln p T t e(t kη)x

e(t kη) 1 ln p T t(x) + e(t kη) e(t kη) 1 L x . (25) For the second term, the score ln p T t is e2(t kη)L -smooth. Therefore, with Lemma 13 and the requirement

2 e2(t kη) 1 e 2(t kη) 1

we have ln p T t(x) ln p T t φ(1 e 2(t kη))

6e2(t kη)L q 1 e 2(t kη) d1/2 + 2e3(t kη)L 1 e 2(t kη) ln p T t e(t kη)x

6e2(t kη)L q 1 e 2(t kη) d1/2

+ 2Le(t kη) e2(t kη) 1 ln p T t e(t kη)x ln p T t(x) + ln p T t(x)

6e2(t kη)L q 1 e 2(t kη) d1/2 + 2L2e(t kη) e2(t kη) 1 e(t kη) 1 x

+ 2Le(t kη) e2(t kη) 1 ln p T t(x) . (26) Due to the range η 1/2, we have the following inequalities

e2(t kη) e2η 1 + 4η, 1 e 2(t kη) 2(t kη) 2η and e(t kη) eη 1 + 2η.

Thus, Eq. 25 and Eq. 26 can be reformulated as ln p T t(x) e(t kη)dp T t e(t kη)x

2η ln p T t(x) + 4ηL x

ln p T t(x) e(t kη)dp T t e(t kη)x

8η2 ln p T t(x) 2 + 32η2L2 x 2 (27)

and ln p T t(x) ln p T t φ(1 e 2(t kη))

6 (4η + 1) L p

2ηd + 2L2 (2η + 1) 4η 2η x + 2L (2η + 1) 4η ln p T t(x)

2ηd + 32L2η2 x + 16Lη ln p T t(x) ,

Published as a conference paper at ICLR 2024

which is equivalent to ln p T t(x) ln p T t φ(1 e 2(t kη))

3 211 L2ηd + 210 L4η4 x 2 + 28 L2η2 ln p T t(x) 2

213 L2ηd + 212 L4η4 x 2 + 210 L2η2 ln p T t(x) 2 .

Without loss of generality, we suppose L 1, combining Eq. 27 and Eq. 28, we have the following bound

" ln p T t(xkη)

p T kη(xkη)

214 Lηd + 28 L2η2E ˆ P h xkη 2i + 212 L2η2E ˆ P h ln p T t(xkη) 2i

214 Lηd + 28 L2η2E ˆ P h xkη 2i + 213 L2η2E ˆ P h ln p T t(xt) 2i

+ 213 L4η2E ˆ P h xkη xt 2i . (29) Besides, we have

" ln p T kη(xkη)

h xkη xt 2i#

4 h 214 Lηd + 28 L2η2E ˆ PT

h xkη 2i + 213 L2η2E ˆ PT

h ln p T t(xt) 2i

+ 213 L2η2 + 1 L2E ˆ PT

h xkη xt 2ii

216 Lηd + 210 L2η2(d + m2 2) + 215 L3η2d + 28 L2 2(m2 2 + d)η2 + 4dη ,

where the last inequality with Lemma 14 and Lemma 15. To diminish the discretization error, we require the step size of backward sampling, i.e., η satisfies

210 L2η2(d + m2 2) ϵ2

215 L3η2d ϵ2

28 L2 2(m2 2 + d)η2 + 4dη ϵ2

η 2 16 L 1d 1ϵ2

η 2 5 L 1 d + m2 2 0.5 ϵ

η 2 7.5 L 1.5d 0.5ϵ

η 2 5L 0.5 d + m2 2 0.5 ϵ

η 2 10L 2d 1ϵ2

Specifically, if we choose

η 2 16 L 2 d + m2 2 1 ϵ2 = C1(d + m2 2) 1ϵ2, we have

" ln p T kη(xkη)

h xkη xt 2i ϵ2. (31)

Hence, the proof is completed.

Thus, for t [kη, (k + 1)η], it has 1 4 E ˆ PT

h vk(xkη) 2 ln p T t(xt) 2i

h ln p T kη(xkη) ln p T t(xt) 2i + 1

h vk(xkη) 2 ln p T kη(xkη) 2i

" ln p T kη(xkη)

+ 4L2 E ˆ PT

h xkη xt 2i

h vk(xkη) 2 ln p T kη(xkη) 2i

h vk(xkη) 2 ln p T kη(xkη) 2i

| {z } ϵscore

Published as a conference paper at ICLR 2024

where the second inequality follows from Assumption [A1].

Lemma 8. For each inner loop, we denote qz( |x) and q( |x) to be the underlying distribution of output particles and the target distribution, respectively, where q satisfies LSI with constant µ. When we set the step size of outer loops to be η, by requiring

n 64Tdµ 1η 3ϵ 2δ 1 and DKL (qz q) 2 13 T 4d 2µ2η8ϵ4δ4,

Pn x(i) 0 on

i=1 q(n) z ( |x)

i=1 vi(x) 1

exp 1 δ/(2 T/η )

+ δ 2 T/η ,

vi(x) := 2 x e (T kη)x(i) 0 1 e 2(T kη) i {1, . . . , n} .

Proof. For each inner loop, we abbreviate the target distribution as q, the initial distribution as q0. Then the iteration of the inner loop is presented as

xz+1 = xz + τ ln q(xz|x) +

We suppose the underlying distribution of the z-th iteration to be qz( |x). Hence, we expect the following inequality

" v(x) Z qz(x0)( 2) x e T kηx0

1 e 2(T kη)

is established, where v(x) = 1/n Pn i=1 vi(x). In this condition, we have

Pn x(i) 0 on

i=1 q(n) z ( |x)

i=1 vi(x) 1

i=1 vi(x) Ev1(x)

=Pn x(i) 0 on

i=1 q(n) z ( |x)

i=1 x(i) 0 E

# + 1 e 2(T kη)

2e (T kη) nϵ

(33) To simplify the notations, we set

bz := Eqz( |x)[x0], vz := En x(i) 0 on

i=1 qn z ( |x)

b := Eq( |x)[x0], and v := En x(i) 0 on

i=1 qn( |x)

Then, we have

Pn x(i) 0 on

i=1 q(n) z ( |x)

i=1 x(i) 0 nbz

vz + 1 e 2(T kη)

2e (T kη) nϵ

Pn x(i) 0 on

i=1 q(n)( |x)

i=1 x(i) 0 nbz

vz + 1 e 2(T kη)

2e (T kη) nϵ

+ DTV(q(n)( |x), q(n) z ( |x))

Pn x(i) 0 o Nk

i=1 q(n)( |x)

i=1 x(i) 0 nbz

vz + 1 e 2(T kη)

2e (T kη) nϵ

+ n DTV(q( |x), qz( |x)).

Published as a conference paper at ICLR 2024

Consider the first term, we have the following relation

i=1 x(i) 0 nbz

vz + 1 e (T kη)

2e 2(T kη) nϵ

i=1 x(i) 0 nb

i=1 x(i) 0 nbz

vz + 1 e 2(T kη)

2e (T kη) nϵ n b bz

= v + 1 e 2(T kη)

2e (T kη) nϵ n b bz + (vz v)

v + 1 e 2(T kη)

2e (T kη) nϵ n W2(q( |x), qz( |x)) p

where the last inequality follows from

Z (q(x0|x) qz(x0|x)) x0dx0

Z (x0 xz) γ(x0, xz)d(x0, xz)

Z γ(x0, xz)d(x0, xz) 1/2 Z γ(x0, xz) x0 xz 2 d(x0, xz) 1/2

W2(q( |x), qz( |x)),

v =n En x(i) 0 on

i=1 q(n)( |x)

i=1 x(i) 0 b

nvar x(1) 0 p

deduced by Lemma 23. By requiring

W2(q( |x), qz( |x)) 1 e 2(T kη)

8e (T kη) ϵ and n 64e 2(T kη)d 1 e 2(T kη) 2 µϵ2 (37)

in Eq. 35, we have

Pn x(i) 0 on

i=1 q(n)( |x)

i=1 x(i) 0 nbz

vz + 1 e 2(T kη)

2e (T kη) nϵ

Pn x(i) 0 on

i=1 q(n)( |x)

i=1 x(i) 0 nb

v + 1 e 2(T kη)

4e (T kη) nϵ

According to Lemma 16, the LSI constant of

i=1 x(i) 0 q q q | {z } n

is µ/n. Besides, considering the function F (x) = x : Rd R is 1-Lipschitz because

F Lip = sup x =y

|F(x) F(y)|

x y = sup x =y

Pn x(i) 0 on

i=1 q(n)( |x)

i=1 x(i) 0 nb

v + 1 e 2(T kη)

4e (T kη) nϵ

exp (1 e (T kη))2nϵ2µ

32e 2(T kη)

Published as a conference paper at ICLR 2024

due to Lemma 18. Plugging Eq. 39 and Eq. 38 into Eq. 34 and Eq. 33, we have

Pn x(i) 0 on

i=1 q(n) z ( |x)

i=1 vi(x) 1

i=1 vi(x) Ev1(x)

exp (1 e (T kη))2nϵ2µ

32e 2(T kη)

+ n DTV(q( |x), qz( |x)).

(40) Besides, we have

En x(i) 0 on

i=1 qn z ( |x)

i=1 vi(x) Ev1(x)

n = 2e (T kη)

1 e 2(T kη)

Suppose the optimal coupling of qz( |x) and q( |x) is γz Γopt (qz( |x), q( |x)), then we have

varqz( |x) (x0) = Z qz(xz|x) xz bz 2 dxz = Z xz bz 2 dγ(xz, x0)

= Z xz x0 + x0 b + b bz 2 dγ(xz, x0)

3 Z xz x0 2 dγ(xz, x0) + 3 Z x0 b 2 dγ(xz, x0) + 3 b bz 2

6W 2 2 ( q, qz) + 3dµ 1

where the last inequality follows from Eq. 36 and Lemma 23. By requiring W 2 2 ( q, qz) d 6µ, we have

1 e 2(T kη)

n 4 n e (T kη)

1 e 2(T kη)

Combining this result with Eq. 40, we have

Pn x(i) 0 on

i=1 qn z ( |x)

i=1 vi(x) 1

# 4 n e (T kη)

1 e 2(T kη)

exp (1 e (T kη))2nϵ2µ

32e 2(T kη)

+ n DTV(q( |x), qz( |x)).

By requiring

4 n e (T kη)

1 e 2(T kη)

d µ ϵ n 16e 2(T kη)d 1 e 2(T kη) 2 µϵ2 ,

(1 e (T kη))2nϵ2µ

32e 2(T kη) T/η 2

δ n T/η 64e 2(T kη)

(1 e (T kη))2µϵ2δ

and DTV( q, qz) δ 2n T/η .

Pn x(i) 0 on

i=1 q(n) z ( |x)

i=1 vi(x) 1

exp 1 δ/(2 T/η )

+ δ 2 T/η .

Published as a conference paper at ICLR 2024

Combining the choice of n and the gap between q( |x) and qz( |x) in Eq. 37 and Eq. 41, we have

n 64e 2(T kη)d 1 e 2(T kη) 2 µϵ2

n T/η 64e 2(T kη)

(1 e (T kη))2µϵ2δ

n 16e 2(T kη)d 1 e 2(T kη) 2 µϵ2

W2(q( |x), qz( |x)) 1 e 2(T kη)

8e (T kη) ϵ

W 2 2 (q( |x), qz( |x)) d/(6µ)

DTV(q( |x), qz( |x)) δ 2n T/η

(42) Without loss of generality, we suppose η 1/2, due to the range of e 2(T kη) as follows

e 2(T kη) e 2η 1 η e 2(T kη)

(1 e 2(T kη))2 η 2,

we obtain the sufficient condition for achieving Eq. 42 is

n 64Tdµ 1η 3ϵ 2δ 1 max 64dµ 1η 2ϵ 2, 64Tη 3µ 1ϵ 2δ 1, 16dµ 1η 2ϵ 2

and DTV( q, qz) 1

2 δηn 1T 1 DKL (qz q) 1

2 δ2η2n 2T 2 2 13 T 4d 2µ2η8ϵ4δ4.

Hence, the proof is completed.

Lemma 9. With Algorithm 1 and notation list 1, if we choose the initial distribution of the k-th inner loop to be

q T kη(x0|x) exp

x e (T kη)x0 2

2 1 e 2(T kη)

then suppose the the LSI constant of q T kη is CLSI,k, their KL divergence can be upper bounded as

DKL q T kη( |x) q T kη( |x) L2

2CLSI,k e2(T kη) d + x 2 .

Proof. According to the fact that the LSI constant of q T kη is CLSI,k, then we have

DKL q T kη( |x) q T kη( |x) (2CLSI,k) 1 Z q T kη(x0|x) f(x0) 2 dx0

L2(2CLSI,k) 1 Ex0 q T kη( |x) h x0 2i = L2(2CLSI,k) 1 var(x0) + E [x0] 2 .

Because q T kη( |x) is a high-dimensional Gaussian, its mean value satisfies E[x0] = e(T kη)x. Besides, we have

2 ln q T kη(x0) = e 2(T kη)

1 e 2(T kη) I.

According to Lemma 20, q T kη( |x) satisfies the log-Sobolev inequality (and the Poincar e inequality). Follows from Lemma 23, we have

varx0 q T kη( |x) [x0] d (1 e 2(T kη)) e2(T kη) de2(T kη).

Hence, we have

DKL q T kη( |x) q T kη( |x) L2

2CLSI,k e2(T kη) d + x 2

and the proof is completed.

Lemma 10. (Errors from the inner loop sampling task) Suppose Assumption [A1],[A2],[A4] hold. With Algorithm 1 notation list 1 and suitable η = C1 d + m2 2 1 ϵ2, there is

k=0 η E ˆ PT

h vk(xkη) 2 ln p T kη(xkη) 2i 20ϵ2 ln C0

Published as a conference paper at ICLR 2024

with a probability at least 1 δ. The gradient complexity of achieving this result is

max (C3C5, C 3C 5) C 1 1 C0 (d + m2 2)18ϵ 16n 83 exp 5C2ϵ 16n δ 6

where constants Ci and C i are independent with d and ϵ.

Proof. To upper bound it more precisely, we divide the backward process into two stages.

Stage 1: when e 2(T kη) 2L/(1 + 2L). It implies the iteration k satisfies

2T ln 1 + 2L

:= N1. (43)

In this condition, we set

q T kη(x0|x) exp( g T kη(x0|x)) := exp

x e (T kη)x0 2

2 1 e 2(T kη)

Hence, we can reformulate g T kη(x0|x) as

g T kη(x0|x) = f (x0) + e 2(T kη)

3(1 e 2(T kη)) x0 2

| {z } part 1

+ e 2(T kη)

6(1 e 2(T kη)) x0 2 e (T kη)

(1 e 2(T kη))x 0 x + x 2

2(1 e 2(T kη)) | {z } part 2

According to Assumption [A4], we know part 1 and part 2 are both strongly convex outside the ball with radius R(e 2(T kη)/(6(1 e 2(T kη)))). With Lemma 22, the function g T kη( |x) is e 2(T kη)/(3(1 e 2(T kη)))-strongly convex outside the ball. Besides, the gradient Lipschitz constant of g T kη( |x) can be upper bounded as

2g T kη(x0|x) 2f (x0) + e 2(T kη)

1 e 2(T kη) I L + e 2(T kη)

1 e 2(T kη)

where the last inequality follows from the choice of e 2(T kη) in the stage.

When the total time satisfies exp( T/2) = 2ϵ2/C0, we have

C 1 LSI,k 6(1 e 2(T kη))e2(T kη) exp 48L R2 e 2(T kη)

6(1 e 2(T kη))

6 exp 2(T kη) + 48L 6(1 e 2(T kη))e2(T kη) 2n

6 exp 2(T kη) + 48L 62n e4n(T kη) .

The second inequality follows from Assumption [A4], and the last inequality follows from the setting n, L 1 without loss of generality.

C 1 LSI,k 6 2 4 C4 0ϵ 8 exp 48L 62n 2 8n C8n 0 ϵ 16n = C6ϵ 8 exp C2 ϵ 16n (45)

For Term 1, due to Lemma 8, if we set the step size of outer loop to be

η = C1 d + m2 2 1 ϵ2,

the sample number of each iteration k satisfies

nk =C3 d + m2 2 4 ϵ 18 exp C2 ϵ 16n δ 1 = 64 C0C 3 1 C6 d + m2 2 4 ϵ 18 exp C2 ϵ 16n δ 1

2ϵ2 d C1(d + m2 2) 1ϵ2 3 ϵ 2 C6ϵ 8 exp C2 ϵ 16n δ 64Tdη 3ϵ 2δ 1C 1 LSI,k

Published as a conference paper at ICLR 2024

and the accuracy of the inner loop meets

DKL q T kη( |x) q T kη( |x) C4 d + m2 2 10 ϵ44 exp 2C2 ϵ 16n δ4

=2 13 C 4 0 C8 1 d + m2 2 10 ϵ28δ4 C 2 6 ϵ16 exp 2C2 ϵ 16n

2 13 2 ln C0

4 d 2 ϵ4 δ4 C8 1 d + m2 2 8 ϵ16 C2 LSI,k 2 13 T 4d 2ϵ4δ4η8C2 LSI,k (46) when ϵ2 1/2. In this condition, we have

vk(x) 2Ex 0 q T kη( |x)

x e (T kη)x0 1 e 2(T kη)

1 exp 1 δ/(2 T/η )

δ 2 T/η 1 δ T/η ,

where q T kη( |x) denotes the underlying distribution of output particles of the k-th inner loop. To achieve

DKL q T kη( |xkη) q T kη( |xkη) C4 d + m2 2 10 ϵ44 exp 2C2 ϵ 16n δ4 := δKL,

Lemma 19 requires the step size to satisfy

τk 2 4 3 2 L 2C4C 1 6 d + m2 2 11 ϵ52 exp 3C2ϵ 16n δ4

C 1 6 ϵ8 exp C2ϵ 16n 1 4(3L)2 1

4d C4 d + m2 2 10 ϵ44 exp 2C2 ϵ 16n δ4

CLSI,k 4 2g T kη( |x) 2 2 1

and the iteration number Zk meets

Zk 1 CLSI,kτk ln 2DKL q T kη,0( |x) q T kη( |x)

where q T kη,0( |x) denotes the initial distribution of the k-th inner loop. By choosing τk to be its upper bound and the initial distribution of k-th inner loop to be

q T kη,0(x0|x) exp

x e (T kη)x0 2

2 1 e 2(T kη)

DKL q T kη,0( |x) q T kη( |x) L2

2CLSI,k e2(T kη) d + x 2

2 1L2C6ϵ 8 exp C2 ϵ 16n e2(T kη) d + x 2

with Lemma 9 and Eq. 45. It implies the iteration number Zk of inner loops to be required as

Zk C5 (d + m2 2)12ϵ 16n 61 exp 4C2ϵ 16n d + x 2 δ 5

=28 34 52L2 C2C 1 4 C2 6 ln 2 4L2C4 0C 1 4 C6 (d + m2 2)12ϵ 16n 61 exp 4C2ϵ 16n d + x 2 δ 5

=C6ϵ 8 exp C2 ϵ 16n 24 32L2C 1 4 C6 d + m2 2 11 ϵ 52 exp 3C2ϵ 16n δ 4

ln 2 4L2C4 0C 1 4 C6 + 3C2ϵ 16n + 60 ln 1

δ + 10 ln(d + m2 2) + ln(d + x 2)

1 CLSI,kτk ln 2DKL q T kη,0( |x) q T kη( |x)

when ln(1/ϵ) 2 and ln d 2 without loss of generality.

Published as a conference paper at ICLR 2024

A sufficient condition to obtain Term 2 ϵ2 is to make the following inequality establish

DKL q T kη( |x) q T kη( |x) 2 3 η2ϵ2 C 1 6 ϵ8 exp C2ϵ 16n

which will be dominated by Eq. 46 in almost cases obviously.

Hence, combining Eq. 17, Eq. 47 and Eq. 18, there is

k=0 η E ˆ PT

h vk(xkη) 2 ln p T kη(xkη) 2i 10N1η ϵ2 (48)

with a probability at least 1 N1 δ/( T/η ) which is obtained by uniformed bound. We require the gradient complexity in this stage will be

k=0 nk E ˆ PT (Zk) =

k=0 C3 d + m2 2 4 ϵ 18 exp C2 ϵ 16n δ 1

E ˆ PT C5 (d + m2 2)12ϵ 16n 61 exp 4C2ϵ 16n d + xkη 2 δ 5

N1 C3C5(d + m2 2)17ϵ 16n 79 exp 5C2ϵ 16n δ 6

(49) where the last inequality follows from Lemma 14.

Stage 2: when 2L 1+2L e 2(T kη) 1. We have the LSI constant for this stage. It is a constant level LSI constant, which mean we should choose the sample and the iteration number similar to Stage 1. Therefore, for Term 1, by requiring

L C0C 3 1 d + m2 2 4 ϵ 10δ 1

2ϵ2 d C1(d + m2 2) 1ϵ2 3 ϵ 2δ 1 L 1 64Tdη 3ϵ 2δ 1C 1 LSI,k

DKL q T kη( |x) q T kη( |x) 2 13L2 C 4 0 C8 1 (d + m2 2) 10ϵ28δ4

2 13 2 ln C0

4 d 2 ϵ4δ4 C8 1 d + m2 2 8 ϵ16 C2 LSI,k 2 13 T 4d 2ϵ4δ4η8C2 LSI,k.

(50) In this condition, we have

vk(x) 2Ex 0 q T kη( |x)

x e (T kη)x0 1 e 2(T kη)

1 exp 1 δ/(2 T/η )

δ 2 T/η 1 δ T/η ,

where q T kη( |x) denotes the underlying distribution of output particles of the k-th inner loop. To achieve

DKL q T kη( |xkη) q T kη( |xkη) 2 13L2 C 4 0 C8 1 (d + m2 2) 10ϵ28δ4 := δKL,

Lemma 19 requires the step size to satisfy

τk 2 17 3 1Σ 1 k,max L2C 4 0 C8 1 d + m2 2 11 ϵ28δ4

Σk,max 1 4Σk,max 1

4d 2 13L2 C 4 0 C8 1 (d + m2 2) 10ϵ28δ4

CLSI,k 4 2g T kη( |x) 2 2 1

and the iteration number Zk meets

Zk 1 CLSI,kτk ln 2DKL q T kη,0( |x) q T kη( |x)

Published as a conference paper at ICLR 2024

where q T kη,0( |x) denotes the initial distribution of the k-th inner loop. By choosing τk to be its upper bound and the initial distribution of k-th inner loop to be

q T kη,0(x0|x) exp

x e (T kη)x0 2

2 1 e 2(T kη)

DKL q T kη,0( |x) q T kη( |x) L2

2CLSI,k e2(T kη) d + x 2 L

2 e2(T kη) d + x 2

with Lemma 9 and Eq. 14. It implies the iteration number Zk of inner loops to be required as

Zk C 5 (d + m2 2)12ϵ 29 d + x 2 δ 5

=222 34 5 L 2C4 0C 8 1 ln 28

L C8 0C 8 1

(d + m2 2)12ϵ 29δ 5 d + x 2

= 1 Σk,min 2 17 3 1Σ 1 k,max L2C 4 0 C8 1 d + m2 2 11 ϵ28δ4 1

L C8 0C 8 1

δ + 10 ln(d + m2 2) + ln(d + x 2)

1 CLSI,kτk ln 2DKL q T kη,0( |x) q T kη( |x)

when ln(1/ϵ) 2 and ln d 2 without loss of generality.

For Term 2, we denote an optimal coupling between q T kη( |x) and q T kη( |x) to be

γ Γopt(q T kη( |x), q T kη( |x)).

Hence, we have 2e (T kη)

1 e 2(T kη)

h Ex 0 q T kη( |x)[x 0] Ex0 q T kη( |x)[x0] i

= E(x0,x 0) γ

1 e 2(T kη) (x0 x 0)

2 4e 2(T kη)

(1 e 2(T kη))2 E(x0,x 0) γ h x0 x 0 2i

= 4e 2(T kη)

(1 e 2(T kη))2 W 2 2 q T kη( |x), q T kη( |x)

4e 2(T kη) 1 e 2(T kη) 2 2 CLSI,k DKL q T kη( |x) q T kη( |x)

8η 2C 1 LSI,k DKL q T kη( |x) q T kη( |x) , (52) where the first inequality follows from Jensen s inequality, the second inequality follows from the Talagrand inequality and the last inequality follows from

e 2(T kη) e 2η 1 η e 2(T kη)

(1 e 2(T kη))2 η 2,

when η 1/2. Therefore, a sufficient condition to obtain Term 2 ϵ2 is to make the following inequality establish

DKL q T kη( |x) q T kη( |x) 2 3 η2ϵ2 L 1

which will be dominated by Eq. 50 in almost cases obviously.

Hence, combining Eq. 17, Eq. 51 and Eq. 52, there is

k=N1+1 η E ˆ PT

h vk(xkη) 2 ln p T kη(xkη) 2i 10( T/η N1)η ϵ2 (53)

Published as a conference paper at ICLR 2024

with a probability at least 1 ( T/η N1) δ/( T/η ) which is obtained by uniformed bound. We require the gradient complexity in this stage will be

k=N1+1 nk E ˆ PT (Zk) =

L C0C 3 1 d + m2 2 4 ϵ 10δ 1

E ˆ PT C 5 (d + m2 2)12ϵ 29 d + xkη 2 δ 5

( T/η N1) C 3C 5(d + m2 2)17ϵ 39δ 6

where the last inequality follows from Lemma 14. Combining Eq. 49 and Eq. 54, we know the total gradient complexity will be less than

max (C3C5, C 3C 5) C 1 1 C0 (d + m2 2)18ϵ 16n 83 exp 5C2ϵ 16n δ 6.

Hence the proof is completed.

E AUXILIARY LEMMAS

Lemma 11. (Lemma 11 of Vempala & Wibisono (2019)) Assume ν = exp( f) is L-smooth. Then Eν f 2 d L. Lemma 12. (Girsanov s theorem, Theorem 5.22 in Le Gall (2016)) Let PT and QT be two probability measures on path space C [0, T], Rd . Suppose under PT , the process ( xt)t [0,T ] follows

d xt = btdt + σtd Bt.

Under QT , the process (ˆxt)t [0,T ] follows

dˆxt = ˆbtdt + σtd ˆBt and ˆx0 = x0.

We assume that for each t 0, σt Rd d is a non-singular diffusion matrix. Then, provided that Novikov s condition holds

σ 1 t bt ˆbt 2 !#

we have d PT d QT = exp

0 σ 1 t bt ˆbt d Bt 1

σ 1 t bt ˆbt 2 dt

Corollary 2. Plugging following settings

PT := P p T T , QT := ˆPT , bt := xt+vk(xkη), ˆbt := xt+2σ2 ln p T t(xt), σt =

2σ, and t [kη, (k+1)η],

into Lemma 12 and assuming Novikov s condition holds, then we have

DKL ˆPT P p T T = E ˆ PT

kη vk(xkη) 2 ln p T t(xt) 2 dt

Lemma 13. (Lemma C.11 in Lee et al. (2022)) Suppose that p(x) e f(x) is a probability density function on Rd, where f(x) is L-smooth, and let φσ2(x) be the density function of N(0, σ2Id). Then for L 1 2σ2 , it has ln p(x) (p φσ2) (x)

6Lσd1/2 + 2Lσ2 f(x) .

Lemma 14. (Lemma 9 in Chen et al. (2022a)) Suppose that Assumption [A1] and [A2] hold. Let (x)t [0,T ] denote the forward process 1.

1. (moment bound) For all t 0,

E h xt 2i d m2 2.

Published as a conference paper at ICLR 2024

2. (score function bound) For all t 0,

E h ln pt(xt) 2i Ld.

Lemma 15. (Variant of Lemma 10 in Chen et al. (2022a)) Suppose that Assumption [A2] holds. Let (x)t [0,T ] denote the forward process 1. For 0 s < t, if t s 1, then

E h xt xs 2i 2 m2 2 + d (t s)2 + 4d (t s)

Proof. According to the forward process, we have

E h xt xs 2i =E

2 + 4 Bt Bs 2 #

+ 4d (t s) 2 Z t

s E h xr 2i dr (t s) + 4d (t s)

2 m2 2 + d (t s)2 + 4d (t s) ,

where the third inequality follows from Holder s inequality and the last one follows from Lemma 14. Hence, the proof is completed.

Lemma 16. (Corollary 3.1 in Chafa ı (2004)) If ν, ν satisfy LSI with constants α, α > 0, respectively, then ν ν satisfies LSI with constant ( 1

Lemma 17. Let x be a real random variable. If there exist constants C, A < such that E eλx Ce Aλ2 for all λ > 0 then

P {x t} C exp t2

Proof. According to the non-decreasing property of exponential function eλx, we have

P {x t} = P eλx eλt E eλx

eλt C exp Aλ2 λt ,

The first inequality follows from Markov inequality and the second follows from the given conditions. By minimizing the RHS, i.e., choosing λ = t/(2A), the proof is completed.

Lemma 18. If ν satisfies a log-Sobolev inequality with log-Sobolev constant µ then every 1Lipschitz function f is integrable with respect to ν and satisfies the concentration inequality

ν {f Eν[f] + t} exp µt2

Proof. According to Lemma 17, it suffices to prove that for any 1-Lipschitz function f with expectation Eν[f] = 0,

E eλf eλ2/(2µ).

To prove this, it suffices, by a routine truncation and smoothing argument, to prove it for bounded, smooth, compactly supported functions f such that f 1. Assume that f is such a function. Then for every λ 0 the log-Sobolev inequality implies

which is written as

Eν λfeλf Eν eλf log E eλf λ2

2µEν h f 2 eλfi .

Published as a conference paper at ICLR 2024

With the notation φ(λ) = E eλf and ψ(λ) = log φ(λ), the above inequality can be reformulated as

λφ (λ) φ(λ) log φ(λ) + λ2

2µEν h f 2 eλfi

φ(λ) log φ(λ) + λ2

where the last step follows from the fact f 1. Dividing both sides by λ2φ(λ) gives

Denoting that the limiting value log(φ(λ))

λ |λ=0= limλ 0+ log(φ(λ))

λ = Eν[f] = 0, we have

which implies that

2µ = φ(λ) exp λ2

Then the proof can be completed by a trivial argument of Lemma 17.

Lemma 19. (Theorem 1 in Vempala & Wibisono (2019)) Suppose p satisfies LSI with constant µ > 0 and is L-smooth. For any x0 p0 with DKL(p0 p ) < , the iterates xk pk of ULA with step size 0 < τ µ 4L2 satisfy

DKL (pt p ) e µτk DKL (p0 p ) + 8τd L2

Thus, for any δ > 0, to achieve DKL (pt p ) δ, it suffices to run ULA with step size

0 < τ µ 4L2 min 1, δ

µτ log 2DKL (p0 p )

Lemma 20. (Variant of Lemma 10 in Cheng & Bartlett (2018)) Suppose ln p is m-strongly convex function, for any distribution with density function p, we have

DKL (p p ) 1 2m

Z p(x) ln p(x)

By choosing p(x) = g2(x)p (x)/Ep g2(x) for the test function g: Rd R and Ep g2(x) < , we have

Ep g2 ln g2 Ep g2 ln Ep g2 2

m Ep h g 2i ,

which implies p satisfies m-log-Sobolev inequality. Lemma 21. Using the notation in Table. 1, for each t [0, ), the underlying distribution pt of the forward process satisfies

DKL (pt p ) 4(d L + m2 2) exp t

Proof. Consider the Fokker Planck equation of the forward process, i.e.,

dxt = xtdt +

2d Bt, x0 p0 e f ,

tpt(x) = (pt(x)x) + pt(x) = pt(x) ln pt(x)

Published as a conference paper at ICLR 2024

It implies that the stationary distribution is

2 x 2 . (55)

Then, we consider the KL convergence of (xt)t 0, and have

d DKL(pt p )

Z pt(x) ln pt(x)

p (x)dx = Z tpt(x) ln pt(x)

= Z pt(x) ln pt(x)

= Z pt(x) ln pt(x)

According to Proposition 5.5.1 of Bakry et al. (2014), if p is a centered Gaussian measure on Rd with covariance matrix Σ, for every smooth function f on Rd, we have

Ep f 2 log f 2 Ep f 2 log Ep f 2 2Ep [Σ f f]

For the forward stationary distribution Eq. 55, we have Σ = I. Hence, by choosing f 2(x) = pt(x)/p (x), we have

DKL (pt p ) 2 Z pt(x) ln pt(x)

Plugging this inequality into Eq. 56, we have

d DKL(pt p )

dt = Z pt(x) ln pt(x)

2DKL(pt p ).

Integrating implies the desired bound,i.e.,

DKL(pt p ) exp t

DKL(p0 p ) = C0 exp t

Lemma 22. Suppose f1 : Rd R and f2 : Rd R is µ-strongly convex for x R. That means v1(x) := f1(x) µ/2 x 2 (and v2(x) := f2(x) µ/2 x 2) is convex on Ω= Rd \ B(0, R). Specifically, we require that x Ω, any convex combination of x = Pk i=1 λixi with x1, . . . , xk Ωsatisfies

i=1 λkv1(xi).

Then, we have f1 + f2 is 2µ-strongly convex for x R.

Proof. We define v(x) = f(x) µ x 2. Hence, by considering x Ωand its convex combination of x = Pk i=1 λixi with x1, . . . , xk Ω, we have

v(x) = v1(x) + v2(x)

i=1 λiv1(xi) +

i=1 λiv2(xi) =

i=1 λiv(xi).

Hence, the proof is completed.

Lemma 23. Suppose q is a distribution which satisfies LSI with constant µ, then its variance satisfies Z q(x) x E q [x] 2 dx d

Published as a conference paper at ICLR 2024

Proof. It is known that LSI implies Poincar e inequality with the same constant (Rothaus, 1981; Villani, 2021; Vempala & Wibisono, 2019), which can be derived by taking ρ (1 + ηg)ν in Eq. (10). Thus, for µ-LSI distribution q, we have

varq (g(x)) 1

µEq h g(x) 2i .

for all smooth function g: Rd R.

In this condition, we suppose b = Eq[x], and have the following equation Z q(x) x Eq [x] 2 dx = Z q(x) x b 2 dx

i=1 q(x) (xi bi)2 dx =

Z q(x) ( x, ei b, ei )2 dx

Z q(x) ( x, ei Eq [ x, ei ])2 dx =

i=1 varq (gi(x))

where gi(x) is defined as gi(x) := x, ei and ei is a one-hot vector ( the i-th element of ei is 1 others are 0).

Combining this equation and Poincar e inequality, for each i, we have

varq (gi(x)) 1

µEq h ei 2i = 1

By combining the equation and inequality above, we have Z q(x) x Eq [x] 2 dx =

i=1 varq (gi(x)) d

Hence, the proof is completed.

Published as a conference paper at ICLR 2024

F EMPIRICAL RESULTS

F.1 EXPERIMENT SETTINGS AND MORE EMPIRICAL RESULTS

We choose 1, 000 particles in the experiments and use MMD (with RBF kernel) as the metric. We choose T { ln 0.99, ln 0.95, ln 0.9, ln 0.8, ln 0.7}. We use 10, 50, or 100 iterations to approximate ˆp chosen by the corresponding problem. The inner loop is initialized with importance sampling mean estimator by 100 particles. The inner iteration and inner loop sample-size are chosen from {1, 5, 10, 100}. The outer learning rate is chosen from {T/20, T/10, T/5}. When the algorithm converges, we further perform LMC until the limit of gradient complexity. Note that the gradient complexity is evaluated by the product of outer loop and inner loop.

0 50 100 150 200 grad complexity

LMC ULMC rd MC

0 50 100 150 200 grad complexity

LMC ULMC rd MC

0 50 100 150 200 grad complexity

LMC ULMC rd MC

r = 0.5 r = 1.0 r = 2.0

Figure 4: Maximum Mean Discrepancy (MMD) convergence of LMC, ULMC, RDMC.

0 50 100 150 200 grad complexity

LMC ULMC rd MC

0 50 100 150 200 grad complexity

LMC ULMC rd MC

0 50 100 150 200 grad complexity

LMC ULMC rd MC

r = 1.0 r = 2.0 r = 4.0

Figure 5: Maximum Mean Discrepancy (MMD) convergence of LMC, ULMC, RDMC.

F.2 MORE INVESTIGATION ON ILL-BEHAVED GAUSSIAN CASE

Figure 6 demonstrate the differences between Langevin dynamics and the OU process in terms of their trajectories. The former utilizes the gradient information of the target distribution ln p ,

Published as a conference paper at ICLR 2024

to facilitate optimization. However, it converges slowly in directions with small gradients. On the other hand, the OU process constructs paths with equal velocities in each direction, thereby avoiding the influence of gradient vanishing directions. Consequently, leveraging the reverse process of the OU process is advantageous for addressing the issue of uneven gradient sampling across different directions.

10 0 10 20 30 40 50 10

10 0 10 20 30 40 50 10

10 0 10 20 30 10

t = 0 t = 0.1 t = 1 t = 2 t = 4 t =

10 0 10 20 30 10

t = 0 t = 0.1 t = 1 t = 2 t = 4

N(0, Id) p Langevin µt OU process µt

Figure 6: Langevin dynamics vs (reverse) OU process. We consider the sampling path between the 2-dimensional normal distributions N(0, I2) and N((20, 20), diag(400, 1)). The mean µt of Langevin dynamics show varying convergence speeds in different directions, while the OU process demonstrates more uniform changes.

In order to further demonstrate the effectiveness of our algorithm, we conducted additional experiments comparing the Langevin dynamics with our proposed method in our sample scenarios. To better highlight the impacts of different components, we chose the 2-dimensional ill-conditioned

Gaussian distribution N (20, 20), 400 0 0 1

(shown in Figure 7) as the target distribution to

showcase this aspect. In this setting, we obtained an oracle sampler for qt, enabling us to conduct a more precise experimental analysis of the sample complexity.

10 0 10 20 30 40 50 10

10 0 10 20 30 40 50 10

N(0, I) N((20, 20), diag(400, 1)) Initial distribution Target distribution

Figure 7: Initial and Target distribution sample illustration.

Figure 8 illustrates the distributions of samples generated by the LMC algorithm at iterations 10, 100, 1000, and 10000. It can be observed that these distributions deviate from the target distribution.

Published as a conference paper at ICLR 2024

10 0 10 20 30 40 50 10

10 0 10 20 30 40 50 10

T = 10 T = 100

10 0 10 20 30 40 50 10

10 0 10 20 30 40 50 10

T = 1000 T = 10000

Figure 8: Illustration of LMC with different iterations.

Figure 9 demonstrates the performance of the RDMC algorithm in the scenario of infinite samples, revealing its significantly superior convergence compared to the LMC algorithm.

10 0 10 20 30 40 50 10

10 0 10 20 30 40 50 10

T/(kη) = 10 T/(kη) = 100

10 0 10 20 30 40 50 10

10 0 10 20 30 40 50 10

T/(kη) = 1000 T/(kη) = 10000

Figure 9: Illustration of RDMC (oracle sampler, infinite samples) with different iterations.

Figure 10 showcases the convergence of the RDMC algorithm in estimating the score under different sample sizes. It can be observed that our algorithm is not sensitive to the sample quantity, demonstrating its robustness in practical applications.

Published as a conference paper at ICLR 2024

10 0 10 20 30 40 50 10

10 0 10 20 30 40 50 10

sample size = 1 sample size = 10

10 0 10 20 30 40 50 10

10 0 10 20 30 40 50 10

sample size = 100 sample size = 1000

Figure 10: Illustration of RDMC (oracle sampler, finite samples, T

Figure 11 illustrates the convergence behavior of the RDMC algorithm with an inexact solver. It can be observed that even when employing LMC as the solver for the inner loop, the final convergence of our algorithm surpasses that of the original LMC. This is attributed to the insensitivity of the algorithm to the precision of the inner loop when t is large. Additionally, when t is small, the log Sobolev constant of the inner problem is relatively large, simplifying the problem as a whole and guiding the samples towards the target distribution through the diffusion path.

10 0 10 20 30 40 50 10

10 0 10 20 30 40 50 10

inner loop = 50 inner loop = 100

10 0 10 20 30 40 50 10

10 0 10 20 30 40 50 10

inner loop = 500 inner loop = 1000

Figure 11: Illustration of RDMC (inexact sampler, finite samples, T

F.3 SAMPLING FROM HIGH-DIMENSIONAL DISTRIBUTIONS

To validate the dimensional dependency of RDMC algorithm, we consider to extend our Gaussian mixture model experiments.

In the log-Sobolev distribution context (see Section 4.2.1), both the Langevin-based algorithm and our proposed method exhibit polynomial dependence on dimensionality. However, the log-Sobolev

Published as a conference paper at ICLR 2024

constant grows exponentially with radius R, which is the primary influencing factor. This leads to behavior akin to the 2-D example presented earlier. A significant limitation of the Langevinbased algorithm is its inability to converge within finite time for large R values, in contrast to the robustness of our algorithm. In higher-dimensional scenarios (e.g., r = 2 for d = 50 and d = 100), we observe a notable decrease in rd MC performance after approximately 100 computations. This decline may stem from the kernel-based computation of MMD, which tends to be less sensitive in higher dimensions. For these large r cases, LMC and ULMC fails to converge in finite time. Overall, Figure 12 exhibits trends consistent with Figure 3, corroborating our theoretical findings.

0 50 100 150 200 grad complexity

LMC ULMC rd MC

0 50 100 150 200 grad complexity

LMC ULMC rd MC

0 50 100 150 200 grad complexity

LMC ULMC rd MC

0 50 100 150 200 grad complexity

LMC ULMC rd MC

0 50 100 150 200 grad complexity

LMC ULMC rd MC

0 50 100 150 200 grad complexity

LMC ULMC rd MC

0 50 100 150 200 grad complexity

LMC ULMC rd MC

0 50 100 150 200 grad complexity

LMC ULMC rd MC

d = 5 d = 10 d = 50 d = 100

Figure 12: Illustration of RDMC and MCMC for high-dimensional Gaussian Mixture model (6 modes). The first row shows the case of r = 1. The second row shows the case of r = 2.

For heavy-tail distribution (refer to Section 4.2.2), the complexity of both RDMC and MCMC are quite high, so it is not feasible to sample from them in limited time. Nonetheless, as our algorithm only has the polynomial dependency of d, but the Langevin-based algorithms have exponential dependency with respect to d, we may have more advantages. For example, we can use the nth moment demonstrate the similarity of different distributions. To illustrate the phenomenon, we consider the extreme case Cauchy distribution2. According to Figure 13, RDMC has better approximation for the true distribution and the approximation gap increases with respect to dimension.

0 25 50 75 100 dimension

103 Ground truth rd MC LMC ULMC

0 25 50 75 100 dimension

108 Ground truth rd MC LMC ULMC

0 25 50 75 100 dimension

Ground truth rd MC LMC ULMC

1st moment 2nd moment 3rd moment

Figure 13: Moment estimators for Cauchy distribution when n = 1000. It reflects the closeness between different distributions. For high dimensional moments, we computes the sum across all dimensions (trace) for proper plotting. The algorithm are evaluated with 1K gradient complexity.

F.4 DISCUSSION ON THE CHOICE OF p0

Note that different p0 may have impacts on the algorithm. In this subsection, we discuss the impact of choice of p0 and try to make a clear demonstration for the influence in real practice.

Practically, selecting T determines the initial distribution for reverse diffusion. While any T choice can achieve convergence, the computational complexity varies with different T values. According to Figure 14, we can notice that with the increase of T, the modes of p T tend to merge to a single mode, and the speed is exponentially fast (Lemma 21). Even with T = ln 0.95, the dis-connectivity of modes can be alleviated significantly. Thus, the choice of T is not sensitive when choosing T from ln 0.95 to ln 0.7.

2The mean of Cauchy distribution does not exist. We use the case for better heavy-tail demonstration.

Published as a conference paper at ICLR 2024

T = ln 0.7 T = ln 0.8 T = ln 0.9 T = ln 0.95

Figure 14: Choice of p T and the approximation by p0. For T from ln 0.95 to ln 0.7, p0 can approximate p T properly, where each modes are connected to make the approximated distribution well-mixed. Then the RDMC can be performed properly.

However, when choosing too small or too large T, there would be some consequence:

If T is too small, the approximation of p0 would be extremely hard. For example, if T 0, the algorithm would be similar to Langevin algorithm;

If T is too large, it would be wasteful to transport from T to ln 0.7 since the distribution in this interval is highly homogeneous3.

In summary, aside from the T 0 scenario, our algorithm exhibits insensitivity to the choice of T (or p0). Selecting an appropriate T can reduce computational demands when using a constant step size schedule.

F.5 NEAL S FUNNEL

Neal s Funnel is a classic demonstration of how traditional MCMC methods struggle with convergence unless specific parameterization strategies are employed. We further investigate the performance of our algorithm for this scenario. As indicated in Figure 15, our method demonstrates more rapid convergence compared to Langevin-based MCMC approaches. Additionally, Figure 16 reveals that while LMC lacks efficient exploration capabilities, ULMC fails to accurately represent density. This discrepancy stems from incorporating momentum/velocity. Our algorithm strikes an improved balance between exploration and precision, primarily due to the efficacy of the reverse diffusion path, thereby enhancing convergence speed.

0 100 200 300 400 500 grad complexity

LMC ULMC rd MC

0 100 200 300 400 500 grad complexity

LMC ULMC rd MC

0 100 200 300 400 500 grad complexity

LMC ULMC rd MC

d = 2 d = 5 d = 10

Figure 15: Convergence of Neal s Funnel for RDMC, LMC, and ULMC.

Ground Truth LMC ULMC RDMC

Figure 16: Samples from Neal s Funnel.

3It is possible to consider varied step size scheduling, which can be interesting future work.