# stochastic_localization_via_iterative_posterior_sampling__34f020d5.pdf

Stochastic Localization via Iterative Posterior Sampling

Louis Grenioux * 1 Maxence Noble * 1 Marylou Gabri e 1 Alain Durmus 1

Building upon score-based learning, new interest in stochastic localization techniques has recently emerged. In these models, one seeks to noise a sample from the data distribution through a stochastic process, called observation process, and progressively learns a denoiser associated to this dynamics. Apart from specific applications, the use of stochastic localization for the problem of sampling from an unnormalized target density has not been explored extensively. This work contributes to fill this gap. We consider a general stochastic localization framework and introduce an explicit class of observation processes, associated with flexible denoising schedules. We provide a complete methodology, Stochastic Localization via Iterative Posterior Sampling (SLIPS), to obtain approximate samples of this dynamics, and as a by-product, samples from the target distribution. Our scheme is based on a Markov chain Monte Carlo estimation of the denoiser and comes with detailed practical guidelines. We illustrate the benefits and applicability of SLIPS on several benchmarks of multi-modal distributions, including Gaussian mixtures in increasing dimensions, Bayesian logistic regression and a high-dimensional field system from statisticalmechanics.

1. Introduction

We consider in this paper the problem of sampling from a probability density known up to a normalization constant. This problem finds its origin in various tasks, ranging from Bayesian statistics (Kroese et al., 2011) to statistical mechanics (Krauth, 2006), including now generative modeling (Turner et al., 2019; Grenioux et al., 2023a).

*Equal contribution 1CMAP, CNRS, Ecole polytechnique, Institut Polytechnique de Paris, 91120 Palaiseau, France. Correspondence to: Louis Grenioux <louis.grenioux@polytechnique.edu>, Maxence Noble <maxence.noble-bourillot@polytechnique.edu>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

Markov chain Monte Carlo (MCMC) samplers are among the most common approaches for this task with a wide span of applicability. In addition, under appropriate conditions on the target, theoretical guarantees can be derived (Dalalyan, 2017; Durmus & Moulines, 2017). However, for complex distributions, simple MCMC algorithms have some limitations. As a workaround, it has been suggested to solve the problem progressively, targeting intermediate smoothed distributions. The resulting algorithms include Replica Exchange (Swendsen & Wang, 1986), Annealed Importance Sampling (Neal, 2001) and Sequential Monte Carlo (Del Moral et al., 2006). Nevertheless, these methods can still largely struggle in high-dimensional settings. To address this issue, approximate inference methods such as Variational Inference (VI) (Wainwright & Jordan, 2008) have emerged, in connection with deep generative models such as Variational Auto-Encoders (Rezende et al., 2014; Kingma & Welling, 2014) or Normalizing Flows (Rezende & Mohamed, 2015).

In contrast to density-based sampling , which is the setting considered in the present work, generative modeling assumes the availability of training samples and aims to generate similar realizations. Henceforth, we refer to generative modeling as data-based sampling as opposed to density-based sampling . Diffusion-based generative models (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song et al., 2021) now constitute the state-of-the-art of data-based sampling. In these models, noise is progressively added to the training samples via a forward stochastic process. The derivatives of the logarithm of the forward marginal densities, called scores, are learned via score matching techniques (Hyv arinen & Dayan, 2005; Vincent, 2011), and new samples are obtained by simulating the backward process using the learned scores. This approach to generative modeling has been found to scale well with dimension in a large variety of applications (Rombach et al., 2022; Chen et al., 2021; Nichol et al., 2022) and comes with theoretical guarantees (Chen et al., 2023).

Extending diffusion-based model techniques to densitybased sampling proves challenging as it requires an efficient estimator of the score without samples of the target. Relying on the link made with stochastic optimal control (Tzen & Raginsky, 2019; De Bortoli et al., 2021; Holdijk et al., 2023; Pavon, 2022), several VI methods using deep

Stochastic Localization via Iterative Posterior Sampling

neural networks have recently been proposed for this task (Zhang & Chen, 2022; Berner et al., 2023; Vargas et al., 2023a;c; 2024; Richter et al., 2023). On the other hand, (Huang et al., 2024) take a non-parametric approach and proposed a scheme based on a MCMC estimation of the scores and therefore can potentially mitigate the numerical bias intrinsic in using neural networks or more broadly any parametric estimation method. This is also our approach.

Closely related to denoising diffusion models, Stochastic Localization (SL) techniques have been first employed as a tool to establish results in geometric measure theory (Eldan, 2013; 2020; 2022; Chen & Eldan, 2022), and more recently have been proposed as a density-based sampling approach. (Alaoui et al., 2022; Montanari & Wu, 2023) demonstrate the potential of SL for specific challenging distributions and (Ghio et al., 2023) provide a conjecture on the type of distributions which can be efficiently sampled with these techniques. (Montanari, 2023) provides a blueprint on using SL for density-based sampling but does not discuss a practical strategy for an arbitrary target distribution.

Contributions. Building on these prior works, we bring the following contributions.

We investigate a general framework of SL that allows to reflect on the role of the signal-to-noise scheduling when using SL for sampling.

We identify the challenges in its practical implementation and propose a learning-free sampling methodology using MCMC estimation, which requires few tuning. We elucidate our algorithmic design with theoretical and numerical considerations applicable to a certain class of non log-concave distributions.

We provide numerical evidence of the robustness of the proposed approach in large dimension, beyond the class of distributions amenable to theoretical guarantees. Those results show that the proposed algorithm is on par or superior to modern sampling methods in a wide variety of settings.

The code to reproduce our experiments is available at https://github.com/h2o64/slips.

Notation. For any distribution µ with finite first moment, its expectation is denoted by mµ = R

Rd xdµ(x). The density of the Gaussian distribution with mean m Rd and covariance Σ Rd d is denoted by x 7 N(x; m, Σ).

2. Background on SL for sampling

Consider a target distribution π defined on (Rd, B(Rd)), with B(Rd) Borel sets of Rd endowed with the Euclidean norm. Using SL to sample from π consists in identifying

and simulating a stochastic process that converges almost surely to a random variable distributed as π (Alaoui et al., 2022; Montanari & Wu, 2023; Montanari, 2023). A specific example is to consider, given X π, the stochastic process (Yt)t 0, called the observation process, defined by

Yt = t X + σWt , (1)

where (Wt)t 0 is a standard Brownian motion on Rd, independent from X, and σ > 01. The time-rescaled Yt/t converges almost surely to X as t . Moreover, if π admits a finite second order moment, we can show that the 2-Wasserstein distance between the distribution of Yt/t and π is bounded by σ p

d/t 2, see Appendix B.2. Thus, for T large, YT /T is approximately distributed according to π.

Sampling from (Yt)t 0 using directly (1) cannot be done in practice, precisely since it requires to first sample from π. Nevertheless, this issue can be overcome under the assumption that π admits a finite first moment. Indeed, in this case, (Yt)t 0 solves the Stochastic Differential Equation (SDE),

d Yt = ut(Yt)dt + σd Bt , Y0 = 0 , (2)

where (Bt)t 0 is a standard Brownian motion on Rd and ut(y) = R

Rd xqt(x|y)dx for any y Rd, see (Liptser & Shiryaev, 1977, Theorem 7.12.). The drift function ut involves the conditional density of X given Yt = y Rd

defined for x Rd as

q0(x|y) π(x) , (3)

qt(x|y) π(x)N(x; y/t, σ2/t Id) , t > 0 .

In the literature, the random conditional expectation ut(Yt) = E[X|Yt] is often referred to as the Bayes estimator of X given Yt (Robbins, 1992; Saremi & Hyv arinen, 2019) or the optimal denoiser of Yt (Montanari, 2023; Ghio et al., 2023). Moreover, if π has finite second moment, one can show that the SDE (2) admits unique weak solutions (Liptser & Shiryaev, 1977, Theorem 7.6. & Remark 7.2.7.), see Appendix B.2. As a result, one can simulate (Yt)t [0,T ] for any T > 0 by integrating the SDE (2), while avoiding to first sample from π, if the drift function (ut)t [0,T ] (or an approximation of it) is known.

Yet, the challenge of this sampling approach lies in the estimation of (ut)t [0,T ] without samples from π available a priori. At time t = 0, we have u0(y) = mπ. For any t > 0, ut is linked to the score of the marginal distribution of Yt, given by pt(y) = R

Rd N(y; tx, σ2t Id)dπ(x), following Tweedie s formula given in Appendix A,

ut(y) = y/t + σ2 log pt(y) .

1Note that the authors originally considered σ = 1. 2This bound will be referred to as the localization rate.

Stochastic Localization via Iterative Posterior Sampling

Therefore, sampling via SL can be reduced to a score estimation task, as noted by (Montanari, 2023), who also establishes a direct connection between the SL induced by the process (1) and variance-preserving diffusions (Song et al., 2021), see Appendix B. In data-based sampling, i.e., when samples {Xi}N i=1 from π are at the disposal of the user, ut can be estimated using a parameterized model through a least-squares regression (Saremi & Hyv arinen, 2019; Montanari, 2023) following score matching techniques.

A few recent works started to discuss the density-based sampling task using SL or diffusions (Montanari & Wu, 2023; Huang et al., 2024; Vargas et al., 2023a). Here, we focus on the SL formalism and propose in Section 4 a sampling algorithm for arbitrary distributions based on a learning-free estimator of the denoiser. Before, we provide in Section 3 a working definition of SL with flexible denoising schedules.

3. SL with flexible denoising scheduling

Inspired by recent developments studying optimal noise scheduling in diffusion models (Nichol & Dhariwal, 2021; Kingma et al., 2021; Karras et al., 2022; Kingma & Gao, 2023), we consider a new class of explicit observation processes associated with flexible denoising schedules.

3.1. A more general observation process

For a possibly finite time horizon Tgen (0, ], we now consider the observation process (Y α t )t [0,Tgen) defined by

Y α t = α(t)X + σWt , (4)

where (Wt)t 0 is a standard Brownian motion on Rd, independent from X π, and α(t) = t1/2g(t) with the following assumptions on g:

(a) g C0([0, Tgen), R+) C1((0, Tgen), R+);

(b) g(t) Ctβ/2 as t 0, for some β 1, C > 0;

(c) g is strictly increasing and limt Tgen g(t) = .

Under these assumptions, the observation process (4) verifies Y α 0 = 0 and Y α t /α(t) X almost surely as t Tgen. Moreover, by denoting πα t the distribution of Y α t /α(t), one can show that for any t (0, Tgen), the localization rate is now function of g(t):

W2(π, πα t ) σ

d/g(t) , (5)

where W2 denotes the 2-Wasserstein distance. Proofs and refinements in the case where π is a Gaussian distribution are postponed to Appendix B.2. We recover the process (1) discussed in Section 2 with g(t) = t1/2 and Tgen = .

As in the previous section, sampling from the observation process via (4) requires to first sample from π. We again

Table 1: Stochastic Localization schemes.

SL scheme Hyper-parameters g(t)

Geom- (α1) α1 1 tα1/2

Geom(α1, α2) α1 1, α2 > 0 tα1/2(1 t) α2/2

bypass this issue by introducing an equivalent SDE. In Appendix B.2, we indeed show that (Y α t )t [0,Tgen) solves

d Y α t = α(t)uα t (Y α t )dt + σd Bt , Y α 0 = 0 , (6)

where (Bt)t 0 is a standard Brownian motion on Rd and the drift function uα t is defined for any y Rd by

uα t (y) = R

Rd xqα t (x|y)dx . (7)

Here, the conditional density of X given Yt = y is

qα 0 (x|y) π(x) , (8)

qα t (x|y) π(x)N(x; y/α(t), σ2/g(t)2 Id) , t (0, Tgen) .

Assuming further that π has finite second moment and that g satisfies technical conditions, (6) admits unique weak solutions, see Appendix B.2. Then again, if uα t is known, integrating (6) up to time T (0, Tgen) is equivalent to sampling from (Y α t )t [0,T ]. Then, Y α T /α(T) is approximately distributed according to π for any T close enough to Tgen.

The challenge still lies in the estimation of the drift function uα t which verifies uα 0 (y) = mπ at time t = 0, and which can be expressed, for any t (0, Tgen), using the score of the marginal distribution pα t (y) given by pα t (y) = R

Rd N(y; α(t)x, σ2t Id)dπ(x). In Appendix B.2, we indeed show that

uα t (y) = y/α(t) + {σ2t/α(t)} log pα t (y) , (9)

recovering the previously mentioned link between score estimation and denoiser estimation.

Moreover, the score of pα t can also be written as an expectation over the same posterior distribution qα t but with a different observable

log pα t (y) = 1 α(t) R

Rd log π(x)qα t (x|y)dx . (10)

Hence, this suggests to consider an alternative expression of the SDE defined in (6), involving the expectation given in (10) instead of the denoiser function uα t

3. We refer to Appendix B.3 for more details. However, we experimentally observed this approach was less stable than our original method, which is why we decided to consider the SDE (6).

3Equation (10) and its consequence have been independently derived in the concurrent works (Bortoli et al., 2024; Akhound Sadegh et al., 2024). Both have been publicly available almost at the same time than the present paper.

Stochastic Localization via Iterative Posterior Sampling

0.00 0.25 0.50 0.75 1.00 t/T

Standard Geom- (2) Geom- (3)

0.00 0.25 0.50 0.75 1.00 t/T

Geom(1,1) Geom(1,2) Geom(2,1)

Figure 1: Display of log SNR for the localization schemes Geom- (left) and Geom (right).

3.2. Hyper-parameter selection and SNR profile

At first sight, determining appropriate choices for g, σ and Tgen, without introducing further complexity, can be challenging. Below, we propose interpretable choices, relying on the analysis of the signal-to-noise ratio of the observation process, which determines the speed at which the denoising process localizes on the target distribution.

An intuitive definition of the Signal-to-Noise Ratio (SNR) of the observation process (4) is, for any t [0, Tgen),

SNR(t) = E h α(t)(X E[X]) 2i

E h σWt 2i = R2 πg(t)2

where R2 π = R

Rd x mπ 2 dπ(x) is the scalar variance of π. The monotonicity of g ensures that the SNR is strictly increasing on [0, Tgen), with limt 0 SNR(t) = 0 (no information on π at all) and limt Tgen SNR(t) = (no noise at all). Further, the scaling in time of the SNR schedule is solely determined by the function g, which also controls the time dependence in the localization rate (5). This observation leads us to fix σ = Rπ/

d, assuming that Rπ is known, to obtain a SNR profile SNR(t) = g2(t) independent of the target distribution or the dimension4. In cases where the scalar variance of the target π is unknown, an estimator ˆRπ can be used instead.

The practical implementation of SL-based sampling requires also the choice of an effective time horizon T < Tgen up to which we integrate the SDE (6), similarly to the maximum denoising time in diffusion models. It needs to be close enough to Tgen to consider Y α T /α(T) as approximately distributed according to π and as small as possible to ensure computational efficiency. This trade-off is accounted for by selecting T based on reaching a predefined level of the logarithm of the SNR (log-SNR), which is preferable to use in practice. Denoting by η a positive log-SNR threshold, we set T = Tη, where Tη is such that log SNR(Tη) = 2 log g(Tη) = η. In practice, the choice of η does not have a significant impact on performance (see Appendix D.3 Figure 7).

4Note that our framework would be unchanged by taking σ = 1 and defining g up to a multiplicative constant.

0.0 0.2 0.4 0.6 0.8 1.0 t/T

Standard Geom(1,1)

Uniform discretization SNR-adapted discretization

Figure 2: SNR-adapted vs uniform time discretization for the schemes Standard and Geom(1, 1). The uniform discretization leads to larger log-SNR differences between timesteps where the SNR increases rapidly.

Having reduced the set of hyper-parameters to tune, we investigate two main settings for the denoising schedule g(t). Their main characteristics are summarized in Table 1 and examples of corresponding log-SNR profiles are plotted in Figure 1.

(a) Asymptotic geometric schedule (Geom- ), with Tgen = and g(t) = tα1/2 for α1 1. This is a natural extension of the observation process discussed in Section 2 for which α1 = 1. We refer to it as the standard scheme.

(b) Non-asymptotic geometric schedule (Geom), with Tgen < and g(t) = (t/Tgen)α1/2(1 t/Tgen) α2/2, for α1 1 and α2 > 0. In practice, we only consider Tgen = 1. If α1 = α2, the log-SNR profile is similar to the one obtained in diffusion models via cosine noise scheduling (Nichol & Dhariwal, 2021). Moreover, by taking α2 < α1, the profile becomes flatter near t = 1, similarly to the scheduling profiles presented in (Kingma et al., 2021; Kingma & Gao, 2023). Following the framework of (Albergo et al., 2023), {(1 t)α2/2Yt}t [0,1] is a stochastic interpolant between the Dirac mass at 0 and π.

The possibility to adapt the denoising schedule has not yet received attention in the application of SL for sampling. In the following section, we detail the design of our algorithm which adapts to any denoising schedule.

4. SLIPS sampling algorithm

In this section, we consider hyper-parameters σ, T and a denoising schedule g as given in Section 3.2 and expose our strategy for SL-based sampling. As already mentioned, sampling from π via SL consists in integrating the SDE (6) to obtain a realization {Y α t }t [0,T ] and set Y α T /α(T) as an approximate sample from π. However, the SDE integration faces two main hindrances. First, the drift function (uα t )t [0,T ] defined in (7) is not tractable in general. Second, even if the drift function were known exactly, numerical integration incurs unavoidable discretization errors. We first tackle the latter issue in Section 4.1. We then present a Monte Carlo-based approach to tackle the estimation of uα t in Section 4.2. Finally, we present the key idea of our algorithm SLIPS in Section 4.3.

Stochastic Localization via Iterative Posterior Sampling

4.1. Handling the discretization error

Consider a time discretization of the interval [0, T] defined by an increasing sequence of timesteps (tk)K k=0 where t0 0, t K = T and K 1. We define a sequence of random variables { Y α tk}K k=0 approximating (6), defined for any k {0, . . . , K 1} by the recursion

Y α tk+1 = Y α tk + wkuα tk( Y α tk) + σ p

δk Zk+1 , (12)

where Y α t0 = 0, δk = tk+1 tk, wk = α(tk+1) α(tk) and (Zk)K k=1 is distributed according to the centered standard Gaussian distribution. Note that (12) results from an Euler Maruyama (EM) discretization scheme applied to (6), that corresponds for any k {0, . . . , K 1} to the exact integration of the SDE

d Y α t = α(t)uα tk( Y α tk)dt + σd Bt , t [tk, tk+1] .

In practice, we rather start the integration from a timestep t0 > 0 for reasons that we detail in the next section. We also adopt a time discretization adapted to the variations of the log-SNR on [t0, T]. Define SNR = {log SNR(T) log SNR(t0)}/K, and for any k {0, . . . , K}, choose tk such that

log SNR(tk) = log SNR(t0) + SNRk . (13)

This discretization defines smaller step-sizes when the variation of log SNR is high, i.e., where the denoising process accelerates (see Figure 2). We observed that using uniform discretization leads to significantly greater numerical errors in our experiments, see Appendix D.

As described in the next section, we suggest to estimate uα tk( Y α tk) by Monte Carlo methods. Denoting by MC-Est(uα t (Y )) an estimator of the drift function at time t evaluated at Y , we come to consider the joint sequence {( U α tk, Y α tk)}K k=0 given by

U α tk = MC-Est(uα tk( Y α tk)) , (14)

Y α tk+1 = Y α tk + wk U α tk + σ p

where (tk)K k=0 is the SNR-adapted time discretization of [t0, T] defined in (13). The initialization of Y α t0 is crucial and will be discussed in Section 4.3. We now detail our Monte Carlo estimation scheme defining U α tk.

4.2. Estimating the denoiser via MCMC

Monte Carlo estimation with posterior sampling. We aim to build accurate estimators of the quantities

uα tk( Y α tk) = R

Rd x qα tk(x| Y α tk)dx .

For ease of notation, we will denote the random posterior density qtk( | Y α tk) by µk in this section and recall that for any k {0, . . . , K}

µk(x) π(x)N(x; Y α tk/α(tk), σ2/g(tk)2 Id) . (15)

One possible approach is to estimate U α tk with Importance Sampling (IS), by choosing as instrumental distribution N( Y α tk/α(tk), σ2/g(tk)2 Id). However, IS is known to suffer from large variance in high-dimension unless the proposal is closely adapted to π (Ceriotti et al., 2012). This variance issue translates into sample-size requirements which may grow exponentially with d, see e.g., (Chatterjee & Diaconis, 2018).

Instead, we propose to approximately sample from µk with a MCMC algorithm, in the same fashion as (Huang et al., 2024), and compute U α tk as the empirical mean of the obtained samples. More precisely, given a Markov chain {Xj k}M j=1 targeting µk, we define U α tk = (1/M) PM j=1 Xj k.

In practice, we turn to the largely used Metropolis-Adjusted Langevin Algorithm (MALA) (Roberts & Tweedie, 1996) leveraging gradient-information of the target log-density. Nonetheless, to justify the use of MCMC here, sampling from µk has to be easier than the original problem of sampling from π. In particular, we establish in the next paragraph, conditions ensuring that µk is log-concave, implying a reasonable mixing time for MALA (Dwivedi et al., 2019).

Main issue in posterior sampling. To illustrate the main challenge in the sampling of µk, we focus on a particular setting. In the same spirit as (Saremi et al., 2024, Theorem 1), we consider the following assumption on π.

A0 (Log-concavity outside a compact). There exist R > 0 and τ > 0 such that π is the convolution of µ and N(0, τ 2 Id), where µ is a distribution compactly supported on B = B(mπ, R

d), i.e., µ(Rd \ B) = 0.

This assumption can be equivalently formulated as follows: for any random vector X π, we have X = U + G, where U mπ 2 d R2 holds almost surely and G N(0, τ 2 Id) is independent of U. In this case, d(R2 + τ 2) is an upper bound on R2 π, defined as the scalar variance of π in Section 3.2. In particular, this setting includes non log-concave distributions such as Gaussian mixtures, see Appendix C for more details. Under Assumption A0, we obtain the following result whose formal statement is given in Appendix E.

Theorem 1. Assume A0. There exists tq > 0 (explicit in Appendix E), depending on g, d, R and τ such that if tk > tq, µk is strongly log-concave. In addition, µk is more log-concave as k increases.

Stochastic Localization via Iterative Posterior Sampling

Y α t /α(t)

Sample from the stochastic process

qα t ( |yα t )

Y α t /α(t)

Sample from the stochastic process

qα t ( |yα t )

Geometric(1,1)

Figure 3: Duality of log-concavity: distribution of Y α t /α(t) (up) and qα t ( |yα t ) (down) for t (0, Tgen), where yα t is a realisation of the observation process (red line), for the standard scheme (left) and the Geom(1, 1) scheme (right). The target distribution π is a mixture of two 1D-Gaussian distributions N( 2/3, (0.05)2) and N(4/3, (0.05)2) with respective weights 2/3 and 1/3, which density is given by the blue line. The heat map represents the likelihood of the distributions and the green line on the right edge stands for the distributions taken at the time given by the dotted green line.

This result can be explained by the fact that, for large tk, the Gaussian term dominates the target term in the posterior (15), since σ2/g(tk)2 0 - which intuitively ensures increasing log-concavity. On the other hand, if tk is close to 0, then µk is expected to be close to π, recalling that qα 0 = π. In this case, for non log-concave target distributions satisfying A0, µk is non log-concave for small tk. We illustrate this behaviour in Figure 3 for a bimodal Gaussian mixture: as depicted by the bottom row, the posterior distribution starts multi-modal (non log-concave) and eventually becomes unimodal (log-concave).

Therefore, for a wide variety of target distributions, our MALA-based posterior sampling approach on the time interval [t0, T] will fail at the very first steps if t0 is chosen too small. We now explain how to bypass this issue in SLIPS.

4.3. Duality of log-concavity is all you need

Initialisation of the integration. Following the previous discussion, a reliable computation of our MCMC-based estimator given U α tk requires to have tk > tq in order to ensure log-concavity of the random posterior µk.

Thus, we aim to start the integration from a t0 > tq. To initialize the recursion (14) from t0, one needs to sample the first iterate Y α t0 distributed according to pα t0. Given that U α t0 is a reliable estimator of the denoiser since t0 > tq, we can derive an approximation of the score log pα t0, via Tweedie s formula (9). We use this estimate to sample from pα t0 via the Unadjusted Langevin Algorithm (ULA) 5

(Roberts & Tweedie, 1996). In this initialization procedure, the successive evaluations of the score log pα t0 in ULA can be computed by applying an inner loop of MALA on qα t0. This results in a Langevin-within-Langevin scheme, which is presented in Algorithm 2, and detailed in Appendix F.

5We do not use MALA since only the score of pα t0 is available.

For efficient sampling, ULA however requires a condition of log-concavity on pα t0 (Dalalyan, 2017; Durmus & Moulines, 2017). In Theorem 2, we prove that this property is actually ensured for small values of t0 under Assumption A0.

Theorem 2. Assume A0. There exists tp > 0 (explicit in Appendix E), depending on g, d, R and τ such that if t < tp, pα t is strongly log-concave.

We provide a formal version of this result in Appendix E. Intuitively, when t is small, the Gaussian noise in the observation process (4) overwhelms α(t)X, which makes pα t log-concave. Note however that this log-concavity property is not verified for large t as pα t becomes as log-concave as π by the localization property, see (5). We illustrate this behaviour in the upper row of Figure 3.

Duality of log-concavity. Ideally, we would like to have (i) t0 > tq (to ensure that the estimation from U α tk is correct) and (ii) t0 < tp (to ensure that pα t0 is log-concave). Under an additional assumption to A0, Theorem 3 demonstrates that such t0 can exist. A proof of this result is given in Appendix E.

Theorem 3. Assume A0, with d R2/τ 2 < 2. Then, tq < tp, where tq and tp are given in Theorem 1 and Theorem 2.

Note however that this extra assumption is restrictive and could be slightly improved, as done for Gaussian mixtures in Appendix E. Then, choosing t0 boils down to finding a sweet spot where pα t0 and qα t0 are both approximately log-concave. Together, these requirements form what we call the duality of log-concavity . In Figure 3, we show that t0 can be found for two different localization schemes when considering a Gaussian mixture that does not fit the assumption made in Theorem 3. Furthermore, we highlight in Appendix F.2 that a sweet spot generally exists for a wider variety of target distributions independently of the localization scheme.

Stochastic Localization via Iterative Posterior Sampling

In practice, we ensure the log-concavity of qα t0 first, as the Monte Carlo estimation of uα t0 is the starting point of the SLIPS algorithm.

4.4. Implementation of SLIPS.

By combining all the previous observations made above, we propose the Stochastic Localization via Iterative Posterior Sampling (SLIPS) algorithm, summarized in Algorithm 1, which we detail now.

The SLIPS algorithm has five inputs :

(a) the noising schedule in the observation process t 7 α(t) defined on a time interval [0, Tgen),

(b) the initial integration time in (6): t0 (0, Tgen),

(c) the final integration time in (6): T (t0, Tgen), or equivalently η > 0 by log-SNR correspondence,

(d) the number of discretization steps in (6): K 1,

(e) the number of samples for posterior sampling M 1.

The algorithm can be decomposed in two main parts : (i) its initialization (summarized in Algorithm 2) and (ii) the run of the discretized SDE (14) via posterior sampling.

(i) The initialization of SLIPS consists in approximately sampling from pα t0. Since its score can be expressed via the posterior distribution qα t0, see (7) and (9), and qα t0 can itself be sampled from with MCMC methods (see Section 4.2), we propose to use a Langevin-within-Langevin procedure : each step of the outer loop corresponds to an iteration of the Langevin algorithm to sample from pα t0, and requires local MALA steps to estimate the score by sampling from the posterior qα t0. The exact computations are detailed in Appendix F. The final iterate of this algorithm is used as the initialization of the second part of SLIPS.

(ii) Once the initialization of SLIPS is done, we turn to the actual core part of SLIPS, which consists in running the discretized SDE given in (14). At step k + 1 of this recursion (corresponding to the time tk+1 in the initial SDE), the denoiser term is approximated by running the MALA algorithm on the posterior distribution qα tk conditioned on the k-th iterate of the recursion. Finally, the last iterate of this recursion is considered as an approximate sample from the target distribution.

Complexity of SLIPS. Note that running this procedure is meant to produce one sample from the target distribution. In practice, one may want to produce several samples. In this case, SLIPS can be easily parallelised by running simultaneously independent realizations of the procedure described above. Moreover, we emphasize that the initialization of Y α t0 is the most challenging step in SLIPS as posterior sampling only gets easier afterwards, by Theorem 1.

The computational cost of this initialization, involving a Langevin-within-Langevin procedure, remains however reasonable thanks to a persistent initialization of the ULA and MALA chains (see Appendix F). As such, only a few steps of each is needed in practice (see Appendix H.4).

Limitation of SLIPS. We highlight that t0 is the only hyper-parameter in our algorithm that requires careful tuning. However, Figure 10 and Figure 11 in Appendix F.2 highlight moderate sensitivity to this hyper-parameter.

Algorithm 1 SLIPS

Input: α, t0, η, K, M Set T = Tη and σ = ˆRπ/

d, see Section 3.2 Set (tk)K k=0 as the SNR-adapted disc. of [t0, T], see (13) Initialize Y α t0 with Algorithm 2 for k = 0 to K 1 do

Define δk = tk+1 tk, wk = α(tk+1) α(tk) Simulate {Xk j }M j=1 µk with MALA, see (15) Estimate the denoiser by U α tk = (1/M) PM j=1 Xk j Simulate Y α tk+1 N( Y α tk + wk U α tk, σ2δk Id) end for Output: Y α t K/α(t K)

Algorithm 2 Langevin-within-Langevin initialization

Input: α(t0), t0, σ, N, M Set Y (0) N(0, σ2t0 Id) and λ = σ2t0/2 for n = 0 to N 1 do

Simulate {X(n) j }M j=1 qα t0( |Y (n)) with MALA

Estimate the denoiser by U (n) = (1/M) PM j=1 X(n) j Set ˆsα t0(Y (n)) = α(t0)U (n) Y (n) /(σ2t0), see (6) Simulate Y (n+1) N(Y (n) + λˆsα t0(Y (n)), 2λ Id) end for Output: Y (N)

5. Related work

Score-based sampling with VI. Building upon scorebased generative models and VI, recent works have proposed deep-learning approaches for density-based sampling. Those sampling schemes amount to discretized versions of denoising processes with a parameterized drift function. Two main settings may be distinguished. On one hand, the VI framework is seen as a stochastic optimal control problem involving the time-reversal of the denoising process, see e.g., (Zhang & Chen, 2022; Berner et al., 2023; Vargas et al., 2023b;a). We refer to (Richter et al., 2023) for a global overview of this approach and its extensions. On the other hand, another line of work has proposed parameterized extensions of Annealed Importance Sampling (AIS) (Doucet et al., 2022; Geffner & Domke, 2023; Vargas et al., 2024). Since SLIPS is learning-free, these algorithms are not included in our numerical tests as it would be difficult to draw comparisons at equal computational budget.

Stochastic Localization via Iterative Posterior Sampling

Score-based sampling with Monte Carlo. Preliminary study of Monte Carlo score estimation was done by (Huang et al., 2021), followed by (Vargas et al., 2023b), who considered the F ollmer diffusion (F ollmer, 2005; 2006) bridging δ0 to π in a finite-time setting. However, their method relying on IS does not scale well with dimension, and suffers from numerical unstability. Note that it can be recasted as a particular example of our scheme Geom(1,1), via the stochastic interpolant analogy, see Section 3.2.

Closely related to this work, (Huang et al., 2024) recently proposed Reverse Diffusion Monte Carlo (RDMC), a sampling algorithm based on a Monte Carlo estimation of the drift of the time-reversal of a variance preserving diffusion. Many algorithmic choices in RDMC are similar to SLIPS, turning to Langevin for the drift estimation and using Langevin-within-Langevin for initialization. These authors also discuss the choice of the initial integration time, which plays a role similar to our t0. Notably, under Lipschitz assumption on the score, they derive an upper bound on the overall complexity of RDMC, depending on this time. The present work complements the approach of (Huang et al., 2024). We formalize the crucial trade-off in choosing t0 according to our notion of duality of log-concavity (see Section 4.3), which is only briefly mentioned in (Huang et al., 2024). This focal point naturally arises from our numerical tests in high-dimension that allows us to assess the potential of SL-based sampling in practice, where (Huang et al., 2024) remained mostly theoretical. A precise comparison is provided in Appendix G.

Multi-Noise Measurements sampling. Recently, (Saremi & Srivastava, 2022) exploited a non-Markovian stochastic process to propose a novel data-based sampling scheme. The Multi-Noise Measurements (MNM) process (Y m)M m=1 is defined by Y m = X + σZm where σ > 0, X π and Zm are independently sampled from N(0, Id). Here, the measurement Y m plays the same role as the observation in standard SL6. In (Saremi et al., 2024), the authors suggest to use this process in the density-based setting by computing Y 1:M sequentially through the sampling of Y m conditioned on Y 1:m 1. They show that those conditional distributions are increasingly log-concave, see (Saremi et al., 2024, Theorem 1), making their Once-At-a-Time (OAT) algorithm efficient. In contrast to our framework, (Saremi et al., 2024) mainly assume that the score of the measurement process is analytically available. Although a Monte Carlo-based approach is proposed to estimate the score in realistic settings, results significantly degrade compared to the analytical case. As we detail in Appendix G, this can be explained by the fact that OAT faces but does not tackle the challenge of duality of log-concavity .

6Indeed, both denoising processes have the same localization rate when M = T, see Appendix G for more details.

6. Numerical experiments

In this section, we compare SLIPS against SMC, AIS, RDMC (Huang et al., 2024) and OAT (Saremi et al., 2024). Note that we deliberately omit an exhaustive comparison with standard MCMC methods, as they notoriously fail to sample from multi-modal distributions, see Appendix H.1. For SLIPS, we consider three different SL schemes : Standard, Geom(1,1) and Geom(2,1). Except for RDMC, all the algorithms are informed by the scalar variance R2 π of the target distribution (or an estimation). We tuned the hyperparameters of each algorithm with coarse grid searches of similar size and similar computational budgets assuming access to an oracle distance metric to the target distribution (see details in Appendix H). For OAT, we estimated the intermediate scores with IS.

Toy target distributions. We first discuss standard target distributions including the 8-Gaussians (d = 2), the Rings (d = 2) and the Funnel (d = 10) distributions. Details about their respective definitions are provided in Appendix H. Apart from Funnel, for which we provide results based on the sliced Kolmogorov-Smirnov (KS) distance as per (Grenioux et al., 2023b), we compare the samples obtained with the algorithms against the ground truth using the entropic regularized 2-Wasserstein distance. The first four columns of Table 2 show that SLIPS is on par with most of its competitors on those toy distributions.

Bayesian Logistic Regression. Beyond toy distributions, we sample from the posterior of a Bayesian logistic regression model on two popular datasets : Ionosphere (d = 34) and Sonar (d = 61). More details on the design of the model are available in Appendix H. We evaluate the quality of sampling by computing the average predictive log-likelihood of the obtained samples. The last two columns of Table 2 show that, in these higher dimensions, SLIPS is slightly superior to its counterparts, especially OAT and RDMC.

High-dimensional Gaussian Mixtures. As a challenging task, we seek to estimate the relative weight of a bimodal Gaussian mixture with modes N (x; (2/3)1d, Σ) and N (x; (4/3)1d, Σ), where Σ = 0.05 Id and 1d is the d-dimensional vector with all components equal to 1, and respective weights 2/3 and 1/3. In our experiments, these weights are computed as ˆW and 1 ˆW, where ˆW is a Monte Carlo estimation of R

Rd 1( ,0)d(x)dπ(x). Here, we consider increasing values of d. Figure 4 (top) shows that SLIPS recovers the relative weight of the target at a 1% accuracy, even in high dimensions, whereas the other methods fail to give accurate estimates in most dimensions. In Figure 4 (bottom), we also illustrate the superiority of SLIPS to capture the local properties of the distribution by displaying the sliced Wasserstein distance to the target distribution. Due to high estimation error, AIS and SMC results are omitted in Figure 4 and displayed in Appendix H.

Stochastic Localization via Iterative Posterior Sampling

Table 2: Metrics when sampling toy distributions and posteriors from Bayesian logistic regression models. Bold font indicates best results. The metric for 8-Gaussians and Rings (d = 2) is the entropic regularized 2-Wasserstein distance (with regularization hyper-parameter 0.05) (the lowest, the best), the metric for Funnel (d = 10) is the sliced Kolmogorov-Smirnov distance (the lowest, the best) and the metric for the Bayesian logistic regression on Sonar (d = 34) and Ionosphere (d = 61) datasets corresponds to the average predictive posterior log-likelihood (the highest, the best).

ALGORITHM 8-GAUSSIANS ( ) RINGS ( ) FUNNEL ( ) SONAR ( ) IONOSPHERE ( )

AIS 1.10 0.09 0.19 0.02 0.037 0.004 111.04 0.08 87.92 0.13 SMC 0.99 0.16 0.28 0.06 0.035 0.005 111.02 0.13 87.82 0.16 OAT (SAREMI ET AL., 2024) 0.91 0.09 0.18 0.02 0.105 0.005 280.92 1.34 205.49 0.61 RDMC (HUANG ET AL., 2024) 1.01 0.05 0.30 0.01 0.082 0.006 129.88 0.12 109.84 0.10 SLIPS STANDARD 0.76 0.05 0.19 0.01 0.024 0.003 109.25 0.07 86.65 0.04 SLIPS GEOM(1,1) 0.74 0.12 0.20 0.01 0.032 0.002 109.14 0.09 86.32 0.10 SLIPS GEOM(2,1) 0.75 0.10 0.22 0.02 0.040 0.007 110.24 0.05 86.78 0.08

0 2 4 6 8 10 Mode weight estimation error (in %)

8 16 32 64 Dimension

10 3 10 2 10 1 100 101 Sliced Wasserstein Distance

SLIPS Geom(1,1) SLIPS Geom(2,1)

SLIPS Standard

Figure 4: Metrics when sampling a bimodal Gaussian mixture with d growing. Top: Relative weight estimation error. Bottom: Sliced Wasserstein distance.

Field system ϕ4 from statistical mechanics. Lastly, we sample the 1D ϕ4 model, which was recently used as a benchmark in (Gabri e et al., 2022). At the chosen temperature, the distribution has two well distinct modes with relative weight that can be adjusted through a local-field parameter h. We discretize this continuous model with a grid size of 100 (i.e., d = 100). We estimate the relative weight between the two modes and compare the results with a Laplace approximation.

0.00 0.02 0.04 0.06 0.08 0.10 h

Laplace approximation Laplace approximation (2nd order) SLIPS Geom(1,1) SLIPS Geom(2,1) SLIPS Standard

Figure 5: Estimation of the mode weight ratio of ϕ4 with increasing h - Only SLIPS produced correct samples.

Figure 5 shows that the relative weight estimated by SLIPS lie between the 0-th and 2-nd order Laplace approximations. Due to high estimation error, the results from concurrent algorithms are omitted in Figure 5. We refer to Appendix H for more details on the setting of this numerical experiment.

7. Discussion

In this paper, we introduced a Stochastic Localization (SL) scheme, that features flexible denoising schedule, in order to sample from any unnormalized target density. Relying on this framework, we proposed our algorithm, Stochastic Localization via Iterative Posterior Sampling (SLIPS) which leverages Monte Carlo estimation of the SL denoiser. Its design notably reveals the so-called duality of log-concavity , that can be seen as a trade-off between sampling from the observation process and sampling from the SL posterior. For various localization schemes, we illustrate the performance of SLIPS in high-dimension. In future work, we would like to derive more theoretical guarantees on the phenomenon of duality, investigate improved designs of denoising schedule and consider more challenging target distributions.

Stochastic Localization via Iterative Posterior Sampling

Acknowledgements

The authors would like to thank Saeed Saremi and Francis Bach for insightful discussions during the course of this project. The work of LG and MG is supported by Hi! Paris. Part of the work of AD is funded by the European Union (ERC, Ocean, 101071601). Views and opinions expressed here are however those of the authors only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them.

Impact Statement

This paper presents work whose goal is to advance the field of sampling from unnormalized densities. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Akhound-Sadegh, T., Rector-Brooks, J., Bose, A. J., Mittal, S., Lemos, P., Liu, C.-H., Sendera, M., Ravanbakhsh, S., Gidel, G., Bengio, Y., Malkin, N., and Tong, A. Iterated denoising energy matching for sampling from boltzmann densities, 2024.

Alaoui, A. E., Montanari, A., and Sellke, M. Sampling from the Sherrington-Kirkpatrick Gibbs measure via algorithmic stochastic localization. In 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS), pp. 323 334, 2022. doi: 10.1109/ FOCS54457.2022.00038.

Albergo, M. S., Boffi, N. M., and Vanden-Eijnden, E. Stochastic interpolants: A unifying framework for flows and diffusions, 2023. URL https://arxiv.org/ abs/2303.08797.

Berner, J., Richter, L., and Ullrich, K. An optimal control perspective on diffusion-based generative modeling. Transactions on Machine Learning Research, 2023. URL https://openreview.net/ forum?id=o YIjw37p TP.

Bonneel, N., Rabin, J., Peyr e, G., and Pfister, H. Sliced and Radon Wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision, 51:22 45, 2015.

Bortoli, V. D., Hutchinson, M., Wirnsberger, P., and Doucet, A. Target score matching, 2024.

Cattiaux, P., Conforti, G., Gentil, I., and L eonard, C. Time reversal of diffusion processes under a finite entropy condition. Annales de l Institut Henri Poincar e, Probabilit es et Statistiques, 59(4):1844 1881, 2023. doi:

10.1214/22-AIHP1320. URL https://doi.org/ 10.1214/22-AIHP1320.

Ceriotti, M., Brain, G. A. R., Riordan, O., and Manolopoulos, D. E. The inefficiency of re-weighted sampling and the curse of system size in high-order path integration. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 468 (2137):2 17, 2012. doi: 10.1098/rspa.2011.0413. URL https://royalsocietypublishing.org/ doi/abs/10.1098/rspa.2011.0413.

Chatterjee, S. and Diaconis, P. The sample size required in importance sampling. The Annals of Applied Probability, 28(2):1099 1135, 2018.

Chen, N., Zhang, Y., Zen, H., Weiss, R. J., Norouzi, M., and Chan, W. Wavegrad: Estimating gradients for waveform generation. In International Conference on Learning Representations, 2021. URL https:// openreview.net/forum?id=Ns MLjc Fa O8O.

Chen, S., Chewi, S., Li, J., Li, Y., Salim, A., and Zhang, A. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/ forum?id=zy LVMgs Z0U .

Chen, Y. and Eldan, R. Localization schemes: A framework for proving mixing bounds for Markov chains. In 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS), pp. 110 122. IEEE, 2022.

Dalalyan, A. S. Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society Series B: Statistical Methodology, 79(3):651 676, 2017.

De Bortoli, V., Thornton, J., Heng, J., and Doucet, A. Diffusion Schr odinger bridge with applications to score-based generative modeling. Advances in Neural Information Processing Systems, 34:17695 17709, 2021.

Del Moral, P., Doucet, A., and Jasra, A. Sequential Monte Carlo samplers. Journal of the Royal Statistical Society Series B: Statistical Methodology, 68(3):411 436, 2006.

Doucet, A., Grathwohl, W., Matthews, A. G., and Strathmann, H. Score-based diffusion meets annealed importance sampling. Advances in Neural Information Processing Systems, 35:21482 21494, 2022.

Durmus, A. and Moulines, E. Quantitative bounds of convergence for geometrically ergodic Markov chain in the Wasserstein distance with application to the Metropolis Adjusted Langevin Algorithm. Statistics and Computing, 25:5 19, 2015.

Stochastic Localization via Iterative Posterior Sampling

Durmus, A. and Moulines, E. Nonasymptotic convergence analysis for the unadjusted Langevin algorithm. 2017.

Dwivedi, R., Chen, Y., Wainwright, M. J., and Yu, B. Logconcave sampling: Metropolis-Hastings algorithms are fast. Journal of Machine Learning Research, 20(183): 1 42, 2019.

Eldan, R. Thin shell implies spectral gap up to polylog via a stochastic localization scheme. Geometric and Functional Analysis, 23(2):532 569, 2013.

Eldan, R. Taming correlations through entropy-efficient measure decompositions with applications to mean-field approximation. Probability Theory and Related Fields, 176(3-4):737 755, 2020.

Eldan, R. Analysis of high-dimensional distributions using pathwise methods. Proceedings of ICM, to appear, 2022.

Flamary, R., Courty, N., Gramfort, A., Alaya, M. Z., Boisbunon, A., Chambon, S., Chapel, L., Corenflos, A., Fatras, K., Fournier, N., Gautheron, L., Gayraud, N. T., Janati, H., Rakotomamonjy, A., Redko, I., Rolet, A., Schutz, A., Seguy, V., Sutherland, D. J., Tavenard, R., Tong, A., and Vayer, T. POT: Python Optimal Transport. Journal of Machine Learning Research, 22(78):1 8, 2021. URL http://jmlr.org/papers/v22/20-451.html.

F ollmer, H. An entropy approach to the time reversal of diffusion processes. In Stochastic Differential Systems Filtering and Control: Proceedings of the IFIP-WG 7/1 Working Conference Marseille-Luminy, France, March 12 17, 1984, pp. 156 163. Springer, 2005.

F ollmer, H. Time reversal on Wiener space. In Stochastic Processes Mathematics and Physics: Proceedings of the 1st Bi Bo S-Symposium held in Bielefeld, West Germany, September 10 15, 1984, pp. 119 129. Springer, 2006.

Gabri e, M., Rotskoff, G. M., and Vanden-Eijnden, E. Adaptive Monte Carlo augmented with normalizing flows. Proceedings of the National Academy of Sciences, 119(10): e2109420119, 2022.

Gabri e, M., Rotskoff, G. M., and Vanden-Eijnden, E. Adaptive Monte Carlo augmented with normalizing flows. Proceedings of the National Academy of Sciences, 119(10):e2109420119, 2022. doi: 10.1073/ pnas.2109420119. URL https://www.pnas.org/ doi/abs/10.1073/pnas.2109420119.

Geffner, T. and Domke, J. Langevin diffusion variational inference. In International Conference on Artificial Intelligence and Statistics, pp. 576 593. PMLR, 2023.

Ghio, D., Dandi, Y., Krzakala, F., and Zdeborov a, L. Sampling with flows, diffusion and autoregressive neural networks: A spin-glass perspective. In Neur IPS 2023 Workshop on Diffusion Models. ar Xiv, August 2023. URL http://arxiv.org/abs/ 2308.14085. ar Xiv:2308.14085 [cond-mat].

Grenioux, L., Moulines, E., and Gabri e, M. Balanced training of energy-based models with adaptive flow sampling. In ICML 2023 Workshop on Structured Probabilistic Inference & Generative Modeling, 2023a. URL https: //openreview.net/forum?id=Aw J2Nqx Wlk.

Grenioux, L., Oliviero Durmus, A., Moulines, E., and Gabri e, M. On sampling with approximate transport maps. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 11698 11733. PMLR, 23 29 Jul 2023b. URL https://proceedings.mlr.press/v202/ grenioux23a.html.

Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840 6851, 2020.

Holdijk, L., Du, Y., Hooft, F., Jaini, P., Ensing, B., and Welling, M. PIPS: Path integral stochastic optimal control for path sampling in molecular dynamics, 2023. URL https://openreview.net/ forum?id=Tn IZf XSFJAh.

Huang, J., Jiao, Y., Kang, L., Liao, X., Liu, J., and Liu, Y. Schr odinger-F ollmer sampler: sampling without ergodicity. ar Xiv preprint ar Xiv:2106.10880, 2021.

Huang, X., Dong, H., HAO, Y., Ma, Y., and Zhang, T. Reverse Diffusion Monte Carlo. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/ forum?id=k IPEy MSd FV.

Hyv arinen, A. and Dayan, P. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005.

Karras, T., Aittala, M., Aila, T., and Laine, S. Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35: 26565 26577, 2022.

Kingma, D., Salimans, T., Poole, B., and Ho, J. Variational diffusion models. Advances in neural information processing systems, 34:21696 21707, 2021.

Stochastic Localization via Iterative Posterior Sampling

Kingma, D. P. and Gao, R. Understanding diffusion objectives as the ELBO with simple data augmentation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.

Kingma, D. P. and Welling, M. Auto-Encoding Variational Bayes. Proc. Int. Conf. on Learning Representations, 2014.

Krauth, W. Statistical mechanics: algorithms and computations. OUP Oxford, 13, 2006.

Kroese, D. P., Taimre, T., and Botev, Z. I. Handbook of Monte Carlo methods. Number 706 in Wiley series in probability and statistics. Wiley, Hoboken, 2011. ISBN 978-0-470-17793-8.

Liptser, R. S. and Shiryaev, A. N. Statistics of random processes: General theory, volume 394. Springer, 1977.

Montanari, A. Sampling, diffusions, and stochastic localization, 2023.

Montanari, A. and Wu, Y. Posterior sampling from the spiked models via diffusion processes. ar Xiv preprint ar Xiv:2304.11449, 2023.

Nadjahi, K., Durmus, A., Simsekli, U., and Badeau, R. Asymptotic guarantees for learning generative models with the sliced-Wasserstein distance. Advances in Neural Information Processing Systems, 32, 2019.

Neal, R. M. Annealed importance sampling. Statistics and computing, 11:125 139, 2001.

Neal, R. M. Slice sampling. The Annals of Statistics, 31(3): 705 767, 2003. doi: 10.1214/aos/1056562461. URL https://doi.org/10.1214/aos/1056562461.

Nichol, A. Q. and Dhariwal, P. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pp. 8162 8171. PMLR, 2021.

Nichol, A. Q., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., Mcgrew, B., Sutskever, I., and Chen, M. GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 16784 16804. PMLR, 17 23 Jul 2022. URL https://proceedings.mlr.press/v162/ nichol22a.html.

Olver, F. W. NIST handbook of mathematical functions. Cambridge university press, 2010.

Pavon, M. On local entropy, stochastic control, and deep neural networks. IEEE Control Systems Letters, 7:437 441, 2022.

Peyr e, G., Cuturi, M., et al. Computational optimal transport: With applications to data science. Foundations and Trends in Machine Learning, 11(5-6):355 607, 2019.

Rezende, D. and Mohamed, S. Variational inference with normalizing flows. In International conference on machine learning, pp. 1530 1538. PMLR, 2015.

Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. Proceedings of The 31st International Conference in Machine Learning, Beijing China, 32:1278 , 2014. URL http://jmlr.org/proceedings/ papers/v32/rezende14.html%5Cnpapers3: //publication/uuid/F2747569-77194EAC-A5A7-9ECA9D6A8FE6. ar Xiv: 1401.4082v3 ISBN: 9781634393973.

Richter, L., Berner, J., and Liu, G.-H. Improved sampling via learned diffusions. In ICML Workshop on New Frontiers in Learning, Control, and Dynamical Systems, 2023. URL https://openreview.net/ forum?id=u Lg YD7ie0O.

Robbins, H. E. An empirical Bayes approach to statistics. In Breakthroughs in Statistics: Foundations and basic theory, pp. 388 394. Springer, 1992.

Roberts, G. O. and Tweedie, R. L. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, 2(4):341 363, 1996. ISSN 13507265. URL http://www.jstor.org/ stable/3318418.

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684 10695, 2022.

Sachs, M., Leimkuhler, B., and Danos, V. Langevin dynamics with variable coefficients and nonconservative forces: from stationary states to numerical methods. Entropy, 19 (12):647, 2017.

Saremi, S. and Hyv arinen, A. Neural empirical Bayes. Journal of Machine Learning Research, 20(181):1 23, 2019. URL http://jmlr.org/papers/v20/19216.html.

Saremi, S. and Srivastava, R. K. Multimeasurement generative models. In The Tenth International Conference on Learning Representations, 2022. URL https: //openreview.net/forum?id=QRX0n CX gk.

Stochastic Localization via Iterative Posterior Sampling

Saremi, S., Srivastava, R. K., and Bach, F. Universal smoothed score functions for generative modeling, 2023.

Saremi, S., Park, J. W., and Bach, F. Chain of log-concave Markov chains. In The Twelfth International Conference on Learning Representations, 2024. URL https:// openreview.net/forum?id=yi MB2DOjs R.

Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp. 2256 2265. PMLR, 2015.

Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. In The Ninth International Conference on Learning Representations, 2021. URL https://openreview.net/ forum?id=Px TIG12RRHS.

Swendsen, R. H. and Wang, J.-S. Replica Monte Carlo Simulation of Spin-Glasses. Physical Review Letters, 57(21):2607 2609, November 1986. doi: 10.1103/Phys Rev Lett.57.2607. URL https://link.aps.org/doi/10.1103/ Phys Rev Lett.57.2607. Publisher: American Physical Society.

Turner, R., Hung, J., Frank, E., Saatchi, Y., and Yosinski, J. Metropolis-Hastings generative adversarial networks. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 6345 6353. PMLR, 09 15 Jun 2019. URL https://proceedings.mlr.press/ v97/turner19a.html.

Tzen, B. and Raginsky, M. Theoretical guarantees for sampling and inference in generative models with latent diffusions. In Conference on Learning Theory, pp. 3084 3114. PMLR, 2019.

Vargas, F., Grathwohl, W. S., and Doucet, A. Denoising diffusion samplers. In The Eleventh International Conference on Learning Representations, 2023a. URL https: //openreview.net/forum?id=8pvnf TAbu1f.

Vargas, F., Ovsianas, A., Fernandes, D., Girolami, M., Lawrence, N. D., and N usken, N. Bayesian learning via neural Schr odinger F ollmer flows. Statistics and Computing, 33(1):3, 2023b.

Vargas, F., Reu, T., and Kerekes, A. Expressiveness remarks for denoising diffusion based sampling. In Fifth Symposium on Advances in Approximate Bayesian Inference, 2023c. URL https://openreview.net/ forum?id=l-32qmxphio.

Vargas, F., Padhy, S., Blessing, D., and N usken, N. Transport meets variational inference: Controlled monte carlo diffusions. In The Twelfth International Conference on Learning Representations, 2024. URL https:// openreview.net/forum?id=PP1rudnxi W.

Vincent, P. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661 1674, 2011.

Wainwright, M. J. and Jordan, M. I. Graphical Models, Exponential Families, and Variational Inference. Foundations and Trends in Machine Learning, 1:1 305, 2008. doi: 10.1561/2200000001. URL http: //discovery.ucl.ac.uk/185880/.

Zhang, Q. and Chen, Y. Path Integral Sampler: a stochastic control approach for sampling. In The Tenth International Conference on Learning Representations, 2022.

Stochastic Localization via Iterative Posterior Sampling

Organization of the supplementary

The appendix is organized as follows. Appendix A summarizes general facts that will be useful for proofs. Appendix B provides details on the Stochastic Localization (SL) frameworks introduced in Section 2 and Section 3. In Appendix C, we derive detailed computations on the application of SL to Gaussian mixtures. Appendix D presents an implementation of SL for sampling in the case where the score of the observation process is available. Appendix E dispenses theoretical results on the phenomenon of duality of log-concavity highlighted in Section 4.2 and Section 4.3. In Appendix F, we detail the implementation of our sampling algorithm called SLIPS and provide an ablation study of its hyper-parameters. A detailed comparison between our approach and works from (Huang et al., 2024) and (Saremi et al., 2024) is given in Appendix G. Finally, Appendix H provides more details on the numerical experiments presented in Section 6.

Notation. To alleviate the computations derived below, for a probability distribution µ defined on Rd, and any measurable function φ : Rd Rd, we will denote the expectation R

Rd φ(x)dµ(x) Rd by E[φ(X)] and the covariance R

Rd(φ(x) µ(φ))(φ(x) µ(φ)) dµ(x) Rd d by Cov[φ(X)], where X is a random vector distributed according to µ. For any T (0, + ], two functions f and g defined on [0, T) are said to be asymptotically equivalent if f(t)/g(t) 1 and g(t)/f(t) 1 as t T. This is denoted by f(t) g(t).

A. Preliminaries

Metrics. We recall that the 2-Wasserstein distance between two distributions µ and ν is given by

W2(µ, ν) = inf{ R

Rd Rd x1 x0 2dπ(x0, x1) : π0 = µ, π1 = ν}1/2 ,

where πi denotes the i-th marginal of π for i {0, 1}. While it is not tractable in general, we have an explicit expression when comparing two Gaussian distributions. Consider µ = N(m1, γ2 1 Id) and ν = N(m2, γ2 2 Id), where (m1, m2) Rd Rd

and (γ1, γ2) (0, )2. Then, following (Peyr e et al., 2019, Equation 2.41), we have

W2(µ, ν)2 = m1 m2 2 + d(γ1 γ2)2 . (16)

We also recall that the entropic regularized 2-Wasserstein distance between two distributions µ and ν is defined by

W2,ε(µ, ν) = inf{ R

Rd Rd x1 x0 2dπ(x0, x1) H (π) : π0 = µ, π1 = ν}1/2 ,

where ε > 0 is a regularization hyper-parameter and H (π) = R

Rd Rd log π(x0, x1)dπ(x0, x1) refers to as the entropy of π.

Finally, we recall that the Kolmogorov-Smirnov distance between two distributions µ and ν is defined by

KS(µ, ν) = sup x Rd |Fµ(x) Fν(x)| ,

where Fµ (respectively Fν) denotes the cumulative distribution function of µ (respectively ν).

Below, we provide a first result, known as the Tweedie s formula, which gives the expression of the score of any distribution that writes as a marginalization, and its derivative.

Lemma 4 (Tweedie s formula and extension). Let p be a positive probability density on Rd such that for any y Rd

Rd p(y|x)q(x)dx, where (x, y) 7 p(y|x) is a positive transition density on Rd Rd, and q is a positive density on Rd. Assume that for any x Rd, y 7 log p(y|x) is twice continuously differentiable and that, for any k {1, 2}, there exists φk : Rd R+ such that R

Rd φk(x)q(x)dx < and for any (x, y) Rd Rd, we have k yp(y|x) φk(x).

For any y Rd, define the posterior density x 7 p(x|y) by p(x|y) = p(y|x)p(x)/p(y).

Then, we have for any y Rd

y log p(y) = Z

Rd y log p(y|x)p(x|y)dx . (17)

Stochastic Localization via Iterative Posterior Sampling

This is often referred to as the Tweedie s formula (Robbins, 1992). We also have for any y Rd

2 y log p(y) = Z

Rd 2 y log p(y|x)p(x|y)dx + Z

Rd y log p(y|x){ y log p(y|x)} p(x|y)dx (18)

Rd y log p(y|x)p(x|y)dx Z

Rd y log p(y|x)p(x|y)dx .

For any y Rd, these two results can be reformulated as

y log p(y) = E[ y log p(y|X)] , 2 y log p(y) = E[ 2 y log p(y|X)] + Cov[ y log p(y|X)] ,

where X is a random vector distributed according to the posterior distribution defined by dµy(x) = p(x|y)dx.

Proof. First observe that for any x Rd, y 7 p(y|x) is twice continuously differentiable since p(y|x) = exp(log p(y|x)), where y 7 log p(y|x) is twice continuously differentiable.

We begin with the proof of (17). By combining the assumptions of the lemma with the dominated convergence theorem, we obtain that p C2(Rd, (0, )), and that for any y Rd, we have

y log p(y) = yp(y)

Rd p(y|x)q(x)dx R

Rd p(y|x)q(x)dx =

Rd y log p(y|x)p(y|x)q(x)dx R

Rd p(y|x)q(x)dx = Z

Rd y log p(y|x)p(x|y)dx ,

We now turn to the proof of (18). Note that we have y log p(y) = p(y) 1 R

Rd y log p(y|x)p(y|x)q(x)dx. Then, using once again the dominated convergence theorem, we obtain that for any y Rd,

2 y log p(y) = 1 p(y)

Rd 2 y log p(y|x)p(y|x)q(x)dx + 1 p(y)

Rd y log p(y|x){ y log p(y|x)} p(y|x)q(x)dx

Rd y log p(y|x)p(y|x)q(x)dx

Rd 2 y log p(y|x)p(x|y)dx + Z

Rd y log p(y|x){ y log p(y|x)} p(x|y)dx

y log p(y) 1

Rd y log p(y|x)p(y|x)q(x)dx ,

which gives the result.

We now dispense a useful lemma to compute exact integration in SDEs with linear drift.

Lemma 5. Let T > 0. Consider the SDE defined on [0, T] by d Yt = βt(Yt + b)dt + σd Bt, where β is integrable on [0, T], b Rd, and σ > 0. For any T t > s 0, the conditional density of Yt given Ys = ys, denoted by pt|s( |ys), verifies

pt|s(yt|ys) = N(yt; exp( R t s βudu)ys + (exp( R t s βudu) 1)b, σ2 R t s exp(2 R t u βrdr)du Id) .

Proof. Define γ(t) = exp( R t 0 βudu). Consider the stochastic process Zt = γ(t)Yt. By ˆIto s formula, we have d Zt = βtγ(t)bdt + γ(t)σd Bt = γ(t)bdt + γ(t)σd Bt. By integrating this SDE between s and t, we have

γ(t)Yt γ(s)Ys = {γ(s) γ(t)}b + σ R t s γ(u)d Bu ,

Yt = exp( R t s βudu)Ys + (exp( R t s βudu) 1)b + σ R t s exp( R t u βrdr)d Bu ,

which gives the result using ˆIto s isometry and the fact that Ys is independent from (Bt Bs)t [s,T ].

Stochastic Localization via Iterative Posterior Sampling

B. Details on Stochastic Localization and our extension

Vocabulary of Stochastic Localization (SL). For sake of clarity, we precise below some terms employed in our paper:

Y α t : observation process,

g: denoising schedule,

uα t : denoiser function,

uα t (Y α t ): denoiser or Bayes estimator,

qα t ( |Y α t ) : random posterior density. In the geometric measure theory literature, this is referred to as the SL process.

B.1. Reminders and intuition on generalized stochastic localization

Connection between standard stochastic localization and diffusion models. Stochastic localization as defined in (1) is equivalent to Variance-Preserving (VP) diffusion models (Song et al., 2021) under change of time. Indeed, if we define Φ(t) = log(1 + σ2t 1)/2 for any t 0, with the convention that Φ(0) = , and consider the Ornstein-Uhlenbeck process (Xt)t 0 solving the SDE

d Xs = Xsds +

2d Bs , X0 π , (19)

then, (Yt)t 0 and ( p

t(t + σ2)XΦ(t))t 0 have the same distribution by application of Dubins-Schwartz theorem (Montanari, 2023). In other words, the localization process (1) can be identified as a non-linear time-reversal transformation of the noising VP process (19).

Comments on our framework. We recall that α(t) = t1/2g(t). We now justify the assumptions given on g in Section 3.1.

Rationale behind (a) and (b): together, these requirements ensure that (i) α is continuously derivable at t = 0, without imposing this same condition on g and that (ii) we have α(0) = 0. With (i), the SDE defined in (6) does not have a singularity at time t = 0. With (ii), it allows to fix Y α 0 to a deterministic value. This is equivalent to full independence with X, i.e., full noise looking at the expression of the SNR in (11), which is natural to start the denoising process. Note that taking g instead of g would be equivalent in the denoising procedure, since the SNR considers g(t)2. Nonetheless, considering g with positive values is arbitrary and allows us to simplify our framework.

Rationale behind (c): as t Tgen, we would like to denoise increasingly the observation process, i.e., gaining progressively information about X, such that we obtain complete denoising at the end of the process. Looking at the expression of the SNR in (11), this requires to naturally take g(t) as t Tgen and g strictly increasing, recalling that g takes non negative values.

Note that it includes the case of standard stochastic localization by taking g(t) = t1/2. In this case, all properties are verified: in particular, consider C = 1 and β = 1 in (b).

B.2. Theoretical results.

Here, we provide details on the theoretical claims made in Section 3.1 on our extended framework of stochastic localization.

Link between score and denoiser function. The formula given in (9) is an immediate corollary of Lemma 4 applied to the observation process defined in (4). Indeed, the conditional distribution of Y α t given X = x Rd is pα t ( |x) = N(α(t)x, σ2t Id). Then, it comes that y log pα t (y|x) = {α(t)x y}/σ2t, and we obtain the result using Lemma 4-(17).

Localization rate. We begin with the localization rate given in Equation (5). We define the following quantity Loc R(t) = σ

d/g(t). We emphasize that the assumption of finite second order moment on the target distribution, that is used below, guarantees the proper definition of the 2-Wasserstein distance.

Stochastic Localization via Iterative Posterior Sampling

Proposition 6. Assume that π has finite second order moment. Denote by πα t the probability distribution of Y α t /α(t) where (Y α t )t [0,Tgen) is the stochastic observation process defined in (4). Then, for any t (0, Tgen), we have

W2(π, πα t ) Loc R(t) .

Proof. Let t (0, Tgen). Consider the following coupling (X, ˆY α t ): (i) X π, (ii) ˆY α t = α(t)X + σ

t Z, where Z N(0, Id). Denote Xα t = ˆY α t /α(t). We have Xα t = X + {σ/g(t)} Z and ˆXα t πα t . Therefore, it comes that

W2(π, πα t )2 E[ σ/g(t) Z 2] = σ2d/g(t)2 ,

which gives the result.

In particular, we recover the localization rate of the standard setting given in Section 2 with g(t) = t1/2 and Tgen = . Note that this upper bound applies on general distributions that may be non log-concave, as considered in A0, and may be improved by assuming further regularity. We notably show a strong refinement in the Gaussian case by a factor O(g(t)).

Proposition 7. Consider the target distribution given by π = N(m, γ2 Id) where m Rd and γ > 0. Then, as t Tgen, we have

W2(π, πα t ) = γ

d σ 2γg(t) Loc R(t) .

Proof. In this setting, the distribution of the observation process at time t is tractable, and we have

pα t = N(α(t)m, {α(t)2γ2 + σ2t} Id) .

Then, we get that

πα t = N(m, {γ2 + σ2/g(t)2} Id) .

In this particular, using (16), the 2-Wasserstein distance is given by

W2(π, πα t ) = γ

from which we deduce the result by a simple asymptotic equivalent.

Interestingly, the same bound applies on the denoiser uα t (Y α t ) = R

Rd xqα t (x|Y α t ). The proof relies on the same structure.

Proposition 8. Assume that π has finite second order moment. Denote by πα t the probability distribution of uα t (Y α t ) where (Y α t )t [0,Tgen) is the stochastic observation process defined in (4). Then, for any t (0, Tgen), we have

W2(π, πα t ) Loc R(t) .

Proof. Let t (0, Tgen). Consider the following coupling (X, ˆY α t ): (i) X π, (ii) ˆY α t = α(t)X + σ

t Z, where Z N(0, Id). Denote Xα t = uα t (Y α t ). We have Xα t πα t . Therefore, it comes that

W2(π, πα t )2 EX,Xα t [ Xα t X 2] = EX, ˆY α t [ E[X| ˆY α t ] X 2] = EX, ˆY α t [ E[X| ˆY α t /α(t)] X 2] .

Since conditional expectations are orthogonal projections in L2, we have

W2(π, πα t )2 EX, ˆY α t [ ˆY α t /α(t) X 2] = E[ σ/g(t) Z 2] = σ2d/g(t)2 ,

which gives the result.

Similarly, we also show a strong refinement of Proposition 8 by a factor O(g(t)) in the Gaussian case.

Stochastic Localization via Iterative Posterior Sampling

Proposition 9. Consider the target distribution given by π = N(m, γ2Id) where m Rd and γ > 0. Then, as t Tgen, we have

W2(π, πα t ) = γ

d σ 2γg(t) Loc R(t) .

Proof. Here again, the distribution of the observation process at time t is tractable, we have

pα t = N(α(t)m, {α(t)2γ2 + σ2t} Id) . (20)

In particular, the score is tractable and given by

y log pα t (y) = α(t)m y α(t)2γ2 + σ2t .

Using relation (9), we thus have

uα t (y) = α(t)γ2

α(t)2γ2 + σ2ty + σ2t α(t)2γ2 + σ2tm . (21)

Recalling that πα t is the distribution of uα t (Y α t ), by combining (20) and (21), it comes that

πα t = N m, γ2 n 1 + σ2 γ2g(t)2 o 1 Id

In this particular, using (16), the 2-Wasserstein distance is given by

W2(π, πα t ) = γ

from which we deduce the result with a simple asymptotic equivalent.

General remark on denoising in Stochastic Localization. In our presentation of standard stochastic localization where the observation process is defined in (1), we take YT /T as approximate sample from π, where YT is obtained by running the (discretized) SDE (2) up to time T. Yet, in standard stochastic localization, (Montanari, 2023) proposed to take the last denoiser of the procedure u T (YT ) = R

Rd xq T (x|YT )dx. We highlight here that this is equivalent asymptotically.

The two approaches have the same upper bound on their rate of convergence, see Proposition 6 and Proposition 8 in the standard setting. Note that we also obtain the same rate asymptotically in the Gaussian case, see Proposition 7 and Proposition 9.

When T is large, the Gaussian term dominates the target term in the expression of the posterior qt given in (3). Noting that σ2/T 1, it comes that q T (x|YT ) δYT /T , and therefore u T (YT ) YT /T.

As shown above, this equivalence also occurs in generalized stochastic localization. In addition to the similarity of results in Proposition 6 and Proposition 8, we observe that, as T is large if Tgen = or as T is close to Tgen otherwise, the Gaussian term dominates the target term in the expression of the posterior qt given in (8). Since σ2/g(T)2 1, it comes that qα T (x|Y α T ) δY α T /α(T ), and therefore, we have again uα T (Y α T ) Y α T /α(T).

In the main document of our paper, we choose to consider the distribution of Y α T /α(T) rather than the the distribution of the denoiser uα T (Y α T ) as an approximation of the target distribution. This choice seemed to us to be easier to understand for non-expert readers. In practice, we however align with the original methodology and compute the denoiser uα T (Y α T ) instead of Y α T /α(T). This allows to us to have a fair comparison with the approach from (Saremi et al., 2024), who also consider such Bayes estimator for sampling.

Stochastic Localization via Iterative Posterior Sampling

Markovian projection of the observation process. Following (Liptser & Shiryaev, 1977), we now turn to existence and uniqueness results on the SDE (6), under assumptions on π. These are corollaries of the general results: (Liptser & Shiryaev, 1977, Theorem 7.12), restated in Proposition 10, and (Liptser & Shiryaev, 1977, Theorem 7.6), restated in Proposition 12.

Proposition 10 (Theorem 7.12 in (Liptser & Shiryaev, 1977)). Let (Ω, F, P) be a complete probability space on Rd, let (Ft)t 0 be a nondecreasing family of sub-σ-algebras and let W = (Wt, Ft)t 0 be a Wiener process. Let T > 0, σ > 0. Consider the ˆIto process Y = (Yt, Ft)t [0,T ] satisfying the following SDE

d Yt = βtdt + σd Wt, Y0 = 0,

where the stochastic process β = (βt, Ft)t [0,T ] verifies P( R T 0 βt dt < ) = 1.

Assume that R T 0 E[ βt ]dt < and define h : [0, T] Rd Rd by

ht(y) = E[βt|Yt = y] .

Also define the stochastic process (W t)t [0,T ] as

0 hs(Ys)ds .

Then, W is a Wiener process and Y is a diffusion process satisfying the following SDE

d Yt = ht(Yt)dt + σd W t , Y0 = 0 .

Since the original result considers the case where σ = 1 and d = 1, we provide below the proof of Proposition 10 for completeness.

Proof. Let t [0, T]. We have that Yt = Y0 + R t 0 βsds + σ R t 0 d Ws = R t 0 βsds + σWt. Hence, we get that

0 [βs hs(Ys)]ds + Wt .

In particular, W 0 = 0. Using Itˆo s formula, we have that

d(exp(iz W t)) = i exp(iz W t)z d W t z 2

2 exp(iz W t)dt .

Integrating this formula between s and t, where 0 s t T, we obtain that

exp(iz (W t W s)) = 1 + iz Z t

s exp(iz (W u W s))d Wu

s exp(iz (W u W s))[βu hu(Yu)]du

s exp(iz (W u W s))du .

Using E[ R t s exp(iz (W u W s))d Wu|Ys] = 0 and

s exp(iz (W u W s))[βu hu(Yu)]du|Ys

s exp(iz (W u W s))E[βu hu(Yu)|Yu]du|Ys

by definition of hu, we get that

E[exp(iz (W t W s))|Ys] = 1 z 2

s E[exp(iz (W u W s))|Ys]du .

Stochastic Localization via Iterative Posterior Sampling

Define f(t) = E[exp(iz (W t W s))|Ys]. Since f(t) = 1 z 2

2 R t s f(u)du, f is continuously differentiable on [s, t].

Following our computations, we have f(t) = ( z 2 /2)f(t) with f(s) = 1, which implies that f(t) = exp( z 2

2 (t s)). Consequently, W is a Wiener process with the filtration (Ft)t [0,T ]. By definition of W, this means that Y satisfies the following SDE d Yt = ht(Yt)dt + σd W t , Y0 = 0 .

Corollary 11. Assume that π has finite first order moment. Consider the observation process Y α defined in (4). Then it is a solution to the SDE

d Y α t = α(t)uα t (Y α t )dt + σd Bt, Y α 0 = 0 .

Proof. Consider T (0, Tgen). We recall that Y α t = α(t)X + σWt. Then, (Y α t )t [0,T ] is an ˆIto process satisfying the SDE

d Y α t = βtdt + σd Bt, Y α 0 = 0 ,

where βt = α(t)X. Note that β is well defined on [0, T] due to the assumptions (a) and (b) considered on g and that we immediately have P( R T 0 βt dt < ) = 1. Moreover, since α is strictly increasing on [0, T], | α| = α and we get that

0 E[ βt ]dt = E[ X ] Z T

0 α(t)dt = E[ X ]{α(T) α(0)} .

Since E[ X ] < by assumption on π, we thus have R T 0 E[ βt ]dt < . By defining ht(y) = R

Rd α(t)xqα t (x|y)dx = α(t)uα t (y), we finally obtain the result by applying Proposition 10.

Proposition 12 (Theorem 7.6 & Remark 7.2.7. in (Liptser & Shiryaev, 1977)). Let T > 0, σ > 0. Consider the following SDE

d Yt = ht(Yt)dt + σd Wt, Y0 = 0 , (22)

where P( R T 0 ht(Yt) 2 dt < ) = 1. Then, (22) admits at most one weak solution.

The proof of this result lies in the fact that the distributions of solutions to the SDE (22) have the same density with respect to the distribution of the Brownian motion. We refer to (Liptser & Shiryaev, 1977) for the complete proof.

Corollary 13. Assume that π has finite second order moment and that t α(t)2 is integrable at time t = 0. Then, the SDE defined in (6) by

d Y α t = α(t)uα t (Y α t )dt + σd Bt , Y α 0 = 0 ,

has a unique weak solution.

Proof. Consider T (0, Tgen). Since π has its first order moment that is finite, we have existence of solutions by Corollary 11. Consider a solution Y α and denote ht = α(t)uα t . We have ht(Y α t ) 2 = α(t)2 E[X|Y α t ] 2 α(t)2E[ X 2 |Y α t ] by Jensen s inequality. Then, using this inequality, we obtain that

E[ R T 0 ht(Y α t ) 2 dt] = R T 0 E[ ht(Y α t ) 2]dt R T 0 α(t)2E[E[ X 2 |Y α t ]]dt = E[ X 2] R T 0 α(t)2dt

Combining the assumptions on π and α, it comes that E[ R T 0 ht(Y α t ) 2 dt] < using the inequality above, and therefore P( R T 0 ht(Y α t ) 2 dt < ) = 1. We finally obtain the result by applying Proposition 12.

We emphasize that the extra assumption made on α in Corollary 13 is verified for the localization schemes Geom and Geompresented in Section 3.2.

Stochastic Localization via Iterative Posterior Sampling

B.3. Alternative denoising approach to SLIPS

We note here the score of the observation process can be expressed via an expectation over the posterior of the model, in a different manner than (9). Define vα t (y) = R

Rd x log π(x)qα t (x|y)dx, where qα t is the posterior density given in (8).

Lemma 14. Consider the observation process defined in (4) with marginal distribution at time t given by pα t . Assume that log π is continuously differentiable on Rd and that there exists φ : Rd R+ such that R

Rd φ(z)N(z; 0, σ2t Id)dz < and for any (z, y) Rd Rd, we have yπ({z + y}/α(t)) φ(z). Then, we have for any y Rd

y log pα t (y) = vα t (y)/α(t) .

Proof. We recall that pα t (y) = R

Rd N(y; α(t)x, σ2t Id)dπ(x). Then, by change of variable z = α(t)x y, we have

Rd π({z + y}/α(t))N(z; 0, σ2t Id)dz ,

where the multiplicative constant does not depend on y. For any y Rd, denote by qα t ( |y) the density defined up to a normalizing constant by qα t (z|y) π({z + y}/α(t))N(z; 0, σ2t Id). By combining the assumptions of the lemma with the result of Lemma 4-(17) for this new expression of pα t , we obtain that

y log pα t (y) = Z

Rd y log π({z + y}/α(t)) qα t (z|y)dz

Rd log π({z + y}/α(t)) qα t (z|y)dz

Rd x log π(x)qα t (x|y)dx ,

where we re-applied the change of variable z = α(t)x y in the last equality.

Therefore, the SDE (6) is strictly equivalent to the SDE

d Y α t = α(t) α(t){Y α t + σ2t

α(t)vα t (Y α t )}dt + σd Bt , Y α 0 = 0 .

Hence, to simulate from the observation process in a learning-free fashion, we can adopt a similar strategy to SLIPS, that rather involves the SDE derived above. Given a SNR-adapted discretization (tk)K k=1 of a time interval [t0, T], where t0 > 0 and K 1, it amounts to consider a sequence {( Y α tk, V α tk)}K k=1, where V α tk is a Monte Carlo estimator of vα tk( Y α tk) and Y α tk is obtained by solving the SDE

d Y α t = α(t) α(t){ Y α t + σ2tk

α(tk)Vtk}dt + σd Bt , t [tk, tk+1] . (23)

In theory, this approach leads to the same level difficulty in the Monte Carlo estimation as SLIPS, since the involved random poster densities are the same. In particular, this formulation does not bypass the duality of log-concavity explained in Section 4.3. The main difference with the approach presented in the main of our paper however lies in the integration of the SDE (23), which is not tractable at first sight. By combining the result of Lemma 5 and the use of the stochastic Exponential Integrator scheme (Durmus & Moulines, 2015), we show in Appendix D.1 how to solve this SDE for our localization schemes. We emphasize that this implementation is also available in our code.

Unfortunately, we found in our early experiments that this approach suffered from numerical unstability and showed higher variance than SLIPS. We think that this is due to the evaluation of log π in our MC estimation for the early steps of the algorithm, and could be overcome by using the SLIPS recursion given in (14) instead. We leave this study for future work.

Stochastic Localization via Iterative Posterior Sampling

C. Detailed computations for Gaussian mixtures

In this section, we consider the special case where π is a mixture of N Gaussians with weights (wi)N i=1, means (mi)N i=1 and covariance matrices (γ2 i Id)N i=1. Under this assumption, pα t can be explicitly written for any y Rd and t (0, Tgen) as

i=1 wi N(y; α(t)mi, t(g2(t)γ2 i + σ2) Id) .

This means that the distribution of the observation process is itself a mixture of Gaussians with the same weights as π but with means (α(t)mi)N i=1 and covariance matrices (t(g2(t)γ2 i + σ2) Id)N i=1. This elementary result is obtained by applying the rule of linear combination of independent Gaussian random variables. The score of the observation process can also be computed by noticing that for all y Rd

log pα t (y) = log pα t (y) pα t (y) = PN i=1 wit 1(g2(t)γ2 i + σ2) 1(y α(t)mi)N(y; α(t)mi, t(g2(t)γ2 i + σ2) Id) PN i=1 wi N(y; α(t)mi(g2(t)γ2 i + σ2) Id) .

One can also compute the posterior distribution qα t for any t (0, Tgen) and x, y Rd as

i=1 wi N(x; mi, γ2 i Id)N(y;

tg(t)x, σ2t Id)

tg(t)) d N(x; mi, γ2 i Id)N x; y

i=1 wiα(t) d N mi; y α(t), γ2 i + σ2

| {z } = wα t,y,i

σ2 + g2(t)γ2 i

γ2 i + yg(t)

| {z } =mα t,y,i

σ2 + g2(t)γ2 i

| {z } =(γα t,y,i)2

This shows that the posterior is itself a mixture of Gaussian distributions with weights (wα t,y,i)N i=1 where wα t,y,i = wα t,y,i/ PN j=1 wα t,y,j, means (mα t,y,i)N i=1 and covariance matrices ((γα t,y,i)2 Id)N i=1.

Additionally, we can derive tight expressions for the constants R and τ introduced in A0 in the case where π is a Gaussian mixture parameterized by N = 2, w1 = 1 w2 = w, with w (0, 1), m2 = m1 = a 1d7, with a > 0, and γ1 = γ2 = γ, with γ > 0. In this case, π verifies A0, where (i) µ is a mixture of two Dirac masses at a 1d and +a 1d with respective weights w and 1 w and (ii) τ = γ. Moreover, for any random vector U µ, it holds almost surely that

U mπ = U a(1 2w) 1d 2 max(w, 1 w)a

Therefore, we obtain that R = 2 max(w, 1 w)a in A0. Note also that π, defined as the distribution of X mπ where X π, is still a Gaussian mixture that verifies A0 with same constants R and τ.

In the experiments conducted in Section 6, Appendix D and Appendix F.2, we will consider a rolling example given by the target distribution π where w = 1/3, a = 1.0 and γ2 = 0.05. In this case, we have R = 4/3 and τ 2 = 0.05 and the corresponding density is defined as

3 1d, 0.05 Id + 1

3 1d, 0.05 Id . (24)

We recall that d(R2 + τ 2) is an upper bound on the scalar variance of this target distribution.

7We recall that 1d stands for the d-dimensional vector with all components equal to 1.

Stochastic Localization via Iterative Posterior Sampling

D. Sampling via Stochastic Localization in an ideal setting

The goal of this section is to validate the claims from Section 4.1 about the minimization of the integration error. In this section, we work under the assumption that the score log pα t is a known function and do not consider any MCMC method at all (even for the initialization). This setting removes entirely the estimation error that we deal with in Section 4.2 to solely focus on the integration error. For the initialization, we now consider an arbitrary t0 > 0 with Y α t0 distributed as N(0, σ2t0 Id) (the best estimation that we have at the beginning of the SDE). We will clarify 4 different points in this section.

(a) Exploring Exponential Integration (EI) based schemes for SDE discretization;

(b) Analyzing the impact of the SNR-adapted discretization on the integration error;

(c) Exploring the limits of the Gaussian approximation Y α t0 N(0, σ2t0 Id);

(d) Exploring the impact of the computational budget.

In the numerical examples presented below, we consider the target distribution defined in (24) where d = 10. We compute the exact score using the analytical formulas from Appendix C. We choose to display two complementary results based on (i) the empirical Sliced Wasserstein distance (Bonneel et al., 2015; Nadjahi et al., 2019), which tells how local information on π is recovered, and (ii) the error in estimating the weight of the first mode, which tells about global properties of the estimation. Moreover, for any SL scheme α, the values of t0 and T will be taken so that the log-SNR evaluated at times t0 and t K has the same value (the initial log-SNR is taken as 4.0 and the last log-SNR is taken as 5.0). This enables fair comparison across different schedules.

D.1. Exploring new integration schemes

By exploiting the relation given in (9) between the denoiser function and the score of the observation process, which is now assumed to be tractable, the SDE (6) is equivalently defined on [t0, T] by

d Y α t = α(t) α(t){Y α t + σ2t log pα t (Y α t )}dt + σd Bt , Y α 0 = 0 . (25)

Through this formulation, we have fully removed the problem of score estimation, but the issue of discretization error still remains to sample from (Y α t )t [t0,T ] with (25). Due to the divergence of the coefficient t 7 α(t)/α(t) at time 0 (and time Tgen in the finite-time setting), we propose to use the stochastic Exponential Integrator (EI) scheme (Durmus & Moulines, 2015) in this setting. Consider a time discretization of the interval [t0, T] defined by an increasing sequence of timesteps (tk)K k=0 where t K = T and K 1. Then, the EI scheme applied on the SDE (25) amounts to define a sequence of random variables { Y α tk}K k=0 obtained by exactly integrating the SDE defined for any k {0, . . . , K 1} by

d Y α t = α(t) α(t){ Y α t + σ2tk log pα tk( Y α tk)}dt + σd Bt , t [tk, tk+1] .

Although the exact integration is not guaranteed for a general schedule α as defined in Section 3.1, our specific design of the denoising schedule g(t) = t 1/2α(t) in the schemes Geom and Geomprovides tractable computations following the result of Lemma 5. We treat these cases separately below.

EI scheme combined with Geomlocalization scheme. In this setting, we recall that g(t) = tα1/2. Then, we have

α(t) α(t) = α1 + 1

Note that α(t)/α(t) 0 as t 0. Following Lemma 5, the sequence { Y α tk}K k=0 obtained by the EI scheme is defined by the recursion

Y α tk+1 = tk+1

σ2tk log pα tk( Y α tk) + σ

tk+1 α1tα1 k (tα1 k+1 tα1 k )Zk+1 .

where (Zk)K k=1 is distributed according to the standard centered Gaussian distribution. In the standard case, i.e., α1 = 1, this simplifies as

Y α tk+1 = tk+1

tk Y α tk + { tk+1

tk 1}σ2tk log pα tk( Y α tk) + σ q

tk (tk+1 tk)Zk+1 .

Stochastic Localization via Iterative Posterior Sampling

EI scheme combined with Geom localization scheme. In this second setting, we recall that g(t) = tα1/2(1 t) α2/2

(Tgen = 1). Then, we have

α(t) α(t) = α1 + 1

2t + α2 2(1 t) .

Note that α(t)/α(t) 0 as t 0 and t 1. Following Lemma 5, the sequence { Y α tk}K k=0 obtained by the EI scheme is defined by the recursion

Y α tk+1 = tk+1

σ2tk log pα tk( Y α tk)

2 k+1 (1 tk+1) α2

2F1( α1, α2, 1 α1, tk)t α1 k 2F1( α1, α2, 1 α1, tk+1)t α1 k+1Zk+1 ,

where 2F1 denotes the hypergeometric function, see (Olver, 2010, Chapter 15), and (Zk)K k=1 is distributed according to the standard centered Gaussian distribution.

When (α1, α2) = (1, 1), this simplifies as

Y α tk+1 = ( tk+1

1 tk+1 ) 1 2 Y α tk

1 tk+1 ) 1 2 1}σ2tk log pα tk( Y α tk)

t2 k+1 1 tk+1 log( tk

tk+1 ) + tk+1(tk+1 tk)

tk(1 tk+1) Zk+1 .

When (α1, α2) = (2, 1), this simplifies as

Y α tk+1 = ( tk+1

tk ) 3 2 ( 1 tk

1 tk+1 ) 1 2 Y α tk

tk ) 3 2 ( 1 tk

1 tk+1 ) 1 2 1}σ2tk log pα tk( Y α tk)

+ σtk+1( tk+1/tk 1

1 tk+1 ) 1 2 q

2tktk+1 1Zk+1 .

When (α1, α2) = (1, 2), this simplifies as

Y α tk+1 = ( tk+1

tk ) 1 2 ( 1 tk

1 tk+1 ) 3 2 Y α tk

tk ) 1 2 ( 1 tk

1 tk+1 ) 3 2 1}σ2tk log pα tk( Y α tk)

(tk+1 tk){t2 k+1 + tk+1

tk } + 2t2 k+1 log( tk

tk+1 )}Zk+1 .

Stochastic Localization via Iterative Posterior Sampling

0 200 400 600 800 1000 Solver iteration

Sliced Wasserstein

0 200 400 600 800 1000 Solver iteration

Uniform discretization SNR-adapted discretization

Figure 6: Sliced Wasserstein distance between the true samples from pα tk and the empirical distribution of Y α tk obtained by EI scheme combined with uniform time discretization (green) and SNR-adapted discretization (red) for the Standard (left) and Geom(1,1) (right) schemes.

10 6 10 4 10 2

Sliced Wasserstein

Weight estimation error

Figure 7: Impact of t0 with different schedules measured with the relative weight estimation error and the sliced Wasserstein distance. Left: Impact of t0 in the standard scheme. Right: Impact of t0 and (Tgen T) in the Geom(1,1) scheme.

D.2. Studying the impact of the SNR-adapted time discretization

In Section 4.1, we suggested to take a SNR-adapted time discretization by (tk)K k=0 such that

log SNR(tk) = log SNR(t0) + SNRk ,

with t0 > 0 and SNR = (log SNR(T) log SNR(t0))/K. Figure 6 shows that the SNR-adapted discretization efficiently reduces the integration error for both schedules Geom and Geom- . Note, the impact on the schedule Geom(1,1) is more moderated as the uniform initialization already splits the curve into moderated log-SNR increments (see Figure 2) because of a slower rate near t0.

D.3. Studying the impact of the Gaussian approximation in the initialization

In this section only, we replace the target distribution considered in (24) by the mixture of two Gaussian distributions N( 3 110, Σ) and N(3 110, Σ), where Σ = 0.05 I10, with weights respectively given by 2/3 and 1/3. Therefore, π has non-zero mean and high variance, which provides a challenging setting for our Gaussian approximation at initialization given by Y α t0 N(0, σ2t0 Id). Without information on the target, this is the best estimation that we have when t0 is close to 0. Figure 7 shows that only high values of t0 (i.e. above 10 2) degrades the performance in our localization schemes. This underlines the need of running the Langevin-within-Langevin correction procedure of this approximation as explained in Section 4.3. Additionally, note that Figure 7 (right) shows that taking (Tgen T) low in the finite time setting does not degrade performance, which was not obvious from the log-SNR shape near Tgen in Figure 1.

D.4. Studying the impact of the number of discretization steps

The bottom row of Figure 8 shows that, in the ideal case where the score is known, using the SNR-adapted discretization makes all the different schemes equivalent8. Moreover, we see that the sampling error quickly stabilizes with medium budgets which highlights that the integration error was successfully minimized.

8We recall that we choose t0 and T so that the starting and ending levels of SNR are the same across different schemes.

Stochastic Localization via Iterative Posterior Sampling

25 26 27 28 29 210 211 212 213

Computational budget

Sliced Wasserstein distance

25 26 27 28 29 210 211 212 213

Computational budget

Mode weight estimation error (in %)

Geom(1.0, 1.0) Geom(1.0, 2.0) Geom(2.0, 1.0) Standard

Figure 8: Impact of the computational budget K with different schemes - The sampling error is computed either with the Sliced Wasserstein distance (left) or the relative weight error (right).

E. Theoretical results on the duality of log-concavity

In this section, we detail the results on log-concavity provided in Section 4.2 and Section 4.3. For sake of readability, those are stated with a general σ > 0. Our first result provides uniform upper bounds on the Hessian of the log-density of the observation process and its corresponding log-posterior density.

Lemma 15. Assume A0. Let t (0, Tgen). We recall that pα t stands for the marginal distribution at time t of the observation process defined by (4), while qα t stands for the corresponding posterior density. Under regularity assumptions on π detailed in the proof of the lemma, we have for any (x, y) Rd Rd that

2 y log pα t (y) ζp(t) Id , where ζp(t) = α(t)2d R2

(α(t)2τ 2+σ2t)2 1 α(t)2τ 2+σ2t , (26)

2 x log qα t (x|y) ζq(t) Id , where ζq(t) = d R2

τ 4 1 τ 2 g(t)2

In particular, ζq is a strictly decreasing function on (0, Tgen).

Proof. We begin by proving (26). Consider the stochastic process (Y α t )t [0,Tgen), given in (4), and defined by Y α t = α(t)X + σWt where X π and (Wt)t 0 is a standard Brownian motion. Due to A0, X can be written as X = U + G where U µ, with µ compactly supported on B = B(mπ, R

d), and G N(0, τ 2 Id). In particular, it holds almost surely that U mπ 2 d R2. We thus have the following decomposition

pα t (y) = R

B pα t (y|u)dµ(u) ,

where pα t ( |u) is the conditional density of Y α t given U = u B defined as

pα t (y|u) = N(y; α(t)u, {α(t)2τ 2 + σ2t} Id) . (28)

Then, it comes that

y log pα t (y|u) = y + α(t)u α(t)2τ 2 + σ2t , 2 y log pα t (y|u) = 1 α(t)2τ 2 + σ2t Id .

We now make the technical assumption that for any k {1, 2}, there exists φp,k : B R+ such that R

B φp,k(u)dµ(u) < and for any (u, y) B Rd, we have k ypα t (y|u) φp,k(u). By combining this assumption with the result of Lemma 4- (18), we obtain that

2 y log pα t (Y α t ) = 1 α(t)2τ 2 + σ2t Id + Cov Y α t + α(t)U α(t)2τ 2 + σ2t

= 1 α(t)2τ 2 + σ2t Id + α(t)2

(α(t)2τ 2 + σ2t)2 Cov[U|Y α t ] .

Stochastic Localization via Iterative Posterior Sampling

Since U mπ 2 d R2, we have Cov[U|Y α t ] d R2 Id9. Then, we obtain the bound (26) as 2 log pα t is continuous and pα t is positive on Rd.

We now prove (27). We recall from (8) that qα t (x|y) π(x)N(x; y/α(t), σ2/g(t)2 Id). Therefore

2 x log qα t (X|Y α t ) = 2 x log π(X) g(t)2

In particular, we have π(x) = R

B N(x; u, τ 2 Id)dµ(u).

We now make the technical assumption that for any k {1, 2}, there exists φq,k : B R+ such that R

B φq,k(u)dµ(u) < and for any (u, x) B Rd, we have k x N(x; u, τ 2 Id) φp,k(u). Then, by using again the result of Lemma 4-(18), we obtain that

2 x log π(X) = 1

τ 4 Cov[U|X] ,

Since U mπ 2 d R2, we also have Cov[U|X] d R2 Id10. We obtain (27) with similar reasoning as before.

Note that the upper bounds obtained in Lemma 15 can be made tighter in the case where the target distribution is a Gaussian mixture, as shown in Lemma 16.

Lemma 16. Let a > 0, γ > 0 and w (0, 1). Consider the target distribution π defined as the mixture of the Gaussian distributions N( a 1d, γ2 Id) and N(+a 1d, γ2 Id), with respective weights w and 1 w. Then for any (x, y) Rd Rd, we have

2 y log pα t (y) = 2α(t)2a2

(α(t)2γ2+σ2t)2(1+cosh g(t,y)) 1d1 d 1 α(t)2γ2+σ2t Id , with g(t, y) = 2α(t)ay 1d

α(t)2γ2+σ2t + log( 1

2 x log qα t (x|y) = 2a2 γ4(1+cosh f(t,x)) 1d1 d 1 γ2 Id g(t)2

σ2 Id , with f(t, x) = 2ax 1d

γ2 + log( 1

In particular, tighter expressions of ζp and ζq can be obtained in Lemma 15 by replacing R by R/{2 max(w, 1 w)} R.

The proof of this result is inspired from the derivation of (Saremi et al., 2024, Eq. (4.6)).

Proof. As explained in Appendix C, π verifies A0, where τ = γ and µ is a mixture of two Dirac masses at a 1d and a 1d with respective weights w and 1 w, whose density is given by p(u) = w1 a 1d(u) + (1 w)1a 1d(u).

We recall that the stochastic process (Y α t )t [0,Tgen), given in (4), is defined by Y α t = α(t)X + σWt where X π and (Wt)t 0 is a standard Brownian motion. Due to A0, X can be written as X = U + G where U µ and G N(0, γ2 Id).

We begin by proving (29). Following the first part of the proof of Lemma 15, we have

2 y log pα t (y) = 1 α(t)2γ2 + σ2t Id + α(t)2

(α(t)2γ2 + σ2t)2 Cov[Uy] ,

where Uy is a random vector distributed according to pα t ( |y), the conditional distribution of U given Y α t = y Rd, whose density is given by

pα t (u|y) p(u)pα t (y|u) ,

where pα t (y|u) is defined in (28). Therefore, we obtain that

pα t (u|y) w exp

y + α(t)a1d 2

2{α(t)2γ2 + σ2t}

| {z } w1(t,y)

1 a 1d(u) + (1 w) exp

y α(t)a1d 2

2{α(t)2γ2 + σ2t}

| {z } w2(t,y)

9Note that this upper bound may be loose, since it does not depend on y. 10Note that this upper bound may be loose, since it does not depend on x.

Stochastic Localization via Iterative Posterior Sampling

Hence, pα t ( |y) is a mixture of two Dirac masses at a 1d and a 1d with respective weights w1(t, y) = w1(t, y)/{w1(t, y) + w2(t, y)} and w2(t, y) = w2(t, y)/{w1(t, y) + w2(t, y)}. Therefore, we get that

Cov[Uy] = E[Uy U y ] E[Uy]E[Uy]

= w1(t, y)a21d1 d + w2(t, y)a21d1 d ( w2(t, y) w1(t, y))2a21d1 d = {1 ( w2(t, y) w1(t, y))2}a21d1 d .

In particular, we have

w2(t, y) w1(t, y) = w2(t, y) w1(t, y)

w2(t, y) + w1(t, y) = exp(log w2(t, y)) exp(log w1(t, y))

exp(log w2(t, y)) exp(log w1(t, y)) ,

where log w1(t, y) = y+α(t)a1d 2

2{α(t)2γ2+σ2t} + log(w) and log w2(t, y) = y α(t)a1d 2

2{α(t)2γ2+σ2t} + log(1 w). Then, it comes that

1 ( w2(t, y) w1(t, y))2 = 1 tanh log w2(t, y) log w1(t, y)

= 2 1 + cosh(log w2(t, y) log w1(t, y)) .

Moreover, we get that

log w2(t, y) log w1(t, y) = 2α(t)ay 1d α(t)2γ2 + σ2t + log 1

Denote g(t, y) = log w2(t, y) log w1(t, y). We finally obtain (29) by combining previous computations.

We now prove (30). Following the second part of the proof of Lemma 15, we have

2 x log qα t (x|y) = 1

γ2 Id g(t)2

γ4 Cov[Ux] ,

where Ux is a random vector distributed according to µ( |x), the conditional distribution of U given X = x Rd, whose density is given by

µ(u|x) N(x; u, γ2 Id)µ(u) .

Therefore, we obtain that

µ(u|x) w exp

| {z } w1(t,x)

1 a 1d(u) + (1 w) exp

| {z } w2(t,x)

Hence, µ( |x) is a mixture of two Dirac masses at a 1d and a 1d with respective weights w1(t, x) = w1(t, x)/{w1(t, x) + w2(t, x)} and w2(t, x) = w2(t, x)/{w1(t, x) + w2(t, x)}. Therefore, we get that

Cov[Ux] = E[Ux U x ] E[Ux]E[Ux]

= w1(t, x)a21d1 d + w2(t, x)a21d1 d ( w2(t, x) w1(t, x))2a21d1 d = {1 ( w2(t, x) w1(t, x))2}a21d1 d .

Similarly to the computations derived above, we obtain that

1 ( w2(t, x) w1(t, x))2 = 2 1 + cosh f(t, x) ,

f(t, x) = log w2(t, x) log w1(t, x) = 2ax 1d

Stochastic Localization via Iterative Posterior Sampling

which finally leads to (30).

We now derive tighter expressions for ζp and ζq in this setting. Since 2/(1 + cosh) 1 and 1d1 d d Id, we have

2 y log pα t (y) α(t)2a2d (α(t)2γ2+σ2t)2 Id 1 α(t)2γ2+σ2t Id ,

2 x log qα t (x|y) a2d

γ4 Id 1 γ2 Id g(t)2

In this particular setting, we recall that R = 2 max(w, 1 w)a, see Appendix C, and thus, the upper bounds on the Hessians derived above are equivalent to

2 y log pα t (y) α(t)2d R2

(α(t)2γ2+σ2t)2 Id 1 α(t)2γ2+σ2t Id ,

2 x log qα t (x|y) d R2 γ4 Id 1 γ2 Id g(t)2

where R = R/{2 max(w, 1 w)} R. Therefore, we can obtain tighter expressions of ζp and ζq in Lemma 15 by replacing R by R.

We provide below a result, that combines formal versions of Theorem 1 and Theorem 2.

Proposition 17. Assume A0. Denote by g 1 the inverse function of the denoising schedule g.

If d R2 > τ 2, define

d R2 τ 2 and tq = g 1 σ

otherwise, define tp = Tgen and tq = 0. Then,

(a) for any y Rd, qα t ( |y) is strongly log-concave for t (tq, Tgen) and gets more log-concave as t increases,

(b) pα t is strongly log-concave for t (0, tp).

Proof. Consider the expressions of tp and tq given above. Note that they are well defined in the case where d R2 > τ 2 since g is a bijection from [0, Tgen) to R+. We begin with the proof of the result (a). Let t (0, Tgen). Following Lemma 15-(27), for any y Rd, qα t ( |y) is strongly log concave if

ζq(t) < 0 g(t)2 > σ2(d R2 τ 2)

τ 4 t > tq .

Moreover, qα t gets more log-concave as t increases as ζq is a strictly decreasing function, see Lemma 15. We now give the proof of the result (b). Let t (0, Tgen). Following Lemma 15-(26), pα t is strongly log concave if

ζp(t) < 0 α(t)2d R2

α(t)2τ 2 + σ2t < 1 g(t)2(d R2 τ 2) < σ2 t < tp .

Hence, this result shows that the condition d R2 > τ 2 is restrictive on the log-concavity of the marginal distribution and the posterior of the localization model. In terms of Gaussian mixtures, this condition can be interpreted as having the distance between the modes that is larger than the variance of the modes. We now restate Theorem 3 and give its proof.

Proposition 18. Assume A0, where d R2 < 2τ 2. Then, tq < tp, where tq and tp are defined in Proposition 17.

Proof. If d R2 τ 2, this result is directly obtained since tq = 0 and tp = Tgen. Assume that d R2 > τ 2. We have

tq < tp d R2 τ 2

τ 4 < 1 d R2 τ 2 (d R2 τ 2)2 < τ 4 d R2 < 2τ 2 ,

which gives the result.

Stochastic Localization via Iterative Posterior Sampling

Hence, Proposition 18 shows that a sweet spot for the hyper-parameter t0 in SLIPS exists under a restrictive condition on R and τ in Assumption A0, enabling the duality of log-concavity. Elaborating on the result of Lemma 16, this condition may be slightly alleviated in the case of an unbalanced Gaussian mixture as proved in Proposition 19. Proposition 19. Let a > 0, γ > 0 and w (0, 1). Consider the target distribution π defined as the mixture of the Gaussian distributions N( a 1d, γ2 Id) and N(+a 1d, γ2 Id), with respective weights w and 1 w. Assume that d R2 < 2τ 2 4 max(w, 1 w)2. Then, the duality of log-concavity is ensured to exist in SLIPS.

Proof. This result can be seen as an extension of Proposition 17 and Proposition 18, in the case where R is replaced by R/{2 max(w, 1 w)}, see Lemma 16.

F. Details on SLIPS algorithm

F.1. Details on the implementation of SLIPS

Langevin-within-Langevin initialization. Consider a timestep t0 well chosen such that pα t0 and qα t0 are both approximately log-concave. The goal of our initialization is to sample from pα t0 at lowest cost. Thanks to log-concavity of pα t0, we may consider to apply ULA to sample from this distribution. Given N 1 and a step-size λ > 0, this amounts to consider a sequence of random variables {Y (n)}N n=0, defined by the following recursion

Y (n+1) = Y (n) + λ log pα t0(Y (n)) +

2λZ(n+1) , (31)

where {Z(n)}N n=1 are independently distributed according to the standard centered Gaussian distribution. We recall that the score log pα t0 is however not tractable but can be re-expressed with (9) such that for any y Rd, we have

log pα t0(y) = {α(t0)uα t0(y) y}/σ2t0 ,

where uα t0(y) is the expectation of the posterior qα t0( |y) given in (8). Although this term is intractable too, it can be estimated by approximately sampling from qα t0 via MALA, since qα t0 is also ensured to be approximately log-concave.

Building upon this relation, at each step n {0, . . . , N 1}, we propose to approximate log pα t0(Y (n)) in the recursion (31) with a MCMC-based estimator of uα t0(Y (n)) obtained by sampling from qα t0( |Y (n)). This results in the Langevinwithin-Langevin procedure summarized in Algorithm 2. We finally highlight three main choices in the design of this algorithm: (i) we choose the Gaussian approximation Y (0) N(0, σ2t0 Id) for ULA initialization, relying on the study conducted in Appendix D.3; (ii) consequently, we set the step-size λ to be slightly smaller than the variance of this distribution, i.e., λ = σ2t0/2; (iii) the MALA-based posterior sampling is initialized with the best estimation of uα t0(Y (n)) that we have, i.e., Y (n)/α(t0), recalling that Y α t /α(t) and uα t (Y α t ) have the same localization behaviour, see Appendix B.2.

Algorithmic technicalities. We now present three algorithmic subtleties featured in SLIPS.

(a) The step-size of MALA used to sample from the posterior is adapted. Using the acceptance ratio of the Metropolis Hasting filter, we geometrically decrease (respectively increase) the step-size when the acceptance ratio is below (respectively above) a target ratio of 75%;

(b) The last MCMC samples obtained when sampling the posterior at step k will be the first samples when sampling at step k + 1. This persistent trick is motivated by the fact that the posterior is expected to change little from one iteration to the next. This behaviour is notably illustrated in Figure 3 (bottom row). (c) Lastly, as mentioned and justified in Appendix B.2, we use the estimated denoiser U α T at the final integration timestep T as an approximate sample from π rather than Y α T /α(T) to align with related work.

F.2. Ablation study

In this section, we investigate the impact of the different hyper-parameters of SLIPS. Similarly to Appendix D, we run our study by applying our algorithm on the unbalanced bimodal Gaussian mixture defined in (24) with d = 10. This section will clarify three points:

(a) The behaviour of SLIPS within the assumptions of Theorem 3;

(b) The existence of a sweet spot for t0 outside the assumptions of Theorem 3 and the impact of the estimation of Rπ;

(c) The tuning of hyper-parameters in the Langevin-within-Langevin initialization (see Algorithm 2).

Stochastic Localization via Iterative Posterior Sampling

Behaviour of SLIPS within the restrictive assumptions. In this part only, we consider the case of a mixture of two Gaussian distributions in dimension 5 given by N( (2/3)a 15, τ 2I5) and N(+(4/3)a 15, τ 2I5), where τ 2 = 0.1 and a > 0 is not fixed, with respective weights 2/3 and 1/3. By varying a, we vary R, recalling that R = (4/3)a, see Appendix C. In Figure 9, we run SLIPS with different schemes while setting t0 as the mean value between tp and tq derived in Proposition 17, in accordance with Theorem 3 (left), or derived by combining the results of Proposition 17 and Proposition 19, which is less restrictive (right). When pα t and qα t are both log-concave independently of t, i.e., when d R2/τ 2 1 for the general case (see Proposition 17) or d R2/τ 2 16/9 in the refined Gaussian mixture setting (see Lemma 16), we always set t0 = 10 2. This choice ensures that the Gaussian approximation which corresponds to the initialization of the Langevin-within-Langevin algorithm is accurate (see Figure 7 for details). This figure shows that choosing t0 according to Proposition 17 leads to accurate sampling below the threshold and degraded (or equal) performance above this threshold.

Existence of a sweet spot for duality outside of the restrictive assumptions. Given the target distribution described in (24) with d = 10, we have that R = 4/3 and τ 2 = 0.05, which means that R2d/τ 2 350. Hence, the distribution at stake does not fit the additional assumption of Theorem 3, or even the refined assumption of Proposition 19. However, Figure 10 shows that there still exists a sweet spot for t0 in this case. Moreover, we observe in Figure 11 that a poor estimation of Rπ shifts the sweet spot. This makes sense as the sweet spot likely corresponds to a SNR level. Recalling that log SNR(t) = 2 log(g(t)) + log(R2 π/(σ2d)), if σ > Rπ/

d (i.e., we overestimate Rπ) the log-SNR is shifted downwards and the resulting t0 should be larger (as we see in the right column of Figure 11); if σ < Rπ/

d (i.e., we underestimate Rπ), the resulting t0 should be smaller (as we see in the left column of Figure 11).

Choice of hyper-parameters in the Langevin-within-Langevin initialization. Figure 12 shows that only a few steps in the Langevin-within-Langevin algorithm (see Algorithm 2) are needed to reach a stationary sampling error.

0 1 2 3 4 5 6

Sliced Wasserstein Distance

Standard Geom(1,1) Geom(2,1)

General threshold (Th. 3) Full log-concavity threshold

0 1 2 3 4 5 6

Standard Geom(1,1) Geom(2,1)

Tight threshold (Prop. 19) Full log-concavity threshold

Figure 9: Sliced Wasserstein distance depending on the ratio R2d/τ 2. Left: R2d/τ 2 = 2 is the threshold of duality of log-concavity in Theorem 3 (general case). Right: R2d/τ 2 = 32/9 is the threshold of duality of log-concavity in Proposition 19 (refinement). In both cases, the grey line corresponds to the threshold where the log-concavity conditions start to be restrictive on the marginal distribution of the observation process and the corresponding posterior distribution.

Stochastic Localization via Iterative Posterior Sampling

0.20 0.25 0.30 0.35 0.40 t0

Sliced Wasserstein Distance

Weight estimation error (in %)

0.1 0.2 0.3 t0

Sliced Wasserstein Distance

Weight estimation error (in %)

Sliced Wasserstein Distance Weight estimation error

0.3 0.4 0.5 t0

Sliced Wasserstein Distance

Weight estimation error (in %)

Figure 10: Sliced Wasserstein distance and estimation error on the relative weight depending on t0 with different schemes. Left: Standard. Middle: Geom(1,1). Right: Geom(2,1).

0.05 0.06 0.07 0.08 0.09 0.10 t0

σ = 0.50 σr

0.20 0.25 0.30 0.35 0.40 t0

σ = 1.00 σr

0.5 0.6 0.7 0.8 0.9 t0

Sliced Wasserstein

σ = 1.50 σr

0.00 0.05 0.10 0.15 0.20 0.25 t0

σ = 0.50 σr

0.10 0.15 0.20 0.25 0.30 0.35 t0

σ = 1.00 σr

0.30 0.35 0.40 0.45 0.50 0.55 t0

Sliced Wasserstein

σ = 1.50 σr

0.05 0.10 0.15 0.20 0.25 0.30 t0

σ = 0.50 σr

0.25 0.30 0.35 0.40 0.45 0.50 t0

σ = 1.00 σr

0.45 0.50 0.55 0.60 0.65 0.70 t0

Sliced Wasserstein

σ = 1.50 σr

Figure 11: Sliced Wasserstein distance depending on t0 for different values of σ. We denote σr = Rπ/

10 20 30 40 50 Number of ULA warmup steps

Sliced Wasserstein

10 20 30 40 50 Number of ULA warmup steps

10 20 30 40 50 Number of ULA warmup steps

Figure 12: Sliced Wasserstein distance depending on the number of Langevin-within-Langevin steps.

Stochastic Localization via Iterative Posterior Sampling

G. Details on related work

In this appendix, we provide further details on related works, highlighting similarities and identifying the main limitations.

G.1. Reverse Diffusion Monte Carlo (Huang et al., 2024)

Given T > 0, the authors consider the denoising process (Yt)t [0,T ] that is solution to the following SDE

d Yt = {Yt + 2 log p T t(Yt)} +

2d Bt , Y0 p T , (32)

where (Bt)t 0 is a Brownian motion in Rd and ps is the intractable marginal distribution defined for any s > 0 and any y Rd by ps(y) = R

Rd N(y; e sx, (1 e 2s) Id)dπ(x). Under mild conditions on π (Cattiaux et al., 2023), (Yt)t [0,T ] is the time-reversed process of the standard Ornstein-Uhlenbeck process defined in (19) and corresponds to a Variance-Preserving diffusion model (Song et al., 2021). For any t [0, T], given X π, we therefore have

Yt = e (T t)X + p

1 e 2(T t)Z ,

where Z is distributed according to the standard centered Gaussian distribution. In this case, in a similar fashion to the SL posterior given in (8), the conditional density of X given Yt = y Rd can be defined as

qt(x|y) π(x)N(x; e T ty, (e2(T t) 1) Id), t [0, T) ,

and the denoiser function as ut(y) = R

Rd xqt(x|y)dx. Similarly to our framework, the random vector ut(Yt) can be seen as the denoiser of Yt. Following Tweedie s formula given in Lemma 4, the denoiser function is related to the score of p T t for any t [0, T) by

log p T t(y) = y 1 e 2(T t) + e (T t)

1 e 2(T t) ut(y) .

Therefore, the SDE (32) describing the denoising process is strictly equivalent to the following SDE

d Yt = e 2(T t) + 1

e 2(T t) 1Yt + 2e (T t)

1 e 2(T t) ut(Yt) dt +

2d Bt, Y0 p T . (33)

This last SDE has to be compared with the SDE describing our observation process in (6): in both cases, the drift involves the denoiser function, which is not tractable in practice. Here, the target distribution is approximated by the distribution of YT , while it is approximated by the distribution of Y α T /α(T) in our framework.

The approach of (Huang et al., 2024) to sample from the SDE (33), while handling the estimation of the denoiser ut, is close to ours. Indeed, after discretizing this SDE with a certain time discretization (tk)K k=1 of [0, T], they propose to estimate the denoiser at time tk with a Markov Chain Monte Carlo estimation. To do so, at each step tk, they approximately sample from the random posterior qtk( | Ytk) with ULA (while we use MALA), where Ytk is the k-th realization of their discretized process. Besides this, they also propose a Langevin-within-Langevin initialization to approximately sample from p T , that involves the corresponding posterior q0. Although they briefly discuss the difficulty of this initialization by considering the log-Sobolev constants of p T and q0, their analysis lacks readability and does not emphasize any crucial trade-off on T (which corresponds to our t0).

Their scheme can nonetheless be analyzed as a localization scheme under the scope of what we call duality of log-concavity , see Section 4.3. Assume that π verifies A0 and is not log-concave. Since we have q0(x|y) π(x)N(x; e T y, (e2T 1) Id), one can show that

(a) if T is large: q0 will be close to π (not log-concave), while p T will be close to N(0, Id) (log-concave),

(b) if T is small: q0 will localize as a Dirac mass (log-concave), while p T will be close to π (not log-concave).

Stochastic Localization via Iterative Posterior Sampling

Figure 13: Sampling the 8-Gaussians distribution via RDMC with different values of T.

Therefore, the setting of T is crucial in RDMC to ensure a proper initialization via Langevin-within-Langevin algorithm. However, the authors claim that setting T large would only incur computational waste and that RDMC shows insensitivity to T, aside from the case where T is too close to 011. As explained above, these claims are not realistic, since sampling from the posterior q0 becomes as hard as sampling from π when T is large. We illustrate this fundamental limitation in Figure 13.

In contrast, we pay a particular attention to point out the importance of our hyper-parameter t0 throughout our analysis, with practical and theoretical perspectives. Finally, we highlight that RDMC relies on a uniform time discretization (while we propose a SNR-adapted discretization), and do not use persistent initialization of their Markov chains in posterior sampling.

G.2. Multi-Noise Measurements sampling (Saremi & Srivastava, 2022; Saremi et al., 2023; 2024)

Given M 1, the Multi-Noise Measurements (MNM) model introduced by (Saremi & Srivastava, 2022) defines a sequence of random variables (Y m)M m=1 as

Y m = X + σZm ,

where X π, σ > 0 and (Zm)M m=1 are independently distributed according to the standard centered Gaussian distribution. This non-Markovian stochastic process can be seen as a denoising process. The rationale behind this formulation is that simulating an increasing number of equally noised measurements of a sample from π helps to obtain more information on this sample, and finally approximate it. We now explain how it can be interpreted as a non-Markovian analog to the standard localization scheme given in (1).

For any m {1, . . . , M}, the conditional density of X given the m-tuple Y 1:m = y1:m (Rd)m is defined as

qm(x|y1:m) π(x)N(x; y1:m, σ2/m Id) ,

where y1:m denotes the empirical mean of the noised measurements, i.e., y1:m = (1/m) Pm i=1 yi. By defining um(y1:m) = R

Rd xqm(x|y1:m)dx, one can derive the Bayes estimator of X given Y 1:m as the random vector um(Y 1:m) (Saremi & Srivastava, 2022). Interestingly, this denoiser shares the same rate of convergence as the denoiser of standard stochastic localization given in Proposition 8 when t = m. Indeed, (Saremi et al., 2024, Proposition 2) states that for any target distribution π, we have W2(π, πm) σ p

d/m, where πm denotes the distribution of um(Y 1:m). In other words, the MNM denoiser localizes to X with the same localization rate as the standard SL denoiser, defined in Section 2, by taking T = M.

To sample from π, the MNM approach consists in first sampling Y 1:M, and then computing the denoiser u M(Y 1:M), in the same fashion as in the SL framework. Recently, (Saremi et al., 2024) proposed to tackle the sampling of the M-tuple Y 1:M

by first simulating Y 1 and then sequentially sampling Y m given Y 1:m 1 for m {2, . . . , M} using a Monte Carlo Markov Chain method - in this case, Underdamped Langevin Algorithm (Sachs et al., 2017). At step m, the Langevin procedure then involves pointwise evaluations of the conditional score of the distribution of Y m given Y 1:m 1 (or simply the score of the distribution of Y1 when m = 1). This introduces the Once-At-a-Time (OAT) algorithm. The interest of such method lies in the fact that, under Assumption A0 that may include multi-modal distributions, the distributions to sample from are increasingly log-concave with m, see (Saremi et al., 2024, Theorem 1). Hence, the most challenging step of this approach lies in the sampling of Y 1, in the same spirit as in SLIPS.

11See (Huang et al., 2024, Appendix F.4).

Stochastic Localization via Iterative Posterior Sampling

In their work, (Saremi et al., 2024) mainly consider the case where the scores are analytically available. To handle realistic settings, they implement an IS estimator, see (Saremi et al., 2024, Section 4.2.1), but their numerical results show that its performance significantly degrades compared to setting of perfect knowledge of the score, see (Saremi et al., 2024, Appendix H). We include this approach in our numerical benchmark. Alternatively, for high-dimensional settings, they propose an estimator based on posterior sampling, see (Saremi et al., 2024, Appendix F.2.), in the same fashion as in SLIPS. However, they do not pay attention to the limitation of this approach, in particular at the initialization of their algorithm. Indeed, while (Saremi et al., 2024, Theorem 1) suggests to take σ very large to ensure that the distribution of Y 1 is approximately log-concave (i.e., taking t0 small in SLIPS), the corresponding posterior becomes unfortunately closer to π, and then hard to sample with standard MCMC methods when π is not log-concave. This fundamental constraint is well illustrated in (Saremi et al., 2024, Figure 7, left), where the posterior sampling approach is shown to systematically fail for an arbitrary choice of σ, independently of the computational budget allocated to the MCMC sampling. Therefore, the combination of OAT with posterior sampling requires a trade-off on the hyper-parameter σ, similarly to t0 in SLIPS, that reflects once again the duality of log-concavity .

H. Details on numerical experiments

H.1. The failure of MCMC methods when targeting multi-modal distributions

As we show in Figure 14, local MCMC samplers such as MALA, HMC, NUTS or ESS tend to produce Markov chains that get trapped in modes while our methodology SLIPS generates samples reaching both modes.

Chain 1 Chain 2 Chain 3

SLIPS (Standard)

SLIPS (Geom(1,1))

SLIPS (Geom(2,1))

Figure 14: Samples from the Gaussian mixture defined in Section 6 (d = 8) obtained using different algorithms. Here, we display the first two coordinates of n = 128 samples. For SLIPS, we used the values of t0 and η given in Table 5, K = 10 discretization steps and n MCMC = 16 MCMC steps. On the other hand, the standard MCMC algorithms produce 3 chains, each one being of size K n MCMC n for fairness of comparison on the number of target density evaluations; we only display the n last MCMC samples of each chain. The chains were initialized in N(0, σ2Id). For HMC, we grid-searched an optimal trajectory length in [4, 8, 12, 16] which ended up being 8 (the step-size was automatically tuned). For ESS, we used N(0, σ2 Id) as a prior.

Stochastic Localization via Iterative Posterior Sampling

Table 3: Hyper-parameter grids used for SLIPS on different targets.

MIXTURE OF GAUSSIANS AND BAYESIAN ϕ4 MODEL OTHERS

STANDARD η {5.0} {5.7, 6.1} {5.0, 5.7} t0 {0.03, 0.05, 0.1, 0.2, 0.4} {0.8, 1.0, 1.2, 1.4, 1.8} {0.1, 0.2, 0.4, 1.0, 1.2}

GEOM(1,1) η {5.0} {5.7, 6.1} {4.6, 5.0} t0 {0.03, 0.05, 0.1, 0.15, 0.25} {0.30, 0.35, 0.40, 0.45} {0.1, 0.15, 0.20}

GEOM(2,1) η {5.0} {5.7, 6.1} {4.6, 5.0} t0 {0.15, 0.20, 0.25, 0.35, 0.45} {0.40, 0.45, 0.50, 0.55} {0.30, 0.35, 0.45}

H.2. Algorithms and hyper-parameters

Using the information from Rπ and adapting the algorithms. As our definition of σ in SLIPS requires to have an estimate of Rπ (see Section 3.2), we informed other algorithms with such knowledge when possible. More precisely, we use A0 and consider the approximation R2 π = d(R2 + τ 2) where R and τ are assumed to be known. For SMC and AIS, we use a starting distribution ρ0 as the centered Gaussian with variance R2d Id; For OAT, we use σ as per (Saremi et al., 2024, Equation 4.3) by setting σ2 = R2d τ 2 (ensuring log-concavity); For RDMC, we do not include this information as the authors do not give any heuristic on T with respect to the variance of the target distribution.

Moreover, the implementations of the algorithms were slightly adjusted from their general definition. SMC and AIS use a MALA kernel (which leaves the target distribution invariant) as transition kernel. Additionally, RDMC also uses a MALA kernel for posterior sampling instead of ULA. This reduces the estimation bias and also enables automatic tuning of the Langevin step-size by leveraging the acceptance ratio (here we adapt the step-size to maintain the acceptance ratio at 75%). Still on RDMC, we drop the first 50% of the MCMC samples to ignore the warm-up period in the estimation. We also ignore the warm-up period with the same proportion in SLIPS. In SLIPS, we reuse the step-size from the MCMC sampling of qtk when sampling qtk+1 and we also initialize the chains of the later with the last samples of the chain from the former. This is an intuitive initialization when looking at the bottom row of Figure 3 as the modes of the posterior seem to be stable and qtk is expected to be close to qtk+1.

Estimating the scores in OAT. In OAT, denoting p(y) the likelihood of the measurement process Y = X + σZ with X π and Z N(0, Id), the score can be written using the following identity from (Saremi et al., 2024, Appendix F) : log p(y) = σ 1EZ q( |y;σ)[Z] where the posterior q( |y) is defined by q(z|y) π(y + σz)N(z; 0, Id). This means that

the score log p(y) can be estimated with the following IS estimator log p(y) σ 1 PN i=1 wi Zi where Zi i.i.d. N(0, Id) and wi = π(y + σZi)/ PN j=1 π(y + σZj). We reduce the variance of this estimator by applying the antithetic trick.

Selecting the hyper-parameters of the algorithms. For each algorithm, we search its hyper-parameters within a predetermined grid. The selection is based on the metrics which will be later detailed. The metrics were computed by comparing 4096 samples against true samples. We globally fixed the computational budget by setting the SDE discretization of SLIPS as K = 1024. Moreover, we define the number of MCMC steps of SLIPS, denoted by L, to be equal to 32 except with the mixture of Gaussian in high dimensions where it is equal to 48, 64 and 96 in dimensions 32, 64 and 128 respectively and in the ϕ4 experiments were it is set to 64. The selected hyper-parameters for each algorithm are summarized in Table 4. Below, we detail how the grids were built for each one.

The SMC and AIS algorithms define a sequence of annealed distributions ρk for k {0, . . . , K} from ρ0 (defined above) to ρK = π as ρk exp((1 βk) log ρ0 + βk log π)) where (βk)K k=0 is a linear schedule of size K between 0 and 1. Both algorithms used N = 4096 particles;

The RDMC algorithm has a single hyper-parameter T which we search within the grid T { log 0.99, log 0.95, log 0.9, log 0.8, log 0.7} given in (Huang et al., 2024, Appendix F.1). However, we also find that large T may not work systematically (see Figure 13). We use 16 steps of (Huang et al., 2024, Algorithm 3) with 4 MCMC chains. The chains are initialized with an IS approximation of the posterior powered by 128 particles.

Stochastic Localization via Iterative Posterior Sampling

The SDE is discretized over K steps and the expectation leading to the drift is estimated with L Langevin steps. The initial sample is distributed according to N(0, (1 exp( 2T)) Id), the best estimation of p T that we have, and the initial step-size is taken according to its variance ;

The OAT algorithm has its parameter M set as K/2 and uses as many MCMC steps per noise level as SLIPS or RDMC. This choice of K ensures that the computational complexity of OAT is on par with the other algorithms. The Langevin steps are done using the underdamped Langevin algorithm as suggested by the authors. Its step-size is searched in {0.03, 1.0} and the efficient friction is searched in {0.0625, 0.05} as recommended by the authors. These prescriptions were extracted from (Saremi et al., 2024, Appendix G). The OAT algorithm has slightly shorter grid sizes because of its prohibitive computational cost;

The SLIPS algorithm is declined in three flavors depending on the choice of the schedule α. Since the choice of t0 is sensitive to the accuracy of the estimation of Rπ (different values of Rπ will shift the log-SNR upwards or downwards), we decided to search the hyper-parameters in different areas depending on the target distribution. Those grids can be found in Table 3. The values for t0 were chosen by approximate equal log-SNR spacing in [ 3.5, 1.0] for Gaussian mixtures and Bayesian logistic regression, in [ 1.0, 0.2] for ϕ4 and [ 2.0, 0.0] for the others. The values for η were chosen to be around 5.0.

Table 4: Hyper-parameters selected for the experiments for each algorithm and target density.

TARGET RDMC OAT SLIPS STANDARD SLIPS GEOM(1,1) SLIPS GEOM(2,1)

TIME T STEP-SIZE δ FRICT. δγ η t0 η t0 η t0

8-GAUSSIANS log(0.80) 0.05 0.03 5.7 0.60 5.7 0.35 5.0 0.35 RINGS log(0.80) 0.0625 1.0 4.6 1.20 4.6 0.10 4.6 0.30 FUNNEL log(0.90) 0.05 0.03 5.0 1.00 4.6 0.30 4.6 0.40 MIXTURE (d = 8) log(0.70) 0.0625 0.03 5.0 0.40 5.0 0.25 5.0 0.45 MIXTURE (d = 16) log(0.70) 0.0625 0.03 5.0 0.20 5.0 0.15 5.0 0.35 MIXTURE (d = 32) log(0.70) 0.05 0.03 5.0 0.10 5.0 0.10 5.0 0.25 MIXTURE (d = 64) log(0.70) 0.0625 0.03 5.0 0.05 5.0 0.05 5.0 0.20 IONOSPHERE log(0.95) 0.0625 0.03 5.0 0.03 5.0 0.03 5.0 0.15 SONAR log(0.95) 0.0625 0.03 5.0 0.03 5.0 0.03 5.0 0.15 ϕ4 (b = 0) log(0.95) 0.0625 0.03 5.7 0.80 5.7 0.30 5.7 0.40 ϕ4 (b = 0.025) log(0.95) 0.0625 1.0 5.7 1.80 5.7 0.35 6.1 0.45 ϕ4 (b = 0.05) log(0.70) 0.05 1.0 6.1 1.00 5.7 0.30 5.7 0.40 ϕ4 (b = 0.075) log(0.70) 0.0625 0.03 5.7 1.80 5.7 0.35 5.7 0.40 ϕ4 (b = 0.1) log(0.90) 0.05 0.03 5.7 1.40 5.7 0.45 5.7 0.40

Stochastic Localization via Iterative Posterior Sampling

H.3. Target distributions and metrics

8 Gaussians, Rings and Funnel distributions and their respective metrics.

(a) The 8 Gaussians distribution consists of 8 equally weighted Gaussian distributions with mean mi = 10 (cos(2πi/8), sin(2πi/8)) for i {0, . . . , 7} and covariance 0.7 I2. This distribution satisfies A0 with R = 10/

2 and τ 2 = 0.7.

(b) The Rings distribution is the inverse polar reparameterization of a distribution pz which has itself a decomposition into two univariate marginals pr and pθ: pr is a mixture of 4 Gaussian distributions N(i + 1, 0.152) with i {0, . . . , 3} describing the radial position and pθ is a uniform distribution over [0, 2π], which describes the angular position of the samples. This distribution satisfies A0 with R = 4/

2 and τ = 0.15.

(c) The Funnel distribution (Neal, 2003) has its density given by x1:10 7 N(x1; 0, σ2I10)N(x2:10; 0, exp(x1)I10) where σ2 = 9. We approximate R by the maximum scalar standard deviation of the distribution, i.e., R = 2.12, and set τ = 0.

For 8 Gaussians and Rings, we evaluate the quality of sampling by approximating the entropy-regularized 2-Wasserstein distance defined in Appendix A, with regularization ε = 0.05 via POT library (Flamary et al., 2021). Regarding the Funnel, we evaluate the quality of sampling using the Kolmogorov-Smirnov distance, see Appendix A, and adopt the sliced version from (Grenioux et al., 2023b, Appendix D.1).

Bayesian Logistic Regression models. Consider a training dataset D = {(xj, yj)}M j=1 where xj Rd and yj {0, 1} for all j {1, . . . , M}. We evaluate the likelihood of a pair (x, y) as given by p(y|x; w, b) = Bernoulli(y; σ(x T x + b)) where w Rd is a weight vector, b R is an intercept and σ is the sigmoid function. Given a prior distribution p(w, b), we sample from the posterior distribution p(w, b|D) p(D|w, b)p(w, b) = QM j=1 p(yj|xj; w, b)p(w, b). The prior is built as p(w, b) = N(w; 0, Id)N(b; 0, (2.5)2). In our experiments, we approximate R by the maximum scalar standard deviation of the prior distribution, i.e., R = 2.5/

d + 1 and set τ = 0. The quality of the samples are obtained by computing the mean predictive log-likelihood (i.e., computing p(w, b|Dtest) with Dtest a separate test dataset).

High dimension Gaussian mixture. We consider the mixture of two Gaussian distributions defined in (24). Following the computations of Appendix C, we set R = 4/3 and τ 2 = 0.05. Complementary to the estimation error on the relative weight given in the main paper, we also report the Sliced Wasserstein distance (Bonneel et al., 2015; Nadjahi et al., 2019) on Figure 15 (left). This figure shows that SLIPS also recovers the local structure of the target distribution quite well. Moreover, Figure 15 (right) shows that SMC and AIS are unable to scale with dimension and collapse into a single mode.

8 16 32 64 Dimension

Sliced Wasserstein Distance

AIS OAT RDMC

SMC SLIPS Geom(1,1)

SLIPS Geom(2,1) SLIPS Standard

8 16 32 64 Dimension

0 10 20 30 40 50 60 Mode weight estimation error (in %)

AIS OAT RDMC

SMC SLIPS Geom(1,1)

SLIPS Geom(2,1) SLIPS Standard

Figure 15: Left: Sliced Wasserstein distance computed on bimodal Gaussian mixtures with increasing dimension. Right: Estimation error when computing the relative weight of the corresponding two modes.

Stochastic Localization via Iterative Posterior Sampling

0.0 0.2 0.4 0.6 0.8 1.0 s

Figure 16: Modes of the ϕ4 distribution (ϕh + in shades of red and ϕh in shades of blue) for different values of h (corresponding to the transparency levels). The time interval [0, 1] on the x-axis is discretized into a grid of size d = 100, which corresponds to the dimensionality of the samples.

The ϕ4 distribution. The ϕ4 model is a toy model defined as a continuous relaxation of the Ising model that serves the study of phase transitions in statistical mechanics. Following (Gabri e et al., 2022), we consider a version of the model discretized on a 1-dimensional grid of size d = 100. One configuration is therefore a d-dimensional vector (ϕi)d i=1. We additionally clip the field to 0 at both extremities by defining the extra ϕ0 = ϕd+1 = 0. The negative log-density of the distribution writes

ln πh(ϕ) = β

i=1 (ϕi ϕi 1)2 + 1 4ad

i=1 (1 ϕ2 i )2 + hϕi

We chose parameter values for which the system is bimodal, a = 0.1 and inverse temperature β = 20, and vary the value of h. We denote by w+ the statistical occurrence of configurations such that ϕd/2 > 0 and w the statistical occurrence of configurations such that ϕd/2 < 0. At h = 0, the measure is invariant under the symmetry ϕ ϕ, such that we expect w+ = w . For h > 0, the negative mode dominates. We plot the two modes on Figure 16.

When d is large, the relative probability of the modes can be estimated by a Laplace approximations at 0-th and 2-nd order. Denoting by ϕh + and ϕh the local maxima of (34), these approximations yield respectively,

w w+ πh(ϕh ) πh(ϕh +) , w w+ πh(ϕh ) | det Hh(ϕh )| 1/2

πh(ϕh +) | det Hh(ϕh +)| 1/2 ,

where Hh is the Hessian of the function ϕ ln πh(ϕ). In our experiments, we considered R = 4.5 and τ = 10 2 by running MALA chains started in each mode : τ 2 is set as the variance within the modes while R corresponds to the distance between the modes and 0100. In this setting, we observed that SMC and AIS suffered from mode collapse, while OAT and RDMC produced degenerate samples; in contrast, our methodology SLIPS recovers accurate samples from the target distribution, with correct relative weight, as shown in Figure 5.

H.4. Empirical complexity of SLIPS

The results of Figure 17 (d = 32) show that under a much lower computational budget (K = 20 here) than in Section 6 (K = 1024), SLIPS maintains the same performance, while still outperforming its competitors. Moreover, these results demonstrate that SLIPS has very reasonable execution times (below 1 minute to obtain 8192 samples) making it completely practical. Lastly, Figure 17 emphasizes that, even using high computational budgets, competitors cannot solve such multimodal tasks in high-dimension. Note that those observations stay valid in problems with lower dimension (see Figure 18 where d = 4).

Stochastic Localization via Iterative Posterior Sampling

0 5000 10000 15000 20000 25000 30000 35000 Number of target evaluation

Mode weight estimation error (in %)

0 50 100 150 200 250 Mean computation time (in s)

AIS OAT RDMC SMC

SLIPS Geom(1,1) SLIPS Geom(2,1) SLIPS Standard

0 5000 10000 15000 20000 25000 30000 35000 Number of target evaluation

Sliced Wasserstein Distance

0 50 100 150 200 250 Mean computation time (in s)

AIS OAT RDMC SMC

SLIPS Geom(1,1) SLIPS Geom(2,1) SLIPS Standard

Figure 17: Metrics when sampling from the Gaussian mixture defined in Section 6 (d = 32) using different algorithms. Top: Weight estimation error. Bottom: Sliced Wasserstein Distance. Left: Metric depending on the number of target density evaluations. Right: Metric depending on wall time. The computational budgets are computed by evolving K linearly from 20 to 90. The number of MCMC steps is fixed at 32. The computations were run on the same Nvidia V100 GPU.

1000 1250 1500 1750 2000 2250 2500 2750 Number of target evaluation

Mode weight estimation error (in %)

AIS OAT RDMC SMC

SLIPS Geom(1,1) SLIPS Geom(2,1) SLIPS Standard

2 3 4 5 6 7 8 Mean computation time (in s)

1000 1250 1500 1750 2000 2250 2500 2750 Number of target evaluation

Sliced Wasserstein Distance

2 3 4 5 6 7 8 Mean computation time (in s)

AIS OAT RDMC SMC

SLIPS Geom(1,1) SLIPS Geom(2,1) SLIPS Standard

Figure 18: Metrics when sampling from the Gaussian mixture defined in Section 6 (d = 4) using different algorithms. Top: Weight estimation error. Bottom: Sliced Wasserstein Distance. Left: Metric depending on the number of target density evaluations. Right: Metric depending on wall time. The computational budgets are computed by evolving K linearly from 20 to 90. The number of MCMC steps is fixed at 16. The computations were run on the same Nvidia V100 GPU. The computational budgets were aligned to match SLIPS s.