# chain_of_logconcave_markov_chains__e1cd3ce0.pdf Published as a conference paper at ICLR 2024 CHAIN OF LOG-CONCAVE MARKOV CHAINS Saeed Saremi1, Ji Won Park1, Francis Bach2 1Frontier Research, Prescient Design, Genentech, South San Francisco, CA 2Inria, Ecole Normale Supérieure, Université PSL, Paris, France We introduce a theoretical framework for sampling from unnormalized densities based on a smoothing scheme that uses an isotropic Gaussian kernel with a single fixed noise scale. We prove one can decompose sampling from a density (minimal assumptions made on the density) into a sequence of sampling from log-concave conditional densities via accumulation of noisy measurements with equal noise levels. Our construction is unique in that it keeps track of a history of samples, making it non-Markovian as a whole, but it is lightweight algorithmically as the history only shows up in the form of a running empirical mean of samples. Our sampling algorithm generalizes walk-jump sampling (Saremi & Hyvärinen, 2019). The walk phase becomes a (non-Markovian) chain of (log-concave) Markov chains. The jump from the accumulated measurements is obtained by empirical Bayes. We study our sampling algorithm quantitatively using the 2-Wasserstein metric and compare it with various Langevin MCMC algorithms. We also report a remarkable capacity of our algorithm to tunnel between modes of a distribution. 1 INTRODUCTION Markov chain Monte Carlo (MCMC) is an important class of general-purpose algorithms for sampling from an unnormalized probability density of the form p(x) = e f(x)/Z in Rd. This is a fundamental problem and appears in a variety of fields, e.g., statistical physics going back to 1953 (Metropolis et al., 1953), Bayesian inference (Neal, 1995), and molecular dynamics simulations (Leimkuhler & Matthews, 2015). The biggest challenge facing MCMC is that the distributions of interest lie in very high dimensions and are far from being log-concave, therefore the probability mass is concentrated in small pockets separated by vast empty spaces. These large regions with small probability mass make navigating the space using Markov chains very slow. The second important challenge facing MCMC is that the log-concave pockets themselves are typically ill-conditioned highly elongated, spanning different directions for different pockets which only adds to the complexity of sampling. The framework we develop in this paper aims at addressing these problems. The general philosophy here is that of smoothing, by which we expand the space from Rd to Rmd for some integer m and fill up the empty space iteratively with probability mass in an approximately isotropic manner, the degree of which we can control using a single smoothing (noise) hyperparameter σ. The map from (noisy samples in) Rmd back to (clean samples in) Rd is based on the empirical Bayes formalism (Robbins, 1956; Miyasawa, 1961; Saremi & Hyvärinen, 2019; Saremi & Srivastava, 2022). In essence, a single jump using the empirical Bayes estimator removes the masses that were created during sampling. We prove a general result that, for any large m, the problem of sampling in Rd can be reduced to sampling from a sequence of log-concave densities: once log-concave, always log-concave. The trade-off here is the linear time cost of accumulating noisy measurements over m iterations. More formally, instead of sampling from p(x), we sample from the density p(y1:m) that is associated with Y1:m := (Y1, . . . , Ym), where Yt = X + Nt, t [m], Nt X, and Nt N(0, σ2I) all independent. As we show in the paper, there is a duality between sampling from p(x) and sampling from p(y1:m) in the regime where m 1/2σ is small, irrespective of how large σ is. This is related to the notion of universality class underlying the smoothed densities. Crucial to our formalism is keeping track of the history of all the noisy samples generated along the way using the factorization p(y1:m) = p(y1) t=2 p(yt|y1:t 1). (1.1) Published as a conference paper at ICLR 2024 Figure 1: Chain of log-concave Markov chains. Here, (y(i) t )i [nt] are samples from a Markov chain, which is used to generate independent draws from p(yt|y1:t 1) for t [m]. The blue arrows indicate the non-Markov aspect of our sampling scheme: the accumulation of noisy measurements. The wiggly arrows indicate the denoising jumps . In this example, p(yt|y1:t 1) is log-concave for all t, but the jumps asymptotically sample the target density (a mixture of two Gaussians) as t increases. An important element of this sampling scheme is therefore non-Markovian. However, related to our universality results, this history only needs to be tracked in the form of an empirical mean, so the memory footprint is minimal from an algorithmic perspective. See Fig. 1 for a schematic. A more technical summary of our contributions and the outline of the paper are as follows: In Sec. 2, we prove universality results underlying the smoothed densities p(y1:m). We study anisotropic Gaussians in Sec. 3, proving a negative result regarding the condition number of p(y1:m) in comparison to p(y1) in the same universality class. This analysis becomes a segue to our factorization (1.1), where in remarkable contrast we show that the condition number monotonically improves upon accumulation of measurements. Sec. 4 is at the heart of the paper, where we prove several results culminating in Theorem 1, which shows that a broad class of sampling problems can be transformed into a sequence of sampling from strongly log-concave distributions using our measurement accumulation scheme. (This is a feasibility result; in particular, we do not prove here that the log-concave sampling strategy is optimal.) We examine the theorem by algebraically studying an example of a mixture of Gaussians in detail. In Sec. 4, we also outline our general sampling algorithm. We validate our algorithm on carefully designed test densities in Sec. 5. In particular, our algorithm results in lower 2-Wasserstein metric compared to sampling from p(x) using Langevin MCMC (without any smoothing). We also qualitatively report the capacity of our log-concave sampling scheme to tunnel to a mode of a distribution with a small probability mass in a small number of steps when it is initialized at a mode with much higher mass. 1.1 RELATED WORK Our solution, sketched above, has its roots in walk-jump sampling (Saremi & Hyvärinen, 2019) and its recent generalization (Saremi & Srivastava, 2022). Both papers were framed within the context of generative modeling, i.e., sampling from an unknown distribution from which one has access to independent samples. In contrast, this work lays the theoretical foundation for the fundamental problem of sampling from an unnormalized density when there are no samples available. In addition, regarding the recent development, we show analytically that the intuition expressed by Saremi & Srivastava (2022) regarding the distribution p(y1:m) being well-conditioned is not correct. This nontrivial negative result motivates our analysis of the non-Markovian scheme for sampling p(y1:m). Published as a conference paper at ICLR 2024 Our methodology is agnostic to the algorithm used for sampling from p(yt|y1:t 1) in (1.1). However, we have been particularly motivated by the research on Langevin MCMC which is a class of gradientbased sampling algorithms obtained by discretizing the Langevin diffusion (Parisi, 1981). There is a growing body of work on the analysis of Langevin MCMC algorithms of various complexity (overdampled, Metropolis-adjusted, underdamped, higher-order) for sampling from log-concave distributions (Dalalyan, 2017; Durmus & Moulines, 2017; Cheng et al., 2018; Dwivedi et al., 2018; Shen & Lee, 2019; Cao et al., 2021; Mou et al., 2021; Li et al., 2022). There is a significant body of work on sequential methods for sampling, rooted in annealing methods in optimization (Kirkpatrick et al., 1983), which became popular in the MCMC literature due to Neal s seminal paper on annealed importance sampling (Neal, 2001). Diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020) are a related class of sequential methods for generative modeling. Our sequential scheme is distinguished from earlier methods from separate angles: (I) Although we sample the conditional densities with Markov chains, we condition on all the previous samples that were generated. As a whole, our scheme is strictly non-Markovian. (II) In our sequential scheme, we are able to guarantee that we sample from (progressively more) log-concave densities. To our knowledge, no other sampling frameworks can make such guarantees. (III) Compared to diffusion models, the noise level in our framework is held fixed. This is an important feature of our sampling algorithm and it underlies many of its theoretical properties. (IV) All prior sequential schemes rely on a noising/annealing schedule which is hard to tune, and their performance is sensitive to the choice of the schedule (Karras et al., 2022; Syed et al., 2022). In contrast, our sequential scheme is free of scheduling and relies on only two parameters: the noise level σ and the number of measurements m.1 Notation. We use p to denote probability density functions and adopt the convention where we drop the random variable subscript to p when the arguments are present, e.g., p(x) := p X(x), p(y2|y1) := p Y2|Y1=y1(y2). We reserve f to be the energy function associated with p(x) e f(x). We use λ to denote the spectrum of a matrix, e.g., λmax(C) is the largest eigenvalue of C. We use the shorthand notations [m] = {1, . . . , m}, y1:m = (y1, . . . , ym), and y1:m = 1 m Pm t=1 yt. 2 UNIVERSAL (σ, m)-DENSITIES Consider the multimeasurement (factorial kernel) generalization of the kernel density by Saremi & Srivastava (2022) for m isotropic Gaussian kernels with equal noise level (kernel bandwidth) σ: Rd e f(x) exp 1 x yt 2 dx. (2.1) We refer to p(y1:m) as the (σ, m)-density. Equivalently, Yt|x iid N(x, σ2I), t [m]. Clearly, p(y1:m) is permutation invariant p(y1, . . . , ym) = p(yπ(1), . . . , yπ(m)), where π : [m] [m] is a permutation of the m measurements. We set the stage for the remainder of the paper with a calculation that shows the permutation invariance takes the following form (see Appendix A): log p(y1:m) = φ(y1:m; m 1/2σ) + m t=1 yt 2 + cst, (2.2) where φ(y; σ) := log Z e f(x) exp 1 2σ2 x y 2 dx. (2.3) The calculation is straightforward by grouping the sums of squares in (2.1): t=1 x yt 2 = m x y1:m 2 + y1:m 2 1 t=1 yt 2 + cst. In addition, the Bayes estimator of X given Y1:m = y1:m simplifies as follows (see Appendix A): E[X|y1:m] = y1:m + m 1σ2 φ(y1:m; m 1/2σ). (2.4) These calculations bring out a notion of universality class that is associated with p(y1:m) formalized by the following definition and proposition. 1Our method can be viewed as a discretization scheme in the stochastic localization method, which we plan to formalize in future research. Published as a conference paper at ICLR 2024 Definition 1 (Universality Class). We define the universality class [ σ] as the set of densities p(y1:m), in the family of (σ, m)-densities, such that for all (σ, m) [ σ] the following holds: m 1/2σ = σ. Proposition 1. If Y1:m p(y1:m), let ˆpσ,m be the distribution of E[X|Y1:m], and define ˆpσ = ˆpσ,1. Then ˆpσ,m = ˆpm 1/2σ. In other words, ˆpσ,m is identical for densities in the same universality class. Proof. We are given X e f(x), Yt = X + εt, εt N(0, σ2I) independently for t [m]. It follows Y 1:m = X + ε, where ε N(0, σ2I), where σ2 = m 1σ2. Using (2.4), E[X|y1:m] is distributed as X + ε + σ2 φ(X + ε; σ), which is identical for all densities p(y1:m) in [ σ]. 2.1 DISTRIBUTION OF E[X|y1:m] VS. p X: UPPER BOUND ON THE 2-WASSERSTEIN DISTANCE Our goal is to obtain samples from p X, but in walk-jump sampling the samples are given by E[X|y1:m], where y1:m p(y1:m) (Saremi & Hyvärinen, 2019; Saremi & Srivastava, 2022). Next, we address how far ˆpσ,m is from the density of interest p X. Proposition 2. The squared 2-Wasserstein distance between p X and ˆpσ,m is bounded by W2(p X, ˆpσ,m)2 σ2 The proof is given in Appendix B. As expected, the upper bound is expressed in terms of σ2 = σ2/m. A close inspection of the proof shows that the bound above is loose as it is obtained from the rate resulting from replacing the empirical Bayes estimator E[X|y1:m] with the empirical mean Y 1:m. Note, however, that when the prior p(x) is strong (e.g., low entropy), the dependence on σ2/m can be significantly improved. 3 THE GEOMETRY OF (σ, m)-DENSITIES In this section we analyze at the problem of sampling from p(y1:m) where we consider p(x) to be an anisotropic Gaussian, X N(0, C), with a diagonal covariance matrix: C = diag(τ 2 1 , . . . , τ 2 d). (3.1) The density p X is strongly log-concave with the property τ 2 max I 2f(x) τ 2 min I, therefore its condition number is κ = τ 2 minτ 2 max. Log-concave densities with κ 1 are considered ill-conditioned. Since Y1 N(0, C + σ2I), the condition number for (single-measurement) smoothed density, which we denote by κσ,1 is given by: κσ,1 = (1 + σ 2τ 2 max)/(1 + σ 2τ 2 min). (3.2) Next, we give the full spectrum of the precision matrix associated with (σ, m)-densities. Proposition 3. Consider an anisotropic Gaussian density X N(0, C) in Rd, where Cij = τ 2 i δij. Then the (σ, m)-density is a centered Gaussian in Rmd: Y1:m N(0, F 1 σ,m). For m 2, the precision matrix Fσ,m is block diagonal with d blocks (indexed by i) of size m m, each with the following spectrum: (i) There are m 1 degenerate eigenvalues equal to σ 2, (ii) The remaining eigenvalue equals to (σ2 + mτ 2 i ) 1. The condition number κσ,m associated with the (σ, m)-density is given by: κσ,m = λmax(Fσ,m) λmin(Fσ,m) = 1 + m σ 2τ 2 max. Remark 1 (The curse of sampling all measurements at once). The above proposition is a negative result regarding sampling from p(y1:m) if this is an important if all m measurements y1:m are sampled in parallel (at the same time). This is because mσ 2 = σ 2 remains constant for m > 1 for (σ, m) [ σ] even worse, the condition number κσ,m is strictly greater than κ σ,1 for m > 1. This negative result regarding the sampling scheme by Saremi & Srivastava (2022), we call joint multimeasurement sampling (JMS), leads to our investigation below into sampling from p(y1:m) sequentially using the factorization (1.1) that we call sequential multimeasurement sampling (SMS). Now, we perform the analysis in Proposition 3 for the spectrum of the conditional densities in (1.1). Published as a conference paper at ICLR 2024 Proposition 4. Assume X N(0, C) is the anisotropic Gaussian in Proposition 3. Given the factorization of p(y1:m) in (1.1), for t > 1, the conditional density p(yt|y1:t 1) is a Gaussian with a shifted mean, and with a diagonal covariance matrix: 2σ2 log p(yt|y1:t 1) = i=1 (1 Ati) yti Ati 1 Ati k=1 yki 2 + cst, where Ati is short for Ati = t + σ2τ 2 i 1. The precision matrix associated with p(yt|y1:t 1), denoted by Ft|1:t 1, has the following spectrum σ2λi(Ft|1:t 1) = 1 t + σ2τ 2 i 1 , with the following condition number κt|1:t 1 = 1 (t + σ2τ 2 min) 1 1 (t + σ2τ 2 max) 1 . Lastly, the condition number κt|1:t 1 is monotonically decreasing as t increases (for any m > 1): 1 < κm|1:m 1 < < κ3|1:2 < κ2|1 < κ1, (3.3) where κ1 := κσ,1 is given by (3.2). The proofs for Proposition 3 and Proposition 4 are given in Appendix C. These two propositions stand in a clear contrast to each other: in the SMS setting of Proposition 4, sampling becomes easier by increasing t as one goes through accumulating measurements y1:t sequentially, where in addition κ1 can itself be decreased by increasing σ. Next, we analyze the SMS scheme in more general settings. 4 CHAIN OF LOG-CONCAVE MARKOV CHAINS Can we devise a sampling scheme where we are guaranteed to always sample log-concave densities? This section is devoted to several results in that direction. We start with the following two lemmas. Lemma 1. Assume x Rd, 2f(x) LI and f(x) µ x x0 for some x0. Then, y Rd: 2(log p)(y) 1 + 3Ld µ2σ2 + 3 x0 y 2 The proof is given in Appendix D. Lemma 2. Consider the density p(x) associated with the random variable X in Rd and the (σ, m)- density given by (2.1). Then in expectation, for any m 1 the conditional densities become more log-concave upon accumulation of measurements:2 Ey1 2 y1 log p(y1) Ey1:2 2 y2 log p(y2|y1) Ey1:m 2 ym log p(ym|y1:m 1). Proof. The full proof of the lemma is given in Appendix D, where we derive the following: 2 ym log p(ym|y1:m 1) = σ 2I + σ 4cov(X|y1:m). The proof follows through since due to the law of total covariance the mean of the posterior covariance Ey1:mcov(X|y1:m) can only go down upon accumulation of measurements. These two lemmas paint an intuitive picture that we expand on in the remainder of this section: (i) by increasing σ we can transform a density to be strongly log-concave (Lemma 1) which we can sample our first measurement from, (ii) and by accumulation of measurements we expect sampling to become easier, where in Lemma 2 this is formalized by showing that on average the conditional densities become more log-concave by conditioning on previous measurements. Next, we generalize these results with our main theorem, followed by an example on a mixture of Gaussians. 2Note that here no assumption is made on the smoothness of p(x). Published as a conference paper at ICLR 2024 2 ym log p(y1:m) 2 4 6 8 10 - 1.0 (c) σ2ζ(m) vs. m Figure 2: (a,b) The negative conditional Hessian for two values of σ are plotted as a function of y1:m and m assuming X is distributed according to (4.4) in 1D, where we set µ = 3, τ = 1 (see (4.5)). (c) The upper bound in (4.1) is sharp for this example; σ2ζ(m) is plotted vs. m for different σ. Theorem 1 (Once log-concave, always log-concave). Consider Z to be a random variable in Rd with a compact support, i.e., almost surely Z 2 R2, and take X = Z +N0, N0 Z, N0 N(0, τ 2I). Then, for any m 1, the conditional Hessian is upper bounded 2 ym log p(ym|y1:m 1) ζ(m)I, (4.1) mτ 2 + σ2 1 + R2 (mτ 2 + σ2)2 (4.2) is a decreasing function of m, in particular: ζ (m) = τ 2(2R2σ2 + σ2τ 2 + mτ 4) σ2(σ2 + mτ 2)3 0. As a corollary, p(y1) associated with Y1 = X + N1, N1 N(0, σ2I) is strongly log-concave if σ2 > R2 τ 2, (4.3) and stays strongly log-concave upon accumulation of measurements. Proof. The full proof is given in Appendix D and it is a direct consequence of the following identity: 2 ym log p(ym|y1:m 1) = 1 mτ 2 + σ2 1 I + 1 (mτ 2 + σ2)2 cov(Z|y1:m), which we derive, combined with cov(Z|y1:m) R2I due to our compactness assumption. Remark 2. Theorem 1 spans a broad class of sampling problems, especially since τ can in principle be set to zero. The only property we loose in the setting of τ = 0 is that the upper bound ζ(m)I does not monotonically go down as measurements are accumulated. 4.1 EXAMPLE: MIXTURE OF TWO GAUSSIANS In this section we examine Theorem 1 by studying the following mixture of Gaussians for α = 1/2: p(x) = α N(x; µ, τ 2I) + (1 α) N(x; µ, τ 2I). (4.4) This is an instance of the setup in Theorem 1, where p(z) = 1 2δ(z µ) + 1 2δ(z + µ), and R2 = µ µ. By differentiating (2.2) twice we arrive at the following expression for 2 ym log p(ym|y1:m 1): 2 ym log p(ym|y1:m 1) = 2 ym log p(y1:m) = σ 2(m 1 1)I + m 2H(y1:m; m 1/2σ), (4.5) where H(y; σ) := 2φ(y; σ); see (2.3) for the definition of φ. In Appendix D we show that for the mixture of Gaussian here, (4.4) with α = 1/2, we have H(y; σ) = 1 (σ2 + τ 2) σ2 + τ 2 1 + cosh 2µ y Published as a conference paper at ICLR 2024 which takes its maximum at y = 0. By using (4.5), it is then straightforward to show that (4.1), (4.2), and (4.3) all hold in this example, with the additional result that the upper bound is now tight. In Fig. 2, these calculations are visualized in 1D for µ = 3, τ = 1, and for different values of σ; in panel (c) we also plot 1 1/m which is the large m behavior of σ2ζ(m). This can be seen from two different routes: (4.2) and (4.5). Remark 3 (Monotonicity). The monotonic decrease of the upper bound in Theorem 1, together with the monotonicity result in Lemma 2, may lead one to investigate whether the stronger result 2 y1 log p(y1) 2 y2 log p(y2|y1) . . . 2 ym log p(ym|y1:m 1), (4.7) could hold, e.g., for the mixture of Gaussians we studied here, especially since the upper bound (4.1) is sharp for this example. For (4.7) to hold, cov(Z|y1:m) cov(Z|y1:m 1) almost surely. However, we can imagine a scenario where y1 + + ym 1 is very large, so that cov(Z|y1:m) 0, while ym is such that y1 + + ym 1 + ym is close to m E[Z], where cov(Z|y1:m) will be large. 4.2 ALGORITHM: NON-MARKOVIAN CHAIN OF (LOG-CONCAVE) MARKOV CHAINS Below, we give the pseudo-code for our sampling algorithm. In the inner loop, MCMCσ is any MCMC method, but our focus in this paper is on Langevin MCMC algorithms3 that use yt log p(yt|y1:t 1) to sample the new measurement Yt conditioned on the previously sampled ones Y1:t 1. Algorithm 1: Sequential multimeasurement walk-jump sampling referred to by SMS. See Fig. 1 for the schematic. A version of MCMCσ is given in Appendix E. 1: Parameter noise level σ 2: Input number of measurements m, number of steps for each measurement nt 3: Output ˆX 4: Initialize Y 1:0 = 0 5: for t = [1, . . . , m] do 6: Initialize Y (0) t 7: for i = [1, . . . , nt] do 8: Y (i) t = MCMCσ(Y (i 1) t , Y 1:t 1) 9: end for 10: Yt = Y (nt) t 11: Y 1:t = Y 1:t 1 + (Yt Y 1:t 1)/t 12: end for 13: return ˆX E[X|Y1:m] according to (2.4) 4.2.1 ESTIMATING log p(y) So far we have assumed we know the smoothed score function g(y; σ) := (log p)(y) = φ(y; σ), and in experiments below we consider cases where we know g(y; σ) in closed form. In general settings, we would like to estimate g in terms of the unnormalized p(x) e f(x). Given (2.3) and (2.4), we write a an expression for g(y; σ) which can be turned into various estimators: g(y; σ) = 1 σ2 E[ ˇX] y , ˇX e ˇ f( ; y,σ), where ˇf(x; y, σ) := f(x) + 1 2σ2 x y 2. (4.8) See Appendix F, where we also give two estimators for g depending on how E[ ˇX] in (4.8) is estimated: ˆgplugin (F.1) is a plug-in estimator obtained by importance sampling, ˆglangevin (F.2) is obtained by sampling ˇX using Langevin MCMC. Finally, use (2.2) to write yt log p(yt|y1:t 1) in terms of g: yt log p(yt|y1:t 1) = yt log p(y1:t) = 1 t g(y1:t; σ σ2 (y1:t yt). (4.9) The above expression is used in the inner loop of Algorithm 1 to sample from p(yt|y1:t 1) sequentially. We conduct experiments to investigate this aspect of the problem (replacing g with ˆg) in the appendix. 3We experimented with a variety of Langevin MCMC algorithms to sample from p(yt|y1:t 1) in the inner loop of Algorithm 1. The results are reported in the appendix due to space constraints. After extensive tuning, we found the algorithm by Sachs et al. (2017) to be the best performing for the test densities we considered. Published as a conference paper at ICLR 2024 1 2 3 4 5 6 log2 d Elliptical Gaussian SMS JMS m = 1 σ = 0 Oracle X p X 1 2 3 4 5 6 log2 d Mixture of Gaussians SMS JMS m = 1 σ = 0 Oracle X p X 0 10 20 30 40 50 60 σ Elliptical Gaussian d = 2 d = 4 d = 8 d = 16 d = 32 d = 64 X p X 0 10 20 30 40 50 60 σ Mixture of Gaussians d = 2 d = 4 d = 8 d = 16 d = 32 d = 64 X p X Figure 3: (a, b) Sliced 2-Wasserstein distance vs. d and (c, d) Sliced 2-Wasserstein distance vs. σ for varying d for the elliptical Gaussian and Gaussian mixture target densities using SMS. 5 EXPERIMENTS We evaluate the performance of Algorithm 1 alongside related sampling schemes on carefully designed test densities. We compare the following sampling schemes: Sequential multimeasurement walk-jump sampling (SMS), Algorithm 1 with m = 1000, Joint multimeasurement walk-jump sampling (JMS), Single-measurement walk-jump sampling ( m = 1 ), Langevin MCMC by Sachs et al. (2017) without any smoothing ( σ = 0 ), Exact samples from ˆpσ,m ( Oracle ).4 Metric. We use the sliced 2-Wasserstein metric (Bonneel et al., 2015; Peyré & Cuturi, 2019) to quantify the consistency of the obtained samples with the target density p X. We use 1,000 projection directions drawn from the Gaussian distribution (Nadjahi et al., 2021, Eq.9). MCMC algorithms. For all the results in this section, we implement MCMC sampling based on underdamped Langevin diffusion (ULD). The particular algorithm used for the results shown in this section extends the BAOAB integration scheme using multiple time steps for the O-part (Sachs et al., 2017). In Appendix H, we present the full comparison across other MCMC algorithms, including other recent ULD variants (Cheng et al., 2018; Shen & Lee, 2019) as well as the Metropolis-adjusted Langevin algorithm (MALA) (Roberts & Tweedie, 1996; Dwivedi et al., 2018). Score estimation. In Appendix I, we compare sampling with the analytic score function, the plug-in estimator of the score function given in (F.1) with varying numbers of MC samples n, and the Langevin estimator of the score function given in (F.2). Hyperparameter search. The hyperparameters were tuned for each sampling scheme and the total number of iterations was kept fixed in our comparisons reported here. See Appendix G for details. 5.1 ELLIPTICAL GAUSSIAN The elliptical Gaussian features a poorly conditioned covariance: X N(0, C), where we set τ 2 1 = 0.1, τ 2 2 = = τ 2 d = 1 in (3.1). For each d, the noise level σ and other hyperparameters of the sampling algorithm, such as step size and friction, were tuned. Fig. 3(a) plots the 2-Wasserstein distance with varying d. Our main observation here is that SMS outperforms JMS, which is expected from our theoretical analysis in Sec. 3. In addition, as Fig. 3(c) shows, SMS was robust to the choice of σ particularly for large d in our experiments. Finally, the underdamped Langevin MCMC by Sachs et al. (2017) does quite well when friction is carefully tuned; we would like to investigate this for larger condition numbers in future research. 4Oracle is short for the SMS oracle. For the test densities considered here we can draw exact samples from ˆpσ,m, the distribution of E[X|Y1:m]. This baseline is used to separate the issue of the closeness of ˆpσ,m to p X from the problem of sampling ˆpσ,m itself. Published as a conference paper at ICLR 2024 6 5 4 3 2 1 0 1 2 3 4 5 6 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 Initialization (a) SMS, 3 trajectories 6 5 4 3 2 1 0 1 2 3 4 5 6 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 Initialization (b) Langevin MCMC (c) SMS, 100 samples (d) Langevin, 100 samples Figure 4: Tunneling phenomenon. Trajectories of three walkers (a) under our SMS sampling scheme using Langevin MCMC by Sachs et al. (2017) in the inner loop of Algorithm 1 (b) the same Langevin MCMC without any smoothing. Purple and green indicate the beginning and end of trajectories, respectively. (c, d) The final samples for 100 random trajectories (using identical seeds). Fewer samples reach the smaller mode in Langevin MCMC compared to SMS. 5.2 MIXTURE OF GAUSSIANS To evaluate mixing of multiple modes, we consider the test density (4.4) with α = 1/5, τ = 1, and µ = 3 1d, where 1d is the d-dimensional vector (1, . . . , 1) . As Fig. 3(a) shows, SMS achieves consistently low (sliced) 2-Wasserstein distance with increasing d, whereas other sampling schemes deteriorate in performance. We observe, in Fig. 3(b), that SMS outperforms the best-performing underdamped Langevin MCMC (σ = 0) in our experiments for at least one σ value for all d. Higher d requires larger σ. In addition, we would like to highlight the following: (I) Vanilla walk-jump sampling (m = 1) is highly ineffective as the dimensions increase. This is in contrast to the sampling from anisotropic Gaussian (already log-concave) in Sec. 5.1. (II) The optimal σ here is in fact larger than the noise level needed to make p(y1) log-concave. This is related to the benefits of sampling from better-conditioned log-concave distributions, which is well-known in the literature. In Appendix J, we include results for a mixture of correlated Gaussians supporting the same conclusion. 5.3 TUNNELING PHENOMENA Fig. 4 illustrates the trajectories of three walkers (a) under our SMS sampling scheme and (b) using Langevin MCMC. Each walker has the same random seed between (a) and (b) and was initialized at (3, 3), the dominant mode with 80% of the mass. With SMS, a walker is able to tunnel to the smaller mode fairly quickly, whereas for Langevin MCMC (without smoothing) all three walkers are stuck around the dominant mode. In panels (c) and (d) we also show the histogram of final samples in the same setup with initialization at (3, 3) for 100 walkers after 100 K steps. In summary, we have consistently observed that the same (Langevin) MCMC algorithm, when used in the inner loop of Algorithm 1, is more effective than when used without Gaussian smoothing. 6 CONCLUSION In this paper, we established a theoretical framework that reduces the general problem of sampling from an unnormalized distribution to that of log-concave sampling defined by a single noise parameter. We conclude with two main limitations of this work at the present time: (I) Our results do not make it clear if the log-concave sampling strategy is optimal. The issue of optimality is challenging as it is inherently problem-dependent and additionally depends on the MCMC algorithm used in the inner loop of Algorithm 1. (II) Related to the issue of optimality is the fact that for general sampling problems, the smoothed score functions required to sample from p(yt|y1:t 1), t [m] using gradient-based methods need to be estimated. This is a complex problem and should be investigated in future research. Finally, an immediate application of the machinery we developed here is the problem of generative modeling, as the m smoothed score functions needed in running Algorithm 1 can indeed be learned (approximated) using empirical Bayes least-squares denoising objectives. This approach is similar to training diffusion models; however, our sampling scheme is fundamentally different, as it relies on the accumulation of measurements, controlled by a single noise parameter, σ. Published as a conference paper at ICLR 2024 Nicolas Bonneel, Julien Rabin, Gabriel Peyré, and Hanspeter Pfister. Sliced and radon Wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision, 51:22 45, 2015. Yu Cao, Jianfeng Lu, and Lihan Wang. Complexity of randomized algorithms for underdamped Langevin dynamics. Communications in Mathematical Sciences, 19(7):1827 1853, 2021. Sourav Chatterjee and Persi Diaconis. The sample size required in importance sampling. The Annals of Applied Probability, 28(2):1099 1135, 2018. Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann Le Cun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-SGD: Biasing gradient descent into wide valleys. In International Conference on Learning Representations, 2017. Xiang Cheng, Niladri S. Chatterji, Peter L. Bartlett, and Michael I. Jordan. Underdamped Langevin MCMC: A non-asymptotic analysis. In Conference on Learning Theory, pp. 300 323, 2018. Arnak S. Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society. Series B (Statistical Methodology), pp. 651 676, 2017. Alain Durmus and Éric Moulines. Nonasymptotic convergence analysis for the unadjusted Langevin algorithm. The Annals of Applied Probability, 27(3):1551 1587, 2017. Raaz Dwivedi, Yuansi Chen, Martin J. Wainwright, and Bin Yu. Log-concave sampling: Metropolis Hastings algorithms are fast! In Conference on Learning Theory, pp. 793 797, 2018. Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840 6851, 2020. Aapo Hyvärinen. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(Apr):695 709, 2005. Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusionbased generative models. Advances in Neural Information Processing Systems, 35:26565 26577, 2022. Scott Kirkpatrick, C. Daniel Gelatt Jr., and Mario P. Vecchi. Optimization by simulated annealing. Science, 220(4598):671 680, 1983. Ben Leimkuhler and Charles Matthews. Molecular dynamics: With deterministic and stochastic numerical methods. Interdisciplinary Applied Mathematics, 39:443, 2015. Ruilin Li, Hongyuan Zha, and Molei Tao. Sqrt(d) dimension dependence of Langevin Monte Carlo. In International Conference on Learning Representations, 2022. Nicholas Metropolis, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller, and Edward Teller. Equation of state calculations by fast computing machines. The Journal of Chemical Physics, 21(6):1087 1092, 1953. Koichi Miyasawa. An empirical Bayes estimator of the mean of a normal population. Bulletin of the International Statistical Institute, 38(4):181 188, 1961. Wenlong Mou, Yi-An Ma, Martin J. Wainwright, Peter L. Bartlett, and Michael I. Jordan. Highorder Langevin diffusion yields an accelerated MCMC algorithm. Journal of Machine Learning Research, 22(1):1919 1959, 2021. Kimia Nadjahi, Alain Durmus, Pierre E Jacob, Roland Badeau, and Umut Simsekli. Fast approximation of the sliced-wasserstein distance using concentration of random projections. Advances in Neural Information Processing Systems, 34:12411 12424, 2021. Radford M. Neal. Bayesian learning for neural networks. 1995. Ph D thesis, University of Toronto. Radford M. Neal. Annealed importance sampling. Statistics and Computing, 11:125 139, 2001. Published as a conference paper at ICLR 2024 Giorgio Parisi. Correlation functions and computer simulations. Nuclear Physics B, 180(3):378 384, 1981. Gabriel Peyré and Marco Cuturi. Computational optimal transport: With applications to data science. Foundations and Trends in Machine Learning, 11(5-6):355 607, 2019. Herbert Robbins. An empirical Bayes approach to statistics. In Proc. Third Berkeley Symp., volume 1, pp. 157 163, 1956. Gareth O. Roberts and Richard L. Tweedie. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, pp. 341 363, 1996. Matthias Sachs, Benedict Leimkuhler, and Vincent Danos. Langevin dynamics with variable coefficients and nonconservative forces: from stationary states to numerical methods. Entropy, 19(12): 647, 2017. Saeed Saremi and Aapo Hyvärinen. Neural empirical Bayes. Journal of Machine Learning Research, 20(181):1 23, 2019. Saeed Saremi and Rupesh Kumar Srivastava. Multimeasurement generative models. In International Conference on Learning Representations, 2022. Ruoqi Shen and Yin Tat Lee. The randomized midpoint method for log-concave sampling. Advances in Neural Information Processing Systems, 32, 2019. Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp. 2256 2265, 2015. Saifuddin Syed, Alexandre Bouchard-Côté, George Deligiannidis, and Arnaud Doucet. Nonreversible parallel tempering: a scalable highly parallel MCMC scheme. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(2):321 350, 2022. Cédric Villani. Topics in Optimal Transportation, volume 58. American Mathematical Society, 2021. Martin J. Wainwright and Michael I. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1 2):1 305, 2008. Published as a conference paper at ICLR 2024 A DERIVATIONS FOR SEC. 2 A.1 DERIVATION FOR (2.2) In this section we derive a general expression for the (σ, m)-densities p(y1:m), short for pσ(y1:m): p(y1:m) = Z Rd p(x) m Y t=1 p(yt|x) dx Rd 1 Z e f(x) m Y 1 (2πσ2)d/2 exp 1 2σ2 x yt 2 dx Rd e f(x) exp 1 The (σ, m)-density p(y1:m) is permutation invariant under the permutation of measurement indices: π : [m] [m]. In the calculation below we derive a general form for p(y1:m) where this permutation invariance becomes apparent in terms of the empirical mean of the m measurements and the empirical mean of { yt 2}m t=1. We start with a rewriting of log p(y1:m|x): 2σ2 log p(y1:m|x) = yt x 2 + cst t=1 yt, x + t=1 yt 2 + cst = m x y1:m 2 y1:m 2 + 1 t=1 yt 2 + cst, where cst is a constant that does not depend on y1:m. Using the above expression, we arrive at: log p(y1:m) = log Z e f(x)p(y1:m|x) dx + cst = log Z e f(x) exp m 2σ2 x y1:m 2 dx + y1:m 2 m 1 Pm t=1 yt 2 2m 1σ2 + cst. The equation above reduces to (2.2) with the following definition φ(y; σ) = log Z e f(x) exp 1 2σ2 x y 2 dx. A.2 DERIVATION FOR (2.4) Next, we derive the expression for ˆx(y1:m) = E[X|y1:m] given in (2.4): E[X|y1:m] = yt + σ2 yt log p(y1:m) = yt + σ2 m 1 φ(y1:m; m 1/2σ) + σ 2(y1:m yt) = y1:m + m 1σ2 φ(y1:m; m 1/2σ). The first equation above comes from the generalization of the Bayes estimator to factorial kernels (Saremi & Srivastava, 2022), and for the second equation we used (2.2). Published as a conference paper at ICLR 2024 B SMOOTHED SCORE FUNCTIONS In this section, we give several different expressions related to smoothed densities used in the paper. We consider Y N(X, σ2I) where X e f(x) in Rd, thus with the density p(y) = Z 1 Z e f(x) 1 (2πσ2)d/2 e 1 2σ2 x y 2dx Z e f(x)e 1 2σ2 x y 2dx. We have the following expressions for log p(y): log p(y) = log Z e f(x)e 1 2σ2 x y 2dx + cst, log p(y) = log Z e f(x+y)e 1 2σ2 x 2dx + cst, (log p)(y) = R e f(x)e 1 2σ2 x y 2 x y σ2 dx R e f(x)e 1 2σ2 x y 2dx = 1 σ2 (E[X|y] y), (B.1) (log p)(y) = R e f(x+y) f(x + y)e 1 2σ2 x 2dx R e f(x+y)e 1 2σ2 x 2dx R e f(x) f(x)e 1 2σ2 x y 2dx R e f(x)e 1 2σ2 x y 2dx = E[ f(X)|y]. (B.2) This in turn leads to three expressions for the Hessian: 2(log p)(y) = 1 E x(x y) |y E [x|y] E [x y|y] E xx |y E [x|y] E [x|y] = 1 σ4 cov(X|y), (B.3) 2(log p)(y) = E[ 2f(X)|y] + E[ f(X) f(X) |y] + E[ f(X)|y]E[ f(X)|y] = E[ 2f(X)|y] + cov( f(X)|y), (B.4) 2(log p)(y) = 1 E (x y) f(x) |y E [(x y)|y] E [ f(x)|y] E x f(x) |y E [x|y] E [ f(x)|y] = 1 σ2 cov(X, f(X)|y). B.1 GAUSSIAN EXAMPLE If X N(µ, C), then Y N(µ, C + σ2I), and then E[X|Y ] = µ + C(C + σ2I) 1(Y µ), cov(X|Y ) = C C(C + σ2I) 1C. Therefore, E[X|Y ] is Gaussian with mean µ and the covariance matrix C(C + σ2I) 1C. We have 2(log p)(y) = (C + σ2I) 1. B.2 DISTRIBUTION OF E[X|Y ] If Y is sampled from p Y , let ˆpσ be the distribution of E[X|Y ] = Y + σ2 (log p)(Y ). We can bound the 2-Wasserstein distance between p and ˆpσ, by sampling X from p X, and taking ˆx = Y + σ2 log p(Y ), where Y N(X, σ2I). This leads to a particular coupling (Villani, 2021) between p and ˆpσ. It follows: W2(p, ˆpσ)2 EX,Y E[X|Y ] X 2. Given that the conditional expectation E[X|Y ] is the optimal estimator for the square loss, the last expression is less than the trival estimator Y , that is, W2(p, ˆpσ)2 EX,Y Y X 2 = σ2d. (B.5) Published as a conference paper at ICLR 2024 B.3 PROPOSITION 2 The calculations in Sec. B.2 that resulted in (B.5) leads to the following: Proof for Proposition 2. Using Proposition 1, we have W2(p, ˆpσ,m) = W2(p, ˆpm 1/2σ), which is then combined with (B.5): W2(p, ˆpσ,m)2 σ2 C PROOFS FOR SEC. 3 C.1 PROPOSITION 3 Proof for Proposition 3. We start with an expression for log p(y1:m): log p(y1:m) = log Z k=1 N(x; yk, σ2I) N(x; 0, C) dx 2σ2 x2 i 2τ 2 i R exp (xi αi)2 The expressions for αi, βi, and γi are given next by completing the square via matching second, first and zeroth derivative (in that order) of the left and right hand sides below 2σ2 x2 i 2τ 2 i = (xi αi)2 evaluated at xi = 0. The following three equations follow: yti σ2 αi = 1 m + σ2τ 2 i α2 i 2β2 i γi = y2 ti 2σ2 γi = y2 ti 2σ2 1 2σ2(m + σ2τ 2 i ) t=1 yti 2 . It is convenient to define: Ati = 1 t + σ2τ 2 i . (C.2) Using above expressions, (C.1) simplifies to: log p(y1:m) = i=1 γi + cst = 2σ2 + 1 2σ2 i=1 Ami m X t=1 yti 2 + cst. (C.3) The energy function can be written more compactly by introducing the matrix Fσ,m: log p(y1:m) = 1 2y 1:m Fσ,my1:m, σ2[Fσ,m]ti,t i = ((1 Ami)δtt Ami(1 δtt )) δii . Published as a conference paper at ICLR 2024 In words, the md md dimensional matrix Fσ,m is block diagonal with d blocks of size m m. The blocks themselves capture the interactions between different measurements indexed by t and t : The m m blocks of the matrix Fσ,m, indexed by i [d] thus have the form: σ2F (i) σ,m = (1 Ami)Im + Ami(Im 1m1 m), where 1 m = (1, 1, . . . , 1) is m-dimensional. It is straightforward to find the m eigenvalues of the m m matrix F (i) σ,m for i [d]: m 1 degenerate eigenvalues equal to σ 2 corresponding to the eigenvectors {(1, 1, 0, . . . , 0) , (1, 0, 1, . . . , 0) , . . . , (1, 0, 0, . . . , 1) }, one eigenvalue equal to σ 2(1 m Ami) corresponding to the eigenvector (1, 1, . . . , 1) . Since m Ami > 0 we have: λmax(Fσ,m) = σ 2, which is (m 1)d degenerate. The remaining d eigenvalues are {σ 2(1 m Ami)}i [d], the smallest of which is given by λmin(Fσ,m) = σ 2 1 m m + σ2τ 2 max 1 + mσ 2τ 2max . Thus we have: κσ,m = λmax(Fσ,m)/λmin(Fσ,m) = 1 + mσ 2τ 2 max. C.2 PROPOSITION 4 Proof for Proposition 4. Using (C.3) and (C.2) we have: 2σ2 log p(y1:t) = i=1 Ati t X k=1 yki 2 . Since log p(yt|y1:t 1) = log p(y1:t) log p(y1:t 1), we have: 2σ2 log p(yt|y1:t 1) = yt 2 i=1 Atiyti yti + 2 k=1 yki + cst i=1 (1 Ati) yti Ati 1 Ati k=1 yki 2 + cst, Therefore the conditional density p(yt|y1:t 1) is the Gaussian N(µt|t 1, F 1 t|t 1) with a shifted mean µt|t 1 = Ati 1 Ati and with an anisotropic, diagonal covariance/precision matrix whose spectrum is given by: σ2λi(Ft|t 1) = 1 (t + σ2τ 2 i ) 1. κt|t 1 = λmax(Ft|t 1) λmin(Ft|t 1) = 1 (t + σ2τ 2 min) 1 1 (t + σ2τ 2 max) 1 . Lastly, to prove monotonicity result (3.3) we do an analytic continuation of κt|t 1 to continuous values by defining η(t) = κt|t 1 and taking its derivative, below R = σ2τ 2 min, r = σ2τ 2 max, thus Published as a conference paper at ICLR 2024 η (t) = (t + R) 2 1 (t + r) 1 (1 (t + R) 1)(t + r) 2 (1 (t + r) 1)2 = (t + r)(t + r 1) (t + R)2(t + r 1)2 (t + R 1)(t + R) (t + r 1)2(t + R)2 = (t + r)2 r (t + R)2 + R (t + r 1)2(t + R)2 = (r R)(2t + r + R 1) (t + r 1)2(t + R)2 < 0. D PROOFS FOR SEC. 4 D.1 LEMMA 1 Proof for Lemma 1. We would like to find an upper bound for 2(log p)(y). By using 2(log p)(y) = σ 2I + σ 4cov(X|y), derived in Appendix B, we need to upper bound cov(X|y). It suffices to study E[ X x0 2|y] since cov(X|y) = cov(X x0|y) E[ X x0 2|y] I. We have by a convexity argument5: E[ X x0 2|y] = R e f(x) 1 2σ2 x y 2 x x0 2dx R e f(x) 1 2σ2 x y 2dx R e f(x)+ 1 σ2 (x x0) (y x0) x x0 2dx R e f(x)+ 1 σ2 (x x0) (y x0)dx . Next, we find an upper bound for x x0 2 itself. Using the assumption in the lemma we have: f(x) + σ 2(x0 y) f(x) σ 2 x0 y µ x x0 σ 2 x0 y , leading to µ x x0 f(x) + σ 2(x0 y) + + σ 2 x0 y and thus µ2 x x0 2 3 f(x) + x0 y σ2 2 + 3 2 + 3 x0 y 2 Finally, using (D.1), we only need to find an upper-bound for f(x) + σ 2(x0 y) 2 under the distribution p(x) e f(x), where f(x) = f(x) σ 2(x x0) (y x0). This is achieved with: Z p(x) f(x) + x0 y σ2 2dx = Z p(x) f(x) 2dx = Z p(x) tr 2f(x)dx Ld, where second equality is obtained using integration by parts akin to score matching (Hyvärinen, 2005), and for the last inequality we used our assumption 2f(x) LI. Putting all together we arrive at: 2(log p)(y) 1 + 3Ld µ2σ2 + 3 x0 y 2 5We consider the exponential family p(x|ν) = exp f(x) + 1 σ2 (x x0) (y x0) ν 2σ2 x x0 2 ν 2σ2 y x0 2 a(ν) . Since a(ν) is convex (Wainwright & Jordan, 2008), we have a (1) a (0), which is exactly the desired statement. Published as a conference paper at ICLR 2024 D.2 LEMMA 2 Since log p(yt|y1:t 1) = log p(y1:t) log p(y1:t 1), we have k yt log p(yt|y1:t 1) = k yt log p(y1:t). Start with the score function (k = 1): yt log p(y1:t) = R σ 2(x yt)p(x) Qt i=1 p(yi|x)dx R p(x) Qt i=1 p(yi|x)dx = σ 2 (E[X|y1:t] yt) . Next, we derive the Hessian 2 yt log p(y1:t) = σ 2I + At Bt, where At and Bt are given by: R σ 4(x yt)(x yt) p(x) Qt i=1 p(yi|x)dx R p(x) Qt i=1 p(yi|x)dx = σ 4 E[(X yt)(X yt) |y1:t], Bt = σ 4 R (x yt)p(x) Qt i=1 p(yi|x)dx R p(x) Qt i=1 p(yi|x)dx ! R (x yt)p(x) Qt i=1 p(yi|x)dx R p(x) Qt i=1 p(yi|x)dx = σ 4 (E[X|y1:t] yt) (E[X|y1:t] yt) . By simplifying At Bt, the yt cross-terms cancel out and the posterior covariance matrix emerges: 2 yt log p(y1:t) = σ 2I + σ 4cov(X|y1:t). (D.2) The lemma is proven since the mean of the posterior covariance Ey1:tcov(X|y1:t) can only go down upon accumulation of measurements (conditioning on more variables). D.3 THEOREM 1 Proof for Theorem 1. We start with the definition of (σ, m)-density: p(y1:m) = Z Z p(z)p(y1:m|z)dz Z 2τ 2 x z 2 1 2σ2 t=1 x yt 2 dx which we express by integrating out x. We have p(y1:m|z) Z 2τ 2 x z 2 m x y1:m 2 y1:m 2 + 1 t=1 yt 2 dx m 2(mτ 2 + σ2) z y1:m 2 + m 2σ2 y1:m 2 1 2σ2 We can then express ym log p(y1:m) in terms of E[Z|y1:m]: ym log p(y1:m) = mτ 2 σ2(mτ 2 + σ2)y1:m ym σ2 + 1 mτ 2 + σ2 E[Z|y1:m]. Next, we take another derivative: 2 ym log p(y1:m) = τ 2 σ2(mτ 2 + σ2) 1 I + 1 mτ 2 + σ2 ym E[Z|y1:m]. Next, we compute ym E[Z|y1:m]: ym E[Z|y1:m] = 1 mτ 2 + σ2 E[ZZ |y1:m] + mτ 2 mτ 2 + σ2 E[Z|y1:m]y 1:m 1 σ2 E[Z|y1:m]y m E[Z|y1:m] 1 mτ 2 + σ2 E[Z|y1:m] + mτ 2 mτ 2 + σ2 y1:m 1 = 1 mτ 2 + σ2 cov(Z|y1:m). Published as a conference paper at ICLR 2024 Putting all together, we arrive at:6 2 ym log p(y1:m) = 1 mτ 2 + σ2 1 I + 1 (mτ 2 + σ2)2 cov(Z|y1:m). (D.4) Finally, since Z 2 R2 almost surely, we have cov(Z|y1:m) R2I, therefore 2 ym log p(ym|y1:m 1) = 2 ym log p(y1:m) ζ(m)I, mτ 2 + σ2 1 + R2 (mτ 2 + σ2)2 . D.4 DERIVATION FOR (4.6) In this example, we have X = Z + N0, N0 N(0, τ 2I), and Y = X + N1, N1 N(0, σ2I), therefore Y = Z + N, N N(0, (σ2 + τ 2)I), (D.5) where Z p(z), 2δ(z µ) + 1 δ is the Dirac delta function in d-dimensions. Alternatively, we have p(y) = R p(z)p(y|z)dz, where p(y|z) = N(y; z, (σ2 + τ 2)I). Using (B.3), adapted for (D.5), we have: H(y) = 2(log p)(y) = 1 σ2 + τ 2 I 1 σ2 + τ 2 cov(Z|y) . (D.6) Next, we derive an expression for cov(Z|y): cov(Z|y) = E[ZZ |y] E[Z|y]E[Z|y] . E[ZZ |y] = µµ , E[Z|y] = µ e A e B e A + e B , 2(σ2 + τ 2), B = y + µ 2 2(σ2 + τ 2). It follows: cov(Z|y) = µµ 1 e A e B 1 + cosh(B A) = 2µµ 1 + cosh 2µ y By combining (D.6) and (D.7) we arrive at (4.6). 6Not required for the proof, but we can also relate cov(X|y1:m) and cov(Z|y1:m) directly since as we know (see the proof of Lemma 2): 2 ym log p(y1:m) = 1 σ4 cov(X|y1:m). (D.3) By combining (D.4) and (D.3), it follows: cov(X|y1:m) = σ2τ 2 mτ 2 + σ2 I + σ2 mτ 2 + σ2 2 cov(Z|y1:m). Published as a conference paper at ICLR 2024 E DETAILED ALGORITHM Algorithm 2: MCMCσ in Algorithm 1 via Underdamped Langevin MCMC by Sachs et al. (2017) 1: Input Y (i 1) t , Y 1:t 1 2: Parameters current MCMC iteration i, current measurement index t, step size δ, friction γ, steps taken nt, estimated smoothed score function ˆg(y; σ), Lipschitz parameter L, noise level σ 3: Output Y (i) t 4: Initialize Y (i,0) t Unif([0, 1]d) + N(0, σ2I) 5: Initialize V 0 6: for k = [0, . . . , K 1] do 7: Y (i,k+1) t = Y (i,k) t + δ 8: Y 1:t = Y 1:t 1 + (Y (i,k+1) t Y 1:t 1)/t 9: G = t 1ˆg(Y 1:t; t 1/2σ) + σ 2(Y 1:t Y (i,k+1) t ) according to (4.9) 10: V V + δ 2L G 11: B N(0, I) 12: V exp( γδ) V + δ 2L G + q 1 L (1 exp( 2γδ))B 13: Y (i,k+1) t Y (i,k+1) t + δ 2 V 14: end for 15: return Y (i) t = Y (i,K) t F ESTIMATORS FOR log p(y) The purpose of this section is to derive estimators for the smoothed score function g(y; σ) = (log p)(y) = φ(y; σ), which can be used to run Algorithm 2, in turn running Algorithm 1. We first derive (4.8) by rewriting (B.1) as follows: g(y; σ) = 1 R (x y) exp ˇf(x; y, σ) dx R exp ˇf(x; y, σ) dx = 1 σ2 (E[ ˇX] y), ˇX e ˇ f( ;y,σ), where ˇf(x; y, σ) := f(x) + x y 2 F.1 THE PLUG-IN ESTIMATOR ˆgplugin We can arrive at a simple plug-in estimator for g(y; σ) by rewriting (B.1) as follows: φ(y; σ) = 1 σ2 EN N(0;σ2I)[Ne f(N+y)] EN N(0;σ2I)[e f(N+y)] , We then simply estimate the numerator and denominator above using i.i.d. Gaussian draws: ˆgplugin(y; σ) = 1 Pn i=1 εi exp ( f(σεi + y)) Pn i=1 exp ( f(σεi + y)) , εi iid N(0, I). (F.1) (Above, we used one set of i.i.d. draws for estimating the numerator and denominator used in our experiments. Of course, one should take independent draws if computation budget is not an issue.) Gradient-based plug-in estimator. One can also obtain a different type of plug-in estimator for log p(y) using (B.2) that directly takes the gradient information f(x) into consideration. This estimator should have better properties, but we did not experiment with it in this paper. Intuitively, such plug-in estimators (gradient-aware or not) will suffer from the curse of dimensionality. Published as a conference paper at ICLR 2024 Connections to importance sampling. One can also arrive at (F.1) by using importance sampling to estimate E[ ˇX] in (4.8), which is insightful. In particular, in importance sampling instead of sampling from the probability measure ν associated with exp( ˇf( ; y, σ)) (which is hard to sample from) we sample from an easier probability measure µ, which in our case we took it to be the Gaussian measure N(y; σ2I). With this setup, the estimator (F.1) follows through. It is known that the number of samples required to have an accurate estimator based on importance sampling is of order exp(KL(ν||µ)), where KL(ν||µ) is the Kullback-Leibler divergence of µ from ν (Chatterjee & Diaconis, 2018). This further highlights the limitations of the estimator (F.1). F.2 THE LANGEVIN ESTIMATOR ˆglangevin A better way to estimate E[ ˇX] in (4.8) is to use MCMC in particular, the gradient-based Langevin MCMC. For any y, this is done by running Langevin MCMC using the score function ˇg(x; y, σ) = f(x) x y As an example, and to be concrete, for any y one can sample ˇX by discretizing the overdamped Langevin diffusion: d ˇXs = f( ˇXs) ˇXs y Given n such independent samples, ( ˇXi)n i=1 drawn by running the Langevin dynamics, we then use (4.8) to arrive at ˆglangevin: ˆglangevin(y; σ) = 1 i=1 ˇXi y . (F.2) The procedure above for drawing a sample ˇX from e ˇ f(x;y,σ) using overdamped Langevin MCMC is given in Algorithm 3. Algorithm 3: Draw ˇX e ˇ f(x;y,σ) with Langevin MCMC for the estimator (F.2) 1: Parameter noise level σ 2: Input current y for which we need to estimate g(y; σ) 3: Hyperparameters step size δ, number of iterations K 4: Output ˇX 5: x0 = y 6: for k = [1, . . . , K] do 7: ˇg = f(xk 1) σ 2(xk 1 y) 8: ε N(0, I) 9: xk = xk 1 + δˇg + 2δ ε 10: end for 11: return ˇX x K We should highlight that in our experiments, instead of using the overdamped Langevin MCMC above, we used the more sophisticated underdamped Langevin MCMC algorithm by Sachs et al. (2017). Connections to Entropy SGD . The problem of estimating smoothed score functions have also been of interest in the neural network optimization literature under the terminology of entropy SGD (Chaudhari et al., 2017). There, the smoothed score function (although the kernel smoothing lexicon is not used there; 1/σ2 is denoted by γ and is referred to as scope) is utilized for optimization, not for sampling, to arrive at flatter minima of the loss landscape with better generalization properties. Due to different motivations, Algorithm 3 differs from the one used in entropy SGD, e.g., we do not have exponential averaging here. Published as a conference paper at ICLR 2024 G EXPERIMENTAL DETAIL The hyperparameters were tuned on a log-spaced grid. We searched the step size δ over {0.03, 0.1, 0.3, 1.0}, the effective friction γδ over {0.0625, 0.125, 0.25, 0.5, 1.0}, per-t MCMC iterations nt over {1, 4, 16}, and the Lipschitz parameter over {1/σ2, 1.0}. We found that the hyperparameter combinations (δ = 0.03, γδ = 0.0625) and (δ = 1.0, γδ = 0.5) worked well for most configurations of test density type, σ, d, and MCMC algorithm. For JMS, m ran over {200, 400, 600, 800, 1000}. We experimented with two initialization schemes in SMS. At each t, the walkers sampling from p(yt|y1:t 1) were initialized at (i) warm: E[X|y1:t 1] + ε, where ε N(0, σ2I), or (ii) cold: at samples from Unif([ 1, 1]d) + N(0, σ2I). For convergence experiments, we report results from the warm start, as it was more robust to the choice of hyperparameters. The tunneling results were obtained with cold start. The hyperparameters were tuned for each algorithm while the total number of iterations was kept fixed at a large value. We define each iteration as an MCMC update step. For SMS, the total number of iterations is Pm t=1 nt, the number of MCMC iterations for each measurement t, nt, summed up over the m measurements. We had nt {1, 4, 16} and m = 5, 000. For the remaining three sampling schemes (JMS, m = 1, and σ = 0), the total number of iterations is simply the number of MCMC iterations, fixed to 20, 000, but we found these algorithms to converge much earlier, around 5, 000. H MCMC ALGORITHMS Our algorithm is agnostic to the choice of MCMC sampling algorithm used in the Markovian phases. In this section, we run SMS sampling with four different Langevin MCMC algorithms. The results presented earlier in Sec. 5 uses an ULD algorithm with an Euler discretization scheme that extends the BAOAB integration using multiple time steps for the O-part ( Sachs et al. ) (Sachs et al., 2017). Next, we consider two algorithms that operate on the integral representations of ULD. Recall that continuous-time ULD is represented by the following stochastic differential equation (SDE): dvt = γvtdt u f(xt)dt + ( p 2γu)d Bt, dxt = vtdt, where xt, vt Rd and Bt is the standard Brownian motion in Rd. The solution (xt, vt) to the continuous-time ULD is vt = v0e γt u Z t 0 exp ( (t s)) f(xs)ds + p 0 exp ( γ(t s)) d Bs, xt = x0 + Z t 0 vsds. (H.1) Similarly, the discrete ULD is defined by the SDE d vt = γvtdt u f( x0)dt + ( p 2γu)d Bt, d xt = vtdt, which yields the solution vt = v0e γt u Z t 0 exp ( (t s)) f( x0)ds + p 0 exp ( γ(t s)) d Bs, xt = x0 + Z t 0 vsds. (H.2) Shen & Lee (2019) seeks a lower discretization error by using a 2-step fixed point iteration method, or the randomized midpoint method. The integral in (H.1) is evaluated along uniform random points between 0 and t. On the other hand, Cheng et al. (2018) computes the moments of the joint Gaussian over ( xt, vt) in the updates of (H.2). In our comparison we additionally include MALA, an Euler discretization of the overdamped Langevin dynamics represented by the SDE dxt = u f(xt)dt + ( p 2γu)d Bt, (H.3) Published as a conference paper at ICLR 2024 0 200 400 600 800 1000 m MALA Sachs et al. Shen and Lee Cheng et al. Direct samples of p X Figure 5: W2 vs. m for various MCMCσ algorithms used in the inner loop of Algorithm 1. accompanied by Metropolis adjustment to correct for the discretization errors (Roberts & Tweedie, 1996). The algorithms are compared in Fig. 5 for the Gaussian mixture test density introduced in Sec. 5.2 with d = 8. The three (unadjusted) ULD algorithms converge faster than does MALA to a lower W2. In ULD algorithms, Brownian motion affects the positions xt through the velocities vt, rather than directly as in MALA, resulting in a smoother evolution of xt that lends itself better to discretization. The first two dimensions of the final samples are displayed in Fig. 6. MALA samples fail to separate into the two modes, whereas Cheng et al, Shen and Lee, and Sachs et al have better sample quality, with Sachs et al performing the best and almost approaching the sample variance of W2 when samples are directly drawn from p X (the gray band). (a) MALA (Roberts & Tweedie, 1996) (b) Cheng et al. (2018) (c) Shen & Lee (2019) (d) Sachs et al. (2017) Figure 6: Final ˆX samples for various MCMCσ algorithms used in the inner loop of Algorithm 1. Published as a conference paper at ICLR 2024 I SCORE ESTIMATION Langevin MCMC requires the smoothed score function g(y; σ) = log p(y). Results presented earlier in Sec. 5 assumed access to the analytic score function. Here, we use the plug-in estimator ˆgplugin presented in (F.1) with varying numbers of MC samples n as well as the Langevin estimator ˆglangevin presented in (F.2). To prevent numerical underflow, we implemented (F.1) as follows: A = logsumexpn i=1 f(y + σεi) , (I.1) B+ = logsumexpn+ j=1 log εj f(y + σεj) , (I.2) B = logsumexpn k=1 log( εk) f(y + σεk) , (I.3) ˆgplugin(y; σ) = 1 σ e B+ A e B A , (I.4) logsumexpn i=1(ai) := amax + log i=1 exp(ai amax), amax := maxi ai, and ε N(0, I). In (I.2) and (I.3), j = 1, . . . , n+ and k = 1, . . . , n denote the indices for which ε is positive and negative, respectively, with n+ + n = n. Note that the same Gaussian samples ε were used to evaluate the numerator and the denominator. For a Gaussian mixture density introduced in Sec. 5.2 with d = 2, both the analytic and the ˆgplugin score functions converge to a low W2, as shown in Fig. 7 (a, b). The sample quality is on par with the analytic score function with n as small as 500 and there is little benefit to increasing the n past 500. The ˆglangevin score function struggles in this low-dimensional example, but catches up to the highest-n ˆgplugin score function at d = 8, as Fig. 7 (c, d) shows. All estimated score functions significantly underperform the analytic score. This is a preliminary result: for obtaining ˆglangevin we did not extensively tune the hyperparameters. In addition, for ˆgplugin, variance reduction techniques, such as importance weighting, may help get more mileage from finite n. J MIXTURE OF CORRELATED GAUSSIANS In this section, we study a correlated test density, namely a mixture of two 2-dimensional Gaussians with full covariances: p(x) = α N(x; µ, Σ0) + (1 α) N(x; µ, Σ1). (J.1) We choose µ = 3 1d, Σ0 = R diag( 1 4, 4) RT , Σ1 = RT diag(1, 9) R, α = 4 5, and the rotation matrix R = cos θ sin θ sin θ cos θ with θ = π/360. Fig. 8 compares the performance of the same Langevin MCMC algorithm (Sachs et al., 2017) used within the SMS scheme (a) with that used without Gaussian smoothing (b). In both cases here, we initialized the samplers at the origin (5000 particles total). The total number of iterations are the same between two algorithms (see Appendix G for details). Published as a conference paper at ICLR 2024 0 100 200 300 400 m n = 250 n = 500 n = 1000 n = 2000 n = 4000 Analytic MCMC X p X (a) W2 vs. m (d = 2) 250 500 1000 2000 4000 Number of MC samples Estimator Analytic Direct samples of p X (b) Final W2 vs. n for ˆgplugin (d = 2) 0 100 200 300 400 m n = 250 n = 500 n = 1000 n = 2000 n = 4000 Analytic MCMC X p X (c) W2 vs. m (d = 8) 250 500 1000 2000 4000 Number of MC samples Estimator Analytic Direct samples of p X (d) Final W2 vs. n for ˆgplugin (d = 8) Figure 7: Comparison of the analytic g(y; σ) = log p(y), the plug-in estimator ˆgplugin (F.1) with varying numbers of MC samples n, and the Langevin estimator ˆglangevin (F.2). The test density was the mixture of Gaussians introduced in Sec. 5.2 with d = 2 (a, b) and d = 8 (c, d). (a) SMS, σ = 16, m = 1000 (b) Langevin MCMC Figure 8: Mixture of correlated Gaussians.