# diffusion_posterior_sampling_is_computationally_intractable__fd1ead64.pdf Diffusion Posterior Sampling is Computationally Intractable Shivam Gupta * 1 Ajil Jalal * 2 Aditya Parulekar * 1 Eric Price * 1 Zhiyang Xun * 1 Diffusion models are a remarkably effective way of learning and sampling from a distribution p(x). In posterior sampling, one is also given a measurement model p(y | x) and a measurement y, and would like to sample from p(x | y). Posterior sampling is useful for tasks such as inpainting, super-resolution, and MRI reconstruction, so a number of recent works have given algorithms to heuristically approximate it; but none are known to converge to the correct distribution in polynomial time. In this paper we show that posterior sampling is computationally intractable: under the most basic assumption in cryptography that one-way functions exist there are instances for which every algorithm takes superpolynomial time, even though unconditional sampling is provably fast. We also show that the exponential-time rejection sampling algorithm is essentially optimal under the stronger plausible assumption that there are one-way functions that take exponential time to invert. 1. Introduction Over the past few years, diffusion models have emerged as a powerful way for representing distributions of images. Such models, such as Dall-E (Ramesh et al., 2022) and Stable Diffusion (Rombach et al., 2021), are very effective at learning and sampling from distributions. These models can then be used as priors for a wide variety of downstream tasks, including inpainting, superresolution, and MRI reconstruction. Diffusion models are based on representing the smoothed scores of the desired distribution. For a distribution p(x), we define the smoothed distribution pσ(x) to be p convolved *Equal contribution 1Department of Computer Science, University of Texas at Austin 2Department of Electrical Engineering and Computer Science, University of California, Berkeley. Correspondence to: Zhiyang Xun . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). with N(0, σ2I). These have corresponding smoothed scores sσ(x) := log pσ(x). Given the smoothed scores, the distribution p can be sampled using an SDE (Ho et al., 2020) or an ODE (Song et al., 2021). Moreover, the smoothed score is the minimizer of what is known as the score-matching objective, which can be estimated from samples. Sampling via diffusion models is fairly well understood from a theoretical perspective. The sampling SDE and ODE are both fast (polynomial time) and robust (tolerating L2 error in the estimation of the smoothed score). Moreover, with polynomial training samples of the distribution, the empirical risk minimizer (ERM) of the score matching objective will have bounded L2 error, leading to accurate samples (Block et al., 2020; Gupta et al., 2023). So diffusion models give fast and robust unconditional samples. But sampling from the original distribution is not the main utility of diffusion models: that comes from using the models to solve downstream tasks. A natural goal is to sample from the posterior: the distribution gives a prior p(x) over images, so given a noisy measurement y of x with known measurement model p(y | x), we can in principle use Bayes rule to compute and sample from p(x | y). Often (such as for inpainting, superresolution, MRI reconstruction) the measurement process is the noisy linear measurement model, with measurement y = Ax+η for a known measurement matrix A Rm d with m < d, and Gaussian noise η = βN(0, Im); we will focus on such linear measurements in this paper. Posterior sampling has many appealing properties for image reconstruction tasks. For example, if you want to identify x precisely, posterior sampling is within a factor 2 of the minimum error possible for every measurement model and every error metric (Jalal et al., 2021a). When ambiguities do arise, posterior sampling has appealing fairness guarantees with respect to sensitive attributes (Jalal et al., 2021b). Given the appeal of posterior sampling, the natural question is: is efficient posterior sampling possible given approximate smoothed scores? A large number of recent papers (Jalal et al., 2021a; Chung et al., 2023; Kawar et al., 2021; Trippe et al., 2023a; Song et al., 2023; Kawar et al., 2022; Dou & Song, 2024) have studied algorithms for posterior sampling, with promising empirical results. But all these fail on some inputs; can we find a better posterior Diffusion Posterior Sampling is Computationally Intractable sampling algorithm that is fast and robust in all cases? There are several reasons for optimism. First, there s the fact that unconditional sampling is possible from approximate smoothed scores; why not posterior sampling? Second, we know that information-theoretically, it is possible: rejection sampling of the unconditional samples (as produced with high fidelity by the diffusion process) is very accurate with fairly minimal assumptions. The only problem is that rejection sampling is slow: you need to sample until you get lucky enough to match on every measurement, which takes time exponential in m. And third, we know that the unsmoothed score of the posterior p(x | y) is computable efficiently from the unsmoothed score of p(x) and the measurement model: x log p(x | y) = log p(x) + log p(y | x). This is sufficient to run Langevin dynamics to sample from p(x | y). Of course, this has the same issues that Langevin dynamics has for unconditional sampling: it can take exponential time to mix, and is not robust to errors in the score. Diffusion models fix this by using the smoothed score to get robust and fast (unconditional) sampling. It seems quite plausible that a sufficiently clever algorithm could also get robust and fast posterior sampling. Despite these reasons for optimism, in this paper we show that no fast posterior sampling algorithm exists, even given good approximations to the smoothed scores, under the most basic cryptographic assumption that one-way functions exist. In fact, under the further assumption that some one-way function is exponentially hard to invert, there exists a distribution one for which the smoothed scores are well approximated by a neural network so that unconditional sampling is fast that takes exponential in m time for posterior sampling. Rejection sampling takes time exponential in m, and so, one can no longer hope for much general improvement over rejection sampling. Precise statements. To more formally state our results, we make a few definitions. We say a distribution is wellmodeled if its smoothed scores can be represented by a polynomial size neural network to polynomial precision: Definition 1.1 (C-Well-Modeled Distribution). For any constant C > 0, we say a distribution p over Rd with covariance Σ is C-well-modeled by score networks if Σ 1 and there are approximate scores bsσ that satisfy E x pσ[ bsσ(x) sσ(x) 2] < 1 d Cσ2 and can be computed by a poly(d)-parameter neural network with poly(d)-bounded weights for every 1 d C σ d C. Throughout our paper we will be comparing similar distributions. We say distributions are (τ, δ) close if they are close up to some shift τ and failure probability δ: Definition 1.2 ((τ, δ)-Close Distribution). We say the distribution of x and bx are (τ, δ) close if they can be coupled such that Pr[ x bx > τ] < δ. An unconditional sampler is one that is (τ, δ) close to the true distribution. Definition 1.3 ((τ, δ)-Unconditional Sampler). A (τ, δ) unconditional sampler of a distribution D is one where its samples bx are (τ, δ) close to the true x D. The theory of diffusion models (Chen et al., 2023) says that the diffusion process gives an unconditional sampler for well-modeled distributions that takes polynomial time (with the precise polynomial improved by subsequent work (Benton et al., 2024)). Theorem 1.4 (Unconditional Sampling for Well-Modeled Distributions). For an O(C)-well-modeled distribution p, the discretized reverse diffusion process with approximate scores gives a 1 d C , 1 d C -unconditional sampler (as defined in Definition 1.3) for any constant C > 0 in poly(d) time. But what about posterior samplers? We want that, for most measurements y, the conditional distribution is (τ, δ) close to the truth: Definition 1.5 ((τ, δ)-Posterior Sampler). Let D be a distribution over X Y with density p(x, y). Let C be an algorithm that takes in y Y and outputs samples from some distribution bp|y over X. We say C is a (τ, δ)-Posterior Sampler for D if, with 1 δ probability over y DY , bp|y and p(x | y) are (τ, δ) close. As described above, we consider the linear measurement model: Definition 1.6 (Linear Measurement Model). In the linear measurement model with m measurements and noise parameter β, we have for x Rd, the measurement y = Ax + η for A Rm d normalized such that A 1, and η = βN(0, Im). One way to implement posterior sampling is by rejection sampling. As long as the measurement noise β is much bigger than the error τ = 1 poly(d) from the diffusion process, this is accurate. However, the running time is exponentially large in m: Theorem 1.7 (Upper Bound). Let C > 1 be a constant. Consider an O(C)-well-modeled distribution and a linear measurement model with β > 1 d C . When δ > 1 d C , rejection sampling of the diffusion process gives a ( 1 d C , δ)-posterior sampler that takes poly(d)( O(1) Our main result is that this is nearly tight: Diffusion Posterior Sampling is Computationally Intractable Theorem 1.8 (Lower Bound). Suppose that one-way functions exist. Then for any m > d0.01, there exists a 10well-modeled distribution over Rd, and linear measurement model with m measurements and noise parameter β = Θ( 1 log2 d), such that ( 1 10)-posterior sampling requires superpolynomial time in d. To be a one-way function, inversion must take superpolynomial time on average. It is widely believed, including for problems based on lattices (Aggarwal et al., 2023) and elliptic curves (Zhandry, 2019), that many one-way function candidates need exponential time to invert. Under the stronger assumption that there exist some one-way functions that require exponential time to invert with non-negligible probability, we can show that posterior sampling takes 2Ω(m) Theorem 1.9 (Lower Bound: Exponential Hardness). Suppose that there exist one-way functions f : { 1}m { 1}m that require 2Ω(m) time to invert. Then for any m O(d) and C > 1, there exists a C-well-modeled distribution over Rd and linear measurement model with m measurements and noise level β = 1 C2 log2 d, such that ( 1 10)-posterior sampling takes at least 2Ω(m) time. Assuming such strong one-way functions exist, then for the lower bound instance, 2Ω(m) time is necessary and rejection sampling takes 2O(m log log d) poly(d) time. Up to the log log d factor, this shows that rejection sampling is the best one can hope for in general. Remark 1.10. The lower bound produces a well-modeled distribution, meaning that the scores are representable by a polynomial-size neural network, but there is no requirement that the network be shallow. One could instead consider only shallow networks; the same theorem holds, except that f must also be computable by a shallow depth network. Many candidate one-way functions can be computed in NC0 (i.e., by a constant-depth circuit) (Applebaum et al., 2004), so the cryptographic assumption is still mild. 2. Related Work Diffusion models (Sohl-Dickstein et al., 2015; Dhariwal & Nichol, 2021; Song & Ermon, 2019) have emerged as the most popular approach to deep generative modeling of images, serving as the backbone for the recent impressive results in text-to-image generation (Ramesh et al., 2022; Rombach et al., 2021), along with state-of-the-art results in video (Blattmann et al., 2023; Ho et al., 2022) and audio (Kong et al., 2021; Chen et al., 2021) generation. Noisy linear inverse problems capture a broad class of applications such as image inpainting, super-resolution, MRI reconstruction, deblurring, and denoising. The empirical success of diffusion models has motivated their use as a data prior for linear inverse problems, without task-specific training. There have been several recent theoretical and empirical works (Jalal et al., 2021a; Chung et al., 2023; Kawar et al., 2021; Trippe et al., 2023a; Song et al., 2023; Kawar et al., 2022; Dou & Song, 2024) proposing algorithms to sample from the posterior of a noisy linear measurement. We highlight some of these approaches below. Posterior Score Approximation. One class of approaches (Chung et al., 2023; Kawar et al., 2021; Song et al., 2023) approximates the intractable posterior score log pt(xt|y) = log pt(xt) + log pt(y|xt) at time t of the reverse diffusion process, and uses this approximation to sample. Here, y = Ax0 + η is the noisy measurement of x0 p0, where pt is the density at time t. For instance, Chung et al. (2023) proposes the approximation log pt(y|xt) log p(x| E [x0|xt]), thereby incurring error quantified by the so-called Jensen gap. (Song et al., 2023) proposes an approximation based on the pseudoinverse of A, while (Kawar et al., 2021) proposes to use the score of the posterior wrt measurement yt of xt. Replacement Method. Another approach, first introduced in the context of inpainting (Lugmayr et al., 2022), replaces the observed coordinates of the sample with a noisy version of the observation during the reverse diffusion process. An extension was proposed for general noisy linear measurements (Kawar et al., 2022). This approach essentially also attempts to sample from an approximation to the posterior. Particle Filtering. A recent set of works (Trippe et al., 2023a;b; Dou & Song, 2024) makes use of Sequential Monte Carlo (SMC) methods to sample from the posterior. These methods are guaranteed to sample from the correct distribution as the number of particles goes to . Our paper implies a lower bound on the number of particles necessary for good convergence. Assuming one-way functions exist, polynomially many particles are insufficient in general, so that these algorithms takes superpolynomial time; assuming some one-way function requires exponential time to invert, particle filtering requires exponentially many particles for convergence. To summarize, our lower bound implies that these approaches are either approximations that fail to sample from the posterior, and/or suffer from prohibitively large runtimes in general. 3. Proof Overview Lower Bound In this section, we give an overview of the proof of our main Theorem 1.8, which states that there is some well-modeled distribution for which posterior sampling is hard. The full Diffusion Posterior Sampling is Computationally Intractable proof can be found in the Appendix. The core idea of our proof is that any general posterior sampler would imply an algorithm that can invert a one-way function. A one-way function is formally defined as follows: Definition 3.1. A polynomial-time computable function f : { 1, 1} { 1, 1} is one-way if, for any polynomialtime randomized algorithm A, any constant c > 0, and all sufficiently large n, Pr x Un [f(A(f(x))) = f(x)] < n c where Un is the uniform distribution over { 1, 1}n. The function f is defined on inputs of arbitrary length; for inputs of length n it can be assumed to have some fixed polynomial output length m(n). An initial attempt. Suppose we have a one-way function f : { 1, 1}d { 1, 1}d, and consider the distribution that is uniform over (s, f(s)) { 1, 1}2d for all s { 1, 1}d. This distribution is easy to sample from unconditionally: sample s uniformly, then compute f(s). At the same time, posterior sampling is hard: if you observe the last d bits, i.e. f(s), a posterior sample should be from f 1(f(s)); and if f is a one-way function, finding any point in this support is computationally intractable on average. However, it is not at all clear that this distribution is wellmodeled as per Definition 1.1; we would need to be able to accurately represent the smoothed scores by a polynomial size neural network. The problem is that for smoothing levels 1 σ d, the smoothed score can have nontrivial contribution from many different (s, f(s)); so it s not clear one can compute the smoothed scores efficiently. Thus, while posterior sampling is intractable in this instance, it s possible the hardness lies in representing and computing the smoothed scores using a diffusion model, rather than in using the smoothed scores for posterior sampling. However, for smoothing levels σ 1 log d, the smoothed scores are efficiently computable with high accuracy. The smoothed distribution is a mixture of Gaussians with very little overlap, so rounding to a nearby Gaussian and taking its score gives very high accuracy. To design a better lower bound, we modify the distribution to encode f(s) differently: into the phase of the discretization of a Gaussian. At large smoothing levels, a discretized Gaussian looks essentially like an undiscretized Gaussian, and the phase information disappears. Thus at large smoothing levels, the distribution is essentially like a product distribution, for which the scores are easy to compute. At the same time, conditioning on the observations still implies inverting f, so this is still hard to conditionally sample; and it s still the case that small smoothing levels are efficiently computable. Based on the above, we define our lower bound instance formally in Section 3.1. Then, in Section 3.2 we sketch a proof of Lemma 3.5, which shows that it is impossible to perform accurate posterior sampling for our instance, under standard cryptographic assumptions. Section 3.3 shows that our lower bound distribution is well-modeled by a small Re LU network, which means that the hardness is not coming merely from inability to represent the scores, and that unconditional sampling is provably efficient. Finally, we put these observations together to show the theorem. 3.1. Lower Bound Instance We define our lower bound instance here formally. Let wσ(x) denote the density of a Gaussian with mean zero and standard deviation σ, and let combε denote the Dirac Comb distribution with period ε, given by For any b { 1, 1}, let ψb be the density of a standard Gaussian discretized to multiples of ε, with phase either 0 or ε 2 depending on b: ψb(x) w1(x) combε Definition 3.2 (Unscaled Lower Bound Distribution). Let f : { 1}d { 1}d be a given function. For R > 0 and for any s { 1}d, define the product distribution gs over x Rd+d such that xi w1(xi R si) for i d xi ψf(s)i d for i > d. The unconditional distribution g we consider is the uniform mixture of gs over s { 1}d. We will have d = O(d) throughout. Figure 1 gives a visualization of gs; the final distribution is the mixture of gs over uniformly random s. For ease of exposition, we will also define a scaled version of our distribution g such that its covariance Σ has Σ 1. Definition 3.3 (Scaled Lower Bound Distribution). Let eg(x) = Rd+d g(R x) be the scaled version of the distribution with density g defined in Definition 3.2. Similarly, let egs = Rd+d gs(R x). The measurement process then takes sample x eg and computes Ax + η, where η = N(0, β2Id ) and A = 0d d Id . That is, we observe the last d bits of x, with variance β2 Gaussian noise added to each coordinate. Diffusion Posterior Sampling is Computationally Intractable Figure 1: The distribution of each coordinate in gs, has independent coordinates. For any seed s { 1}d, the first d bits are normal distributions whose mean is specified by si, and the last d bits are a discretized standard normal where the discretization is specified by f(s)j. The full distribution g is a mixture over all seeds s of gs. 3.2. Posterior Sampling Implies Inversion Below, we state the main result of this section, and give a sketch of the proof. We show that given any function f : Rd Rm, if we can conditionally sample the above measurement process, then we can invert f. For the sake of exposition, we assume here that f has unique inverses; a similar argument applies in general. The full proof of this Lemma is given in the Appendix. Lemma 3.4. For any function f, suppose C is an (1/10, 1/10)-posterior sampler in the linear measurement model with noise parameter β for distribution with density eg as defined in Definition 3.3, with ε β 32 log d and R 32 log d. If C takes time T to run, then there exists an algorithm A that runs in time T + O(d) such that Pr s,A[f(A(f(s))) = f(s)] 3 Take some z { 1}d . Our goal is to compute f 1(z), using the posterior sampler for eg. To do this, we take a sample zi ψzi N(0, β2) for i {1, . . . , d }, and feed in z into our posterior sampler, to output bx. We then take the first d bits of bx, round each entry to the nearest 1, and output the result. To see why this works, let s analyze what the resulting conditional distribution looks like. First, note that any sample x eg encodes some (s, f(s)) coordinate-wise so that the encoding of f(s) is one of two discretizations of a normal distribution, with width ε, offset by ε/2 from each other (see Figure 1). Furthermore, since β ε, these two encodings are distinguishable with high probability even after adding noise with variance β2. Therefore, with high probability, our sample z, which is a noised and discretized encoding of the input z we want to invert, will be such that each coordinate is within ε/4 of the correct discretization. Consequently, a posterior sample with this observation will correspond to an encoding of (s, f(s)) where s = f 1(z), with high probability. The first d bits of this encoding are just the bits of f 1(z) smoothed by a gaussian with variance 1/R2, and since R 1, rounding these coordinates to the nearest 1 returns f 1(z), with high probability. So, we showed how to invert an arbitrary f using a posterior sampler. The runtime of this procedure was just the runtime of the posterior sampler, along with some small overhead. In particular, if f were a one-way function that takes superpolynomial time to invert, posterior sampling must take superpolynomial time. Formally, we show the following: Lemma 3.5. Suppose m d0.01 and one-way functions exist. Then, for eg as defined in Definition 3.3 with ε = 1 C log d and R = C log d, and linear measurement model with noise parameter β = 1 C2 log2 d and measurement matrix A Rm d, ( 1 10)-posterior sampling takes superpolynomial time. One minor detail is that a one-way function is defined to map {0, 1}n {0, 1}n for an unconstrained n , while we want one that maps {0, 1}d m {0, 1}m. Standard arguments imply that we can get such a function from the assumption; see Section G for details. 3.3. Re LU Approximation of Lower Bound Score We have shown that our (scaled) lower bound distribution eg (as defined in Definition 3.3) is computationally intractable to sample from. Now, we sketch our proof showing that eg is well-modeled: the σ-smoothed scores are well approximated by a polynomially bounded Re LU network. The main result of this section is the following. Diffusion Posterior Sampling is Computationally Intractable (a) l1 is unbounded and has an unbounded number of pieces (b) l is bounded, has a small number of pieces. Figure 2: Piecewise-Linear Approximations of Score sσ Corollary 3.6 (Lower Bound Distribution is Well-Modeled). Let C be a sufficiently large constant. Given a Re LU network f : { 1}d { 1}d with poly(d) parameters bounded by poly(d) in absolute value, the distribution eg defined in Definition 3.3 for R = C log d and 1 poly(d) < ε < 1 C log d, is O(C)-well-modeled. To show this, we will first show that the unscaled distribution g has a score approximation representable by a and polynomially bounded Re LU net. Rescaling by a factor of R = C log d then shows the above. Notation. We will let h be the σ-smoothed version of g, and hr be the σ-smoothed version of gr. Strategy. We will first show how to approximate the score of any σ-smoothed product distribution using a polynomialsize Re LU network with polynomially bounded weights in our dimension d, 1 γ for L2 error γ2. Then, we will observe that when σ is large, so that poly(d) σ ε log d, h becomes very close to a mixture of (1 + σ2)Id+d -covariance Gaussians placed at the vertices of a scaled hypercube (in the first d coordinates). Since this is a product distribution, we can represent its score using our Re LU construction. On the other hand, when σ is small, for R log d and 1 poly(d) σ R log d, the score of h at any point x is well approximated by the distribution hr, where r { 1}d represents the orthant containing the first d coordinates of x. Since hr is a product distribution, our Re LU construction applies. Finally, we set R log d so that for any 1 poly(d) σ poly(d), there is a polynomially bounded Re LU net that approximates the score of h. We now describe each of these steps in more detail. 3.3.1. RELU APPROXIMATION FOR SCORE OF PRODUCT DISTRIBUTION We will show first how to construct a Re LU network approximating the score of a one-dimensional distribution the construction generalizes to product distributions in a straightforward way. Consider any one-dimensional distribution p with σsmoothed version pσ, and corresponding score sσ. Suppose pσ has standard deviation m2. We will first construct a piecewise-linear function l that approximates sσ in L2. Since sσ is σ-smoothed, its value does not change much in most σ-sized regions. More precisely, Lemma H.1 shows that sup |c| σ s σ(x + c)2 # This immediately gives a piecewise linear-approximation l1 with O(γσ2)-width pieces: By Taylor expansion, we can write any sσ(x) = sσ(αx) + (x αx)s σ(ξ) for some ξ between αx and x. Then, if αx is the largest discretization point smaller than x (so that |x αx| γσ2), this gives that E (sσ(x) sσ(αx))2 γ2σ4 E[sup c s σ(x + c)2] γ2 So, we can approximate every sσ(x) with sσ(αx), yielding a piecewise-constant approximation. Then, we can similarly obtain another piecewise-constant approximation by replacing sσ(x) with sσ(βx) for βx the smallest discretization point larger than x. By convexity, we can linearly interpolate between sσ(αx) and sσ(βx) to obtain our piecewise-linear approximation l1 (see Fig. 2). Unfortunately, l1 suffers from two issues: 1) It is potentially unbounded, and 2) It has an unbounded number of pieces. For 1), since sσ is σ-smoothed, it is bounded by with high probability, so that we can ensure that our approximation Diffusion Posterior Sampling is Computationally Intractable is also bounded without increasing its error much. For 2), since pσ has standard deviation m2, Chebyshev s inequality gives that the total probability outside a radius m2 γσ2 region is small, so that we can use a constant approximation outside this region. This allows us to bound the number of pieces by poly m2 γσ , yielding our final approximation l. As is well-known, such a piecewise linear function can be represented using a Re LU network with poly m2 ters, and each parameter bounded by poly m2 γσ in absolute value. For product distributions, we simply construct Re LU networks for each coordinate individually, and then append them, for bounds polynomial in d and 1/σ, 1/γ and m2. In the remaining proof, whenever this construction is used, all these parameters are set to polynomial in d, for final bounds poly(d). 3.3.2. RELU APPROXIMATION FOR LARGE σ ψ1 N(0, σ2) ψ 1 N(0, σ2) Figure 3: ψ1 and ψ 1 are discretized Gaussians with discretization width ε and phase 0 and ε/2 respectively. If we convolve with N(0, σ2), we get a distribution close to Gaussian when σ ε for each of ψ1, ψ 1. Note that our lower bound distribution g is such that the first d coordinates are simply a mixture of Gaussians placed on the vertices of a (scaled) hypercube, while the last d coordinates are discretized Gaussians ψ1 or ψ 1, with the choice of discretization depending on the first d coordinates. The only reason g is not already a product distribution is that ψ1 and ψ 1 are different. But for smoothing σ ε log d, a Fourier argument shows that the smoothed versions of ψ1 and ψ 1 are polynomially close to each other. See Figure 3 for an illustration. 3.3.3. RELU APPROXIMATION FOR SMALL σ When σ R log d and R log d, consider the density h(x) for x1,...,d lying in the orthant identified by r { 1}d. Recall that s { 1}d hs(x) where hs is the product distribution that is Gaussian with mean R si in the first d coordinates and is a smoothed discretized Gaussian with mean 0 in the remaining d coordinates. We first show that h(x) is approximated by hr(x) 2d up to small additive error. This is because every hs has radius at most 1 + σ2 R log d and there are d k points s = r with the mean of hs at least Ω( k R) away from x. So, the total contribution of all the terms involving hs(x) to h(x) for s = r is at most 1 2d 1 poly(d). We can show that h(x) is approximated by hr(x) 2d in L2 up to similar additive error in an analogous way. We then show that the score of hr serves as a good approximation to the score of h for all such points x such that x1,...,d lies in the orthant identified by r. For x close to the mean of hr (to within R/10, say), the above gives that h(x) is approximated up to multiplicative error by hr(x) 2d , and h(x) is approximated up to multiplicative error by hr(x) 2d . Together, this gives that the score of h at x, h(x) h(x) is approximated by the score of hr at x up to 1 poly(d) error. On the other hand, for x far from the mean of hr, since the density itself is small, the total contribution of such points to the score error is negligible. Since the score of h is well-approximated by the score of hr, and hr is a product distribution, we can essentially use our Re LU construction for product distributions to represent its score, after using a small gadget to identify the orthant that x1,...,d lies in. 3.4. Putting it all Together Lemma 3.5 shows that it is computationally hard to sample from eg from the posterior of a noisy linear measurement when f is a one-way funciton, while Corollary 3.6 shows that eg has score that is well-modeled by a Re LU network when f can be represented by a polynomial-sized Re LU network. In Section G, we show that any one-way function can be represented using a polynomial-sized Re LU network. Thus, together, these imply our lower bound, Theorem 1.8. Essentially the same argument holds under the stronger guarantee that there exists a one-way function that takes exponential time to invert, for a lower bound exponential in the number of measurements m. 4. Proof Overview - Upper Bound In this section, we sketch the proof of Theorem 1.7 in Section E: the time complexity of posterior sampling by rejection sampling (Algorithm 1). For ease of discussion, we only consider the case when δ = Θ(1). The proof overview below will repeatedly refer to events as occurring with arbitrarily high probability ; this means the statement is true for every constant probability p < 1. (Usually there will be Diffusion Posterior Sampling is Computationally Intractable Algorithm 1 Rejection Sampling Algorithm 1: while True do 2: Sample x Dx 3: Compute q := e 2β2 (proportional to p(y | x)) 4: Generate a random number r U(0, 1) 5: if r < q then 6: return x 7: end if 8: end while a setting of constants in big-O notation nearby that depends on p.) Sampling Correctness With Ideal Sampler. To illustrate the idea of the proof, we first focus on the scenario where we can sample from the distribution of x perfectly. We aim to show that rejection sampling perfectly samples x | y. To prove the correctness of Algorithm 1, noting that each round is independent, it suffices to verify that each round outputs x with probability density proportional to p(x | y). We have p(x | y) = p(y | x)p(x) p(y) p(y | Ax)p(x) e Ax y Therefore, with a perfect unconditional sampler for Dx (sampling x according to density p(x)), rejection sampling perfectly samples x | y. Running time. Now we show that for linear measurements y = Ax + βN(0, Id), with arbitrarily high probability over x D, the acceptance probability per round is at least Θ(β)m; this implies the algorithm terminates in (O(1)/β)m rounds with arbitrarily high probability. For a given y, the acceptance probability per round is equal to 2β2 Pr x Ax y O(β m) e O(m). We first focus on the case when m = 1. We aim to show that with arbitrarily high probability over y, Pr x [ Ax y O(β)] β. For well-modeled distributions, the covariance matrix of x has constant singular values. Then with arbitrarily high probability, x is O(1) in each direction. Since every singular value of A is at most 1, the projection Ax onto R will lie in [ C, +C] for some constant C with arbitrarily high probability. We divide [ C, +C] into N = 2C β segments of length β, forming set S. Now we only need to prove that with arbitrarily high probability over y, there exists a segment θ S satisfying for all x θ, |x y| O(β) , and Prx Dx[Ax θ] β. For any constant c > 0, define S := {θ S | Pr x Dx[Ax θ] > c Each segment in S has probability at least Ω(1/N) β to be hit. Therefore, we only need to prove that, with arbitrarily high probability, y = Ax+η satisfies these two independent events simultaneously: (1) Ax lands in some segment θ S ; (2) η β. By a union bound, the probability that Ax lies in a segment in S \ S is at most N c N c. For sufficiently small c, combining with the fact that Ax S with arbitrarily high probability, we have (1) with arbitrarily high probability. Since that η N(0, β2). By the concentration of Gaussian distribution, (2) is satisfied with arbitrarily high probability. For the general case when m > 1, with arbitrarily high probability, Ax will lie in {x Rm | x C m} for some C > 0. Instead of segments, we use N = ( O(1) balls with radius β to cover {x Rm | x C m}. Following a similar argument, we can prove that with arbitrarily high probability over y, Pr x Ax y O(β m) Θ(β)m. Diffusion as unconditional sampler. In practice, we do not have a perfect sampler for Dx. Theorem 1.4 states that for O(C)-well-modeled distributions, diffusion model gives an unconditional sampler that samples from approximation distribution b Dx satisfying that there exists a coupling between x Dx and bx b Dx such that with arbitrarily high probability, x bx 1/d2C. For (x, bx) drawn from this coupling, we know from our previous analysis that rejection sampling based on x is correct. But the algorithm only knows bx, which changes its behavior in two ways: (1) it chooses to accept based on p(y | bx) rather than p(y | x), and (2) it returns bx rather than x on acceptance. The perturbation from (2) is easily within our tolerance, since it is 1 d2C close to x with arbitrarily high probability. For (1), we can show when x and bx are close, these two probabilities are nearly the same. When x bx 1 d2C o(β/ m), we have log p(y | bx) 2β2 Abx y 2 This implies that p(y | bx) = (1 o(1))p(y | x) and proves Theorem 1.7. Diffusion Posterior Sampling is Computationally Intractable 5. Conclusion and Future Work We have shown that one cannot hope for a fast general algorithm for posterior sampling from diffusion models, in the way that diffusion gives general guarantees for unconditional sampling. Rejection sampling, slow as it may be, is about the fastest one can hope for on some distributions. However, people run algorithms that attempt to approximate the posterior sampling every day; they might not be perfectly accurate, but they seem to do a decent job. What might explain this? Given our lower bound, a positive theory for posterior sampling of diffusion models must invoke distributional assumptions on the data. Our lower bound distribution is derived from a one-way function, and not very nice . It would be interesting to identify distributional properties under which posterior sampling is possible, as well as new algorithms that work under plausible assumptions. Acknowledgements We thank Xinyu Mao and David Wu for discussions on cryptographic assumptions. SG, AP, EP and ZX are supported by NSF award CCF-1751040 (CAREER) and the NSF AI Institute for Foundations of Machine Learning (IFML). AJ is supported by ARO 051242-002. Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. Aggarwal, D., Bennett, H., Brakerski, Z., Golovnev, A., Kumar, R., Li, Z., Peters, S., Stephens-Davidowitz, N., and Vaikuntanathan, V. Lattice problems beyond polynomial time. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, STOC 2023, pp. 1516 1526, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9781450399135. doi: 10.1145/3564246.3585227. URL https://doi. org/10.1145/3564246.3585227. Applebaum, B., Ishai, Y., and Kushilevitz, E. Cryptography in NC0. In 45th Annual IEEE Symposium on Foundations of Computer Science, pp. 166 175, 2004. doi: 10.1109/FOCS.2004.20. Benton, J., Bortoli, V. D., Doucet, A., and Deligiannidis, G. Nearly d-linear convergence bounds for diffusion models via stochastic localization, 2024. Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., Jampani, V., and Rombach, R. Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023. Block, A., Mroueh, Y., and Rakhlin, A. Generative modeling with denoising auto-encoders and langevin sampling. Ar Xiv, abs/2002.00107, 2020. URL https: //api.semanticscholar.org/Corpus ID: 211010797. Chen, N., Zhang, Y., Zen, H., Weiss, R. J., Norouzi, M., and Chan, W. Wavegrad: Estimating gradients for waveform generation. In International Conference on Learning Representations, 2021. URL https://openreview. net/forum?id=Ns MLjc Fa O8O. Chen, S., Chewi, S., Li, J., Li, Y., Salim, A., and Zhang, A. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/ forum?id=zy LVMgs Z0U_. Chung, H., Kim, J., Mccann, M. T., Klasky, M. L., and Ye, J. C. Diffusion posterior sampling for general noisy inverse problems. In The Eleventh International Conference on Learning Representations, 2023. URL https: //openreview.net/forum?id=On D9z GAGT0k. Dhariwal, P. and Nichol, A. Diffusion models beat gans on image synthesis. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 8780 8794. Curran Associates, Inc., 2021. URL https://proceedings.neurips. cc/paper_files/paper/2021/file/ 49ad23d1ec9fa4bd8d77d02681df5cfa-Paper. pdf. Dou, Z. and Song, Y. Diffusion posterior sampling for linear inverse problem solving: A filtering perspective. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview. net/forum?id=tpl XNc HZs1. Gupta, S., Lee, J., Price, E., and Valiant, P. Finite-sample maximum likelihood estimation of location. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 30139 30149. Curran Associates, Inc., 2022. Gupta, S., Parulekar, A., Price, E., and Xun, Z. Sampleefficient training for diffusion, 2023. Diffusion Posterior Sampling is Computationally Intractable Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 6840 6851. Curran Associates, Inc., 2020. URL https://proceedings.neurips. cc/paper_files/paper/2020/file/ 4c5bcfec8584af0d967f1ab10179ca4b-Paper. pdf. Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., and Fleet, D. J. Video diffusion models, 2022. Jalal, A., Arvinte, M., Daras, G., Price, E., Dimakis, A. G., and Tamir, J. Robust compressed sensing mri with deep generative priors. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 14938 14954. Curran Associates, Inc., 2021a. URL https://proceedings.neurips. cc/paper_files/paper/2021/file/ 7d6044e95a16761171b130dcb476a43e-Paper. pdf. Jalal, A., Karmalkar, S., Hoffmann, J., Dimakis, A., and Price, E. Fairness for image generation with uncertain sensitive attributes. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 4721 4732. PMLR, 18 24 Jul 2021b. URL https://proceedings.mlr. press/v139/jalal21b.html. Kawar, B., Vaksman, G., and Elad, M. SNIPS: Solving noisy inverse problems stochastically. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum? id=p BKOx_dx YAN. Kawar, B., Elad, M., Ermon, S., and Song, J. Denoising diffusion restoration models. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URL https: //openreview.net/forum?id=kx Xvopt9p WK. Kong, Z., Ping, W., Huang, J., Zhao, K., and Catanzaro, B. Diffwave: A versatile diffusion model for audio synthesis. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum? id=a-x FK8Ymz5J. Laurent, B. and Massart, P. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, 28, 10 2000. doi: 10.1214/aos/1015957395. Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R., and Gool, L. V. Repaint: Inpainting using denoising diffusion probabilistic models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11451 11461, 2022. URL https://api.semanticscholar. org/Corpus ID:246240274. Mohri, M., Rostamizadeh, A., and Talwalkar, A. Foundations of machine learning. 2018. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. Hierarchical text-conditional image generation with clip latents, 2022. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. Co RR, abs/2112.10752, 2021. URL https://arxiv.org/abs/2112.10752. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In Bach, F. and Blei, D. (eds.), Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pp. 2256 2265, Lille, France, 07 09 Jul 2015. PMLR. URL https://proceedings. mlr.press/v37/sohl-dickstein15.html. Song, J., Meng, C., and Ermon, S. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021. URL https://openreview. net/forum?id=St1giar CHLP. Song, J., Vahdat, A., Mardani, M., and Kautz, J. Pseudoinverse-guided diffusion models for inverse problems. In International Conference on Learning Representations, 2023. URL https://openreview.net/ forum?id=9_gs MA8MRKQ. Song, Y. and Ermon, S. Generative modeling by estimating gradients of the data distribution. In Neural Information Processing Systems, 2019. URL https: //api.semanticscholar.org/Corpus ID: 196470871. Trippe, B. L., Wu, L., Naesseth, C. A., Blei, D., and Cunningham, J. P. Practical and asymptotically exact conditional sampling in diffusion models. In ICML 2023 Workshop on Structured Probabilistic Inference & Generative Modeling, 2023a. URL https://openreview. net/forum?id=r9s3Gbxz7g. Trippe, B. L., Yim, J., Tischer, D., Baker, D., Broderick, T., Barzilay, R., and Jaakkola, T. S. Diffusion probabilistic modeling of protein backbones in 3d for the motif-scaffolding problem. In The Eleventh International Conference on Learning Representations, Diffusion Posterior Sampling is Computationally Intractable 2023b. URL https://openreview.net/forum? id=6Tx Bxq NME1Y. Zhandry, M. The magic of elfs. Journal of Cryptology, 32: 825 866, 2019. Diffusion Posterior Sampling is Computationally Intractable A. Lower Bound instance We first define our Lower Bound Distribution g (up to scaling). Let wσ(x) denote the density of a Gaussian with mean zero and standard deviation σ, and let combε denote the Dirac Comb distribution with period ε, given by For any b { 1, 1}, let ψb be the density of a standard Gaussian discretized to multiples of ε, with phase either 0 or ε 2 depending on b: ψb(x) w1(x) combε Definition 3.2 (Unscaled Lower Bound Distribution). Let f : { 1}d { 1}d be a given function. For R > 0 and for any s { 1}d, define the product distribution gs over x Rd+d such that xi w1(xi R si) for i d xi ψf(s)i d for i > d. The unconditional distribution g we consider is the uniform mixture of gs over s { 1}d. We define our final Lower Bound distribution below, which is a scaled version of g. Definition 3.3 (Scaled Lower Bound Distribution). Let eg(x) = Rd+d g(R x) be the scaled version of the distribution with density g defined in Definition 3.2. Similarly, let egs = Rd+d gs(R x). B. Lower Bound Posterior Sampling implies Inversion of One-Way Function B.1. Notation Let l := [d] = {1, 2, 3, . . . , d}, and let r := {d + 1, d + 2, . . . , d + d }, so that for any x Rd+d , x[:d] Rd is a vector containing the first d entries of x, and x[ d :] Rd is a vector containing the last d entries of x. Let Round R : Rk { R}k be such that for all i [k], Round R(x)i = arg min v { R} |xi v| . Let parity : Z { 1, +1} be such that parity(2i) = 1, parity(2i + 1) = 1 for all i Z. Let Bitsε : Rk { 1}k be such that for all i [k], (Bitsε(y))i = parity arg min i Z This function takes a value y and returns a guess for whether y comes from a smoothed distribution discretized to even multiples of ε/2 or odd multiples of ε/2, based on which is closer. Definition B.1 (Conditional Distribution). Let g be the distribution defined in 3.2, parameterized by a function f, and real values R, ε > 0. For some noise pdf h, we define X h f,R,ε to be the distribution over (x, y) where x g and y x[ d :] + h. We also explicitly define the two noise models we will be using for the lower bound: we take X β f,R,ε := X wβ f,R,ε, wβ = N(0, β2). (1) Let (X β f,R,ε)y denote the marginal over y. Further, X β,βmax f,R,ε := X b f,R,ε where b is a clipped normal distribution: b := clip(βmax, N(0, β2)). Diffusion Posterior Sampling is Computationally Intractable B.2. Inverting f via Posterior Sampling Lemma B.2. Let βmax ε/4 and q Pr xb,yb X β,βmax f,R,ε h f(Round R(xb [:d])) = Bitsε(yb) i 1 δ Proof. Let xb, yb X β,βmax f,R,ε . By definition, we know that yb xb [ d :] + clip(βmax, N(0, β2). Further, for all indices i, (xb [ d :])i = jε/2 for some integer j. So, if βmax ε/4, then Bitsε(yb) = Bitsε(xb [ d :]). (2) We know that xb is drawn from a uniform mixture over gs(x), as defined in 3.2. So, fixing an s { 1}d. We have that Bitsε(xb [ d :]) = s. (3) On the other hand, x[:d] is a product of gaussians centered at Rsi in the ith coordinate. Therefore, for all i < d, " (xb [:d])i Rsi δ R/4, we get that h Round R(xb [:d]) = s i 1 δ. (4) Putting together Equation (2), Equation (3), and Equation (4) we get h Round R(xb [:d]) = Bitsε(yb) i 1 δ Lemma B.3. Let C be a (τ, δ)-conditional sampling algorithm for X β f,R,ε. If ε β q δ , τ R/4, and 32 log d then for y (X β f,R,ε)y and bx C(y), Pr[f(Round R(bx[:d])) = Bitsε(y)] 5δ. Proof. Let X β f,R,ε have pdf pβ. Assume we have a (τ, δ)-posterior sampler over X β f,R,ε that outputs sample from distribution b X with distribution bp. This means that with probability 1 δ over y, there exists a coupling P over (x, bx) such that (x, bx) are (τ, δ)-close. Therefore, there exists a distribution P over (x, bx, y) Rd+d Rd+d Rd with density p P such that p P(x, y) = pβ(x, y), p P(bx | y) = bp(bx | y), and Pr x,bx P [ x bx 2 τ] 1 2δ. Now, let X β,βmax f,R,ε have pdf pβ,βmax, with βmax = β q δ . We have TV (X β f,R,ε, X β,βmax f,R,ε ) e β2 max/2β2 δ Therefore, building on P, we can construct a new distribution P over (x, bx, xb, y, yb) Rd+d Rd+d Rd+d Rd Rd with density p P such that p P (x, y) = pβ(x, y), p P (bx | y) = bp(bx | y), p P (xb, yb) = pβ,βmax(xb, yb), (x, y) = (xb, yb) with probability 1 δ, and Pr x,bx P [ x bx 2 τ] 1 2δ Diffusion Posterior Sampling is Computationally Intractable Therefore, under this distribution, Pr bx,xb P bx xb 2 τ 1 3δ In particular, we apply the fact that bx[:d] xb [:d] bx xb bx xb 2 to get h bx[:d] xb [:d] τ i 1 3δ. (5) Now, by the definition of X β,βmax f,R,ε , for all i < d, xb i is a mixture of variance 1 normal distributions centered at R. So, for any i < d, " xb i Round R(xb i) Applying a union bound over i [d] and putting this together with Equation (5), bx[:d] Round R(xb [:d]) So, since q 2 , and Round R((xb [:d])i) R, we have Round R(bx[:d]) Round R(xb [:d]) Again, the output of Round R is always R, so this means h Round R(bx[:d]) = Round R(xb [:d]) i 1 3δ Now, by Lemma B.2, since βmax < ε/4 and R q δ , we have h f(Round R(xb [:d])) = Bitsε(yb) i 1 δ Therefore, Pr bx,yb P f(Round R(bx[:d])) = Bitsε(yb) 1 4δ Finally, we know that y = yb with probability 1 δ. Therefore, we get Pr bx,y P f(Round R(bx[:d])) = Bitsε(y) 1 5δ Theorem B.4. For any function f, let C be a (R/4, δ)-posterior sampler (1.5) for X β f,R,ε, as defined in (1), with ε δ , and R q δ , that takes time T to run. Then, there exists an algorithm A that runs in time T + O(d) such that Pr s,A[f(A(f(s))) = f(s)] 6δ Proof. Sample y hf(r), where ( (w1(y) combε(y))) N(0, β2), si = 1 w1(y) combε y ε 2 N(0, β2) si = 1 Diffusion Posterior Sampling is Computationally Intractable Now, since β ε δ , each coordinate of the noise, drawn from N(0, β2), is less than ε/4 with probability 1 δ/d. Pr [Bitsε(y) = f(r)] 1 δ By definition, hs is the same as the density of (X β f,R,ε)y. So, by Lemma B.3, since R τ/4, R q δ , and we take ˆx C(y), we have Pr bx,y f(Round R(bx[:d])) = Bitsε(y) 5δ Pr bx,y f(Round R(bx[:d])) = f(r) 6δ So, our algorithm A can output Round R(bxl). All we had to do to run this algorithm was to sample d normal random variables, and then run our posterior sampler. This takes T + O(d) time. Lemma 3.4. For any function f, suppose C is an (1/10, 1/10)-posterior sampler in the linear measurement model with noise parameter β for distribution with density eg as defined in Definition 3.3, with ε β 32 log d and R 32 log d. If C takes time T to run, then there exists an algorithm A that runs in time T + O(d) such that Pr s,A[f(A(f(s))) = f(s)] 3 Proof. This follows from Theorem B.4, using the fact that after rescaling down by R, X β f,R,ε as a distribution over (x, y) is the same distribution as x eg, with y = Ax + N(0, β2). B.3. Inverting a One-Way function via Posterior Sampling Lemma 3.5. Suppose m d0.01 and one-way functions exist. Then, for eg as defined in Definition 3.3 with ε = 1 C log d and R = C log d, and linear measurement model with noise parameter β = 1 C2 log2 d and measurement matrix A Rm d, ( 1 10)-posterior sampling takes superpolynomial time. Proof. When m > d/2, we can add an arbitrary number of dummy observations which always observes 0. Posterior sampling in this instance is identical to only observing the first d/2 coordinates. Therefore, we only need to consider the case when m d/2. When d0.01 < m < d/2, d and m are only polynomially separated. So, by G.1, we can construct a one-way function f : { 1}d m { 1}m. By definition, we can see that eg, with measurement noise β is the same distribution as X βR f,R,ε, scaled down by R. Therefore, by Theorem B.4, since R 32 q δ , if we can run a posterior sampler in time T, we can invert f with probability 1 6δ in time T + O(m). So, if f takes time superpolynomial in m to invert, then T + O(m) is superpolynomial. Since m > d0.01, this means that T itself is superpolynomial in d. Lemma B.5. Suppose that there exist one-way functions f : { 1}m { 1}m that require 2Ω(m) time to invert. Then, for any m = O(d), for eg as defined in Definition 3.3 with ε = 1 C log d and R = C log d, and linear measurement model with noise parameter β = 1 C2 log2 d and measurement matrix A Rm d, ( 1 10)-conditional sampling takes at least 2Ω(m) Proof. Similar to the proof of Lemma 3.5, we only need to consider the case when m d/2. By definition, we can see that eg, with measurement noise β is the same distribution as X βR f,R,ε, scaled down by R. Therefore, by Theorem B.4, since R 32 log d, ε βR log d, if we can run a posterior sampler in time T, we can invert f with probability 0.4 in time T + O(m). So, if f takes at least time 2Ω(m) to run, then we must have T + O(m) 2Ω(m), which means T 2Ω(m). Diffusion Posterior Sampling is Computationally Intractable C. Lower Bound Re LU Approximation of Score C.1. Piecewise Linear Approximation of σ-smoothed score in One Dimension In this section, we analyze the error of a piecewise linear approximation to a smoothed score. We first show that for one dimensional distributions, we can get good approximations, and later extend it to product distributions in higher dimensions. First, we show that a piecewise linear approximation that discretizes the space into intervals of width γ has low error. Lemma C.1. Let p be a distribution over R, and let pσ = p N(0, σ2) have score sσ. Let γ σ, and let Si = [iγ, (i+1)γ) for all i Z. Define a piecewise linear function f : R R so that: for all x, if i is such that Si x, then f(x) = ((i + 1)γ x) s(iγ) + (x iγ) s((i + 1)γ) Then f is continuous and satisfies E (s(x) f(x))2 γ2 Proof. Define the left and right piecewise constant approximations l(x) = s(iγ), r(x) = s((i + 1)γ) for all x Si. We know that for any y Si, there is some y [iγ, y] such that s(y) = s(iγ) + (y iγ)s (y ). So, we get y Si, s(y) s(iγ) + γ sup z Si s (z) s(iγ) + γ sup |c| γ s (y + c). E x p (sσ(x) l(x))2 γ2 E x p[ sup |c| γ s (y + c)2] γ2 By Lemma H.1. The same holds for r(x). Now, recall that f satisfies i Z, x Si, f(x) = (i + 1)γ x γ s(iγ) + x iγ γ s((i + 1)γ). The coefficients (i+1)γ x γ sum to 1 and are within the interval [0, 1]. So, at each point, f is just a convex combination of the two approximations l and r. Therefore, by convexity, for any Si, if x Si, E x Si[(sσ(x) f(x))2] E x Si[(sσ(x) l(x))2] + E x Si[(sσ(x) r(x))2] (6) This immediately gives us that E[(sσ(x) f(x))2] E[(sσ(x) l(x))2] + E[(sσ(x) r(x))2] ε2 Within each interval, the function is linear and so it is continous. We just need to check continuity at the endpoints. However, we can see that for any i Z, limx iγ = limx iγ+ = s(iγ), and so we also have continuity. Unfortunately, the above approximation has an infinite number of pieces. To handle this, we show that in regions far away from the mean, a zero-approximation is good enough, given that the distribution has bounded second moment m2. Lemma C.2. Let p be some distribution over R with mean µ, and let pσ = p N(0, σ2) have score sσ. Let m2 2 := Ex p (x µ)2 be the second moment of pσ. Further, let |φ| 1 δ be some constant. Then, E h (sσ(x) φ)2 1|x µ|> m2 Diffusion Posterior Sampling is Computationally Intractable Proof. We have E h (sσ(x) φ)2 1|x µ|> m2 i E h sσ(x)2 1|x µ|> m2 i + E h φ2 1|x µ|> m2 First, by Chebyshev s inequality, we know that Pr h |x µ| m2 Now, we use Cauchy Schwarz to bound the first term: E h sσ(x)2 1|x µ|> m2 E [sσ(x)4] E h 1|x µ|> m2 E [sσ(x)4] Pr |x µ| m2 E [sσ(x)4] δ p where the last line is by Corollary H.8. Finally, for the second term, we know that E h φ2 1|x µ|> m2 δ 1|x µ|> m2 δ Pr |x µ| > m2 The last line here uses the fact that for all x, x log2(1/x) 3 x. Summing the two terms gives the desired result. Then, we show that neighborhoods where the magnitude of the score can be large are rare and can also be approximated by the zero function. This allows us to control the slope of the piecewise linear approximation in each piece. Lemma C.3. Let p be a distribution over R. Let pσ = p N(0, σ2) have score sσ. Let γ σ 2 , and let m(x) = supy [x γ,x+γ] s(x). Then, E s(x)2 1 m(x)> log 1 E m(x)2 1 m(x)> log 1 E [m(x)4] E 1 m(x)> log 1 by Cauchy-Schwarz v u u u t E sup y [x γ,x+γ] s(x) Pr m(x) > log 1 1 σ4 Pr m(x) > log 1 by Lemma H.8 δ σ2 by Lemma H.2 We put these lemmas together to show that a piecewise linear function with a bounded number of pieces and bounded slope in each piece is a good approximation to the smoothed score. Diffusion Posterior Sampling is Computationally Intractable Lemma C.4. Let p be a distribution over R with mean µ, and let pσ = p N(0, σ2) have score sσ and second moment m2 2. Then, for any α 1/4 there exists a function l : R R that satisfies 1. l is piecewise linear with at most Θ( m2 σκ3/2 ) pieces, 2. if x is a transition point between two pieces, then |x µ| m2 3. the slope of each piece is bounded by Θ log 1 E x p[(l(x) s(x))2] κ Proof. First, we partition the real line into Si = [iγ, (i+1)γ) for all i Z, where γ < σ/2. Define the function l1 : R R so that if Si x, then l1(x) = ((i + 1)γ x)s(iγ) + (x iγ)s((i + 1)γ) As in Lemma C.1, this is the linear interpolation between s(iγ) and s((i + 1)γ) on the interval [iγ, (i + 1)γ). By Lemma C.1, when γ < σ/2, we have E (s(x) l1(x))2 γ2 Now, we define l2 : R R. This function uses the piecewise linear l1 to create a linear approximation that has small slopes on all of the pieces. Define first a set of good sets G = Si : sup y Si s(x) 1 These are the intervals on which the score is always bounded. Further, define two helper maps L(x) and U(x): L(x) = the largest i such that iγ < x, Si 1 G R(x) = the smallest i such that iγ x, Si G These represent the nearest endpoint of a good interval to the left and right, respectively. We then interpolate linearly between s(γL(x)) and s(γR(x)) to evaluate l2(x). That is, l2(x) = (γR(x) x)s(γL(x)) + (x γL(x))s(γR(x)) γ(R(x) L(x)) (8) Note that by assumption, we have that |s(γR(x))| , |s(γL(x))| 1 δ, and so |l2(x)| 1 δ. We now analyze the error of l2 against s. First, we note that on the sets outside G, the error is bounded, using Lemma C.3: X Si G E (s(x) l2(x))21x Si 2 X E s(x)21x Si + E l2(x)21x Si by Lemma C.3 sup y [x γ,x+γ] s(x) 1 δ by Lemma H.2 Diffusion Posterior Sampling is Computationally Intractable Further, if x is in a good interval, then L(x), R(x) are simply the left and right endpoints of the interval that x is in. This means that l2(x) = l1(x). So, X Si G E (s(x) l2(x))21x Si = X Si G E (s(x) l1(x))21x Si i E (s(x) l1(x))21x Si Putting these two together, we get that E (l2(x) s(x))2 Now, define l3 : R R as follows: l2(x) |x µ| m2 δ l2 µ + m2 This takes our previous approximation l2 and holds it constant on values of x far away from the mean. Let B be the integers i such that x Si = |x µ| m2/ δ. In other words, the set B enumerates the intervals on which l2 = l1, and equivalently, l2 = 0. Note that since |l3(x)| 1 δ, we have in particular that l3 µ m2 Therefore, for some |φ| 1 δ , we have E (s(x) l3(x))2 = X i E (s(x) l3(x))21x Si i B E (s(x) l3(x))21x Si + X i B E (s(x) l3(x))21x Si i B E h (s(x) φ)21|x µ| m2/ i E (s(x) l2(x))21x Si where this last line uses Lemma C.2. Finally, each piece of l3 has slope at most Θ log 1 δ γσ since the endpoints of each interval are bounded in magnitude by δ and each interval is at least γ in width. Also, we can see that l3 has at most as many pieces as l2, which has Θ m2 γ pieces, with each endpoint being within m2/ δ of the mean. So, we take l to be l3 with δ = κ2, and γ = σ κ. Note that when κ < 1/4, we have γ < σ/2. Plugging these in, and using the fact that x log2(1/x) 3 x, we get that the number of pieces is Θ m2 σκ3/2 , the slope of each piece is bounded by Θ log 1 κ2 σ2 κ , the function itself is always bounded by 1 κ2 , and Ex p[ l(x) s(x) 2 2] κ σ2 . Finally, we show that if we have a product distribution over d dimensions, we can simply use the product of the one dimensional linear approximations along each coordinate to give a good approximation for the full score. Lemma C.5. Let p be a product distribution over Rd, such that p(x) = Qd i=1 pi(xi). Let s : Rd Rd be the score of p and let si : R R be the score of pi. If li : R R is an approximation to si such that (li(xi) si(xi))2 ε/d, then the function l : Rd Rd defined as l(x) = (li(xi)) satisfies E x p l(x) s(x) 2 2 ε Diffusion Posterior Sampling is Computationally Intractable Proof. We have s(x)i = ( log p(x))i = xi log i=1 pi(xi) = xi i=1 log pi(xi) = xi log pi(xi) = si(xi). E x p l(x) s(x) 2 2 = E x p i=1 li(xi) si(xi) 2 2 i=1 E xi pi li(xi) si(xi) 2 2 d ε/d = ε C.2. Small noise level Score of vertex distribution close to full score in vertex orthant Lemma C.6 (Density gs(x) is close to g(x) for s { 1}d closest to x). Let d = O(d). Consider gs and g as in Definition 3.2. We have that for x { 1}d such that s is closest to x1,...,d among points in { 1}d, for the σ-smoothed versions hs = gs N(0, σ2Id+d ) of gs and h = g N(0, σ2Id+d ) of g, for R2 1+σ2 > C log d for sufficiently large constant C, 1 2d hs (x) h(x) 1 Proof. We have that there are d k vectors z { 1}d such that R z x1,...,d 2 k R2. For such a z, hz(y) e k R2 1 2d hs (x) h(x) = since R2 1+σ2 > C log d. Lemma C.7 (Gradient of density gx(y) is close to g(y) for x { 1}d closest to y). Let d = O(d) and consider gs and g as in Definition 3.2, and x Rd+d . We have that for s { 1}d such that s is closest to x1,...,d among points in { 1}d, for σ τ, τ = 1 d C and ε > 1 poly(d), for the σ-smoothed versions hs = gs N(0, σ2Id+d ) of gs and h = g N(0, σ2Id+d ) of g, for R2 1+σ2 > C log d for sufficiently large constant C, 1 2d hs(x) h(x) Proof. We will let ehs,i = egs,i N(0, σ2), where egs,i is defined in Definition 3.2. So, hs(x) = Qd+d i=1 ehs,i(xi). We have that there are d k vectors z { 1}d such that R z x1,...,d 2 k R2. So, for i [d], for such a z, | ( hz(x))i | e k R2 Diffusion Posterior Sampling is Computationally Intractable On the other hand, for i > d, by Lemma C.16, since σ > ε2 and ε > 1 poly(d), eh z,i(xi) w σ2+1(xi) e σ2 2ε2(1+σ2) + X 2ε2(1+σ2) +log j ε(1+σ2) 2ε2(1+τ2) + X 2ε2(1+τ2) +log j ε(1+τ2) So, we have that for z { 1}d such that R z x1,...,d 2 > k R2, since ε > 1 poly(d), τ = 1 poly(d) and R2 1+σ2 > C log d, |( hz(x))i| ε 2(1+σ2) e k R2 So, finally, for such z, hz(x) 2 e k R2 1 2d hs(x) h(x) k=1 dke k R2 Lemma C.8 (Score of mixture close to score of closest (discretized) Gaussian). Let d = O(d ), and consider gs, g as in Definition 3.2 for any s { 1}d, with R2 1+σ2 > C log d for sufficiently large constant C. Let S Rd be the orthant containing s. Let σ τ for τ = 1 poly(d), and let ε > 1 poly(d). We have that, for the σ-smoothed scores sσ,s of gs and sσ of g, h sσ,s(x) sσ(x) 2 1x1,...,d S i e Ω R2 where h is the σ-smoothed version of g, given by h = g N(0, σ2Id+d ). Proof. Let hs be the σ-smoothed version of gs, given by hs = gs N(0, σ2Id+d ). Let es Rd+d be such that the first d coordinates are given by s, and the remaining d coordinates are 0. We have h sσ,s(x) sσ(x) 2 1x1,...,d S i = E x h h sσ,s(x) sσ(x) 2 1 x es R/10 1x1,...,d S i h sσ,s(x) sσ(x) 2 1 x es >R/10 1x1,...,d S i Note that when x es R/10, by Lemma C.16, hs(x) e R2 64(1+σ2) since σ τ for τ = 1 poly(d), ε > 1 poly(d) and R2 1+σ2 > C log d. So, by Lemmas C.6 and C.7, h(x) = 1 2d hs(x) 1 + O e R2 8(1+σ2) , and h(x) 1 2d hs(x) 2 16(1+σ2) . Also note that by Lemma C.6, h(x|x1,...,d S) hs(x)+O(e R2 8(1+σ2) ) hs(x) 1 + O e R2 Diffusion Posterior Sampling is Computationally Intractable So, for the first term, h sσ,s(x) sσ(x) 2 1 x es R 1x1,...,d S i 1 2d hs(x) h(x) 1 x es R/10 1x1,...,d S 16(1+σ2) + e R2 1 2d hs(x)2 1 x es R/10 1x1,...,d S 32(1+σ2) + e R2 8(1+σ2) E x hs 32(1+σ2) + de R2 since σ 1 poly(d), ε > 1 poly(d) and R2 1+σ2 > C log d. For the second term, by Cauchy-Schwarz, h sσ,s(x) sσ(x) 2 1 x es > R 1x1,...,d S i h sσ,s(x) 4 + sσ(x) 4 1x1,...,d S i E h 1 x es >R/10 1x1,...,d S i σ4 E h x 4 1x1,...,d S i e Ω R2 R4 + E s { 1}d h E h x 4 1x1,...,d S, x gs ii e Ω R2 So, we have the claim. C.3. Re LU Network approximation of σ-smoothed Scores of Product Distributions Once we have this, we also need to go from being close to mixture of Gaussians to being close to mixture of discretized Gaussians. Lemma C.9. Let f : R R be a continuous piecewise linear function with D segments. Then, f can be represented by a Re LU network with O(D) parameters. If each segment s slope, each transition point, and the values of the transition points are at most β in absolute value, each parameter of the network is bounded by O(β) in absolute value. Proof. Since f is piecewise linear, we can define f as follows: there exists = γ0 < γ1 < γ2 < < γD 1 = γD = + such that a1x + b1, x γ1 a2x + b2, γ1 < x γ2 ... a Dx + b D, γD 1 < x, where akγk + bk = ak+1γk + bk+1 for each k [D 1]. Now we will show that f(x) equals g(x) defined below: g(x) := a1x + b1 + i=2 Re LU((ai ai 1)(x γi 1)). Diffusion Posterior Sampling is Computationally Intractable We observe that for γk 1 < x γk, g(x) = a1x + b1 + i=2 (ai ai 1)(x γi 1) = akx i=2 (ai ai 1)γi 1. Then, when k > 1, for γk 1 < x γk, we have i=2 (ai ai 1)γi 1 i=2 (ai ai 1)γi 1 + akx ak 1γk 1 (ak ak 1)γk 1 = g(γk 1) + akx akγk 1. Using these observations, we can inductively show that for each k [D], g(x) = f(x) holds for γk 1 < x γk. For x γ1, g(x) = a1x + b1 = f(x). Assuming for γk 2 < x γk 1, g(x) = f(x). Then g(γk 1) = f(γk 1) = akγk 1 + bk. Therefore, for γk 1 < x γk, we have g(x) = g(γk 1) + akx akγk 1 = akx + bk = f(x). This proves that g(x) = f(x) for x R and we only need to design neural network to represent g. By employing one neuron for a1x + b1 and D 1 neurons for Re LU((ai ai 1)(x γi 1)), and aggregating their outputs, we obtain the function g. There are O(D) parameters in total, and each parameter is bounded by O(β) in absolute value. Lemma C.10. Let f1, . . . , fk be functions mapping R to R. Suppose each fi can be represented by a neural network with p parameters bounded by β in absolute value. Then, function g : Rk Rk defined by g(x1, . . . , xk) := (f1(x1), . . . , fk(xk)) can be represented by a neural network with O(pk) parameters bounded by β in absolute value. Proof. We just need to deal with each coordinate separately and use the neural network representation for each fi. We just need to concatenate each result of fi together as the final output. Lemma C.11 (Re LU network implementing the score of a one-dimensional σ-smoothed distribution). Let p be a distribution over R with mean µ, and let pσ = p N(0, σ2) have variance m2 2 and score sσ. There exists a constant-depth Re LU network f : R R with O( m2 γ3σ4 ) parameters with absolute values bounded by O( m2 σ2γ2 + log 1 γ σ3γ + |µ|) such that sσ(x) f(x) 2 γ2 Proof. By Lemma C.4, there exists a continuous piecewise approximation of p with O( m2 σ3γ4 ) pieces with each segment s slope, each transition point, and function value all bounded in O( m2 σ2γ2 + 1 σ3γ log 1 σγ + 1 σ log 1 σγ + |µ|). Taking this into C.9 and we have the bound. Lemma C.12. Let p be a product distribution over Rd such that p(x) = Qd i=1 pi(xi), and let pσ = p N(0, σ2Id) have score sσ. Assume pσ has mean µ and variance m2 2 = Ep[ x µ 2 2]. Then, there exists a constant-depth Re LU network f : Rd Rd with O( dm2 γ3σ4 ) parameters with absolute values bounded by O( dm2 d σ3γ log d σγ + µ 1) such that sσ(x) f(x) 2 γ2. Diffusion Posterior Sampling is Computationally Intractable Proof. Consider distribution pi : R R and its σ-smoothed version piσ = pi N(0, σ2). Let µi and m2i be the mean and the variance of pi respectively. Let sσi be the i-th component of sσ. Then, Lemma C.11 shows that for each i [d], there exists a constant-depth Re LU network fi : R R with O( m2i γ3σ4 ) parameters with absolute values bounded by d σ3γ log d σγ + |µi|) such that sσi(x) fi(x) 2 γ2 Then, we can use the product function f = (f1, . . . , fd) as the approximation for sσ. By Lemma C.5, sσ(x) f(x) 2 γ2. Taking the fact that P i [d] |µi| = µ1 and P i [d] m2i dm2 into Lemma C.10, and we prove the statement. C.4. Re LU network for Score at Small smoothing level Lemma C.13 (Vertex Identifier Network). For any 0 < α < 1, there exists a Re LU network h : Rd Rd with O(d/α) parameters, constant depth, and weights bounded by O(1/α) such that If |xi| > α, for all i [d], then h(x)i = xi |xi| for all i [d]. Proof. Consider the one-dimensional function 1, y α y α, α < y < α 1, y α This is a piecewise linear function, where the derivative of each piece is bounded by 1 α, the value of the transition points are at most α in absolute value, and |h| itself is bounded by 1. Thus, by Lemma C.9, we can represent the function h(x) = (g(x1), . . . , g(xd)) using O(d/α) parameters, with each parameter s absolute value bounded by O(1/α). Moreover, clearly h(x)i = xi |xi| for all i [d] whenever |xi| 1 Lemma C.14 (Switch Network). Consider any function switch : Rd+1 Rd such that for x Rd, y R, with |xi| T for all i [d], switch(x, y) = ( x if y = 1 0 if y = 1 switch can be implemented using a constant depth Re LU network with O(d T) parameters, with each parameter s absolute value bounded by O(T). Proof. Consider the Re LU network given by switch(x, y)i = Re LU((xi 2T) + 2T y) Re LU(( xi 2T) + 2T y) It computes our claimed function. Moreover, it is constant-depth, the number of parameters is O(d T), and each parameter is bounded by O(T) in absolute value, as claimed. Diffusion Posterior Sampling is Computationally Intractable Lemma C.15. Let d = O(d). Given a constant-depth Re LU network representing a one-way function f : { 1, 1}d { 1, 1}d with poly(d) parameters, there is a constant-depth Re LU network h : Rd+d Rd+d with poly d σγ parameters with each parameter bounded in absolute value by poly d σγ such that for the unconditional distribution g defined in Definition 3.2 with σ-smoothed version gσ and corresponding score sσ, for τ = 1 d C and τ σ < R C log d for sufficiently large constant C, and R > C log d, ε > 1 poly(d), γ > 1 d C/100 sσ(x) h(x) 2 γ2 Proof. We will let our Re LU network h be as follows. Let r be the Re LU network from Lemma C.13 that identifies the closest hypercube vertex with any constant parameter α < 1. For each i [d], we will let ehi be the Re LU network that implements the approximation to the score of the one-dimensional distribution w1(x) N(0, σ2) from Lemma C.11. By the lemma, it satisfies h (ehi(x) log w σ2+1(x))2i γ2 (10) For i d + [d ], we will let ehi,1 be the Re LU network that implements the approximation to the score esσ,i, 1 of egσ,i, 1 = ( w1 combε R w1(x) combε(x)dx) N(0, σ2), and we will let ehi, 1 implement the approximation to the score of w1(x) combε(x ε/2) R w1(x) combε(x ε/2)dx N(0, σ2), as given by Lemma C.11. By the Lemma, for every i d + [d ] and j { 1}, we have E x egσ,i,j h (ehi,j(x) sσ,i,j(x))2i γ2 (11) Note that each |ehi| C σ log 1 σγ for i d, and |ehi, 1| C σ log 1 σγ for i > d, for sufficiently large constant C. Now let switch be the Re LU network described in Lemma C.14 for T = C σ log 1 σγ . Consider the network h : Rd+d Rd+d given by (ehi(xi r(x)i R) for i d switch(ehi,1(xi), f(r(x))i d) + switch(ehi, 1(xi), f(r(x))i d) for i > d Note that h can be represented with poly d σγ parameters with absolute value of each parameter bounded in poly d σγ . We will show that h approximates sσ well in multiple steps. For r(x) { 1}d, consider the score sσ,r(x) of gσ,r(x), the σ-smoothed version of the distribution gr(x) centered at g r(x) Rd+d , as described in Definition 3.2, where g r(x) has the first d coordinates given by r(x), and the remaining coordinates set to 0. Whenever r(x) = j { 1}d, h approximates sσ,j well over gσ,j. We will show that for fixed j { 1}d sσ,j(x) h(x) 2 1r(x)=j dγ2 First, note that for i d, by (10) and our definition of h, (h(x)i log w σ2+1(x R j))2 1r(x)=j γ2 On the other hand, for i > d, by (11) and our definition of h, E x egσ,i,f(j)i d (h(x)i sσ,i,f(j)i d)2 1r(x)=j γ2 Since by Definition 3.2, for j { 1}d, gσ,j(x) = Qd i=1 w σ2+1(x) Qd+d i=d+1 egσ,i,f(j)i d(x), we have by Lemma C.5, h(x) sσ,j(x) 2 1r(x)=j dγ2 Diffusion Posterior Sampling is Computationally Intractable h approximates sσ,j exponentially accurately over gσ. By Lemma C.6, we have that for x such that r(x) = j, 1 2d gσ,j(x) gσ(x) 1 for our choice of R, σ. So, we have h(x) sσ,j(x) 2 1r(x)=j dγ2 h approximates sσ well over gσ whenever r(x) { 1}d. Summing the above over j { 1}d gives h(x) sσ,r(x)(x) 2 1r(x) { 1}d dγ2 Moreover, by Lemma C.8, for r(x) = y where y { 1}d represents the orthant that x { 1}d belongs to, sσ, r(x) sσ(x) 2 e Ω R2 1+σ2 1 d C2/10 So, by the above, we have that h(x) sσ(x) 2 1r(x) { 1}d dγ2 + 1 d C2/10 Contribution of x such that r(x) { 1}d is small. By the definition of r, r(x) { 1}d de (R α)2 So, by Cauchy-Schwarz, sσ(x) 2 1r(x) { 1}d q E x gσ [ sσ(x) 4] Pr x gσ [r(x) { 1}d] Similarly, since |h(x)i| C σ log 1 σγ , we have h(x) 2 1r(x) { 1}d 1 Thus, we have h(x) sσ(x) 2 1r(x) { 1}d e R2 8 1 d C2/40 Putting it together. By the above, we have, h(x) sσ(x) 2 dγ2 + 1 d C2/40 Reparameterizing γ and noting that γ > 1 d C/100 gives the claim. Diffusion Posterior Sampling is Computationally Intractable C.5. Smoothing a discretized Gaussian Lemma C.16. For any ϕ, let g be the univariate discrete Gaussian with pdf g(x) w1(x) combε(x ϕ) Consider the ρ-smoothed version of g, given by gρ = g wρ. We have that ρ2+1(x) e ρ2 2ε2(1+ρ2) w and g ρ(x) w ρ2+1(x) e ρ2 2ε2(1+ρ2) |w ρ2+1(x)| + X 2ε2(1+ρ)2 j ε(1 + ρ2) w Proof. The Fourier transform of the combε distribution is given by 1 εcomb1/ε. So, for the discrete Gaussian g, we have that its Fourier Transform is given by bg(ξ) = bw1 e iξϕ Then, for gρ, the ρ-smoothed version of g, we have that its Fourier Transform is bgρ(ξ) = (bg bw1/ρ)(ξ) So, we have that, by the inverse Fourier transform, ρ2+1(x) + 1 ρ2+1(x) + 1 2ε2(ρ2+1) X 2 ξ j ε(ρ2+1) ρ2+1(x) + X 2ε2(ρ2+1) e ix j ε(ρ2+1) w so that gρ(x) w j =0 e j2ρ2 2ε2(ρ2+1) w giving the first claim. For the second claim, note that the above gives ρ2+1(x) + X 2ε2(ρ2+1) e ixj ε(ρ2+1) ij ε(ρ2 + 1)w ρ2+1(x) + w So, g ρ(x) w ρ2+1(x) e ρ2 2ε2(1+ρ2) w ρ2+1(x) + X 2ε2(1+ρ2) j ε(1 + ρ2)w Diffusion Posterior Sampling is Computationally Intractable C.6. Large Noise Level - Distribution is close to a mixture of Gaussians Lemma C.17. Let C be a sufficiently large constant. Let gj(x) Qd i=1 w1(xi) Qd i=d+1 w1(xi) combε(xi ϕi,j) be the pdf of a distribution on Rd+d with shifts ϕi,j. Consider a mixture of discrete d-dimensional Gaussians, given by the pdf j=1 βjgj(x µj) Let hρ(x) = (h wρ)(x) be the smoothed version of h. Then, for the mixture of standard (d + d )-dimensional Gaussians given by ε2(1+ρ2) > C log d, we have that hρ(x) fρ(x) 2ε2(1+ρ2) 1 + m2 2 + sup j µj 2 + X 2ε2(1+ρ2) j ε(1 + ρ2) where m2 2 = Ex hρ x 2 . Proof. We have that i=1 βjgj ρ(x µj) where gj ρ(x) = gj(x) wρ(x). By Lemma C.16, we have that for every i, j, 2ε2(1+ρ2) w 2ε2(1+ρ2) j ε(1 + ρ2) w ( hρ(x))i ( fρ(x))i j=1 βj gj ρ(x µj) w 2ε2(1+ρ2) w 2ε2(1+ρ2) j ε(1 + ρ2) w Similarly, for the density, by Lemma C.16 ρ2+1(x) de ρ2 2ε2(1+ρ2) w Diffusion Posterior Sampling is Computationally Intractable |hρ(x) fρ(x)| = j=1 βj gj ρ(x µj) w 2ε2(1+ρ2) k X 2ε2(1+ρ2) fρ(x) Thus, we have " ( hρ(x))i hρ(x) ( fρ(x))i fρ(x) 1 + O de ρ2 2ε2(1+ρ2) ( fρ(x))i " ( fρ(x) )i Pk j=1 βj e ρ2 2ε2(1+ρ2) w ρ2+1(x µj) i + P 2ε2(1+ρ2) j ε(1+ρ2) w ε2(1+ρ2) + d X ε2(ρ2+1) j ε(1 + ρ2) + de ρ2 Pk j=1 βj |x µj|i w ε2(1+ρ2) + d X ε2(1+ρ2) j ε(1 + ρ2) + de ρ2 ε2(1+ρ2) E sup j |x µj|2 i ε2(1+ρ2) 1 + E x2 i + sup j |µj|2 i ε2(1+ρ2) j ε(1 + ρ2) Thus, we have hρ(x) fρ(x) ε2(1+ρ2) 1 + E x 2 + sup j µj 2 + d2 X 2ε2(1+ρ2) j ε(1 + ρ2) 2ε2(1+ρ2) 1 + m2 2 + sup j µj 2 + X 2ε2(1+ρ2) j ε(1 + ρ2) ε2(1+ρ2) > C log d Corollary C.18. Let d = O(d), and let g be as defined in Definition 3.2, and let gρ = g N(0, ρ2Id+d ) be the ρ-smoothed version of g. Let fρ be the mixture of (d + d )-dimensional standard Gaussians, given by ρ2+1(y R ex) Diffusion Posterior Sampling is Computationally Intractable where ex Rd+d has the first d coordinates given by x, and the last d coordinates 0. Then, for ρ2 ε2(1+ρ2) > C log d for sufficiently large constant C, we have h log gρ(x) log fρ(x) 2i e ρ2 2ε2(1+ρ2) 1 + R2 + ρ2 + X 2ε2(1+ρ2) j ε(1 + ρ2) Proof. Follows from the facts that m2 2 d R2 + ρ2 and µ2 j d R2 for all j. C.7. Re LU network for Score at Large smoothing Level This section shows how to represent the score of the σ-smoothed unconditional distribution defined in Definition 3.2 for large σ using a Re LU network with a polynomial number of parameters bounded by a polynomial in the relevant quantities. We proceed in two stages first, we show how to represent the score of a mixture of Gaussians placed on the vertices of a scaled hypercube. Then, we show that for large σ, this network is close to the score of the σ-smoothed unconditional distribution. Lemma C.19 (Re LU network representing score of mixture of Gaussians on hypercube). For any σ > 0 and R > 1 consider the distribution on Rd with pdf µ { 1}d wσ(x Rµ) where wσ is the pdf of N(0, σ2Id). There is a constant depth Re LU network h : Rd Rd with O d R γ3σ4 parameters, with absolute values bounded by O d R σ3γ2 such that log fσ(x) h(x) 2 γ2 Proof. Note that gσ is a product distribution. So, the claim follows by Lemma C.12. Lemma C.20. Let d = O(d), and let R poly(d). Let g be the pdf of the unconditional distribution on Rd+d , as defined in Definition 3.2, and let gσ be its σ-smoothed version with score sσ. For ε < 1 C log d, and σ > Cε log d + q for sufficiently large constant C, there is a constant depth Re LU network h with O d R γ3σ4 parameters with absolute values bounded by O d R σ3γ2 such that sσ(x) h(x) 2 γ2 + 1 d C2/20 Proof. Let h be the Re LU network from Lemma C.19 for smoothing σ. It satisfies our bounds on the number of parameters and the absolute values. Note that for our setting of ε and σ, we have that ε2(1 + σ2) = 1 ε2 1 + 1 σ2 > 1 ε2 + 1 C2 log d > 1 2 C2 log d > C2 log d ε2(1 + σ2) > C2(log d + log 1 ε) 2 > log 1 ε(1 + σ2) So, by Lemma C.18, for the mixture of Gaussians fσ as described in Lemma C.19, for R poly(d), sσ(x) log fσ(x) 2 e σ2 10ε2(1+σ2) 1 d C2/20 Diffusion Posterior Sampling is Computationally Intractable Also, by Lemma C.16, |gσ(x) fσ(x)| fσ(x) So, by Lemma C.19, log fσ(x) h(x) 2 E x fσ log fσ(x) h(x) 2 γ2 sσ(x) h(x) 2 γ2 + 1 d C2/20 C.8. Re LU Network Approximating score of Unconditional Distribution Theorem C.21 (Re LU Score Approximation for Lower bound Distribution). Let C be a sufficiently large constant, and let d = O(d). Fix any σ τ for τ = 1 d C . Given a constant-depth Re LU network representing a function f : { 1, 1}d { 1, 1}d with poly(d) parameters, there is a constant-depth Re LU network h : Rd+d Rd+d with poly (d) parameters with each parameter bounded in absolute value by poly (d) such that for the unconditional distribution g defined in Definition 3.2 with σ-smoothed version gσ and corresponding score sσ, for R > C log d, 1 poly(d) < ε < 1 C log d, sσ(x) h(x) 2 1 d C/200 Proof. Follows by Lemmas C.15 and C.20. Corollary 3.6 (Lower Bound Distribution is Well-Modeled). Let C be a sufficiently large constant. Given a Re LU network f : { 1}d { 1}d with poly(d) parameters bounded by poly(d) in absolute value, the distribution eg defined in Definition 3.3 for R = C log d and 1 poly(d) < ε < 1 C log d, is O(C)-well-modeled. Proof. Follows via reparameterization from the Theorem, and rescaling. D. Lower Bound Putting it all Together Theorem 1.8 (Lower Bound). Suppose that one-way functions exist. Then for any m > d0.01, there exists a 10-well-modeled distribution over Rd, and linear measurement model with m measurements and noise parameter β = Θ( 1 log2 d), such that ( 1 10)-posterior sampling requires superpolynomial time in d. Proof. First, by Lemma G.3, there exists a Re LU network that represents a one-way function f : { 1}m { 1}m, with constant weights, polynomial size, and parameters bounded in magnitude by poly(d). Therefore, by Corollary 3.6, the distribution eg over Rd is a C-well-modeled distribution, if we take R = C log d, ε = 1 C log d. Further, if we take a linear measurement model with β = 1 C2 log2(d), then by Lemma 3.5, any (1/10, 1/10)-posterior sampler for this distribution takes at least 2Ω(m) time to run. Theorem 1.9 (Lower Bound: Exponential Hardness). Suppose that there exist one-way functions f : { 1}m { 1}m that require 2Ω(m) time to invert. Then for any m O(d) and C > 1, there exists a C-well-modeled distribution over Rd and linear measurement model with m measurements and noise level β = 1 C2 log2 d, such that ( 1 10)-posterior sampling takes at least 2Ω(m) time. Proof. First, by Lemma G.3, there exists a Re LU network that represents a one-way function f : { 1}m { 1}m, with constant weights, polynomial size, and parameters bounded in magnitude by poly(d). Therefore, by Corollary 3.6, the distribution eg over Rd is a C-well-modeled distribution, if we take R = C log d, ε = 1 C log d. Further, if we take a linear measurement model with β = 1 C2 log2(d), then by Lemma B.5, any (1/10, 1/10)-posterior sampler for this distribution takes at least 2Ω(m) time to run. Diffusion Posterior Sampling is Computationally Intractable E. Upper Bound Lemma E.1. Let q be a distribution over Rm such that Ew q[ w 2 2] = O(m). Let w q and y = w + βN(0, Im). Then, there exists a constant c > 0 such that h y w 10γ p m + log(1/δ) y i (cγ)m δm/2+1i 1 δ. Proof. Since Ew q[ w 2 2] m, there exists a constant C such that Lemma H.10 shows that there exists a covering over {x Rm | x 2 p Cm/δ} with N = O( 1 δβ )m balls of radius m + log(1/δ). Let S be the set of all the covering balls. This means that Pr[ θ S : w θ] 1 δ S := {θ S | Pr w [w θ] > δ 3N }. Then we have that with high probability, w will land in one of the cells in S : Pr w [ θ S : w / θ] Pr[ θ S : w / θ] + Pr[ _ θ S\S w θ] δ 3 + N δ 3N 2δ Moreover, we define S+ := {y Rm | θ S , w θ : w y 10β By the sampling process of y, we have that Pr y y S+ = Pr w q,z N(0,Im) Pr w q,z N(0,Im) ( θ S : w θ) ( z 8 m) 1 Pr w [ θ S : w / θ] Pr z N(0,Im) z 2 > 64(m + log 1 By Lemma H.9, we have Pr z N(0,Im) z 2 > 64(m + log 1 Therefore, Pr y y S+ 1 δ. This implies that with 1 δ probability over y, there exists a cell θ S such that y t 10β and Prw[w θ] δ 3N δ Θ( Lemma E.2. Consider a well-modeled distribution and a linear measurement model. Suppose we have a (τ, δ)-unconditional sampler for the distribution, where τ < cδβ m+log(1/δ) for a sufficiently small constant c > 0. Then rejection sampling (Algorithm 1) gives a (τ, 2δ)-posterior sampler using at most log(1/δ) δ )m samples . Diffusion Posterior Sampling is Computationally Intractable Proof. Let P be the distribution that couples true distribution D over (x, y) and the output distribution of the posterior sampler bp|y. Rigorously, we define P over (x, bx, y) X X Y with density p P such that p P(x, y) = p D(x, y), p P(bx | y) = bp|y(bx). Similarly, we let e P over (x, bx, y) Rd Rd Rα be the joint distribution between the unconditional sampler over (x, bx) and the measurement process D over (x, y). Then by the definition of unconditional samplers, we have Pr x,ˆx e P [ x bx τ] δ. Therefore, to prove the correctness of the algorithm, we only need to show that there exists a b P over (x, bx, y) such that e P(bx | y) = bp|y(bx) and TV( b P, e P) δ. By Lemma H.9, Ax y 2 4β2(m + log 1 Therefore, we define e P as e P conditioned on x bx < τ and Ax y 2 2β2 2(m + log 1 δ ). Then we have TV( e P, e P ) 3δ Algorithm correctness. We have bp|y(bx) = p e P (bx) e 2β2 R p e P (bx) e R p e P (x, bx) e 2β2 dx R p e P (bx) e Then we define R p e P (x, bx) e 2β2 dx R p e P (x, bx) e Conditioned on x bx τ and Ax y 2 2β2 2(m + log 1 δ ), we have |log r(bx)| sup x | Ax y 2 Abx y 2| τ 2 A 2 2 + 2τ A 2 Ax y m + log(1/δ) By our setting of τ, we have 1 δ/8 < r(bx) < 1 + δ/8. So we have Z p e P (bx) e 2β2 dbx = Z p e P (x, bx) e 2β2 dx dbx = 1 δ Z p e P (x, bx) e 2β2 dx dbx. bp|y(bx) = r(bx) R p e P (x, bx)e 8) R p e P (x, bx)e = r(bx) R p e P (x, bx)p e P (y | x) dx 8)p e P (y) r(bx) Z p e P (x, bx | y) dx p e P (bx | y). Diffusion Posterior Sampling is Computationally Intractable Finally, we have Z p e P (bx | y) bp|y(bx) dbx dp e P (y) = Z p e P (bx | y) p e P (bx | y) dbx dp e P (y) δ This implies that TV( b Pbx, e Pbx) = TV( b Pbx, e P bx) + TV( e P bx, e Pbx) δ Pr x,bx b P [ x bx τ] 3δ Running time. Now we prove that for most y For y Y , for each round, the acceptance probability q(y) each round is that q(y) = Z p e P (bx)e y Abx 2 8) Z p e P (x, bx) e Z p X (x) e h Ax y 10 p m + log(1/δ)β i e 100(m+log(1/δ))β2 h Ax y 10 p m + log(1/δ)β i δe 50m By Lemma H.11, Ex X [ Ax 2 2] = O(m). By Lemma E.1, we have that for 1 δ/8 probability over y, for some c > 0, h Ax y 10 p m + log(1/δ)β i (cβ)m δm/2+1. Therefore, for some c > 0, h q(y) (cβ)m δm/2+2i 1 δ Hence, for some C > 0, Pr Rejection sampling terminates in log(1/δ) m rounds 1 δ Theorem 1.7 (Upper Bound). Let C > 1 be a constant. Consider an O(C)-well-modeled distribution and a linear measurement model with β > 1 d C . When δ > 1 d C , rejection sampling of the diffusion process gives a ( 1 d C , δ)-posterior sampler that takes poly(d)( O(1) Proof. Theorem 1.4 suggests that for an O(C)-well-modeled distribution, a poly(d) time ( 1 d3C , 1 2d C )-unconditional sampler exists. Since 1 d3C < o 1 2d C 1 d C m + log(1/δ) By lemma E.2, a ( 1 d3C , 1 d C )-posterior sampler exists using log(1/δ) δ )m poly(d)( O(1) δ )m samples. Since generating each sample costs poly(d) time. The total time is poly(d)( O(1) Diffusion Posterior Sampling is Computationally Intractable F. Well-Modeled Distributions Have Accurate Unconditional Samplers Notation. For the purposes of this section, we let est = sσ2 denote the score at time t. Definition F.1 (Forward and Reverse SDE). For distribution q0 over Rd, consider the Variance Exploding (VE) Forward SDE, given by dxt = d Bt, x0 q0 where Bt is Brownian motion, so that xt x0 + N(0, t Id). Let qt be the distribution of xt. There is a VE Reverse SDE associated with the above Forward SDE given by dx T t = es T t(x T t) + d Bt (12) for x T q T . Theorem F.2 (Unconditional Sampling Theorem, Implied by (Benton et al., 2024), adapted from (Gupta et al., 2023)). Let q be a distribution over Rd with second moment m2 2 = Ex q x 2 between 1 poly(d) and poly(d). Let qt = q N(0, t Id) be t-smoothed version of q, with corresponding score est. Suppose T = d C. For any γ > 0, there exist N = e O d ε2 log2 1 discretization times 0 = t0 < < t N T γ such that, given score approximations h T tk of es T tk that satisfy es T tk h T tk 2 ε2 C (T tk) log d for sufficiently large constant C, then, the discretization of the VE Reverse SDE defined in (12) using the score approximations can sample from a distribution ε + 1 d C/2 close in TV to a distribution γm2-close in 2-Wasserstein to q in N steps. Theorem 1.4 (Unconditional Sampling for Well-Modeled Distributions). For an O(C)-well-modeled distribution p, the discretized reverse diffusion process with approximate scores gives a 1 d C , 1 d C -unconditional sampler (as defined in Definition 1.3) for any constant C > 0 in poly(d) time. Proof. The definition of a well-modeled distribution gives that, for every 1 d C < σ < d C there is an approximate score bsσ such that bsσ(x) sσ(x) 2 < 1 d Cσ2 and bsσ can be computed by a poly(d)-parameter neural network with poly(d) bounded weights. Here pσ is the σ-smoothed version of p with score sσ. Then, by Theorem F.2, this means that the discretized reverse diffusion process can use the bsσ to produce a sample bx from a distribution bp that is 1 d C/3 close in TV to a distribution 1 d C/3 close in 2-Wasserstein. This means there exists a coupling between bx bp and x p such that Pr bx x > 1 d C/6 The claim follows via reparameterization. G. Cryptographic Hardness Recall that a one-way function f is a function such that every polynomial-time algorithm fails to find a pre-image of a random output of f with high probability. Lemma G.1. If a one-way function f : { 1}n { 1}m(n) exists, then for any 1 poly(n) l(n) poly(n), there exists a one-way function g : { 1}n { 1}l(n). Proof. For l(n) > m(n), we just need to pad l(n) m(n) 1 s at the end of the output, i.e., g(x) := (f(x), 1l(n) m(n)). Diffusion Posterior Sampling is Computationally Intractable For 1 poly(n) l(n) < m(n), for each n, there exists a constant c < 1 such that l(n) = m(nc). Then we can satisfies the requirement by defining g(x) := f(first nc bits of x). Lemma G.2. Every circuit f : { 1}n { 1}m(n) of poly(n) size can be simulated by a Re LU network with poly(n) parameters and constant weights. Proof. In the realm of {+1, 1}, 1 corresponds to True and +1 corresponds to False. We can use a layer of neurons to translate it to {0, 1} first, where 1 corresponds to True and 1 corresponds to False. We will translate {0, 1} back to {+1, 1} when output. Now we only need to show that the logic operation ( , , ) in each gate of the circuit can be simulated by a constant number of neurons with constant weights in Re LU network when the input is in {0, 1}n: For each AND ( ) gate, we use Re LU(P(yi 1) + 1) to calculate V yi. For each OR ( ) gate, we use Re LU(1 Re LU(1 P yi)) to calculate W yi. For each NOT ( ) gate , we use Re LU(1 yi) to calculate yi. It is easy to verify that for {0, 1} input, the output of each neuron-simulated gate will remain in {0, 1}n and equal to the result of the logical operation. Then the next corollary directly follows. Corollary G.3. Every one-way function can be computed by a Re LU network with poly(n) parameters, and constant weights. H. Utility Results Lemma H.1. Let pσ be some σ-smoothed distribution with score sσ. For any ε σ, E x pσ sup |c| ε s σ(x + c)2 1 Proof. Draw x pσ, and let z N(0, σ2) be independent of x. By Lemma H.3, sσ(x) = E z|x Moreover, by Corollary H.4, sσ(x + c) = Ez|x 2σ2 i = Ez|x h ecz/σ2 z c Ez|x[ecz/σ2] Taking the derivative with respect to c, since (a/b) = (a b ab )/b2, s σ(x + c) = Ez|x h ecz/σ2 z2 zc σ2 σ4 i Ez|x[ecz/σ2] Ez|x h ecz/σ2 z c σ2 i Ez|x h z σ2 ecz/σ2i Ez|x[ecz/σ2]2 = Ez|x h ecz/σ2 z2 σ2 σ4 i Ez|x[ecz/σ2] Ez|x h ecz/σ2 z Ez|x[ecz/σ2]2 Ez|x[eεz/σ2 z2 σ4 ] Ez|x[eεz/σ2] (13) Diffusion Posterior Sampling is Computationally Intractable Now we take the supremum over all |c| ε, and take the expectation of this quantity over x to get the desired moment: sup |c| ε s σ(x + c)2 # Ez|x[ecz/σ2 z2 Ez|x[ecz/σ2]2 sup |c| ε E z|x sup |c| ε E z|x h ecz/σ2i 2 !# v u u t E x sup |c| ε E z|x sup |c| ε E z|x The last inequality here follows from Cauchy-Schwarz. For the first term of equation 14, we have sup |c| ε E z|x (eεz/σ2 + e εz/σ2) z2 We compute the 4th moment of this term directly: E x[g(x)4] = E x (eεz/σ2 + eεz/σ2) z2 (eεz/σ2 + e εz/σ2)4 z8 E z [(eεz/σ2 + e εz/σ2)8] E z E z [28(e8εz/σ2 + e 8εz/σ2)] E z e32ε2/σ2 1 σ16 = e16ε2/σ2 For the second term of equation 14, sup |c| ε E z|x h ecz/σ2i 4 # sup |c| ε E z|x h e 4cz/σ2i# by Jensen s h e4ε|z|/σ2i h e4εz/σ2 + e 4εz/σ2i = 2e 1 2 σ2 (4ε/σ2)2 = 2e8ε2/σ2 (16) So, putting equations 16 and 15 into equation 14, we get sup |c| ε s σ(x + c)2 # 2e8ε2/σ2 e16ε2/σ2 Now, by assumption, ε σ. So, we finally get that sup |c| ε s σ(x + c)2 # Diffusion Posterior Sampling is Computationally Intractable Lemma H.2. Let p be a distribution over R, and let pσ = p N(0, σ2) have score sσ. If γ σ/4, then, sup y [x γ,x+γ] s(y) t Proof. From Corollary H.8, we have sup y [x γ,x+γ] s(y)k # So, we have sup y [x γ,x+γ] s(x)k tk # E h supy [x γ,x+γ] s(x)ki Setting k = log 1 sup y [x γ,x+γ] |s(x)| 15 For the following Lemmas, If p is a distribution over R and has score s, define the Fisher information I as I := E x p[s2(x)] Lemma H.3 (Lemma A.1 from (Gupta et al., 2022)). Let p be a distribution over R, and let pσ = p N(0, σ2) have score sσ. Let (x, y, z) be the joint distribution such that y p, z N(0, σ2) are independent, and x = y + z. For all ε > 0, p(x) = E z|x 2σ2 and sσ(x) = E z|x Corollary H.4. Let p be a distribution over R, and let pσ = p N(0, σ2) have score sσ. sσ(x + ε) = Ez|x h eεz/σ2 z ε Ez|x[eεz/σ2] Proof. This proof is given in Lemma A.2 of (Gupta et al., 2022), and is reproduced here for convenience and completeness, since a statement in the middle of their proof is what we use. By Lemma H.3, we have pσ(x + ε) pσ(x) = E z|x Taking the derivative with respect to ε, we have pσ(x) = Ez|x sσ(x + ε) = p σ(x + ε) pσ(x + ε) = p σ(x + ε) pσ(x) pσ(x) pσ(x + ε) 2σ2 i = Ez|x h eεz/σ2 z ε Ez|x[eεz/σ2] Diffusion Posterior Sampling is Computationally Intractable Lemma H.5 (Lemma 3.1 from (Gupta et al., 2022)). Let p be a distribution over R and let pσ = p N(0, σ2) have Fisher information Iσ. Then, Iσ 1 σ2 . Lemma H.6 (Lemma B.3 from (Gupta et al., 2022)). Let p be a distribution over R and let pσ = p N(0, σ2) have score sσ and Fisher information Iσ. If |γ| σ/2, then E[s2(x + γ)] Iσ + O γ Lemma H.7 (Lemma A.6 from (Gupta et al., 2022)). Let p be a distribution over R and let pσ = p N(0, σ2) have score sσ and Fisher information Iσ. Then, for k 3 and |γ| σ/2, E[|sσ(x + γ)|k] k! 2 (15/σ)k 2 max E[s2 σ(x + γ)], Iσ Corollary H.8. Let p be a distribution over R and let pσ = p N(0, σ2) have score sσ. Then, for k 3 and |γ| σ/2, E[|sσ(x + γ)|k] 15k Proof. Consider the continuous function f(x) = x q log 1 σ2x. This function is only defined on 0 < x 1/σ2. We have f (x) = 2 log 1 σ2x 1 log 1 σ2x . Setting this equal to zero gives x = 1 σ2 e. f( 1 σ2 e) = 1 σ2 2e. Since f(1/σ2) = 0 and limx 0+ f(x) = 0, we have this is the maximum value of the function. Further, we know by Lemma H.5 that Iσ 1/σ2. So, along with the fact that |γ| σ/2, we have γ σ Iσ log 1 σ2Iσ 1 Therefore, from Lemma H.6, and using Lemma H.5 again, we get E[s2(x + γ)] Iσ + O γ Finally, we can plug this into Lemma H.7 to get E[|sσ(x + γ)|k] k! 2 (15/σ)k 2 max E[s2 σ(x + γ)], Iσ Lemma H.9 (Laurent-Massart Bounds(Laurent & Massart, 2000)). Let v N(0, In). For any t > 0, Pr[ v 2 n 2 nt + 2t] e t. Lemma H.10 (See Mohri et al. (2018), Lemma 6.27). There exist Θ(R/ε)d d-dimensional balls of radius ε that cover {x Rd | x 2 R}. Lemma H.11. Let p be a distribution over Rd with covariance Σ such that Σ 1, and let A Rm d be a matrix with A 1. Then E x p[ Ax 2] m. Diffusion Posterior Sampling is Computationally Intractable Proof. Note the expectation of the squared norm Ax 2 can be expressed as: Ex p[ Ax 2] = trace(AT AΣ). Given that A 1, the singular values of A are at most 1. Hence, the matrix AT A, which represents the sum of squares of these singular values, will have its trace (sum of eigenvalues) bounded by m: trace(AT A) m. Hence, given that Σ 1, we have : Ex p[ Ax 2] = trace(AT AΣ) Σ trace(AT A) m.