# the_probability_flow_ode_is_provably_fast__e2c7cdb6.pdf The probability flow ODE is provably fast Sitan Chen Sinho Chewi Holden Lee Yuanzhi Li Jianfeng Lu Adil Salim We provide the first polynomial-time convergence guarantees for the probability flow ODE implementation (together with a corrector step) of score-based generative modeling with an OU forward process. Our analysis is carried out in the wake of recent results obtaining such guarantees for the SDE-based implementation (i.e., denoising diffusion probabilistic modeling or DDPM), but requires the development of novel techniques for studying deterministic dynamics without contractivity. Through the use of a specially chosen corrector step based on the underdamped Langevin diffusion, we obtain better dimension dependence than prior works on DDPM (O( d) vs. O(d), assuming smoothness of the data distribution), highlighting potential advantages of the ODE framework. 1 Introduction Score-based generative models (SGMs) [Soh+15; SE19; HJA20; DN21; Son+21a; Son+21b; VKK21] are a class of generative models which includes prominent image generation systems such as DALL E 2 [Ram+22]. Their startling empirical success at data generation across a range of application domains has made them a central focus of study in the literature on deep learning [Aus+21; DN21; Kin+21; Shi+21; CSY22; Gna+22; Rom+22; Son+22; BV23; WHZ23]. In this paper, we aim to provide theoretical grounding for such models and thereby elucidate the mechanisms driving their remarkable performance. Our work follows in the wake of numerous recent works which have provided convergence guarantees for denoising diffusion probabilistic models (DDPMs) [De +21; BMR22; De 22; LLT22; Liu+22; Pid22; WY22; Che+23a; CLL23; LLT23] and denoising diffusion implicit models (DDIMs) [CDD23]. We briefly recall that the generating process for SGMs is the time reversal of a certain diffusion process, and that DDPMs hinge upon implementing the reverse diffusion process as a stochastic differential equation (SDE) whose coefficients are learned via neural network training and the statistical technique of score matching [Hyv05; Vin11] (more detailed background is provided in 2). Among these prior works, the concurrent results of [Che+23a; LLT23] are remarkable because they require minimal assumptions on the data distribution (in particular, they do not assume log-concavity or similarly restrictive conditions) and they hold when the errors incurred during score matching are only bounded in an L2 sense, which is both natural in view of the derivation of score matching (see [Hyv05; Vin11]) and far more realistic.7 Subsequently, the work of [CLL23] significantly sharpened the analysis in the case when no smoothness assumptions are imposed on the data distribution. Harvard University, sitan@seas.harvard.edu Institute for Advanced Study, schewi@ias.edu Johns Hopkins University, hlee283@jhu.edu Microsoft Research, yuanzhili@microsoft.com Duke University, jianfeng@math.duke.edu Microsoft Research, adilsalim@microsoft.com 7It is unreasonable, for instance, to assume that the score errors are bounded in an L sense, since we cannot hope to learn the score function in regions of the state space which are not well-covered by the training data. 37th Conference on Neural Information Processing Systems (Neur IPS 2023). Taken together, these works paint an encouraging picture of our understanding of DDPMs which takes into account both the diversity of data in applications (including data distributions which are highly multimodal or supported on lower-dimensional manifolds), as well as the non-convex training process which is not guaranteed to accurately learn the score function uniformly in space. Besides DDPMs, instead of implementing the time reversed diffusion as an SDE, it is also possible to implement it as an ordinary differential equation (ODE), called the probability flow ODE [Son+21b]; see 2. The ODE implementation is often claimed to be faster than the SDE implementation [Lu+22; ZC23], with the rationale being that ODE discretization is typically more accurate than SDE discretization, so that one could use a larger step size. Indeed, the discretization error usually depends on the regularity of the trajectories, which is C1 for ODEs but only C 1 2 for SDEs (i.e., Hölder continuous with any exponent less than 1 2) due to the roughness of the Brownian motion driving the evolution. Far from being able to capture this intuition, current analyses of SGMs cannot even provide a polynomial-time analysis of the probability flow ODE. The key issue is that under our minimal assumptions (i.e., without log-concavity of the data distribution), the underlying dynamics of either the ODE or SDE implementation are not contractive, and hence small errors quickly accumulate and are magnified. The aforementioned analyses of DDPMs managed to overcome this challenge by leveraging techniques specific to the analysis of SDEs, through which we now understand that stochasticity plays an important role in alleviating error accumulation. It is unknown, however, how to carry out the analysis for the purely deterministic dynamics inherent to the probability flow ODE. Our first main contribution is to give the first convergence guarantees for SGMs with OU forward dynamics in which steps of the discretized probability flow ODE referred to as predictor steps are interleaved with corrector steps which runs the overdamped Langevin diffusion with estimated score, as pioneered in [Son+21b]. Our results are akin to prior works on DDPMs in that they hold under minimal assumptions on the data distribution and under L2 bounds on the score estimation error, and our guarantees scale polynomially in all relevant problem parameters. Here, the corrector steps inject stochasticity which is crucial for our proofs; however, we emphasize that the use of corrector steps does not simply reduce the problem to applying existing DDPM analyses. Instead, we must develop an entirely new framework based on Wasserstein to TV regularization, which is of independent interest; see 4 for a detailed overview of our techniques. Our results naturally raise the question of whether the corrector steps are necessary in practice, and we discuss this further in 5. When the data distribution is log-smooth, then the dimension dependence of prior results on DDPMs, as well as our first result for the probability flow ODE with overdamped corrector, both scale as O(d). Does this contradict the intuition that ODE discretization is more accurate than SDE discretization? The answer is no; upon inspecting our proof, we see that the discretization error of the probability flow ODE is indeed smaller than what is incurred by DDPMs, and in fact allows for a larger step size of order 1/ d. The bottleneck in our result stems from the use of the overdamped Langevin diffusion for the corrector steps. Taking inspiration from the literature on log-concave sampling (see, e.g., [Che22] for an exposition), our second main contribution is to propose corrector steps based on the underdamped Langevin diffusion (see 2) which is known to improve the dimension dependence of sampling. In particular, we show that the probability flow ODE with underdamped Langevin corrector attains O( d) dimension dependence. This dependence is better than what was obtained for DDPMs in [Che+23a; CLL23; LLT23] and therefore highlights the potential benefits of the ODE framework. We note that the benefit to which we refer is at generation time, and not at training time. Previously, [JP22] have proposed a noise denoise" sampler using the underdamped Langevin diffusion, but to our knowledge, our work is the first to use it in conjunction with the probability flow ODE. Although we provide preliminary numerical experiments in the Appendix, we leave it as a question for future work to determine whether the theoretical benefits of the underdamped Langevin corrector are also borne out in practice. 1.1 Our contributions In summary, our contributions are the following. We provide the first convergence guarantees for the probability flow ODE with overdamped Langevin corrector (DPOM; Algorithm 1). We propose an algorithm based on the probability flow ODE with underdamped Langevin corrector (DPUM; Algorithm 2). We provide the first convergence guarantees for DPUM. These convergence guarantees show improvement over (i) the complexity of DPOM (O( d) vs O(d)) and (ii) the complexity of DDPMs, i.e., SDE implementations of score-based generative models (again, O( d) vs O(d)). We provide preliminary numerical experiments in a toy example showing that DPUM can sample from a highly non log-concave distribution (see Appendix). The numerical experiments are not among our main contributions and are provided for illustration only. The Python code can be found in the Supplementary material. Our main theorem can be summarized informally as follows; see 3 for more detailed statements. Theorem 1 (Informal). Assume that the score function along the forward process is L-Lipschitz, and that the data distribution has finite second moment. Assume that we have access to e O(ε/ L) L2-accurate score estimates. Then, the probability flow ODE implementation of the reversed Ornstein Uhlenbeck process, when interspersed with either the overdamped Langevin corrector (DPOM; Algorithm 1) or with the underdamped Langevin corrector (DPUM; Algorithm 2), outputs a sample whose law is ε-close in total variation distance to the data distribution, using e O(L3d/ε2) or e O(L2 d/ε) iterations respectively. Our result provides the first polynomial-time guarantees for the probability flow ODE implementation of SGMs, so long as it is combined with the use of corrector steps. Moreover, when the corrector steps are based on the underdamped Langevin diffusion, then the dimension dependence of our result is significantly smaller (O( d) vs. O(d)) than prior works on the complexity of DDPMs, and thus provides justification for the use of ODE discretization in practice, compared to SDEs. Our main assumption on the data is that the score functions along the forward process are Lipschitz continuous, which allows for highly non-log-concave distributions, yet does not cover non-smooth distributions such as distributions supported on lower-dimensional manifolds. However, as shown in [Che+23a; CLL23; LLT23], we can also obtain polynomial-time guarantees without this smoothness assumption via early stopping (see Remark 1). 1.2 Related works The idea of using a time-reversed diffusion for sampling has been fruitfully exploited in the logconcave sampling literature via the proximal sampler [TP18; LST21; CE22; LC22; FYC23; LC23], as put forth in [Che+22], as well as through algorithmic stochastic localization [EMS22; MW23]. Although we do not aim to be comprehensive in our discussion of the literature, we mention, e.g., [ABV23; Che+23b] for alternative approaches for diffusion models. We also note that the recent work of [CDD23] obtained a discretization analysis for the probability flow ODE (without corrector) in KL divergence, though their bounds have a large dependence on d and are exponential in the Lipschitz constant of the score integrated over time. Since the original ar Xiv submission of this paper, there have been further works studying the probability flow ODE. The work of [BDD23] also studied the probability flow ODE, but without providing discretization guarantees (and with possibly exponential dependencies). The work [Li+23] provides polynomial-time guarantees for the probability flow ODE (without corrector steps), at the cost of larger polynomial dependencies and more stringent score assumptions (namely, bounds on the Jacobian of the score). Also, [PMM23] study another variant of the predictor-corrector framework. 2 Preliminaries 2.1 Score-based generative modeling Let q denote the data distribution, i.e., the distribution from which we wish to sample. In score-based generative modeling, we define a forward process (q t )t 0 with q 0 = q , which transforms our data distribution into noise. In this paper, we focus on the canonical choice of the Ornstein Uhlenbeck (OU) process, dx t = x t dt + 2 d Bt , x 0 q , q t := law(x t ) , (1) where (Bt)t 0 is a standard Brownian motion in Rd. It is well-known that the OU process mixes rapidly (exponentially fast) to its stationary distribution, the standard Gaussian distribution γd. Once we fix a time horizon T > 0, the time reversal of the SDE defined in (1) over [0, T] is given by dx t = (x t + 2 ln q t (x t )) dt + 2 d Bt , (2) where q t := q T t, and the reverse SDE is a generative model: when initialized at x 0 q 0 , then x T q. Since q 0 = q T γd, the reverse SDE transforms samples from γd (i.e., pure noise) into approximate samples from q . In order to implement the reverse SDE, however, one needs to estimate the score functions ln q t for t [0, T] using the technique of score matching [Hyv05; Vin11]. In practice, the score estimates are produced via a deep neural network, and our main assumption is that these score estimates are accurate in an L2 sense (see Assumption 4). This gives rise to the denoising diffusion probabilistic modeling (DDPM) algorithm. Notation. Since the reverse process is the primary object of interest, we drop the arrow from the notation for simplicity; thus, qt := q t . We will always denote the forward process with the arrow . For each t [0, T], let st denote the estimate for the score ln qt = ln q t . 2.2 Probability flow ODE (predictor steps) Instead of running the reverse SDE (2), there is in fact an alternative process (xt)t [0,T ] which evolves according to an ODE (and hence evolves deterministically), and yet has the same marginals as (2). This alternative process, called the probability flow ODE, can also be used for generative modeling. One particularly illuminating way of deriving the probability flow ODE is to invoke the celebrated theorem, due to [JKO98], that the OU process is the Wasserstein gradient flow of the KL divergence functional (i.e. relative entropy) KL( γd). From the general theory of Wasserstein gradient flows (see [AGS08; San15]), the Wasserstein gradient flow (µt)t 0 of a functional F can be implemented via the dynamics zt = [ W2F(µt)](zt) , z0 µ0 , in that zt µt for all t 0. Applying this to F := KL( γd), we arrive at the forward process x t = ln q t γd (x t ) = x t ln q t (x t ) . (3) Setting xt := x T t, it is easily seen that the time reversal of (3) is xt = xt + ln qt(xt) , i.e., xt = xt + ln q T t(xt) , (4) which is called the probability flow ODE. In this paper, the interpretation of the probability flow ODE as a reverse Wasserstein gradient flow is only introduced for interpretability, and the reader who is unfamiliar with Wasserstein calculus can take (4) to be the definition of the probability flow ODE. Crucially, it has the property that if x0 q0, then xt qt for all t [0, T]. We can discretize the ODE (4). Fixing a step size h > 0, replacing the score function ln qt with the estimated score given by st, and applying the exponential integrator to the ODE (i.e., exactly integrating the linear part), we arrive at the discretized process xt+h = xt + Z h 0 xt+u du + h st(xt) = exp(h) xt + (exp(h) 1) st(xt) . (5) 2.3 Corrector steps Let q be a distribution over Rd, and write U as a shorthand for the potential ln q. Overdamped Langevin. The overdamped Langevin diffusion with potential U is a stochastic process (xt)t 0 over Rd given by dxt = U(xt) dt + 2 d Bt . The stationary distribution of this diffusion is q exp( U). We also consider the following discretized process where U is replaced by a score estimate s. Fix a step size h > 0 and let (bxt)t 0 over Rd be given by dbxt = s(bx t/h h) dt + Underdamped Langevin. Given a friction parameter γ > 0, the corresponding underdamped Langevin diffusion is a stochastic process (zt, vt)t 0 over Rd Rd given by dzt = vt dt , dvt = ( U(zt) + γvt) dt + p The stationary distribution of this diffusion is q γd. We also consider the following discretized process, where U is replaced by a score estimate s. Let (bzt, bvt)t 0 over Rd Rd be given by dbzt = bvt dt , dbvt = (s(bz t/h h) γbvt) dt + p 2γ d Bt . (6) Diffusions as corrector steps. At time t, the law of the ideal reverse process (4) initialized at q0 is qt. However, errors are accumulated through the course of the algorithm: the error from initializing at γd rather than at q0; errors arising from discretization of (4); and errors in estimating the score function. That s why the law of the algorithm s iterate will not be exactly qt. We propose to use either the overdamped or the underdamped Langevin diffusion with stationary distribution qt and estimated score as a corrector, in order to bring the law of the algorithm iterate closer to qt. In the case of the underdamped Langevin diffusion, this is done by drawing an independent Gaussian random variable bv0 γd, running the system (6) starting from (bz0, bv0) (where bz0 is the current algorithm iterate) for some time t, and then keeping bzt. In our theoretical analysis, the use of corrector steps boosts the accuracy and efficiency of the SGM. 3.1 Assumptions We make the following mild assumptions on the data distribution q and on the score estimate s. Assumption 1 (second moment bound). We assume that m2 2 := Eq [ 2] < . Assumption 2 (Lipschitz score). For all t [0, T], the score ln qt is L-Lipschitz, for some L 1. Assumption 3 (Lipschitz score estimate). For all t for which we need to estimate the score function in our algorithms, the score estimate st is L-Lipschitz. Assumption 4 (score estimation error). For all t for which we need to estimate the score function in our algorithms, Eqt[ st ln qt 2] ε2 sc . Assumptions 1, 2, and 4 are standard and were shown in [Che+23a; CLL23; LLT23] to suffice for obtaining polynomial-time convergence guarantees for DDPMs. The new condition that we require in our analysis is Assumption 3, which was used in [LLT22] but ultimately shown to be unnecessary for DDPMs. We leave it as an open question whether this can be lifted in the ODE setting. Remark 1. As observed in [Che+23a; CLL23; LLT23], Assumption 2 can be removed via early stopping, at the cost of polynomially larger iteration complexity. The idea is that if q has compact support but does not necessarily satisfy Assumption 2 (e.g., if q is supported on a compact and lower-dimensional manifold), then q δ will satisfy Assumption 2 if δ > 0. By applying our analysis up to time T δ instead of time T, one can show that a suitable projection of the output distribution is close in Wasserstein distance to q (see [CLL23, Corollary 2.4] or [Che+23a, Corollary 5]). For brevity, we do not consider this extension of our results here. 3.2 Algorithms We provide the pseudocode for the two algorithms we consider, Diffusion Predictor + Overdamped Modeling (DPOM) and Diffusion Predictor + Underdamped Modeling (DPUM), in Algorithms 1 and 2 respectively. The only difference between the two algorithms is in the corrector step, which we highlight in Algorithm 2. For simplicity, we take the total amount of time T to be equal to N0/L + hpred for an integer N0 1, and we assume that 1/L is a multiple of hpred and that hpred is a multiple of δ = Θ ε2 L2 (d m2 2) . We consider two stages: in the first stage, which lasts until time N0/L = T hpred, we intersperse predictor epochs (run for time 1/L, discretized with step size hpred) and corrector epochs (run for time Θ(1/L) for the overdamped corrector or for time Θ(1/ L) for the underdamped corrector, and discretized with step size hcorr). The second stage lasts from time T hpred to time T δ, and we incorporate geometrically decreasing step sizes for the predictor. Note that this implies that our algorithm uses early stopping. Algorithm 1: DPOM(T, hpred, hcorr, s) Input: Total time T, predictor step size hpred, corrector step size hcorr, score estimates s Output: Approximate sample from the data distribution q 1 Draw bx0 γd. 2 for n = 0, 1, . . . , N0 1 do 3 Predictor: Starting from bxn/L, run the discretized probability flow ODE (5) from time n L with step size hpred and estimated scores to obtain bx (n+1)/L. 4 Corrector: Starting from bx (n+1)/L, run overdamped Langevin Monte Carlo for total time Θ(1/L) with step size hcorr and score estimate s(n+1)/L to obtain bx(n+1)/L. 5 Predictor: Starting from bx T hpred, run the discretized probability flow ODE (5) with step sizes hpred/2, hpred/4, hpred/8, . . . , δ and estimated scores to obtain bx T δ. 6 Corrector: Starting from bx T δ, run overdamped Langevin Monte Carlo for total time Θ(1/L) with step size hcorr and score estimate s T δ to obtain bx T δ. 7 return bx T δ Algorithm 2: DPUM(T, hpred, hcorr, s) Input: Total time T, predictor step size hpred, corrector step size hcorr, score estimates s Output: Approximate sample from the data distribution q 1 Draw bx0 γd. 2 for n = 0, 1, . . . , N0 1 do 3 Predictor: Starting from bxn/L, run the discretized probability flow ODE (5) from time n L with step size hpred and estimated scores to obtain bx (n+1)/L. 4 Corrector: Starting from bx (n+1)/L, run underdamped Langevin Monte Carlo for total time L) with step size hcorr and score estimate s(n+1)/L to obtain bx(n+1)/L. 5 Predictor: Starting from bx T hpred, run the discretized probability flow ODE (5) with step sizes hpred/2, hpred/4, hpred/8, . . . , δ and estimated scores to obtain bx T δ. 6 Corrector: Starting from bx T δ, run underdamped Langevin Monte Carlo for total time Θ(1/ L) with step size hcorr and score estimate s T δ to obtain bx T δ. 7 return bx T δ 3.3 Convergence guarantees Our main results are the following convergence guarantees for the two predictor-corrector schemes described in 3.2: Theorem 2 (DPOM). Suppose that Assumptions 1 4 hold. If bq denotes the output of DPOM (Algorithm 1) with δ ε2 L2 (d m2 2), then TV(bq, q ) ( d m2) exp( T) + L2Td1/2hpred + L3/2Td1/2h1/2 corr + L1/2Tεsc + ε . (7) In particular, if we set T = Θ ln( d m2 2 ε2 ) , hpred = eΘ( ε L2d1/2 ), hcorr = eΘ( ε2 L3d), and if the score estimation error satisfies εsc e O( ε L), then we can obtain TV error ε with a total iteration complexity ε2 ) steps. The five terms in the bound (7) correspond, respectively, to: the convergence of the forward (OU) process; the discretization error from the predictor steps; the discretization error from the corrector steps; the score estimation error; and the early stopping error. Theorem 2 recovers nearly the same guarantees as the one in [Che+23a; CLL23; LLT23], but for the probability flow ODE with overdamped Langevin corrector instead of the reverse SDE without corrector. Recall also from Remark 1 that our results can easily be extended to compactly supported data distributions without smooth score functions. This covers essentially all distributions encountered in practice. Therefore, our result provides compelling theoretical justification complementing the empirical efficacy of the probability flow ODE, which was hitherto absent from the literature. However, in Theorem 2, the iteration complexity is dominated by the corrector steps. Next, we show that by replacing the overdamped LMC with underdamped LMC, we can achieve a quadratic improvement in the number of steps, considering the dependence on d. As discussed in the Introduction, this highlights the potential benefits of the ODE framework over the SDE. Theorem 3 (DPUM). Suppose that Assumptions 1 4 hold. If bq denotes the output of DPUM (Algorithm 2) with δ ε2 L2 (d m2 2), then TV(bq, q ) ( d m2) exp( T) + L2Td1/2hpred + L3/2Td1/2hcorr + L1/2Tεsc + ε . In particular, if we set T = Θ ln( d m2 2 ε2 ) , hpred = eΘ( ε L2d1/2 ), hcorr = eΘ( ε L3/2d1/2 ), and if the score estimation error satisfies εsc e O( ε L), then we can obtain TV error ε with a total iteration complexity of eΘ( L2d1/2 4 Proof overview Here we give a detailed technical overview for the proof of our main results, Theorems 2 and 3. As in [Che+23a; CLL23; LLT23], the three sources of error that we need to keep track of are (1) estimation of the score function; (2) discretization of time when implementing the probability flow ODE and corrector steps; and (3) initialization of the algorithm at γd instead of the true law of the end of the forward process, q0 = q T . It turns out that (1) is not so difficult to manage as soon as we can control (2) and (3). Furthermore, as in prior work, we can easily control (3) via the data-processing inequality: the total variation distance between the output of the algorithm initialized at q0 versus at γd is at most TV(q T , γd), which is exponentially small in T by rapid mixing of the OU process. So henceforth in this overview, let us assume that both the algorithm and the true process are initialized at q0. It remains to control (2). Failure of existing approaches. In the SDE implementation of diffusion models, prior works handled (2) by directly bounding a strictly larger quantity, namely the KL divergence between the laws of the trajectories of the algorithm and the true process; by Girsanov s theorem, this has a clean formulation as an integrated difference of drifts. Unfortunately, in the ODE implementation, this KL divergence is infinite: in the absence of stochasticity in the reverse process, these laws over trajectories are not even absolutely continuous with respect to each other. In search of an alternative approach, one might try a Wasserstein analysis. As a first attempt, we could couple the initialization of both processes and look at how the distance between them changes over time. If (bxt)0 t T and (xt)0 t T denote the algorithm and true process, then smoothness of the score function allows us to naïvely bound t E[ bxt xt 2] by O(L) E[ bxt xt 2]. While this ensures that the processes are close if run for time 1/L, it does not rule out the possibility that they drift apart exponentially quickly after time 1/L. Restarting the coupling first attempt. What we would like is some way of restarting this coupling before the processes drift too far apart, to avoid this exponential compounding. We now motivate how to achieve this by giving an argument that is incorrect but nevertheless captures the intuition for our approach. Namely, let pt := law(bxt) denote the law of the algorithm, let P t0,h ODE denote the result of running the ideal probability flow ODE for time h starting from time t0, and let b P t0,h ODE denote the same but for the discretized probability flow ODE with estimated score. For h 1/L, consider the law of the two processes at time 2h, i.e., p2h = q0 b P 0,2h ODE and q2h = q0P 0,2h ODE . (8) The discussion above implies that q0P 0,h ODE and q0 b P 0,h ODE are close in 2-Wasserstein distance, so by the data-processing inequality, this implies that q0P 0,h ODE b P h,h ODE and q0 b P 0,h ODE b P h,h ODE are also close. To show that p2h and q2h in Eq. (8) are close, it thus suffices to show that q0P 0,2h ODE and q0P 0,h ODE b P h,h ODE are close. But these two distributions are given by running the algorithm and the true process for time h, both starting from q0P 0,h ODE. So if we restart the coupling by coupling the processes based on their locations at time h, rather than time 0, of the reverse process, we can again apply the naïve Wasserstein analysis. At this juncture, it would seem that we have miraculously sidestepped the exponential blowup and shown that the expected distance between the processes only increases linearly over time! The issue of course is in the application of the data-processing inequality, which simply does not hold for the Wasserstein distance. Restarting the coupling with a corrector step. This is where the corrector comes in. The idea is to use short-time regularization: if we apply a small amount of noise to two distributions which are already close in Wasserstein, then they become close in KL divergence, for which a data-processing inequality holds. The upshot is that if the noise doesn t change the distributions too much, then we can legitimately restart the coupling as above and prove that the distance between the processes, now defined by interleaving the probability flow ODE and its discretization with periodic injections of noise, increases only linearly in time. It turns out that naïve injection of noise, e.g., convolution with a Gaussian of small variance, is somewhat wasteful as it fails to preserve the true process and leads to poor polynomial dependence in the dimension. On the other hand, if we instead run the overdamped Langevin diffusion with potential chosen so that the law of the true process is stationary, then we can recover the linear in d dependence of Theorem 2. Then by replacing overdamped Langevin diffusion with its underdamped counterpart, which has the advantage of much smoother trajectories, we can obtain the desired quadratic speedup in dimension dependence in Theorem 3. Score perturbation lemma. In addition to the switch from SDE to ODE and the use of the underdamped corrector, a third ingredient is essential to our improved dimension dependence. The former two ensure that the trajectory of our algorithm is smoother than that of DDPMs, so that even over time windows that scale with 1/ d, the process does not change too much. By extension, as the score functions are Lipschitz, this means that any fixed score function evaluated over iterates in such a window does not change much. This amounts to controlling discretization error in space. It is also necessary to control discretization error in time, i.e., proving what some prior works referred to as a score perturbation lemma [LLT22]. That is, for any fixed iterate x, we want to show that the score function ln qt(x) does not change too much as t varies over a small window. Unfortunately, prior works were only able to establish this over windows of length 1/d. In this work, we improve this to windows of length 1/ d (see Lemma 3 and Corollary 1). In our proof, we bound the squared L2 norm of the derivative of the score along the trajectory of the ODE. The score function evaluated at y can be expressed as EP0|t( |y)[ U]; here, the posterior distribution P0|t( | y) is essentially the prior q tilted by a Gaussian of variance O(t). Hence we need to bound the change in the expectation when we change the distribution from P0|t to P0|t+ t; because U is L-Lipschitz, we can bound this by the Wasserstein distance between the distributions. For small enough t, P0|t is strongly log-concave, and a transport cost inequality bounds this in terms of KL divergence, which is more easily bounded. Indeed, we can bound it with the KL divergence between the joint distributions P0,t and P0,t+ t, which reduces to bounding the KL divergence between Gaussians of unequal variance. However, since our score perturbation lemma degrades near the beginning of the forward process, we require better control of the discretization error during this part of the algorithm, hence leading to our choice of geometrically decreasing step sizes. Alternatively, we could use a two-stage step size schedule, see Remark 4. 5 Conclusion In this work, we have provided the first polynomial-time guarantees for the probability flow ODE implementation of SGMs with corrector steps and exhibited improved dimension dependence of the ODE framework over prior results for DDPMs (i.e., the SDE framework). Our analysis raises questions relevant for practice, of which we list a few. Although we need the corrector steps for our proof, are they in fact necessary for the algorithm to work efficiently in practice? Is it possible to obtain even better dimension dependence, perhaps using higher-order solvers and stronger smoothness assumptions? Can we obtain improved dimension dependence even in the non-smooth setting, compared to the result of [CLL23]? We also list several limitations of our work, namely: Our analysis only covers the probability flow ODE corresponding to the OU forward process. We leave the study of more general dynamics for future study. Our guarantees require the score function to be learned to L2 accuracy e O(ε/ L), which is more stringent than the prior works [Che+23a; CLL23; LLT23] and may be an technical artefact of our proof. We have not validated our theoretical findings with large-scale experiments. In particular, it is still unclear whether flow-based methods can outperform the standard DDPM algorithm in practical, high dimensional settings. 6 Acknowledgments We want to thank Yin Tat Lee for his valuable comments, shaping the direction of this research project in its early stages. SC was supported by NSF Award 2103300 for part of this work. [ABV23] M. S. Albergo, N. M. Boffi, and E. Vanden-Eijnden. Stochastic interpolants: a unifying framework for flows and diffusions . In: ar Xiv preprint 2303.08797 (2023). [AGS08] L. Ambrosio, N. Gigli, and G. Savaré. Gradient flows in metric spaces and in the space of probability measures. Second. Lectures in Mathematics ETH Zürich. Birkhäuser Verlag, Basel, 2008, pp. x+334. [Aus+21] J. Austin, D. D. Johnson, J. Ho, D. Tarlow, and R. van den Berg. Structured denoising diffusion models in discrete state-spaces . In: Advances in Neural Information Processing Systems. Ed. by M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan. Vol. 34. Curran Associates, Inc., 2021, pp. 17981 17993. [BDD23] J. Benton, G. Deligiannidis, and A. Doucet. Error bounds for flow matching methods . In: ar Xiv preprint 2305.16860 (2023). [BGL01] S. G. Bobkov, I. Gentil, and M. Ledoux. Hypercontractivity of Hamilton Jacobi equations . In: Journal de Mathématiques Pures et Appliquées 80.7 (2001), pp. 669 696. [BMR22] A. Block, Y. Mroueh, and A. Rakhlin. Generative modeling with denoising autoencoders and Langevin sampling . In: ar Xiv preprint 2002.00107 (2022). [BV23] N. M. Boffi and E. Vanden-Eijnden. Probability flow solution of the Fokker Planck equation . In: Machine Learning: Science and Technology 4.3 (July 2023), p. 035012. [CDD23] S. Chen, G. Daras, and A. Dimakis. Restoration-degradation beyond linear diffusions: a non-asymptotic analysis for DDIM-type samplers . In: Proceedings of the 40th International Conference on Machine Learning. Ed. by A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett. Vol. 202. Proceedings of Machine Learning Research. PMLR, July 2023, pp. 4462 4484. [CE22] Y. Chen and R. Eldan. Localization schemes: a framework for proving mixing bounds for Markov chains . In: ar Xiv preprint 2203.04163 (2022). [Che+22] Y. Chen, S. Chewi, A. Salim, and A. Wibisono. Improved analysis for a proximal algorithm for sampling . In: Proceedings of Thirty Fifth Conference on Learning Theory. Ed. by P.-L. Loh and M. Raginsky. Vol. 178. Proceedings of Machine Learning Research. PMLR, July 2022, pp. 2984 3014. [Che+23a] S. Chen, S. Chewi, J. Li, Y. Li, A. Salim, and A. R. Zhang. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions . In: The Eleventh International Conference on Learning Representations. 2023. [Che+23b] Y. Chen, W. Deng, S. Fang, F. Li, N. T. Yang, Y. Zhang, K. Rasul, S. Zhe, A. Schneider, and Y. Nevmyvaka. Provably convergent Schrödinger bridge with applications to probabilistic time series imputation . In: Proceedings of the 40th International Conference on Machine Learning. Ed. by A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett. Vol. 202. Proceedings of Machine Learning Research. PMLR, July 2023, pp. 4485 4513. [Che22] S. Chewi. Log-concave sampling. Book draft available at https://chewisinho. github.io/. 2022. [CLL23] H. Chen, H. Lee, and J. Lu. Improved analysis of score-based generative modeling: user-friendly bounds under minimal smoothness assumptions . In: Proceedings of the 40th International Conference on Machine Learning. Ed. by A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett. Vol. 202. Proceedings of Machine Learning Research. PMLR, July 2023, pp. 4735 4763. [CSY22] H. Chung, B. Sim, and J. Ye. Come-closer-diffuse-faster: accelerating conditional diffusion models for inverse problems through stochastic contraction . In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA, USA: IEEE Computer Society, June 2022, pp. 12403 12412. [De +21] V. De Bortoli, J. Thornton, J. Heng, and A. Doucet. Diffusion Schrödinger bridge with applications to score-based generative modeling . In: Advances in Neural Information Processing Systems. Ed. by M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan. Vol. 34. Curran Associates, Inc., 2021, pp. 17695 17709. [De 22] V. De Bortoli. Convergence of denoising diffusion models under the manifold hypothesis . In: Transactions on Machine Learning Research (2022). [DN21] P. Dhariwal and A. Nichol. Diffusion models beat GANs on image synthesis . In: Advances in Neural Information Processing Systems. Ed. by M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan. Vol. 34. Curran Associates, Inc., 2021, pp. 8780 8794. [EMS22] A. El Alaoui, A. Montanari, and M. Sellke. Sampling from the Sherrington Kirkpatrick Gibbs measure via algorithmic stochastic localization . In: 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science FOCS 2022. IEEE Computer Soc., Los Alamitos, CA, 2022, pp. 323 334. [FYC23] J. Fan, B. Yuan, and Y. Chen. Improved dimension dependence of a proximal algorithm for sampling . In: Proceedings of Thirty Sixth Conference on Learning Theory. Ed. by G. Neu and L. Rosasco. Vol. 195. Proceedings of Machine Learning Research. PMLR, July 2023, pp. 1473 1521. [Gna+22] D. Gnaneshwar, B. Ramsundar, D. Gandhi, R. Kurchin, and V. Viswanathan. Scorebased generative models for molecule generation . In: ar Xiv preprint 2203.04698 (2022). [GW12] A. Guillin and F.-Y. Wang. Degenerate Fokker Planck equations: Bismut formula, gradient estimate and Harnack inequality . In: Journal of Differential Equations 253.1 (2012), pp. 20 40. [HJA20] J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models . In: Advances in Neural Information Processing Systems 33 (2020), pp. 6840 6851. [Hyv05] A. Hyvärinen. Estimation of non-normalized statistical models by score matching . In: J. Mach. Learn. Res. 6 (2005), pp. 695 709. [JKO98] R. Jordan, D. Kinderlehrer, and F. Otto. The variational formulation of the Fokker Planck equation . In: SIAM J. Math. Anal. 29.1 (1998), pp. 1 17. [JP22] A. Jain and B. Poole. Journey to the BAOAB-limit: finding effective MCMC samplers for score-based models . In: Neur IPS 2022 Workshop on Score-Based Methods. 2022. [Kin+21] D. Kingma, T. Salimans, B. Poole, and J. Ho. Variational diffusion models . In: Advances in Neural Information Processing Systems. Ed. by M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan. Vol. 34. Curran Associates, Inc., 2021, pp. 21696 21707. [LC22] J. Liang and Y. Chen. A proximal algorithm for sampling from non-smooth potentials . In: ar Xiv preprint 2110.04597 (2022). [LC23] J. Liang and Y. Chen. A proximal algorithm for sampling . In: Transactions on Machine Learning Research (2023). [Li+23] G. Li, Y. Wei, Y. Chen, and Y. Chi. Towards faster non-asymptotic convergence for diffusion-based generative models . In: ar Xiv preprint 2306.09251 (2023). [Liu+22] X. Liu, L. Wu, M. Ye, and Q. Liu. Let us build bridges: understanding and extending diffusion generative models . In: ar Xiv preprint ar Xiv:2208.14699 (2022). [LLT22] H. Lee, J. Lu, and Y. Tan. Convergence for score-based generative modeling with polynomial complexity . In: Advances in Neural Information Processing Systems. Ed. by A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho. 2022. [LLT23] H. Lee, J. Lu, and Y. Tan. Convergence of score-based generative modeling for general data distributions . In: Proceedings of the 34th International Conference on Algorithmic Learning Theory. Ed. by S. Agrawal and F. Orabona. Vol. 201. Proceedings of Machine Learning Research. PMLR, Feb. 2023, pp. 946 985. [LST21] Y. T. Lee, R. Shen, and K. Tian. Structured logconcave sampling with a restricted Gaussian oracle . In: Proceedings of Thirty Fourth Conference on Learning Theory. Ed. by M. Belkin and S. Kpotufe. Vol. 134. Proceedings of Machine Learning Research. PMLR, Aug. 2021, pp. 2993 3050. [Lu+22] C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li, and J. Zhu. DPM-Solver: a fast ODE solver for diffusion probabilistic model sampling in around 10 steps . In: Advances in Neural Information Processing Systems. Ed. by S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh. Vol. 35. Curran Associates, Inc., 2022, pp. 5775 5787. [MW23] A. Montanari and Y. Wu. Posterior sampling from the spiked models via diffusion processes . In: ar Xiv preprint 2304.11449 (2023). [Pid22] J. Pidstrigach. Score-based generative models detect manifolds . In: Advances in Neural Information Processing Systems. Ed. by S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh. Vol. 35. Curran Associates, Inc., 2022, pp. 35852 35865. [PMM23] F. Pedrotti, J. Maas, and M. Mondelli. Improved convergence of score-based diffusion models via prediction-correction . In: ar Xiv preprint 2305.14164 (2023). [Ram+22] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen. Hierarchical text-conditional image generation with CLIP latents . In: ar Xiv preprint ar Xiv:2204.06125 (2022). [Rom+22] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models . In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA, USA: IEEE Computer Society, June 2022, pp. 10674 10685. [San15] F. Santambrogio. Optimal transport for applied mathematicians. Vol. 87. Progress in Nonlinear Differential Equations and their Applications. Calculus of variations, PDEs, and modeling. Birkhäuser/Springer, Cham, 2015, pp. xxvii+353. [SE19] Y. Song and S. Ermon. Generative modeling by estimating gradients of the data distribution . In: Advances in Neural Information Processing Systems. Ed. by H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett. Vol. 32. Curran Associates, Inc., 2019. [Shi+21] C. Shi, S. Luo, M. Xu, and J. Tang. Learning gradient fields for molecular conformation generation . In: Proceedings of the 38th International Conference on Machine Learning. Ed. by M. Meila and T. Zhang. Vol. 139. Proceedings of Machine Learning Research. PMLR, July 2021, pp. 9558 9568. [Soh+15] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics . In: Proceedings of the 32nd International Conference on Machine Learning. Ed. by F. Bach and D. Blei. Vol. 37. Proceedings of Machine Learning Research. Lille, France: PMLR, July 2015, pp. 2256 2265. [Son+21a] Y. Song, C. Durkan, I. Murray, and S. Ermon. Maximum likelihood training of scorebased diffusion models . In: Advances in Neural Information Processing Systems. Ed. by M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan. Vol. 34. Curran Associates, Inc., 2021, pp. 1415 1428. [Son+21b] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Scorebased generative modeling through stochastic differential equations . In: International Conference on Learning Representations. 2021. [Son+22] Y. Song, L. Shen, L. Xing, and S. Ermon. Solving inverse problems in medical imaging with score-based generative models . In: International Conference on Learning Representations. 2022. [TP18] M. K. Titsias and O. Papaspiliopoulos. Auxiliary gradient-based sampling algorithms . In: J. R. Stat. Soc. Ser. B. Stat. Methodol. 80.4 (2018), pp. 749 767. [Vin11] P. Vincent. A connection between score matching and denoising autoencoders . In: Neural Comput. 23.7 (2011), pp. 1661 1674. [VKK21] A. Vahdat, K. Kreis, and J. Kautz. Score-based generative modeling in latent space . In: Advances in Neural Information Processing Systems. Ed. by M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan. Vol. 34. Curran Associates, Inc., 2021, pp. 11287 11302. [WHZ23] Z. Wang, J. J. Hunt, and M. Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning . In: The Eleventh International Conference on Learning Representations. 2023. [WY22] A. Wibisono and K. Y. Yang. Convergence in KL divergence of the inexact Langevin algorithm with application to score-based generative models . In: ar Xiv preprint 2211.01512 (2022). [ZC23] Q. Zhang and Y. Chen. Fast sampling of diffusion models with exponential integrator . In: The Eleventh International Conference on Learning Representations. 2023. A Notation and overview In this section, we collect together the notation used throughout the proofs and provide a road map for the end-to-end analysis in E. Throughout the analysis, Assumptions 1 4 are in full force. We will reserve q for the law of the reverse process (and denote the forward process by q when needed). In E, the law of the algorithm is denoted by p. We use the following Markov kernels: 1. P t,h ODE is the output of running the ODE for time h, starting at (reverse) time t. 2. PLD (resp. PULD) is the output of running the continuous-time overdamped (resp. underdamped) Langevin diffusion for time h. In this notation, we have suppressed mention of the stationary distributions of the diffusion, which will be provided by context. 3. b P t,h ODE and b PLMC (resp. b PULMC) are the corresponding processes once discretized and using the estimated score. For the ODE, we are more precise with the notation because even within a single epoch of predictor steps, the kernel for the probability flow ODE depends on time (as opposed to the kernels for the diffusions, which are constant within any epoch of corrector steps); moreover, for our analysis in E, we also need to take time-varying step sizes for the predictor steps. We will omit the dependencies on t and h when clear from context. When P = PODE or b PODE, we use P t,h1,...,h N to denote P t,h1P t+h1,h2 P t+h1+ +h N 1,h N (we compose kernels on the right). We refer to 4 for a high-level description of the proof strategy. We begin in B with our improved score perturbation lemma (Corollary 1); this is the only section of the analysis which is indexed by forward time (instead of reverse time). In Lemma 5 in C, we establish our main result for the predictor steps, which combines together standard ODE discretization analysis with the score perturbation lemma of B. Since Corollary 1 degrades near the end of the reverse process (or equivalently, near the start of the forward process, when the regularization has not yet kicked in), our analysis requires a geometrically decreasing step size schedule, which leads to the two-stage Algorithms 1 and 2. In D, we prove our main regularization results for the overdamped corrector (Theorem 4) and the underdamped corrector (Theorem 5). Finally, we put together the various constituent results in the end-to-end analysis in E. B Score perturbation In this section, we prove a score perturbation lemma which refines that of [LLT22]. This improved lemma is necessary in order to obtain O( d) dependence for the probability flow ODE. Lemma 1 (Score perturbation). Suppose pt = p0 N(0, t I) and x0 p0, xt = 1 2 ln pt(xt). Suppose that 2 ln p(t 1 2L ) 0(x) op L for all x. Then E[ t ln pt(xt) 2] L2d L + 1 Proof. Without loss of generality, we may assume t 1 2L, as otherwise, noting that pt = pt 1 2L N(0, 1 2LI), we may replace p0 with pt 1 2L and t with 1 2L. Suppose p0(x) = e V (x). Let P0,t denote the joint distribution of (X0, Xt) where Xt = X0 + t Z with Z N(0, I) independent of X0, and let P0|t( | xt) denote the conditional distribution of X0 given Xt = xt. We first note that since ln pt(y) = ln Z exp V (x) 1 2t y x 2 dx , we have the following calculations: ln pt(y) = 1 t EP0|t( |y)(y ) = EP0|t( |y)( V ) , 2 ln pt(y) = Cov P0|t( |y)( V ) EP0|t( |y)( 2V ) . Using xt = 1 2 ln pt(xt), we calculate t ln pt(xt) = [ t ln pt(y)]|y=xt 1 2 2 ln pt(xt) ln pt(xt) . (9) We bound each term above separately. For the first term, a quick calculation shows that t ln pt(y) = Cov P0|t( |y) y 2 2t2 , V is finite a.s.: by Cauchy Schwarz, it suffices to show EP0|t( |y)[ y 4] and EP0|t( |y)[ V 2] are finite for all y, and this follows because for t 1 2L the measure P0|t( | y) is strongly log-concave and V 2 can be bounded by a quadratic. Because V is L-Lipschitz, [ t ln pt(y)]|y=xt 2 = [ t EP0|t( |y)( V )]|y=xt 2 = lim t 0 1 t EP0|t+ t( |y)[ V ] EP0|t( |y)[ V ] y=xt L2 lim inf t 0 1 ( t)2 W 2 1 P0|t+ t( | xt), P0|t( | xt) . (10) Now P0|t( | y) has density p0|t(x | y) p0(x) e x y 2 2t so if 2 ln p0 op L and t 1 2L, then P0|t is 1 2t-strongly log-concave. By Talagrand s transport cost inequality, W 2 1 P0|t+ t( | xt), P0|t( | xt) 4t KL P0|t+ t( | xt) P0|t( | xt) . Plugging this back in (10) and using Fatou s lemma and the chain rule for KL, E[ [ t ln pt(y)]|y=xt 2] L2 E lim inf t 0 1 ( t)2 4t KL P0|t+ t( | xt) P0|t( | xt) L2 lim inf t 0 1 ( t)2 4t E KL P0|t+ t( | xt) P0|t( | xt) L2 lim inf t 0 1 ( t)2 4t KL(P0,t+ t P0,t) . (11) KL(P0,t+ t P0,t) = Ex P0 KL Pt+ t|0( | x), Pt|0( | x) = KL N(0, (t + t)I) N(0, t I) Plugging into (11) gives E[ [ t ln pt(y)]|y=xt 2] L2d For the second term, by assumption we have 2 ln pt op L. Then, since xt pt, E[ 2 ln pt(xt) ln pt(xt) 2] L2 Ept[ ln pt 2] L3d (13) using the fact Eµ[ ln µ 2] Ld for any measure µ such that ln µ is L-smooth, which follows from integration by parts. From (9), (12), and (13), and the elementary inequality a, b a 2 + 4 b 2, we get E[ t ln pt(xt) 2] E[ [ t ln pt(y)]|y=xt 2] + E[ 2 ln pt(xt) ln pt(xt) 2] The above result holds for the dynamics xt = 1 2 ln pt(xt) for which (pt)t 0 follows the heat flow; this corresponds to the variance-exploding SGM. In this paper, since we wish to consider the SGM based on the variance-conserving Ornstein Uhlenbeck (OU) process, we can apply the following reparameterization lemma. Lemma 2 (Reparameterization). Suppose that (xt)t 0 satisfies the probability flow ODE for Brownian motion starting at p0; that is, letting pt = p0 N(0, t I), we have x0 p0, xt = 1 2 ln pt(xt). Then, if we set yt = e t xe2t 1 , then (yt)t 0 satisfies the probability flow ODE for the OU process starting at p0; that is, letting q t be the density of the OU process at time t, we have y0 p0 = q 0 , yt = yt ln q t (yt). Proof. By direct calculation, one can check that for any y Rd, it holds that q t (y) pe2t 1(ety). The claim follows from the chain rule. Lemma 3 (Score perturbation for OU). Suppose q t is the density of the OU process at time t, started at q 0 , and y0 q 0 , yt = yt ln q t (yt). Suppose for all t and all x that 2 ln q t (x) op L, where L 1. Then, E[ t ln q t (yt) 2] L2d L 1 Proof. Using the relationship q t (y) pe2t 1(ety), ln q t (y) = et ln pe2t 1(ety) , t ln q t (yt) = et ln pe2t 1(xe2t 1) | {z } =:A + et s ln ps(xs)|s=e2t 1 2e2t | {z } =:B If 2 ln q t op L, then 2 ln pe2t 1 op e 2t L. By Lemma 1, E[ s ln ps(xs)|s=e2t 1 2] e 4t L2d e 2t L 1 e2t 1 . E[B2] L2d L 1 E[A2] e2t E[ ln pe2t 1(xe2t 1) 2] e2t e 2t Ld Ld . The result follows. Finally, we use Lemma 3 to derive a bound on how much the score changes along the trajectory of the probability flow ODE. Corollary 1. Consider the setting of Lemma 3, and suppose 0 < s < t, h = t s. 1. If s, t 1/L, then E ln q t (xt) ln q s (xs) 2 L3dh2 . E ln q t (xt) ln q s (xs) 2 L2dh2 Proof. By Lemma 3, E ln q t (xt) ln q s (xs) 2 = E h Z t s u ln q u (xu) du 2i s E[ u ln q u (xu) 2] du s L2d max L, 1 In the first case, this is bounded by O(L3dh2). In the second case, this is bounded by O(L2dh R t s 1 u du) = O(L2dh ln(t/s)) = O(L2dh2/t). C Predictor step Next, we need an ODE discretization analysis. Lemma 4. Suppose the score function satisfies Assumption 2. Assume that L 1, h 1/L, and T (t0 + h) T t0 W2(q P t0,h ODE, q b P t0,h ODE) Ld1/2h2 L1/2 1 Proof. We have the ODEs xt = xt + ln qt(xt) , bxt = bxt + st0(bxt0) , for t0 t t0 + h, with xt0 = bxt0 q, xt0+h q PODE, and bxt0+h q b PODE. Then, t xt bxt 2 = 2 xt bxt, xt bxt = 2 xt bxt 2 + xt bxt, ln qt(xt) + st0(bxt0) h xt bxt 2 + h ln qt(xt) st0(bxt0) 2 . By Grönwall s inequality, noting that h = O(1), E[ xt0+h bxt0+h 2] exp 2 + 1 t0 h E[ ln qt(xt) st0(bxt0) 2] dt t0 E[ ln qt(xt) st0(bxt0) 2] dt . (14) We split up the error term as ln qt(xt) st0(bxt0) 2 ln qt(xt) ln qt0(xt0) 2 + ln qt0(xt0) st0(bxt0) 2 . By Corollary 1, the expectation of the first term is bounded by E[ ln qt(xt) ln qt0(xt0) 2] L2dh2 L 1 T t0 The second term is bounded in expectation by ε2 sc. Plugging back into (14) gives E[ xt0+h bxt0+h 2] h2 L2dh2 L 1 T t0 The Wasserstein distance is bounded by the square root of this quantity, and the lemma follows. Lemma 4 suggests that focusing on the dependence on d, we will be able to take h d 1/2 (we need to keep one factor of h in the bound, as we need to sum up the bound over 1/h iterations). Remark 2. Our improved score perturbation lemma is necessary to obtain this d1/2 dependence. The original score perturbation lemma [LLT22, Lemma C.11 12] combined with a space discretization bound gives a bound of E ln q t (xt) ln q s (xs) 2 L2dh in place of Corollary 1. Note this is a 1 2-Hölder continuity bound rather than a Lipschitz bound. The bound in Lemma 4 then becomes W2(q P t0,h ODE, q b P t0,h ODE) Ld1/2h3/2 + hεsc , and we would only be able to take h d 1. We also note that our bound has an extra factor of max{L1/2, (T t0) 1/2}; we do not know if this extra factor is necessary. We now iterate Lemma 4 to obtain the following result. Note that we now need to also assume that the score estimate is L-Lipschitz. Lemma 5. Suppose that both Assumptions 2 and 3 hold. Let h1, . . . , h N > 0 be a sequence such that letting t N = h1 + + h N, we have t N 1/L. Let hmax = max1 n N hn. 1. If T (t0 + t N) 1/L, then W2(q P t0,h1,...,h N ODE , q b P t0,h1,...,h N ODE ) L3/2d1/2hmaxt N + εsct N L1/2d1/2hmax + εsc 2. If T t0 1/L and hn+1 T t0 tn 2 for each n, then W2(q P t0,h1,...,h N ODE , q b P t0,h1,...,h N ODE ) L1/2d1/2hmax + εsct N L1/2d1/2hmax + εsc Proof. We abbreviate P N ODE := P t0,h1,...,h N ODE and b P N ODE := b P t0,h1,...,h N ODE . Using the triangle inequality, W2(q P N ODE, q b P N ODE) W2(q P N ODE, q P N 1 ODE b PODE) + W2(q P N 1 ODE b PODE, q b P N ODE) O(Ld1/2h2 N max{L1/2, (T t0 t N) 1/2} + h Nεsc) + exp(O(Lh N)) W2(q P N 1 ODE , q b P N 1 ODE ) where the bound on the first term is by Lemma 4. By induction, W2(q P N ODE, q b P N ODE) Ld1/2h2 n max{L1/2, (T t0 tn) 1/2} + hnεsc exp(O(L (hn+1 + + h N))) . By assumption, hn+1 + + h N t N 1/L. In the first case, we get W2(q P N ODE, q b P N ODE) L3/2d1/2hmaxt N + εsct N . In the second case we get W2(q P N ODE, q b P N ODE) Ld1/2hmax hn (T t0 tn)1/2 + εsct N L1/2d1/2hmax + εsct N by interpreting the summation as a Riemann sum, and noting that the condition hn+1 T t0 tn 2 implies that this is a constant-factor approximation of the integral R T t0 T t0 t N 1 t1/2 dt T t0. Choice of step sizes. In the first case, we can take all the step sizes to be equal, but in the second case, we may need to take decreasing step sizes. Given a target time T t0 t N = δ, by taking h1 = hmax and then hn = min n δ, hmax, T t0 tn 1 we can reach the target time in N = O 1 Lhmax + ln hmax D Corrector step In D.1 (resp. D.2), we will show that if p, q are close in Wasserstein distance, then running the corrector step based on the overdamped (resp. underdamped) Langevin diffusion starting from p and from q for some amount of time results in distributions which are close in total variation distance. In the end-to-end analysis in E, we combine this total variation to Wasserstein regularization with the Wasserstein discretization analysis of the predictor step in C in order to establish our final results. D.1 Corrector via overdamped Langevin We will take the potential and score estimate defining the Markov kernels PLD and b PLMC from 2.3 to be U and s respectively. Recall that these correspond respectively to running the overdamped Langevin diffusion with stationary distribution q exp( U) and running the discretized diffusion with score estimate s, both for time h. The main result of this section is to show that p b P N LMC and q are close in total variation if p and q are close in Wasserstein. Theorem 4 (Overdamped corrector). For any Tcorr := Nh 1/L, TV(p b P N LMC, q) W2(p, q)/ p Tcorr + εsc p Tcorr + L p In particular, for Tcorr 1/L, TV(p b P N LMC, q) L W2(p, q) + εsc/ We will bound TV(p P N LD, q) and TV(p b P N LMC, p P N LD) separately. For the former, we use the following short-time regularization result: Lemma 6 ([BGL01, Lemma 4.2]). If Tcorr 1/L, then TV(p P N LD, q) q KL(p P N LD q) W2(p, q)/ p Proof. The first inequality is Pinsker s inequality. The second inequality is a consequence of [BGL01, Lemma 4.2], which gives a bound of KL(p P N LD q) L (1 + 1/(e2LTcorr 1)) W 2 2 (p, q). The claim then follows from simplifying by using Tcorr 1/L. For the latter term, we introduce notation for two stochastic processes. dx t = U(x t ) dt + 2 d Bt , x 0 q , dxt = U(xt) dt + 2 d Bt , x0 p . Note that for any integer k 0, x kh q P k LD , xkh p P k LD . Observe that marginally, q P k LD = q for any k 0 because q is the stationary distribution of the Langevin diffusion. The three processes are coupled by using the same Brownian motion and by coupling x0 = bx0 p and x 0 q optimally. Before we proceed to bound TV(p P N LD, p b P N LMC), we need the following simple lemma. Lemma 7. If Tcorr 1/L, then E[ xt x t 2] W 2 2 (p, q) for all 0 t Tcorr. Proof. By Itô s formula, d( xt x t 2) = 2 xt x t , U(xt) U(x t ) 2L xt x t 2 . By Grönwall s inequality, xt x t 2 e2Lt x0 x 0 2 , so that if we couple the two processes by coupling x0 and x 0 optimally, we conclude that E[ xt x t 2] e2Lt E[ x0 x 0 2] = e2Lt W 2 2 (p, q) W 2 2 (p, q) , recalling that t Tcorr 1/L by hypothesis. It remains to bound TV(p b P N LMC, p P N LD). Lemma 8. If Tcorr 1/L, then TV(p b P N LMC, p P N LD) q KL(p P N LD p b P N LMC) L p Tcorr W2(p, q) + εsc p Tcorr + L p Proof. As x and bx are driven by the same Brownian motion, by Girsanov s theorem8 and the data processing inequality we have KL(p P N LD p b P N LMC) kh E[ s(xkh) U(xu) 2] du . We can decompose the integrand as follows: E[ s(xkh) U(xu) 2] E s(xkh) s(x kh) 2 + s(x kh) U(x kh) 2 + U(x kh) U(x u) 2 + U(x u) U(xu) 2 L2 E[ xkh x kh 2] + ε2 sc + L2 E[ x kh x u 2] + L2 E[ x u xu 2] L2 W 2 2 (p, q) + ε2 sc + L2 E[ x kh x u 2] (15) where we used Lemma 7 to bound E[ x u xu 2]. It remains to bound E[ x kh x u 2]. Note that E[ x kh x u 2] = E h Z u kh U(x s) ds + 2 (Bu Bkh) 2i h Z u kh E[ U(x s) 2] ds + dh Ldh2 + dh dh , where in the last step we used that E[ U(x u) 2] Ld. Substituting this into (15), we obtain KL(p P N LD p b P N LMC) L2Tcorr W 2 2 (p, q) + ε2 sc Tcorr + L2dh Tcorr . The claimed bound on TV(p P N LD, p b P N LMC) follows by Pinsker s inequality. Proof of Theorem 4. This is immediate from Lemma 6 and Lemma 8, recalling that Tcorr 1/L so that the bound in Lemma 6 dominates the W2(p, q) term in Lemma 8. D.2 Corrector via underdamped Langevin Throughout, we set the friction parameter to We will take the potential and score estimate defining the Markov kernels PULD and b PULMC from 2.3 to be U and s respectively. Recall that these correspond respectively to running the underdamped Langevin diffusion with stationary distribution q and running the discretized diffusion with score estimate s, both for time h. Given probability measures p and q, we write p := p γd and q := q γd, where γd is the standard Gaussian measure in Rd. The main result of this section is to show that p b P N ULMC and q are close in total variation if p and q are close in Wasserstein. Compared to D.1, the discretization error for the underdamped Langevin diffusion is smaller. Theorem 5 (Underdamped corrector). For Tcorr 1/ TV(p b P N ULMC, q) W2(p, q) L1/4T 3/2 corr + εsc T 1/2 corr L1/4 + L3/4T 1/2 corr d1/2h . In particular, if we take Tcorr 1/ TV(p b P N ULMC, q) L W2(p, q) + εsc/ 8Although the validity of Girsanov s theorem typically requires Novikov s condition to be satisfied, this can be avoided via an approximation argument as in [Che+23a]. We will bound TV(p P N ULD, q) and TV(p b P N ULMC, p P N ULD) separately. For the former, we use the short-time regularization result of [GW12]: Lemma 9. If Tcorr 1/ TV(p P N ULD, q) q KL(p P N ULD q) W2(p, q) L1/4T 3/2 corr . Proof. This is a consequence of [GW12, Corollary 4.7 (1)]. The condition to check therein is their Eq. (3.6), which in our setting is satisfied by the constants K1 = L and K2 = γ. The Corollary then states that for the cost function c Tcorr((z, v), (z , v )) := inf t (0,Tcorr] t 2γ t2 + L + 3γ 27 + γ v v o2 , we have KL(p P N ULD q) Wc Tcorr (p γd, q γd). For v = v and Tcorr 1/ L, note that c Tcorr((z, v), (z , v)) 1 L1/2T 3corr z z 2 , so the claim follows by Pinsker s inequality. Next, we define the following processes: dz t = v t dt, dzt = vt dt, dv t = γv t dt U(z t ) dt + p 2γ d Bt , (z 0, v 0) q , dvt = γvt dt U(zt) dt + p 2γ d Bt , (z0, v0) p . It follows that for any integer k 0, (z kh, v kh) q P k ULD = q , (zkh, vkh) p P k ULD . We couple these processes by using the same Brownian motion and coupling q γd and p γd optimally (in particular, v0 = v 0). Before we proceed to bound TV(p P N ULD, p b P N ULMC), we start with the following lemma. Lemma 10. If Tcorr 1/ L, then for all 0 t Tcorr, E[ zt z t 2] W 2 2 (p, q) . Proof. We have U(zt) U(z t ) = Z 1 0 2U(zt + u (z t zt)) du (zt z t ) := Ht(zt z t ) , and the operator Ht satisfies Ht op L . Let α := 2/γ. For the vectors δt := (zt + αvt) (z t + αv t ) and ηt := zt z t , we have 1 2 d( δt 2 + ηt 2) = (δt, ηt) (γ 1 α) Id 1 2 (αHt γ Id) 1 2 (αHt γ Id) 1 α Id L ( δt 2 + ηt 2) . By Grönwall s inequality, δt 2 + ηt 2 e O( Lt) ( δ0 2 + η0 2) , so if we couple the two processes by coupling z0 and z 0 optimally and taking v0 = v 0, we obtain E[ zt z t 2] E[ δt 2+ ηt 2] e O( Lt) E[ δ0 2+ η0 2] e O( Lt) W 2 2 (p, q) W 2 2 (p, q) , recalling that t Tcorr 1/ L by hypothesis. It remains to bound TV(p b P N ULMC, p P N ULD). Lemma 11. If Tcorr 1/ TV(p b P N ULMC, p P N ULD) q KL(p P N ULD p b P N ULMC) L3/4T 1/2 corr W2(p, q) + L 1/4T 1/2 corr εsc + L3/4T 1/2 corr d1/2h . Proof. As (z, v) and (z , v ) are driven by the same Brownian motion, by Girsanov s theorem9 and the data processing inequality we have KL(p P N ULD p b P N ULMC) 1 kh E[ s(zkh) U(zu) 2] du . We can decompose the integrand as follows: E[ s(zkh) U(zu) 2] E s(zkh) s(z kh) 2 + s(z kh) U(z kh) 2 + U(z kh) U(z u) 2 + U(z u) U(zu) 2 L2 E[ zkh z kh 2] + ε2 sc + L2 E[ z kh z u 2] + L2 E[ z u zu 2] L2 W 2 2 (p, q) + ε2 sc + L2 E[ z kh z u 2] , (16) where we applied Lemma 10. It remains to bound E[ z kh z u 2]. Note that E[ z kh z u 2] = E h Z u kh v s ds 2i h Z u kh E[ v s 2] ds dh2 , where in the last step we used the fact that v s γd. Substituting this into (16), we conclude that KL(p P N ULD p b P N ULMC) L3/2Tcorr W 2 2 (p, q) + L 1/2Tcorrε2 sc + L3/2dh2Tcorr . The claimed bound on TV(p P N ULD, p b P N ULMC) follows by Pinsker s inequality. Proof of Theorem 5. This is immediate from Lemma 9 and Lemma 11, recalling that Tcorr 1/ L so that the bound in Lemma 9 dominates the W2(p, q) term in Lemma 11. Remark 3. In all other sections of this paper, we abuse notation as follows. Given a distribution p on Rd, we write p b PULMC to denote the projection onto the z-coordinate of p b PULMC, i.e., we view b PULMC as a Markov kernel on Rd rather than on Rd Rd (and similarly for PULD). E End-to-end analysis Lemma 12 (TV error after one round of predictor and corrector). Choose predictor step sizes h1, . . . , h Npred as in Lemma 5 with Tpred = h1 + + h Npred 1/L. That is, if T t0 Tpred 1/L, then we ensure that hn+1 T t0 h1 hn 2 for all n, and if T t0 1/L, then we can take h1 = = h N. Let hpred := max1 n Npred hn and abbreviate P (Npred) ODE := P t0,h1,...,h Npred ODE (and similarly for b PODE). 1. Consider running the overdamped Langevin corrector for time Tcorr 1/L, step size hcorr, and stationary distribution qt0P (Npred) ODE = qt0+Tpred; set Ncorr = Tcorr/hcorr. Then, TV(p b P (Npred) ODE b P Ncorr LMC , qt0+Tpred) TV(p, qt0) + O L d hpred + p Ldhcorr + εsc 2. Consider running the underdamped Langevin corrector for time Tcorr 1/ L, step size hcorr, and stationary distribution qt0P (Npred) ODE = qt0+Tpred; set Ncorr = Tcorr/hcorr. Then, TV(p b P (Npred) ODE b P Ncorr ULMC, qt0+Tpred) TV(p, qt0) + O L Ld hcorr + εsc 9Again, we can avoid checking Novikov s condition using the approximation argument of [Che+23a]. Proof. By the triangle inequality and the data-processing inequality, TV(p b P (Npred) ODE b P Ncorr LMC , qt0+Tpred) TV(p b P (Npred) ODE b P Ncorr LMC , qt0 b P (Npred) ODE b P Ncorr LMC ) + TV(qt0 b P (Npred) ODE b P Ncorr LMC , qt0+Tpred) TV(p, qt0) + TV(qt0 b P (Npred) ODE b P Ncorr LMC , qt0+Tpred) . For overdamped Langevin, applying Theorem 4, TV(qt0 b P (Npred) ODE b P Ncorr LMC , qt0+Tpred) L W2(q b P (Npred) ODE , qt0+Tpred) + εsc/ Ldhcorr . (17) For the Wasserstein term, Lemma 5 yields W2(qt0 b P (Npred) ODE , qt0+Tpred) = W2(qt0 b P (Npred) ODE , qt0P (Npred) ODE ) Ld hpred + εsc Combining these bounds yields the result for the overdamped corrector. For the underdamped corrector, we modify (17) by replacing the use of Theorem 4 with Theorem 5. We also need the following lemma on the convergence of the OU process. Lemma 13. Let (q t )t 0 denote the marginal law of the OU process started at q 0 = q . Then, for all T 1, it holds that TV(q T , γd) ( d + m2) exp( T) . Proof. This follows from [CLL23, Lemma C.4]. Alternatively, using the short-time regularization result of [BGL01, Lemma 4.2] for time t0 1 and the Wasserstein contraction of the OU process, TV(q T , γd) q KL(q T γd) W2(q T t0, γd) t0 exp( (T t0)) W2(q , γd) . The result follows from W2(q , γd) W2(q , δ0) + W2(δ0, γd) m2 + We now prove our main theorems. Proof of Theorems 2 and 3. For t [0, T], let pt := law(bxt). From Lemma 13, TV(p0, q0) = TV(q T , γd) ( d + m2) exp( T) . We divide our analysis according to the two stages of the algorithm. In the first stage, after iterating Lemma 12 for N0 LT steps, TV(p T hpred, q T hpred) TV(p0, q0) + O L Ld hp corr + εsc d + m2) exp( T) + L2Td1/2hpred + L3/2Td1/2hp corr + L1/2Tεsc where p = 1 2 if we use the overdamped corrector and p = 1 if we use the underdamped corrector. Applying the second part of Lemma 12 for the second stage of the algorithm, we then conclude that TV(p T δ, q T δ) ( d + m2) exp( T) + L2Td1/2hpred + L3/2Td1/2hp corr + L1/2Tεsc . Finally, we note that if we take δ ε2 L2 (d m2 2), then by [LLT23, Lemma 6.4], TV(q T δ, q T ) ε; a triangle inequality thus finishes the proof. Remark 4. Alternatively, instead of taking geometrically decreasing step sizes and employing early stopping, we could split the algorithm into two stages: for time t < T hpred, we take constant step size hpred, and for time t > T hpred, we use a smaller constant step size h as required if working with the original score perturbation lemma (see Remark 2). F Numerical experiments In this section, we provide preliminary numerical experiments to illustrate our theory. We implement DPUM on a toy model that is not log-concave (mixture of Gaussians). Setup. The target distribution is a mixture of five Gaussians in dimension 5. On Figures 1 and 2, the red stars represent the means of the Gaussians and the red ellipses around the stars represent the variances of the Gaussians. We start by sampling 500 independent points (in blue) from a standard Gaussian. Then, we run DPUM from the blue dots over 300 iterations and plot the two first coordinates of the dots at iterations 0, 100, 200 and 300. This is a low-dimensional toy example so it does not illustrate our theory, rather we include it as a simple sanity check. Parameters. We use a closed form formula for the score along the forward process. In other words, the score estimation error is equal to zero. The step size of the predictor is 0.01 and the step size of the corrector is 0.001. The corrector consists in 3 steps of the underdamped Langevin algorithm. In the latter algorithm, we initialize the velocity as a centered Gaussian random variable with standard deviation 0.001 and set the parameter γ to 0.01. Observations. We observe the expected behavior: although the target distribution is highly non-logconcave, DPUM is able to provide samples from a distribution that is close to the target distribution. In particular, the initial Gaussian distribution splits in clusters that will fit each component of the target mixture of Gaussians. Recall that we experiment without score error but with discretization error: our numerical results illustrate the common wisdom that score knowledge along the forward process can replace convexity assumptions. In particular, we observe that even isolated, low probability components of the Gaussian mixture, are recovered by DPUM. Iteration 0 Iteration 100 Iteration 200 Iteration 300 Figure 1: A realization of DPUM for a mixture of Gaussians. 3 2 1 0 1 2 3 4 5 Iteration 0 3 2 1 0 1 2 3 4 5 Iteration 100 3 2 1 0 1 2 3 4 5 Iteration 200 3 2 1 0 1 2 3 4 5 Iteration 300 Figure 2: A realization of DPUM for another mixture of Gaussians.