# understanding_diffusion_models_by_feynmans_path_integral__7425a469.pdf Understanding Diffusion Models by Feynman s Path Integral Yuji Hirono 1 Akinori Tanaka 2 3 4 Kenji Fukushima 5 Abstract Score-based diffusion models have proven effective in image generation and have gained widespread usage; however, the underlying factors contributing to the performance disparity between stochastic and deterministic (i.e., the probability flow ODEs) sampling schemes remain unclear. We introduce a novel formulation of diffusion models using Feynman s path integral, which is a formulation originally developed for quantum physics. We find this formulation providing comprehensive descriptions of score-based generative models, and demonstrate the derivation of backward stochastic differential equations and loss functions. The formulation accommodates an interpolating parameter connecting stochastic and deterministic sampling schemes, and we identify this parameter as a counterpart of Planck s constant in quantum physics. This analogy enables us to apply the Wentzel Kramers Brillouin (WKB) expansion, a well-established technique in quantum physics, for evaluating the negative log-likelihood to assess the performance disparity between stochastic and deterministic sampling schemes. 1. Introduction Diffusion models have demonstrated impressive performance on image generation tasks (Dhariwal & Nichol, 2021) and they have earned widespread adoption across various applications (Yang et al., 2023). While the predominant contemporary application of diffusion models lies in conditional 1Department of Physics, Kyoto University, Kyoto 6068502, Japan 2RIKEN AIP, RIKEN, Nihonbashi 103-0027, Japan 3RIKEN i THEMS, RIKEN, Wako 351-0198, Japan 4Department of Mathematics, Keio University, Hiyoshi 223-8522, Japan 5Department of Physics, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, Japan. Correspondence to: Yuji Hirono , Akinori Tanaka , Kenji Fukushima . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). sampling driven by natural language (Rombach et al., 2022), the mathematical framework underlying their training, i.e., the score-based scheme (Hyv arinen et al., 2009; Vincent, 2011; Song & Ermon, 2019) or the denoising scheme (Sohl Dickstein et al., 2015; Ho et al., 2020), is not inherently dependent on prior conditions. The work (Kingma et al., 2021) showed the equivalence of these two schemes from a variational perspective. The work (Song et al., 2021b) developed a unified description based on stochastic differential equations (SDEs)1. In both cases, the sampling process is given by a Markovian stochastic process. Another type of probabilistic models employs deterministic sampling schemes using ordinary differential equations (ODEs) such as the probability flow ODE (Song et al., 2021b), which can be understood as a continuous normalizing flow (CNF) (Chen et al., 2018; Lipman et al., 2023). A notable advantage of deterministic sampling schemes is that there is a bijective map between the latent space and the data space. This bijection not only facilitates intricate tasks like manipulation of latent representations for image editing (Su et al., 2023) but also enables the direct computation of negative log-likelihoods (NLLs). Within stochastic sampling processes in contrast to deterministic ones, a direct evaluation method of the NLL remains elusive, though a theoretical bound exists for the NLL (Kong et al., 2023). In general, stochastic sampling schemes require more number of function evaluations (NFE) than deterministic schemes, and is inferior in terms of sample generation speed. However, the beneficial impact of stochastic generation on certain metrics, such as the Fr echet Inception Distance (FID) (Heusel et al., 2017) is a well-known property in practice. For example, (Karras et al., 2022) reports improvements in these metrics with the incorporation of stochastic processes both in Variance-Exploding and Variance-Preserving pretrained models in (Song et al., 2021b). Intuitively, one can conceptualize noise within stochastic generation as perturbation to propel particles out of local minima, potentially enhancing the diversity and quality of the generated samples. However, beyond this intuitive level, thoroughly quantitative analysis or rigorous theoretical framework to explain this phenomenon is missing. 1 See (Oksendal, 2013) for basics of SDE and (Yang et al., 2023) for diffusion models if readers are unfamiliar with them. Understanding Diffusion Models by Feynman s Path Integral 0.0 0.2 0.4 0.6 0.8 1.0 time t "Quantum" Paths Classical Paths Figure 1. Schematic of the correspondence between the diffusion model and the path integral formulation. The initial 10 points are randomly sampled from π(x T ) and xts evolve with Eq. (9) in which the score function is trained with a 2-Gaussian distribution. Blue zigzag lines: Quantum trajectories with h = 1. Red smooth lines: Classical trajectories with h = 0. We tackle these two issues by making use of the path integral formalism2, a framework originally developed in quantum physics by Feynman (Feynman, 1948; Feynman & Hibbs, 1965). In classical physics, a particle s motion draws a trajectory, i.e., a deterministic path in space and time. To accommodate quantum effects, the path integral formalism generalizes the trajectory including quantum fluctuations by comprising all possible paths, {xt}t [0,T ], of a particle between two points, x0 at time t = 0 and x T at t = T; see a magnified panel in Fig. 1 for a counterpart in diffusion models. The quantum expectation value of observable O(xt) is computed as a weighted sum: P paths O(xt)ei A[xt]/ℏ, where ℏis Planck s constant. In addition to quantum fluctuations, the path integral can be extended to incorporate stochastic fluctuations as well (Onsager & Machlup, 1953). We demonstrate that diffusion models can be formulated in terms of path integrals, which not only deepens our understanding about diffusion model formulations but also allows for the application of various techniques advanced in quantum physics. Importantly, this framework provides 2 For a brief introduction to the concept of the path integral and a physical meaning of quantum fluctuations, see Appendix A. an innovative method for calculating the NLL in stochastic generation processes of diffusion models. Our contributions are as follows. We reformulate diffusion models using path integral techniques. We exemplify applications including a rederivation of the time-reversed SDE (Anderson, 1982) and an estimate of loss functions (Song et al., 2021a) based on Girsanov s theorem (Oksendal, 2013). Following (Zhang & Chen, 2023), we introduce an interpolating parameter h connecting the stochastic generation (h = 1) and the probability flow ODE (h = 0). In the path integral language, the limit h 0 corresponds to the classical limit under which quantum fluctuations are dropped off. The path integral formulation of diffusion models reveals the role of h as Planck s constant ℏin quantum physics. We apply the Wentzel Kramers Brillouin (WKB) expansion (Messiah, 1970), that is formulated in terms of Planck s constant ℏin quantum physics, with respect to h to the likelihood calculation. Based on the first order NLL expression, we quantify the merit of noise in the sampling process by computing the NLL as well as the 2-Wasserstein distance. Building upon the analogy with quantum physics, these contributions unveil a far more profound connection to physics beyond a classic viewpoint of the Brownian motion. 2. Related Works Diffusion Models Key contributions in this field of diffusion models include (Sohl-Dickstein et al., 2015) and (Song et al., 2021b) which have laid the foundational principles for these models. Our reformulation of diffusion models by path integrals captures the basic mathematical characteristics of score-based diffusion models. The idea of implementing stochastic variables to represent quantum fluctuations is traced back to (Nelson, 1966), and mathematical foundation has been established as stochastic quantization (Damgaard & H uffel, 1987). Recent works (Wang et al., 2023; Premkumar, 2023) suggested similarity between diffusion models and quantum physics. However, the path integral derivation of the basic aspects of diffusion models has not been discussed. There is no preceding work to explore the WKB expansion applied in diffusion models. Likelihood Calculation in Diffusion Models In deterministic sampling schemes based on ODEs, one can directly calculate log-likelihood based on the change-of-variables formula (Chen et al., 2018; Song et al., 2021b; Lipman et al., 2023). Once we turn on noise in the sampling process, Understanding Diffusion Models by Feynman s Path Integral only a formal expression is available from the celebrated Feynman-Kac formula (Huang et al., 2021b). To our best knowledge, there is no stochastic case where the values of log-likelihood have ever been calculated. Our approach of the perturbative expansion with respect to h can access the log-likelihood explicitly even in the presence of noise. Note that an exact formula for the log-likelihood has been derived based on the information theory connecting to the Minimum Mean Square Error (MMSE) regression (Kong et al., 2023). However, this method still necessitates the computation of expectation values and cannot be implemented using a single trajectory. Stochastic and Deterministic Sampling Procedures The role of stochasticity in the sampling process has been investigated in (Karras et al., 2022) through empirical studies. The gap between the score-matching objective depending on the sampling schemes has been pointed out in (Huang et al., 2021a). To bridge the gap between the model distribution generated by the probability flow and the actual data distribution, (Lu et al., 2022) introduced a higher-order score-matching objective. Besides these efforts, (Lai et al., 2023) derived an equation to be satisfied by a score function and introduced a regulator in the loss function to enforce this relation. The present approach is complementary to these works; we employ the perturbative h-expansion and directly evaluate the noise-strength dependence of the NLL for pretrained models. 3. Reformulation of Diffusion Models by Path Integral Formalism In this section, we describe the reformulation of score-based diffusion models in terms of path integrals. 3.1. Forward and Reverse Processes For a given datapoint x0 sampled from an underlying data distribution p0, the first step of a score-based diffusion model is to gradually modify the data by adding noise via a forward SDE, dxt = f(xt, t)dt + g(t)dwt. (1) One can view this diffusion process as a collection of stochastic trajectories, and consider the path-probability P({xt}t [0,T ]) associated with Eq. (1). Intuitively, it corresponds to the joint probability for the path {xt}t [0,T ] (see also Fig. 1 for schematic illustration of paths). In the path integral formulation of quantum mechanics, the expectation value of observables is expressed as a summation over all possible paths weighted by an exponential factor with a quantity called an action. In the following proposition, [Dxt] is defined later in Eq. (3). We observe that the process (1) of diffusion models can be represented as a path integral: Proposition 3.1. The path-probability P({xt}t [0,T ]) can be represented in the following path integral form: P({xt}t [0,T ]) = p0(x0)e A[Dxt], (2) with A := R T 0 L( xt, xt)dt + J, where L( xt, xt) is called Onsager Machlup function (Onsager & Machlup, 1953) defined by L( xt, xt) := xt f(xt, t) 2 and J is the Jacobian associated with the chosen descretization scheme in stochastic process. Here, the overdot indicates the time derivative. In the physics literature, L( xt, xt) is called the Lagrangian and A is the action. For a detailed derivation of Eq. (2) and the explicit expression for J, see Appendix B.2. Using the path probability (2), the expectation value of any observable O(xt) depending on xt obeying Eq. (1) is represented as E[O(xt)] = R O(xt)P({xt}t [0,T ]). This expression is commonly referred to as a path integral as it involves the summation over infinitely many paths. Here, we present an intuitive explanation of the expression (2). Let us start from a discretized version of the SDE (1) by Euler-Maruyama scheme: xt+ t = xt + f(xt, t) t + g(t) tvt, vt N(0, I), p(xt+ t|xt) = N(xt+ t|xt + f(xt, t) t, g(t)2 t I), where t is time interval and I is the identity matrix. Now the time evolution of the SDE is described by a conditional Gaussian distribution. In the following, we take xt+ t xt xt t, and deform the conditional Gaussian as p(xt+ t|xt) e xt f(xt,t) 2 2g(t)2 t = e L( xt,xt) t. Now let us consider the probability for realizing an explicit path {xt} := [x0, x t, x2 t, , x T ], and regarding the summation as the Riemannian sum, we get the following path integral expression for the path probability: p0(x0)p(x t|x0) p(x T |x T t) Y =: p0(x0)e PT/ t 1 n=0 L( xn t,xn t) t[Dxt] t, (3) where [Dxt] t contains a normalization constant depending on the discretization step t. We take the t 0 limit at the end of the calculation and omit t in the subscript in later discussions. In the expression (3), we need to include an additional contribution denoted by J in Proposition 3.1, depending on the choice of the discretization scheme (see Appendix B.2 for details). Understanding Diffusion Models by Feynman s Path Integral The sampling process of score-based diffusion models is realized by the time-reversed version of the forward SDE (1). The path integral reformulation is beneficial in furnishing an alternative derivation of the reverse-time SDE: Proposition 3.2. Let P({xt}t [0,T ]) be the pathprobability, and pt(x) be the distribution at time t determined by the Fokker-Planck equation corresponding to Eq. (1). The path-probability is written using the reversetime action e A as P({xt}t [0,T ]) = e e A p T (x T )[Dxt], with e A := R T 0 e L( xt, xt)dt + e J, where e L( xt, xt) := xt f(xt, t) + g(t)2 log pt(xt) 2 and e J is the Jacobian for reverse process depending on the discretization scheme. We provide the proof in Appendix B.3. We emphasize that the path integral derivation does not rely on the reversetime SDE (Anderson, 1982). In fact, e L involves the scorefunction log pt(xt), so that Proposition 3.2 gives us another derivation of the reverse-time SDE, dxt = [f(xt, t) g(t)2 log pt(xt)]dt + g(t)d wt, (4) by inverting the discussion from Eq. (1) to Proposition 3.1. 3.2. Models and Objectives In score-based models (Song & Ermon, 2019; Song et al., 2021a;b), the score function log pt(xt) is approximated by a neural network sθ(xt, t), and samplings are performed based on dxt = [f(xt, t) g(t)2sθ(xt, t)]dt + g(t)d wt, (5) that is a surrogate for the reverse-time (4). By repeating the same argument as Proposition 3.2, we can conclude that the path-probability Qθ({xt}) corresponding to Eq. (5) takes the following path integral representation: Qθ({xt}) = e e Aθπ(x T )[Dxt], (6) where π( ) is a prior distribution, typically chosen to be the standard normal distribution. Here, e Aθ := R T 0 e Lθ( xt, xt)dt + e Jθ with e Lθ the Onsager Machlup function for the SDE (5) defined by e Lθ( xt, xt) := xt f(xt, t) + g(t)2sθ(xt, t) 2 2g(t)2 , (7) and e Jθ is the Jacobian contribution. Figure 1 depicts a schematic picture of these reverse-time processes. Now, we can follow the training scheme based on bound of the log-likelihood or equivalently, KL divergence via data-processing inequality (Song et al., 2021a): DKL(p0 qθ0) DKL(P Qθ). (8) The r.h.s. of Eq. (8) can be calculated by using Girsanov s theorem (Oksendal, 2013). An equivalent computation can also be performed in the path integral formulation: Proposition 3.3. The KL divergence of path-probabilities can be represented by the path integral form, DKL(P Qθ) = Z e e Ap T (x T ) log e e A p T (x T ) e e Aθπ(x T ) [Dxt], and it can be computed as DKL(P Qθ) = DKL(p T π) 2 Ept log pt(xt) sθ(xt, t) 2dt. Indeed, Proposition 3.3 yields the same contribution as the calculation based on Girsanov s theorem. We give the derivation of Proposition 3.3 in Appendix B.4. The discussion so far can be straightforwardly extended to the cases with fixed initial conditions. One can basically make the replacement, pt(xt) pt(xt|x0); see Appendix B.5 for details. 4. Log-Likelihood by WKB Expansion We have so far discussed reformulation of score-based diffusion models using the path integral formalism. This reformulation allows for the techniques developed in quantum physics for analyzing the properties of diffusion models. As an illustrative example, we present the calculation of loglikelihood in the presence of noise in the sampling process by pretrained models. 4.1. Interpolating SDE and Probability Flow ODE Following (Zhang & Chen, 2023), we consider a family of generating processes defined by the following SDE parametrized by h R 0: dxt = f(xt, t) 1 + h 2 g(t)2sθ(xt, t) dt + p If we take h = 0, the noise term vanishes, and the process reduces to the probability flow ODE (Song et al., 2021b). The situation at h = 1 corresponds to the original SDE (5). The path-probability corresponding to this process (9) can be expressed as Qh θ({xt}t [0,T ]) = e e Ah θ π(x T )[Dxt], (10) Understanding Diffusion Models by Feynman s Path Integral 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Figure 2. Gray dots: training data sampled from the Swiss-roll distribution. Colored lines: generative trajectories based on Eq. (9) from the identical initial vector x T shown by near the center of the figure. The color represents the noise level h; the deepest red corresponds to the ODE path in the h 0 limit and the trajectories become more stochastic with stronger blue. with e Ah θ := R T 0 e Lh θ( xt, xt)dt + e Jh θ , where e Lh θ( xt, xt) := xt f(xt, t) + 1+h 2 g(t)2sθ(xt, t) 2 (11) and e Jh θ is the Jacobian contribution. The way how h enters Eq. (11) is quite suggestive: the action e Ah θ is inversely proportional to h, similarly to quantum mechanics where the path integral weight takes a form of e i A/ℏ. This structural similarity provides us with a physical interpretation of the limit, h 0; it realizes the classical limit in the path integral representation. Moreover, this analogy between h and ℏnaturally leads us to explore a perturbative expansion in terms of h, i.e., the WKB expansion, a well-established technique to treat the asymptotic series expansion in mathematical physics. In the classical limit of h 0, the dominant contribution comes from the path that minimizes the action, realizing the principle of least action in the physics context. Since the Lagrangian e Lh θ of this model is nonnegative, e e Ah θ goes vanishing in the limit h 0 unless the path satisfies e Lh=0 θ = 0 or equivalently x = f PF θ (xt, t), where f PF θ (xt, t) is the drift term for the probability flow ODE defined by f PF θ (xt, t) := f(xt, t) 1 2g(t)2sθ(xt, t). (12) Consequently, the path-probability reduces to the product of delta functions as h 0: e e Ah θ π(x T )[Dxt] t=0 δ xt f PF θ (xt, t) ! e Jh=0 θ π(x T )[Dxt]. This limiting behavior is visualized in Fig. 2: we plot trajectories generated by the SDE (9) from identical initial x T with various hs with a pretrained model by the Swiss-roll distribution. The trajectories are concentrated near the ODE path when the noise level is low (h 0), which means that the path-probability for h 0 reduces the classical path represented by Dirac s delta function. In this way, the realization of the probability flow ODE in h 0 can be regarded as a reminiscent of the reduction from quantum mechanics to classical mechanics in ℏ 0. The 1-dim. trajectories for h = 0 (classical paths) and h = 1 ( quantum paths) are also visualized in Fig. 1. For h = 0, it is well-known that the log-likelihood can be written exactly. Employing the Itˆo scheme, we can recover the log-likelihood by the path integral with fixed initial condition with x0 as log qh=0 θ (x0) t=0 δ xt f PF θ (xt, t) e Jh=0 θ π(x T )[Dxt] = log π(x T ) + Z T 0 f PF θ (xt, t)dt, (13) where xt in the last line is obtained from the solution of xt = f PF θ (xt, t) with initial condition xt=0 = x0 , and we have used the explicit expression of Jh=0 θ (see Appendix B). Equation (13) is nothing but the instantaneous change-ofvariables formula of an ODE flow (Chen et al., 2018). The computations presented here can be equivalently performed in different discretization schemes, and it should be noted that the final results are free from such scheme dependences. 4.2. h = 0 in Path Integral When the score estimation is imperfect, the probability distribution qh 0(x0) of a model acquires nontrivial dependency on parameter h. This implies that the quality of sampled images varies depending on the level of noisiness parametrized by h. As we discussed earlier, in quantum physics, ℏ 0 corresponds to the classical limit, and the effect of small but nonzero ℏcan be taken into account as a series expansion with respect to ℏ, commonly referred to as the WKB expansion or the semi-classical approximation. In the path integral formulation of diffusion models, h plays a role of Planck s constant ℏ, which quantifies the degrees of quantumness. As a basis for the WKB expansion of the log-likelihood, we have found the following result: Theorem 4.1. The log-likelihood for the process (9) satisfies log qh 0(x0) = log π(x T ) 0 dt f PF θ (xt, t) hg(t)2 2 [sθ(xt, t) log qh t (xt)] , Understanding Diffusion Models by Feynman s Path Integral where π( ) is a prior and xt is the solution for the modified probability flow ODE, xt = f PF θ (xt, t) hg(t)2 2 [sθ(xt, t) log qh t (xt)], (15) with initial condition xt=0 = x0. We provide the proof of this theorem in Appendix B.6 Note that Eq. (14) is valid for any value of h R 0. The modified probability flow ODE (15) indicates that, when the score deviates from the ground-truth value, the deterministic trajectory should also be modified by h = 0. In Eq. (14), the density qh t (x) appears on both sides and this is a selfconsistent equation3. This relation provides us with the basis for perturbative evaluation of the NLL in power series of h. Theorem 4.2. The perturbative expansion of the loglikelihood (9) to the first order in h reads log qh 0(x0) = log qh=0 0 (x0) δx T log π(x T ) + Z T 0 dt δf PF θ (xt, δxt, t) where (xt, δxt) is the solution for the coupled probability flow ODE, xt = f PF θ (xt, t), δxt = δf PF θ (xt, δxt, t), (17) with initial condition xt=0 = x0 and δxt=0 = 0, where δf PF θ (xt, δxt, t) := (δxt )f PF θ (xt, t) g(t)2 2 [sθ(xt, t) log qh=0 t (xt)]. We provide the proof in Appendix B.7. Note that we treat xt and δxt as independent variables, so the gradient does not act on δxt. Because Eq. (16) is no longer a self-consistent equation, this theorem allows us to compute O(h1) correction of the log-likelihood based on log qh=0 t (xt), i.e., log-likelihood defined by the usual probability flow ODE. A unique feature of the h-expansion in this model is that h appears in the denominator of Eq. (11) as well as in the numerator, which contrasts quantum physics where ℏonly appears in the denominator as an overall factor. As a result, once we consider finite h corrections, the classical path deviates from the classical path obtained in the h 0 limit. This deviation caused by h = 0 is quantified by δxt, which represents how noise influences the bijective relationship between the data points and their latent counterparts. 3 Note that qh t (xt) is obtained by replacing variable t in R to τ and all instances of 0 with t in Theorem 4.1. 0.0 0.2 0.4 0.6 0.8 1.0 ε = 0.5 ε = 0.3 ε = 0 ε = 0.1 Figure 3. (Top) Negative log-likelihood (NLL) of the 1-dim. Gaussian toy model. (Bottom) 2-Wasserstein metric (W2) between the data distribution and the distribution obtained by the same model. Both panels are plotted as a function of parameter h. 5. Experiments 5.1. Analysis of 1-dim. Gaussian Data Let us first illustrate the results in a simple example of onedimensional (1-dim.) Gaussian distribution, in which everything is analytically tractable. We take the data distribution to be 1-dim. Gaussian distribution with zero mean and variance v0. For the forward process, we employ f = βx/2 and g(t) = β, which corresponds to denoising diffusion probabilistic model (DDPM) for general β (Ho et al., 2020). Here, we specifically take β to be constant for simplicity. To study the relation between the imperfect score estimation and stochasticity in the sampling process, we parametrize the score model as s(x, t) = x(1 + ϵ)/vt, where ϵ R quantifies the deviation from the perfect score, and vt is the variance of the distribution in the forward process. In this simple model, the distribution remains Gaussian in both forward and backward processes. When ϵ = 0, the backward time evolution exactly matches the forward one, and qt(x) = N(x|0, vt). When ϵ = 0, the model distribution qh t (x) nontrivially depends on h. We evaluate qh t (x) and examine the effect of noisiness in the sampling process. In this model, we can compute qh t (x) analytically and verify that Eq. (14) is satisfied, as we detail in Appendix D. In Fig. 3 (Top), we plot the NLL, Ep0[log qh 0(x0)], as a function of parameter h. Different lines correspond to different values of ϵ. The NLL is computed analytically, and we give the details in Appendix D. For nonzero ϵ, the score estimation is imperfect and the model distribution qh 0(x0) differs from the data distribution. In these situations at ϵ = 0, the NLL acquires nontrivial dependence on h, meaning that the quality of generated images depends the noise level in the sampling process. In Fig. 3 (Top), the Understanding Diffusion Models by Feynman s Path Integral NLLs are decreasing functions of h, and the presence of noise improves the output quality. Depending on the choice of ϵ, the NLL could be an increasing function of h as well. For reference, in Fig. 3 (Bottom), we plot the 2-Wasserstein distance, W2, between the data distribution and qh 0(x0) for different values of ϵ. The qualitative behavior is consistent with that of the NLL. 5.2. Experiment of 2-dim. Synthetic Data Let us show some examples4 of log-likelihood computation for noisy generating process with pre-trained diffusion models trained by two-dimensional (2-dim.) synthetic distributions, Swiss-roll and 25-Gaussian (see Appendix E.1). Pretrained Models We train simple neural network: (x, t) concat([x, t]) Dense(128) swish 3 Dense(2) sθ(x, t) by choosing one of the following SDE schedulings: SIMPLE: f(x, t) = β 2 tx, g(t) = βt, (β = 20) COSINE: f(x, t) = π 2 t)x, g(t) = p with time interval t [Tmin, Tmax], where Tmin R is small but nonzero to avoid singular behavior around t = 0. We take Tmin = 0.01, Tmax = 1 here. We train each model by using the following loss function: xi,t α(ti)xi σ(ti)2 + sθ(xi,t, ti) (18) where the signal and noise functions α(t) and σ(t) are fixed by the chosen SDE scheduling, and ti represents discretized time. To compute the above, we take the Monte Carlo sampling with Nbatch batches of data according to xi,t N(α(ti)xi, σ(ti)2I). For more details about pretraining with explicit forms of α(t), σ(t), ti, and Nbatch, see explanations around Eq. (114) in Appendix E.2. NLL Calculation Our basic strategy is based on Theorem 4.2, the perturbative expansion of the log-likelihood with respect to the parameter h. To calculate the O(h) correction, we need the values of log qh=0 t (x) and its derivatives. We can obtain log qh=0 t (x) by solving the probability flow ODE; however, there is no closed formula for its derivatives. In this experiment, the dimension of x = [x, y] is 2, and this relatively low dimensionality allows us to approximate derivatives by discretized differential operators with x, a change in the value of x and y, that needs to be 4 The code to reproduce experiments here can be found at https://github.com/Akinori Tanaka-phys/ diffusion_path_integral. prefixed in practice: ˆ log qh=0 t (x) " log qh=0 t (xt+[ x,0] ) log qh=0 t (xt [ x,0] ) 2 x log qh=0 t (xt+[0, x] ) log qh=0 t (xt [0, x] ) 2 x ˆ 2 log qh=0 t (x) = log qh=0 t (xt + [ x, 0] ) + log qh=0 t (xt [ x, 0] ) + log qh=0 t (xt + [0, x] ) + log qh=0 t (xt [0, x] ) 4 log qh=0 t (xt) / x2. (20) There are inherent discretization errors that we need to carefully evaluate. More on this will be addressed later in our analysis. Consequently, to obtain the O(h1) correction, we need to evaluate a nested integral. The pseudocode detailing our calculation method is presented in Algorithm 1. We use scipy.integrate.solve_ivp (Virtanen et al., 2020) both in 0th-logq Solver and discrete-time update in 1th-logq Solver. We take Tmin > 0 as the initial time and Tmax as the terminal time for probability flow ODE. This amounts to calculating the NLL for generation by SDE (9) in time interval [Tmin, Tmax], and does not affect the sampling quality if we take sufficiently small Tmin. We show our results in Table 1. Numerical Errors In Algorithm 1, we made two approximations in discrete differential operators and ODE solvers. To ensure reliability of our results, it is essential to assess and estimate the associated errors. Let us call the numerical value of log qh=0 t (x) calculated by 0th-logq Solver in Algorithm 1 as N[log qh=0 t (x)], then we have two errors: err ˆ N[log qh=0 t (x)] = log qh=0 t (x) ˆ N[log qh=0 t (x)], err ˆ 2N[log qh=0 t (x)] = 2 log qh=0 t (x) ˆ 2N[log qh=0 t (x)]. (21) Exact calculation of these values is unachievable, but nevertheless, we should somehow estimate them. We propose two estimation schemes, subtraction (Appendix F.2) and model (Appendix F.3). Here, we show results based on the latter method, which operates at a higher speed. In addition, we should integrate these local errors5 to estimate the error piled up in the final results. This can also be calculated numerically by the ODE solver, and the numerical errors estimated in this way are shown in Table 1. See Appendix F for more details. 5 Related to this, we apply certain tips to ensure stable numerical calculations in practice, such as excluding unstable solutions that occur with negligible frequency. See README file in the repository mentioned at Footnote 4 for more details. Understanding Diffusion Models by Feynman s Path Integral Algorithm 1 1st-logq Solver Input: data x, a change in the value of x and y for calculating differentials x, SDE info [f( , t), g(t), Tmin, Tmax], solver 0th-logq Solver Initialize: t = Tmin, xt = x, δxt = 0, δ log qt = 0 Calculate log q0 0th-logq Solver(x, t = Tmin) repeat log qt 0th-logq Solver(xt, t) log qx+ t 0th-logq Solver(xt + [ x, 0] , t) log qx t 0th-logq Solver(xt [ x, 0] , t) log qy+ t 0th-logq Solver(xt + [0, x] , t) log qy t 0th-logq Solver(xt [0, x] , t) Calculate f PF θ (xt, t) Calculate δf PF θ (xt, δxt, t) by log qx/y t and (19) Calculate δf PF θ (xt, δxt, t) by log qx/y t and (20) Update t, xt, δxt, δ log qt by time-discretized version (e.g. Runge-Kutta update) based on xt = f PF θ (xt, t) δxt = δf PF θ (xt, δxt, t) δ log qt = δf PF θ (xt, δxt, t) until t = Tmax 1st-correction = δx Tmax log π(x Tmax) + δ log q Tmax Output: log q0, 1st-correction Comparison to 2-Wasserstein We also show 2Wasserstein distance, W2, between validation data and generated data in Fig. 4 with stochastic SDE (9). We see that the W2 values typically decrease especially in the small h region, which signifies the improvement of generated data quality. This observation is consistent with negative-valued NLL corrections in 1ST-CORR column in Table 1. The overall trend is the same as the results from the analytical study in Sec. 5.1; the tendency of enhancing sampled data quality by noise has been confirmed by our experiment even in intricate cases where exact calculation is not accessible. In addition, considering that the values in 1ST-CORR column are the first-order derivative of NLL (cross-entropy) with respect to h, one might say that the values for COSINE-SDE are a few times larger than the values for SIMPLE-SDE, which agrees with the behavior of the first-order derivative near h = 0 in Fig. 4. 6. Conclusions We presented a novel formulation of diffusion models utilizing the path integral framework, originally developed Table 1. NLL (cross-entropy) and its O(h1) corrections. We apply x = 0.01 to compute (19) and (20). tol represents the value of absolute and relative tolerances of updates in Algorithm 1. We set absolute and relative tolerances of 0th-logq Solver as order 1e-5. SDE (NLL) tol 1ST-CORR ERRORS SIMPLE (1.39 0.05) 1e-3 -0.31 0.21 0.13 0.00 1e-5 -0.44 0.38 0.13 0.00 COSINE (1.42 0.02) 1e-3 -1.59 0.57 0.35 0.00 1e-5 -3.27 1.11 0.37 0.02 25-GAUSSIAN SDE (NLL) tol 1ST-CORR ERRORS SIMPLE (-1.22 0.01) 1e-3 -3.64 0.49 0.31 0.00 1e-5 -3.61 0.64 0.32 0.01 COSINE (-1.71 0.02) 1e-3 -17.57 5.56 0.70 0.01 1e-5 -19.65 17.46 0.67 0.03 in quantum physics. This formulation provides a unified perspective on various aspects of score-based generative models, and we gave the re-derivation of reverse-time SDEs and loss functions for training. In particular, one can introduce a continuous parameter linking different sampling schemes: the probability flow ODEs and stochastic generation. We have performed the expansion with respect to this parameter and perturbatively evaluated the negative loglikelihood, which is a reminiscent of the WKB expansion in quantum physics. In this way, this formulation has presented a new method for scrutinizing the role of noise in the sampling process. An interesting future direction is an extension of the analysis based on path integral formalism for diffusion Schr odinger bridges (De Bortoli et al., 2021), in which the prior can be more general. Another direction is to understand the cases mentioned in (Karras et al., 2022), where injecting noise would rather degrade the quality of the generated data. This is an open problem, and there is room for deeper study in terms of log-likelihood calculations based on WKB expansions. Limitations Our experiments did not involve actual image data. This omission is primarily due to the current evaluation method s limitations regarding NLLs, which are not directly applicable to high-dimensional data such as images on account of the uses of explicit discrete differentials6. An- 6 However, we believe this limitation can be resolved with the following strategy. The main bottleneck was the computation of 2 log qt(x) using numerical discretization. We can sidestep this by introducing a parametrized function rθ(x, t) by a neural Understanding Diffusion Models by Feynman s Path Integral 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 0.200 Swiss-roll (SIMPLE) Swiss-roll (COSINE) 25-Gaussian (SIMPLE) 25-Gaussian (COSINE) Figure 4. 2-Wasserstein metrics (W2) by POT library (Flamary et al., 2021) between validation data and generated data via (9) with Swiss-roll data and 25-Gaussian data with SIMPLE and COSINE SDE scheduling. The dots and the errorbars represent the mean values over 10 independent trials and std/ 10, respectively. other limitation is the potential of underestimated numerical error in our computed NLLs. Although our estimated local errors look safe (Appendix F.5), they are potentially underestimated because the estimation is based on score-based model that does not match to log qt(x) exactly in general. Acknowledgements The work of Y. H. was supported in part by JSPS KAKENHI Grant No. JP22H05111. The work of A. T. was supported in part by JSPS KAKENHI Grant No. JP22H05116. This work of K. F. was supported by JSPS KAKENHI Grant Nos. JP22H01216 and JP22H05118. Impact Statement This paper presents a novel approach for understanding diffusion models through the path-integral formalism of quan- network, which is to be used as a proxy for log qt(x). To optimize this neural network, we use an objective function defined as: L(θ) = Z dtλ(t)EQ[ rθ(x, t) log qt(x) 2]. (22) The minimization of L(θ) is equivalent to that of the following: L (θ) = Z dtλ(t)EQ[ rθ(x, t) 2 + 2 rθ(x, t)]. (23) The loss function L (θ) can be evaluated via a sample average over paths generated by ODEs. The divergence of rθ(x, t) can be efficiently computed by Skilling-Hutchinson trace estimation. We believe this approach enables us to apply our algorithm to higher-dimensional settings such as realistic images. tum physics. By offering a new perspective on the behaviors of score-based diffusion models grounded in physical principles, our work bridges the fields of Machine Learning and Physics. We believe this interdisciplinary approach contributes significantly to advancing both fields and could inspire innovative developments. Although our study primarily addresses theoretical aspects, we recognize the potential societal implications of AI advancements and underscore the necessity of ethical considerations and responsible application of these technologies. Anderson, B. D. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313 326, 1982. Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., Vander Plas, J., Wanderman-Milne, S., and Zhang, Q. JAX: composable transformations of Python+Num Py programs, 2018. URL http://github.com/google/jax. Chen, R. T., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. Neural ordinary differential equations. Advances in neural information processing systems, 31, 2018. Damgaard, P. H. and H uffel, H. Stochastic quantization. Physics Reports, 152(5):227 398, 1987. ISSN 0370-1573. doi: https://doi.org/10.1016/0370-1573(87)90144-X. URL https://www.sciencedirect.com/ science/article/pii/037015738790144X. De Bortoli, V., Thornton, J., Heng, J., and Doucet, A. Diffusion schr odinger bridge with applications to score-based generative modeling. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 17695 17709. Curran Associates, Inc., 2021. URL https://proceedings.neurips. cc/paper_files/paper/2021/file/ 940392f5f32a7ade1cc201767cf83e31-Paper. pdf. Deep Mind, Babuschkin, I., Baumli, K., Bell, A., Bhupatiraju, S., Bruce, J., Buchlovsky, P., Budden, D., Cai, T., Clark, A., Danihelka, I., Dedieu, A., Fantacci, C., Godwin, J., Jones, C., Hemsley, R., Hennigan, T., Hessel, M., Hou, S., Kapturowski, S., Keck, T., Kemaev, I., King, M., Kunesch, M., Martens, L., Merzic, H., Mikulik, V., Norman, T., Papamakarios, G., Quan, J., Ring, R., Ruiz, F., Sanchez, A., Sartran, L., Schneider, R., Sezener, E., Spencer, S., Srinivasan, S., Stanojevi c, M., Stokowiec, W., Wang, L., Zhou, G., and Viola, F. The Deep Mind JAX Ecosystem, 2020. URL http://github.com/google-deepmind. Understanding Diffusion Models by Feynman s Path Integral Dhariwal, P. and Nichol, A. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780 8794, 2021. Feynman, R. P. Space-time approach to non-relativistic quantum mechanics. Rev. Mod. Phys., 20:367 387, Apr 1948. doi: 10.1103/Rev Mod Phys.20. 367. URL https://link.aps.org/doi/10. 1103/Rev Mod Phys.20.367. Feynman, R. P. and Hibbs, A. R. Quantum mechanics and path integrals. International series in pure and applied physics. Mc Graw-Hill, New York, NY, 1965. Flamary, R., Courty, N., Gramfort, A., Alaya, M. Z., Boisbunon, A., Chambon, S., Chapel, L., Corenflos, A., Fatras, K., Fournier, N., Gautheron, L., Gayraud, N. T., Janati, H., Rakotomamonjy, A., Redko, I., Rolet, A., Schutz, A., Seguy, V., Sutherland, D. J., Tavenard, R., Tong, A., and Vayer, T. Pot: Python optimal transport. Journal of Machine Learning Research, 22(78):1 8, 2021. URL http: //jmlr.org/papers/v22/20-451.html. Heek, J., Levskaya, A., Oliver, A., Ritter, M., Rondepierre, B., Steiner, A., and van Zee, M. Flax: A neural network library and ecosystem for JAX, 2023. URL http:// github.com/google/flax. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017. Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840 6851, 2020. Huang, C.-W., Lim, J. H., and Courville, A. C. A variational perspective on diffusion-based generative models and score matching. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 22863 22876. Curran Associates, Inc., 2021a. URL https://proceedings.neurips. cc/paper_files/paper/2021/file/ c11abfd29e4d9b4d4b566b01114d8486-Paper. pdf. Huang, C.-W., Lim, J. H., and Courville, A. C. A variational perspective on diffusion-based generative models and score matching. Advances in Neural Information Processing Systems, 34:22863 22876, 2021b. Hyv arinen, A., Hurri, J., Hoyer, P. O., Hyv arinen, A., Hurri, J., and Hoyer, P. O. Estimation of non-normalized statistical models. Natural Image Statistics: A Probabilistic Approach to Early Computational Vision, pp. 419 426, 2009. Karras, T., Aittala, M., Aila, T., and Laine, S. Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35: 26565 26577, 2022. Kingma, D., Salimans, T., Poole, B., and Ho, J. Variational diffusion models. Advances in neural information processing systems, 34:21696 21707, 2021. Kingma, D. P. and Gao, R. Understanding the diffusion objective as a weighted integral of elbos. ar Xiv preprint ar Xiv:2303.00848, 2023. Kong, X., Brekelmans, R., and Steeg, G. V. Informationtheoretic diffusion. 11th International Conference on Learning Representations, 2023. Lai, C.-H., Takida, Y., Murata, N., Uesaka, T., Mitsufuji, Y., and Ermon, S. FP-diffusion: Improving score-based diffusion models by enforcing the underlying score fokkerplanck equation. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 18365 18398. PMLR, 23 29 Jul 2023. URL https://proceedings.mlr.press/ v202/lai23d.html. Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling. 11th International Conference on Learning Representations, 2023. Lu, C., Zheng, K., Bao, F., Chen, J., Li, C., and Zhu, J. Maximum likelihood training for score-based diffusion ODEs by high order denoising score matching. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 14429 14460. PMLR, 17 23 Jul 2022. URL https://proceedings.mlr. press/v162/lu22f.html. Machlup, S. and Onsager, L. Fluctuations and irreversible process. ii. systems with kinetic energy. Phys. Rev., 91:1512 1515, Sep 1953. doi: 10.1103/Phys Rev. 91.1512. URL https://link.aps.org/doi/10. 1103/Phys Rev.91.1512. Messiah, A. Quantum Mechanics Volume I. North-Holland Publishing Company, 1970. Nelson, E. Derivation of the schr odinger equation from newtonian mechanics. Phys. Rev., 150:1079 1085, Oct 1966. doi: 10.1103/Phys Rev.150.1079. URL https://link.aps.org/doi/10.1103/ Phys Rev.150.1079. Understanding Diffusion Models by Feynman s Path Integral Oksendal, B. Stochastic differential equations: an introduction with applications. Springer Science & Business Media, 2013. Onsager, L. and Machlup, S. Fluctuations and irreversible processes. Phys. Rev., 91:1505 1512, Sep 1953. doi: 10. 1103/Phys Rev.91.1505. URL https://link.aps. org/doi/10.1103/Phys Rev.91.1505. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011. Petzka, H., Fischer, A., and Lukovnicov, D. On the regularization of wasserstein gans. 6th International Conference on Learning Representations, 2018. Premkumar, A. Generative diffusion from an action principle. ar Xiv preprint ar Xiv:2310.04490, 2023. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684 10695, 2022. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp. 2256 2265. PMLR, 2015. Song, Y. and Ermon, S. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019. Song, Y., Durkan, C., Murray, I., and Ermon, S. Maximum likelihood training of score-based diffusion models. Advances in Neural Information Processing Systems, 34: 1415 1428, 2021a. Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. 9th International Conference on Learning Representations, 2021b. Su, X., Song, J., Meng, C., and Ermon, S. Dual diffusion implicit bridges for image-to-image translation. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/ forum?id=5HLo Tv VGDe. Vincent, P. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661 1674, 2011. Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt, S. J., Brett, M., Wilson, J., Millman, K. J., Mayorov, N., Nelson, A. R. J., Jones, E., Kern, R., Larson, E., Carey, C. J., Polat, I., Feng, Y., Moore, E. W., Vander Plas, J., Laxalde, D., Perktold, J., Cimrman, R., Henriksen, I., Quintero, E. A., Harris, C. R., Archibald, A. M., Ribeiro, A. H., Pedregosa, F., van Mulbregt, P., and Sci Py 1.0 Contributors. Sci Py 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261 272, 2020. doi: 10.1038/s41592-019-0686-2. Wang, L., Aarts, G., and Zhou, K. Diffusion models as stochastic quantization in lattice field theory. ar Xiv preprint ar Xiv:2309.17082, 2023. Yang, L., Zhang, Z., Song, Y., Hong, S., Xu, R., Zhao, Y., Zhang, W., Cui, B., and Yang, M.-H. Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys, 56(4):1 39, 2023. Zhang, Q. and Chen, Y. Fast sampling of diffusion models with exponential integrator. 11th International Conference on Learning Representations, 2023. Understanding Diffusion Models by Feynman s Path Integral A. Overview of quantum mechanics and the WKB approximation The quantum physical system is defined by the Hamiltonian H which is for example H(x, p) = p2/2 + V (x) for a one-particle potential problem (with the mass normalized to the unity). Then, the wave function ψ should satisfy the Schr odinger equation, 2 2ψ + V (x)ψ = iℏ We can introduce a specific form without loss of generality, ψ(x, t) = ei A(x,t)/ℏ, up to irrelevant normalization, and plug this into Eq. (24). Then, the limit of ℏ 0 gives 1 2( A)2 + V (x) = A which is called the Hamilton-Jacobi equation in classical mechanics derived from the stationary condition of the action, i.e., δA = 0. Importantly, the wave functions at different times, t and t0, are connected by a kernel convolution as ψ(x, t) = Z dx0ψ(x0, t0)K(x, t|x0, t0) , (26) where K(x, t|x0, t0) is given by P paths ei A[xt]/ℏwith paths with boundaries at (x, t) and (x0, t0). This representation provides us with a useful intuition that ψ(x, t) is furiously fluctuating and canceling on average over trajectories unless trajectories in K(x, t|x0, t0) are close enough within δA = O(ℏ). Therefore, ℏis a parameter to characterize the magnitude of quantum fluctuations. In our discussions in the text, what is called Euclidean formalism is adopted in which the imaginary time, it, is regarded as the time, and then the weight no longer fluctuates and ei A/ℏis replaced by e A/ℏ. Then, the kernel is also given an interpretation as the conditional probability which is denoted by p(xt|x0) in the text. This strategy to expand the solution of Eq. (24) in powers of ℏcan be systematically performed, which is generally referred to as the WKB expansion. As we discuss in the main part, in the diffusion models, a counterpart of ℏdenoted by h appears in a slightly more complicated way than quantum mechanics, but the expansion works similarly. B. From SDE to path integral In this appendix, we describe the details of the path integral formulation of diffusion models. B.1. Discretization schemes In later discussions, we will introduce discretized summations in different ways, which do not seem apparent in their continuum counterpart. We here introduce discretization schemes that will be used in later analyses. Suppose we have {xt}t [0,T ] obeying SDE (1). We discretize the time interval [0, T] with width t and M := T/ t, which we take to be an integer. Let us take the following summation: n=0 ftn (xtn+ t xtn), (27) where tn := n t and ft = f(xt, t). We will consider the continuum limit t 0 and will refer to this as the Itˆo scheme, which will be denoted by Z f dx := Z ft (xt+dt xt). (28) A common prescription in the physics literature is the Stratonovich scheme, Z f dx := Z ft + ft+dt 2 (xt+dt xt), (29) which should be understood as a short-hand notation for the following equation, ftn + ftn+ t 2 (xtn+ t xtn). (30) Understanding Diffusion Models by Feynman s Path Integral We also encounter the reverse-Itˆo scheme, which is written as Z f dx := Z f(xt+dt) (xt+dt xt). (31) This can be seen as the Itˆo scheme in reverse time. B.2. Derivation of the path integral expression Here, we give a derivation of the path integral representation of a diffusion model (Proposition 3.1). We consider a forward process described by the SDE (1). In the physics literature, Eq. (1) is commonly expressed in the following form, xt = f(xt, t) + ξt, (32) where the noise term satisfies E[ξi tξj t ] = g(t)2δijδ(t t ). (33) The expectation value of a generic observable O({xt}) over noise realizations can be expressed as EP [{xt}][O({xt})] = Z [Dxt][Dξt] O({xt}) t δ(xt xsol t ) e R T 0 1 2g(t)2 ξt 2dtp0(x0), (34) where p0(x0) is the probability distribution of initial states (i.e., the data distribution), δ( ) is Dirac s delta function, and xsol t is the solution of Eq. (32). The symbol [Dxt] t denotes [Dxt] t := QM n=0 C( t)dxtn, where C( t) is a normalization constant, and similarly for [Dξt]. When the weight is sufficiently well-behaved (like e R T 0 1 2g(t)2 ξt 2dt in Eq. (34)), the limit t 0 the path integral is well-defined. We will be considering this limit at the end of the calculation do not explicitly indicate the dependence on t hereafter. The delta function imposes that xt is a solution of Eq. (32) under a given noise realization. Using the change-of-variable formula for the delta function, δ(xt xsol t ) = det δEOMt δ(EOMt), (35) where EOMt := xt f(xt, t) ξt. The Jacobian part gives a nontrivial contribution (see Appendix C for a detailed derivation) in the case of the Stratonovich scheme, det δEOMt = det( tδij ifj) = det t det(δij 1 t ifj) 2 R f(xt,t)dt, while it gives a trivial factor for the Itˆo scheme. Writing this factor as e J, the expectation value can now be written as EP [{xt}][O({xt})] = Z [Dxt][Dξt] O({xt})e J Y t δ( xt ft ξt) e R T 0 1 2g(t)2 ξt 2dtp0(x0). (37) Performing the integration over ξt, we arrive at the expression EP [{xt}][O({xt})] = Z [Dxt] O({xt}) e Ap0(x0), (38) Here, we defined the action A by 0 L( xt, xt)dt + J (39) where L( xt, xt) is the Onsager-Machlup function (Onsager & Machlup, 1953; Machlup & Onsager, 1953) given by L( xt, xt) = 1 2g(t)2 xt f(xt, t) 2, (40) Understanding Diffusion Models by Feynman s Path Integral and J is the term coming from the Jacobian, which depends on the choice of the discretization scheme: 0 Itˆo R T 0 1 2 ft(xt, t)dt Stratonovich R T 0 ft(xt, t)dt Reverse-Itˆo This concludes the derivation of Proposition 3.1. Any observable can be computed using the weight given by A. For example, one can compute the joint probability for the variables on discrete time points {tn}n=1,...,N as P(yt1, yt2, . . . , yt N ) = Z [Dxt] e A p0(x0) n=1 δ(ytn xtn). (42) Before ending this section, let us comment on the mathematical interpretation of the Onsager-Machlup function (40). A reader might be concerned with the meaning of the time derivative xt, since the paths in question are almost surely non-differentiable. Indeed, we should note that the notation of the time derivative xt appearing in the action is only symbolic. Precisely speaking, the action, 0 dt 1 2g(t)2 xt f(xt, t) 2, (43) should be understood through its discretized form, i t 1 2g(ti)2 (xti+ t xti)/ t f(xti, ti) 2. (44) The path-integral measure is obtained in the limit t 0, i.e., lim t 0 e Adisc. (45) This limit is well-defined. In this form, it is apparent that the paths do not need to be differentiable. B.3. Time-reversed SDE Here we present the derivation of the time-reversed dynamics in the path integral formalism (Proposition 3.2). We will use the Itˆo scheme, and later comment on other discretization schemes. We start by rewriting Eq. (38) as EP [{xt}][O({xt})] = Z [Dxt] O({xt}) p T (x T ) e R T 0 [ L ln p T (x T )+ln p0(x0)]dt = Z [Dxt] p T (x T ) e e A, (46) where e A := R T 0 e L( xt, xt)dt with e L( xt, xt) := L( xt, xt) + d dt ln pt(xt). (47) The total time derivative of ln pt(xt) is written as d ln pt(xt) = t ln pt(xt)dt + dxt ln pt(xt) + 1 2g(t)2 2 ln pt(xt)dt, (48) where we have used Itˆo s formula. For tpt(xt), we use the Fokker-Planck equation, tpt(x) = f(xt, t)pt(x) g(t)2 2 pt(x) , (49) which we rewrite as t ln pt(x) = f(xt, t) f(xt, t) ln pt(x) + g(t)2 2 2 ln pt(x) + ( ln pt(x))2 . (50) Understanding Diffusion Models by Feynman s Path Integral The new Lagrangian e L( xt, xt) is now written as e L( xt, xt) = 1 2g2 t ( x2 2 x (f(xt, t) g(t)2 ln pt(xt)) + f(xt, t)2) + g(t)2 2 2 ln pt(xt) f(xt, t) f(xt, t) ln pt(xt) + g(t)2 2 2 ln pt(xt) + ( ln pt(xt))2 = 1 2g2 t ( x f(xt, t) + g(t)2 ln pt(xt))2 f(xt, t) g(t)2 ln pt(xt) . Thus, the action e A is given by 0 e L( xt, xt)dt = Z T 1 2g(t)2 xt e f(xt, t) 2 e f(xt, t) dt, (52) where e f(xt, t) := f(xt, t) g(t)2 ln pt(xt). (53) Note that there appears a nontrivial Jacobian term in Eq. (52). This term disappears if we rewrite the integral using the inverse time τ in the Itˆo convention. The action e A contains the following contribution, e A Z 1 g(t)2 e f(xt, t) dxt. (54) Currently, this product written in the Itˆo scheme. We rewrite this term in the reverse-Itˆo scheme, e ft dxt = h e ft+dt e ft+dt e ft i dxt = e ft+dt dxt X i,j i( eft)jdxidxj = e ft+dt dxt g(t)2 e ft dt. The second term of this equation cancels the Jacobian term. Thus, the time-reversed action can be naturally interpreted as an Itˆo integral in reverse time. The same procedure can be also done for the Stratonovich scheme. The difference is the presence of the Jacobian term in the original action and the total time derivative of pt(xt) is written as d ln pt(xt) = t ln pt(xt)dt + dxt ln pt(xt) (56) instead of Eq. (48). Following similar steps, the time-reversed action is found to be given by 0 e L( xt, xt)dt = Z T 1 2g(t)2 xt e ft 2 1 Note that the sign of the Jacobian term is flipped compared with the original action, which allows us to interpret the process as a time-reversed one. B.4. Evaluation of KL divergence We here give the derivation of Proposition 3.3. We evaluate the upper limit of the KL divergence of the data distribution and a model, DKL(p0(x0) q0(x0)) DKL(P({xt}t [0,T ]) Q({xt}t [0,T ])), (58) where we used the data processing inequality. Below, we evaluate the RHS of Eq. (58). The time-reversed action of the data distribution (in the reverse-Itˆo scheme) reads 0 e L( xt, xt)dt = Z T 1 2g(t)2 xt e f(xt, t) 2 dt, (59) Understanding Diffusion Models by Feynman s Path Integral The time-reversed action of a model is given by e Aθ := Z T 0 e Lθ( xt, xt)dt = Z T 1 2g(t)2 xt e fθ(xt, t) 2 dt, (60) with e fθ(xt, t) := f(xt, t) g(t)2sθ(xt, t). (61) We note that, we employ the reverse-Itˆo scheme in the following computation and that is why there is no Jacobian term in Eqs. (59) and (60). We rewrite the joint probability of paths as P({xt}t [0,T ]) = e Ap0(x0) = p T (x T )e e A, (62) Q({xt}t [0,T ]) = e Aθq0(x0) = q T (x T )e e Aθ. (63) The KL divergence of the path-probability P({xt}t) from Q({xt}t) is written as DKL P({xt}t [0,T ]) Q({xt}t [0,T ]) = EP ({xt}) ln P({xt}t [0,T ]) Q({xt}t [0,T ]) = EP ({xt}t) ln p T (x T ) q T (x T ) e A + e Aθ = DKL (p T (x T ) q T (x T )) + EP ({xt}t) h e Aθ e A i . The second term of the RHS can be written as EP ({xt}t) h e Aθ e A i = EP ({xt}t) h 2 xt ( e fθ(xt, t) e f(xt, t)) + ( e fθ(xt, t))2 ( e f(xt, t))2i dt = EP ({xt}t) 1 2g(t)2 [ e fθ(xt+dt, t + dt) e f(xt+dt, t + dt)] [ 2 xt + e fθ(xt, t) + e f(xt, t)]dt. We discretize this and look at the contribution from the neighboring part (t, t + t), EP (xt,xt+ t) t 1 2g(t)2 ( e fθ,t+ t e ft+ t) h 2(xt+ t xt)/ t + e fθ,t + e ft i t 1 2g(t)2 ( e fθ,t+ t e ft+ t) ( 2 e ft + e fθ,t + e ft) t 1 2g(t)2 e fθ,t e ft 2 + O( t2) , where we performed the summation over δ = xt+ t xt. Summing up these contributions for [0, T], we have EP ({xt}t) h e Aθ e A i = EP ({xt}t) 1 2g(t)2 e fθ(xt, t) e f(xt, t) 2dt = EP ({xt}t) 2 ln pt(xt) sθ(xt, t) 2dt Thus, we have obtained the following inequality, DKL(p0(x0) q0(x0)) DKL (p T (x T ) q T (x T )) + Z T 2 Ept ln pt(xt) sθ(xt, t) 2 dt. (68) The distribution q T (x T ) is taken to be a prior, π(x T ). This concludes the proof of Proposition 3.3. Understanding Diffusion Models by Feynman s Path Integral B.5. Conditional variants A similar argument applies to the case when the initial state is fixed. The derivation of the reverse process can be obtained by replacing the probability densities with conditional ones on a chosen initial state x 0. The expectation value of a general observable in this situation can be written as EP ({xt}t (0,T ]|x 0)[O({xt})] = Z [Dxt] O({xt}) e R Ldtδ(x0 x 0). (69) Similarly to the previous section, we can rewrite this as EP ({xt}t (0,T ]|x 0)[O({xt})] = Z [Dxt] O({xt}) PT (x T |x 0) e e A(x 0)δ(x0 x 0), (70) where e A(x 0) := R T 0 e L( xt, xt|x 0) dt with e L( xt, xt|x 0) := L( xt, xt|x 0) + d dtpt(xt|x 0). (71) Following similar calculations with the unconditional case, the force of the reverse process turns out to be given by e f(xt, t) := f(xt, t) g(t)2 ln pt(xt|x 0). (72) One can also repeat a similar argument for the evaluation of the KL divergence with fixed x 0, which corresponds to the ELBO-based loss (Kingma et al., 2021; Kingma & Gao, 2023): Proposition B.1. Let pt(x|x0) as the Markov kernel from time 0 to t determined by the Fokker-Planck equation. Determining the Onsager Machlup function for reverse process e L( xt, xt|x 0) to satisfy p0(x0|x 0)e R T 0 L( xt,xt)dt+J[Dxt] = e R T 0 e L( xt,xt|x 0)dt+ e J(x 0)p T (x T |x 0)[Dxt], (73) e L( xt, xt|x 0) = xt f(xt, t) + g(t)2 log pt(xt|x 0) 2 2g(t)2 , (74) where e J(x 0) is the conditional Jacobian for the reverse process depending on discretization scheme. This representation proves to be practically valuable, especially since in most cases, we do not have access to the score function log pt(x). Proposition B.2. The KL divergence of path-probabilities can be represented by the path integral form, DKL(P( |x 0) Qθ) = Z e e A(x 0)p T (x T ) log e e A(x 0) p T (x T ) e e Aθπ(x T ) [Dxt], (75) where π is a prior distribution, and it can be computed as DKL(P( |x 0) Qθ) = DKL(p T π) + Z T 2 Ept log pt(xt|x 0) sθ(xt, t) 2 dt. (76) B.6. Likelihood evaluation We here give the proof of Theorem 4.1. The probability distribution qh t (x) corresponding to the backward process (5) satisfies the following Fokker-Planck equation, tqh t (x) = f PF θ (x, t)qh t (x) hg(t)2 2 (sθ,t(x) log qh t (x))qh t (x) . (77) Understanding Diffusion Models by Feynman s Path Integral Note that qh t (x) depends on h nontrivially when sθ(x, t) = log qh t (x). Introducing a new parameter γ R 0, Eq. (77) can be written as tqh t (x) = f PF θ (x, t)qh t (x) hg(t)2 2 (sθ,t(x) log qh t (x))qh t (x) γg(t)2 2 qh t (x) + γg(t)2 = f PF θ (x, t) g(t)2 h h(sθ,t(x) log qh t (x)) + γ log qh t (x) i qh t (x) + γg(t)2 2 qh t (x) . (78) Let us define Fh,γ(x, t) := f PF θ (x, t) hg(t)2 2 sθ(x, t) + (h γ)g(t)2 2 log qh t (x). (79) We consider the path integral expression for Eq. (78). The corresponding action is given by e Ah,γ = Z T 0 dt e Lh,γ( xt, xt) = Z T 1 2γg(t)2 x Fh,γ(xt, t) 2 dt. (80) Using the Itˆo scheme, one can obtain the likelihood for finite h by taking the limit γ 0, qh 0(x 0) Z t [0,T ] dxt δ(x0 x 0)e e Ah,γ+ R T 0 Fh,γdtqh T (x T ) γ 0 e R T 0 Fh,γ=0dtqh T (x T ). Thus, we obtain the formula for the likelihood for finite h, log qh 0(x0) = log qh T (x T ) + Z T 0 f PF θ (xt, t) hg(t)2 h sθ(xt, t) log qh t (xt) i dt. (82) Since qh T (x T ) is taken to be a prior, qh T (x T ) = π(x T ), this concludes the proof of Theorem 4.1. B.7. Likelihood up to first order of h We here give a proof of Theorem 4.2. First, we rename the solution for the modified probability flow ODE (15) as xh t , i.e. xh t = f PF θ (xh t , t) hg(t)2 h sθ(xh t , t) log qh t (xh t ) i , xh t=0 = x0 (83) and it can be represented by the formal integral xh t = x0 + Z t f PF θ (xh t , t) hg(t)2 h sθ(xh t , t) log qh t (xh t ) i dt. (84) Next, we consider Taylor expansion of f PF θ (xh t , t) with h: f PF θ (xh t , t) = f PF θ (xt, t) + h hf PF θ (xt, t)|h=0 | {z } ( hxh t |h=0 )f PF θ (xt,t) +O(h2) (85) Now we define hxh t |h=0 as δxt for simplicity, then by taking differential of (84), we get δxt = hxh t |h=0 hf PF θ (xh t , t)|h=0 | {z } ( hxh t |h=0 )f PF θ (xt,t) h sθ(xt, t) log qh=0 t (xt) i (δxt )f PF θ (xt, t) g(t)2 h sθ(xt, t) log qh=0 t (xt) i dt, Understanding Diffusion Models by Feynman s Path Integral which is the integral of δxt = (δxt )f PF θ (xt, t) g(t)2 h sθ(xt, t) log qh=0 t (xt) i | {z } δf PF θ (xt,δxt,t) By combining it to O(h0) ODE for xt, i.e., the probability flow ODE, we get Eq. (17). Next, we apply the generic formula that is valid with arbitrary function f(x), f(xh t ) = f(xt) + hδxt f(xt) + O(h2), (88) to the right hand side of the self-consistent equation of log-likelihood in Theorem 4.1. log qh 0(x0) = log π(xh T ) + Z T 0 f PF θ (xh t , t) hg(t)2 2 [sθ(xh t , t) log qh t (xh t )] dt = log π(x T ) + hδxt log π(x T ) + Z T f PF θ (xt, t) + h(δxt ) f PF θ (xt, t) hg(t)2 2 [sθ(xt, t) log qh=0 t (xt)]) then, this expression is equivalent to what we wanted to prove, i.e., Eq. (16). C. Computation of determinant Let us detail on the computation of the determinant of an operator of the form, det ( t M) . (90) We first factorize this as det ( t M) = det t det 1 1 t M . The latter factor is computed as det 1 1 t M = exp log det 1 1 t M = exp tr ln 1 1 t M = exp tr n( 1 t M)n . (91) Noting that 1 t δ(t t ) = θ(t t ), the term with n = 1 gives exp θ(0) Z dt tr M , (92) where θ(x) is the step function. Its value at 0 depends on the discretization scheme as 0 Ito 1 2 Stratonovich 1 Reverse-Ito . (93) All the higher-order terms vanish. For example, the term with n = 2 reads Z θ(t t )θ(t t)tr M 2 dtdt = 0. (94) Thus, we have det ( t M) exp θ(0) Z dt tr M , (95) with θ(0) given by Eq. (93). Understanding Diffusion Models by Feynman s Path Integral D. Detail of the example in Sec. 5.1 We here describe the details of the simple example discussed in Sec. 5.1. The data distribution is taken to be a one-dimensional Gaussian distribution, P0(x0) = N(x0 | 0, v0). (96) For the forward process, we use the following SDE, 2βxt dt + p Namely, we have chosen the force and noise strength as We here take β to be constant. In the current situation, the distribution stays Gaussian with a zero mean throughout the time evolution, and the distribution pt(x) is fully specified by its variance. Using Ito s formula, d(x2 t) = 2xt dxt + 1 2 2(dxt)2 = βx2 t dt + βdt. (99) Thus, the time evolution of the variance vt is described by dvt = β(vt 1)dt, (100) which can be solved with initial condition vt=0 = v0 as vt = 1 + e βt(v0 1). (101) The distribution at t is given by pt(x) = N(x | 0, vt). (102) with vt given by Eq. (101). D.1. Likelihood evaluation Suppose that the estimated score s(x, t) is given by s(x, t) = x log qt(x) = x vt (1 + ϵ). (103) where ϵ R is a constant. The parameters ϵ quantifies the deviation from the ideal estimation and when ϵ = 0 we can recover the original data distribution perfectly. If ϵ = 0, the likelihood q0(x0) depends nontrivially on parameter h. The force of the reverse process is written as efh,ϵ(x) = f(x) 1 + h 2 g2s(x, t) = β vt (1 + ϵt) = β (1 + h)(1 + ϵt) vt 1 x. (104) Let us denote the variance of the model distribution qh t (x) by v t, which differs from vt when ϵ = 0. The variance v t obeys dv t = β (1 + h)(1 + ϵ) vt 1 v tdt hβdt. (105) If we solve this with the boundary condition v T = v T , we have v t = 1 (1 + h)(1 + ϵ) 1 hvt + ϵ(1 + h)eβ(T t)v T e β(T t) vt (1+h)(1+ϵ)# Understanding Diffusion Models by Feynman s Path Integral Note that, when ϵ = 0, we have v t = vt and it is independent of h. For nonzero ϵ, v 0 depends on h nontrivially. The model distribution at t is given by qh t (x) = N(x | 0, v t). (107) The NLL can be expressed as Ep0(x0)[log qh 0(x0)] = Z dx N(x | 0, v0) log N(x | 0, v t=0) = 1 log 2πv t=0 + v0 As another measure, we can compute the 2-Wasserstein distance, W2(p0, p T )2 = ( v0 p v t=0)2. (109) Let us check the formula for the likelihood is satisfied. We shall check that the following formula is satisfied: log qh 0(x0) log qh T (x T (x0)) = Z T ef PF(xt) h 2g2(s(xt, t) ln qh t (xt)) dt, (110) where ef PF(x) = efh=0,ϵ(x). Noting that v t v0 x0, (111) and the LHS can be computed to give v 0 . (112) The integrand on the RHS reads 2g2(s(x, t) ln qh t (x)) = x vt (1 + ϵ) + h (1 + h)(1 + ϵ) The integration can be performed analytically, and we can check that the LHS indeed coincides with the RHS. E. Setting of the pretraining in 2d synthetic data We use synthetic data similar to the synthetic data shown in Fig. 5 used in (Petzka et al., 2018) to train our model. Swiss-roll data is generated by sklearn.datasets.make_swiss_roll (Pedregosa et al., 2011) with noise = 0.5, hole = False. This data itself is 3-dimensional data, so we project them to 2-dimensional data by using [0, 2] axes. After getting data, we normalize it by its std ( 6.865) to get data with std = 1. 25-Gaussian data is generated by mixture of gaussians. We first generate gaussian distributions with mean in { 4, 2, 0, 2, 4}2, std = 0.05, and again divide each sample vector component by its std ( 2.828) to get data with std = 1. E.2. Training We used JAX (Bradbury et al., 2018) and Flax (Heek et al., 2023) to implement our score-based models with neural networks, and Optax (Deep Mind et al., 2020) for the training. As we write in the main part of this paper, we use simple neural network: (x, t) concat([x, t]) Dense(128) swish 3 Dense(2) sθ(x, t) with default initialization both in training with Swiss-roll data and 25-Gaussian data. The loss function is calculated by Monte-Carlo sampling in each training step: 1. First we divide the time interval [Tmin, Tmax] into 1,000 equal parts in advance to get a discretized diffusion time array. Understanding Diffusion Models by Feynman s Path Integral 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Figure 5. (Left) 25-Gaussian data (3,000 samples), and (Right) Swiss-roll data (3,000 samples). 2. We take Nbatch = 512 batches of data {xi}i=1,2,...,Nbatch and for each data point we take a discretized diffusion time ti uniformly from the array that we discretized in the first step. 3. We compute the signal α(ti) and noise σ(ti) at time ti, computed from the definition of SDE, and take the following quantity as loss function xi,t α(ti)xi σ(ti)2 + sθ(xi,t, ti) where xi,t N(α(ti)xi, σ(ti)2I) is the Monte-Carlo sample. The signal and noise functions α(t) and σ(t) are: SIMPLE: α(t) = e β 4 t2, σ(t)2 = 1 e β COSINE: α(t) = cos π 2 t , σ(t)2 = 1 cos π 2 t . (115) These expressions are derived from general argument of SDE. In general, once the SDE dxt = f(t)x + g(t)dwt (116) is given, the conditional probability from time t = 0 to t is given by pt|0(xt|x0) = N(xt|α(t)x0, σ(t)2I), (117) α(t) = e R t 0 f(ξ)dξ, σ(t)2 = α(t)2 Z t α(ξ)2 dξ, (118) which is essentially equivalent to the formula in (Karras et al., 2022). In training, 3,000 data points are taken in advance, and stochastic gradients are computed based on the Monte-Carlo loss function (114) with batch size 512 mini-batches. The gradients are used for optimization of the neural network sθ(xt, t) with Adam-optimizer with learning rate 1e-3 and default values determined by Optax (Deep Mind et al., 2020). We train our models 16,000 epochs. E.3. Inference Here we show some instances on generated data by our pretrained model in Figs. 6 to 9. The figures are plotted with identical points in discretized time, t = 1, 0.18, 0.1, 0.01. Understanding Diffusion Models by Feynman s Path Integral 1 0 1 2 2.0 t=0.01 = 0.2 Figure 6. SIMPLE-SDE pretrained model on Swiss-roll data with (Left) probability flow ODE (h = 0), (Right) SDE (9) with h = 0.2. t=0.01 = 0.2 Figure 7. COSINE-SDE pretrained model on Swiss-roll data with (Left) probability flow ODE (h = 0), (Right) SDE (9) with h = 0.2. t=0.01 = 0.2 Figure 8. SIMPLE-SDE pretrained model on 25-Gaussian data with (Left) probability flow ODE (h = 0), (Right) SDE (9) with h = 0.2. 2 1 0 1 2.0 t=0.01 = 0.2 Figure 9. COSINE-SDE pretrained model on 25-Gaussian data with (Left) probability flow ODE (h = 0), (Right) SDE (9) with h = 0.2. F. Derivative numerical error estimations in Sec. 5.2 F.1. Estimation of the local errors (21) For simplicity, we omit h = 0 script here. In this notation, error of log qt(x) and local errors (21) are err(N[log qt(x)]) = log qt(x) N[log qt(x)], err( ˆ N[log qt(x)]) = log qt(x) ˆ N[log qt(x)], err( ˆ 2N[log qt(x)]) = 2 log qt(x) ˆ 2N[log qt(x)]. We use a third-party solver scipy.integrate.solve_ivp (Virtanen et al., 2020), and it has inputs atol and rtol that control the errors in the subroutine, to make it clear, let us call the numerical value as Natol,rtol[...]. It is plausible to expect the order of errors are same between numerical calculations based on the same order of tolerances, say O err( ˆ 0 or 1 or 2N(1.1 atol),(1.1 rtol)[log qt(x)]) = O err( ˆ 0 or 1 or 2Natol,rtol[log qt(x)]) , (120) Understanding Diffusion Models by Feynman s Path Integral where O means order estimate. Based on this observation, we can estimate each error. For example, N(1.1 atol),(1.1 rtol)[log qt(x)] | {z } log qt(x)+err(N(1.1 atol),(1.1 rtol)[log qt(x)]) Natol,rtol[log qt(x)] | {z } log qt(x)+err(Natol,rtol[log qt(x)]) = err(N(1.1 atol),(1.1 rtol)[log qt(x)]) err(Natol,rtol[log qt(x)]) = O err(Natol,rtol[log qt(x)]) . F.2. subtraction: Estimation of the local errors (21) by subtraction On the order estimations for differential operators, we have 2 choices. First choice is estimating them simply by err( ˆ 1 or 2N[log qt(x)]) = ˆ 1 or 2N(1.1 atol),(1.1 rtol)[log qt(x)] ˆ 1 or 2Natol,rtol[log qt(x)] (122) We name error estimation scheme based on this by subtraction. this method is straightforward, however, we need doubled computation time, and we introduce more time-efficient estimation in the next section. F.3. model: Estimation of the local errors (21) by Taylor expansion and score models Next choice is simply based on the error N[log qt(x)] = log qt(x) + err(N[log qt(x)]). (123) By applying the discrete differential (19), we can get ˆ N[log qt(x)] = " N[log qt(xt+[ x,0] )] N[log qt(xt [ x,0] )] 2 x N[log qt(xt+[0, x] )] N[log qt(xt [0, x] )] It would be sufficient to show x component: ˆ x N[log qt(x)] = N[log qt(x + [ x, 0] )] N[log qt(x [ x, 0] )] = log qt(x + [ x, 0] ) log qt(x [ x, 0] ) 2 x + err(N[log qt(x + [ x, 0] )]) err(N[log qt(x [ x, 0] )]) = x log qt(x)2 x + O( x3) 2 x + O(err(N[log qt(x)])) = x log qt(x) + O( x2) + O(err(N[log qt(x)])) where we use the facts: 1) O( x2) term vanishes from the numerator thanks to the symmetric descrete differential, 2) orders of err(N[log qt(x))] and err(N[log qt(x [ x, 0] )]) are same, in the 3rd line. This fact is achieved by the Runge-Kutta integrator. Therefore, by subtracting the true value from both side, we get err( ˆ N[log qt(x)]) = log qt(x) ˆ N[log qt(x)] = O( x2) + O(err(N[log qt(x)])) Understanding Diffusion Models by Feynman s Path Integral We can derive the error for 2 similarly: ˆ 2N[log qt(x)] = N[log qt(x + [ x, 0] )] + N[log qt(x [ x, 0] )] + N[log qt(x + [0, x] )] + N[log qt(x [0, x] )] 4N[log qt(x)] = log qt(x + [ x, 0] ) + log qt(x [ x, 0] ) + log qt(x + [0, x] ) + log qt(x [0, x] ) 4 log qt(x) err(N[log qt(x + [ x, 0] )]) + err(N[log qt(x [ x, 0] )]) + err(N[log qt(x + [0, x] )]) + err(N[log qt(x [0, x] )]) 4err(N[log qt(x)]) x2 = 2 log qt(x) x2 + O( x4) x2 + O(err(N[log qt(x)])) = 2 log qt(x) + O( x2) + O(err(N[log qt(x)])) where we use that O( x3) term in the numerator vanishes because we use the 5-point approximation of the laplacian. Now we get err( ˆ 2N[log qt(x)]) = 2 log qt(x) ˆ 2N[log qt(x)] = O( x2) + O(err(N[log qt(x)])) In summary, we conclude each error as err( ˆ N[log qt(x)]) = C(1) t (x) x2 + O(err(N[log qt(x)]))1 err( ˆ 2N[log qt(x)]) = C(2) t (x) x2 + O(err(N[log qt(x)])) where 1 is the vector with all 1 components, and C(1) t (x), C(2) t (x) are determined by the Taylor expansion remainder terms. Typically, these are approximated by C(1) t (x) 3rd derivative of log qt(x), C(2) t (x) 4th derivative of log qt(x). (130) Of course, we cannot access to the functions (130), however, it would be plausible to regard O( log qt(x)) = O(sθ(x, t)), (131) because the objective of the diffusion model training is to achieve sθ(x, t) log pt(x) and in the ideal case, qt = pt. By using this assumption, we regard C(1) t (x) ( sθ(x, t)) C(2) t (x) 2( sθ(x, t)) (132) and estimate the local errors by err( ˆ N[log qt(x)]) | ( sθ(x, t)) x2| + O(err(N[log qt(x)]))1 err( ˆ 2N[log qt(x)]) | 2( sθ(x, t)) x2| + O(err(N[log qt(x)])) where O(err(N[log qt(x)])) can be estimated by using (121). We name error estimation scheme based on this by model. F.4. Integral of the local errors Now we go back to Theorem 4.2 to explain our error estimation for O(h) correction to the (negative) log-likelihood. To calculate O(h) terms, we consider the paired ODE constructed by xt = f PF θ (xt, t), δxt = δf PF θ (xt, δxt, t) (by ˆ N[log qt(xt)]), δ log qt = δf PF θ (xt, δxt, t) (by ˆ 2N[log qt(xt)]), Understanding Diffusion Models by Feynman s Path Integral as we wrote in Algorithm 1. In this expression, there is no numerical error in the RHS of xt except for the float32 precision that is the default of the deep learning framework. In the RHS of δxt, we have discretization of differential ˆ and numerical integral for log qh=0 t (xt), and as we noted, we omit h = 0 and call it as log qt(xt). By using the error estimation (133), we rewrite ODE for δxt as δxt = (δxt )f PF θ (xt, t) g(t)2 2 (sθ(xt, t) ˆ N log qt(xt) | {z } log qt(xt)+err( ˆ N[log qt(xt)]) = (δxt )f PF θ (xt, t) g(t)2 2 (sθ(xt, t) log qt(xt)) + g(t)2 2 err( ˆ N[log qt(xt)]). When we apply the solver, basically it discretize the system as δxt+ϵ = δxt + ϵ (δxt )f PF θ (xt, t) g(t)2 2 (sθ(xt, t) log qt(xt)) + g(t)2 2 err( ˆ N[log qt(xt)]) + . . . , (136) and truncate . . . terms up to certain finite order, and apply this recursive equation from δx0. For simplicity, we consider the liner term only here, and split δxt = δxtrue t + err(1) t , then δxtrue t+ϵ + err(1) t+ϵ δxtrue t + err(1) t δxtrue t + err(1) t f PF θ (xt, t) g(t)2 2 (sθ(xt, t) log qt(xt)) + g(t)2 2 err( ˆ N[log qt(xt)]) δxtrue t + ϵ (δxtrue t )f PF θ (xt, t) g(t)2 2 (sθ(xt, t) log qt(xt)) err(1) t + ϵ (err(1) t )f PF θ (xt, t) + g(t)2 2 err( ˆ N[log qt(xt)]) and we get the ODE for time t error by taking continuous time limit: err(1) t = (err(1) t )f PF θ (xt, t) + g(t)2 2 err( ˆ N[log qt(xt)]). (138) Note that err(1) t is a vector with same dimension to xt, δxt, sθ, and f PF θ . As same, we can get the ODE for δ log qt term. The discrete ODE is = δ log qt + ϵ (δxt ) f PF θ (xt, t) g(t)2 2 ( sθ(xt, t) 2 log qt(xt)) + g(t)2 2 err( ˆ 2N[log qt(xt)]) + . . . , and as same, by splitting true value and error value, we get δ log qtrue t+ϵ + err(2) t+ϵ δ log qtrue t + ϵ (δxtrue t ) f PF θ (xt, t) g(t)2 2 ( sθ(xt, t) 2 log qt(xt)) err(2) t + ϵ (err(1) t ) f PF θ (xt, t) + g(t)2 2 err( ˆ 2N[log qt(xt)]) and get the ODE for err(2) t as err(2) t = (err(1) t ) f PF θ (xt, t) + g(t)2 2 err( ˆ 2N[log qt(xt)]). (141) Understanding Diffusion Models by Feynman s Path Integral In summary, we get the ODE system xt = f PF θ (xt, t), δxt = δf PF θ (xt, δxt, t) (by ˆ N[log qt(xt)]), δ log qt = δf PF θ (xt, δxt, t) (by ˆ 2N[log qt(xt)]), err(1) t = (err(1) t )f PF θ (xt, t) + g(t)2 2 err( ˆ N[log qt(xt)]), err(2) t = (err(1) t ) f PF θ (xt, t) + g(t)2 2 err( ˆ 2N[log qt(xt)]). The ERRORS in Table 1 are calculated by solving more conservative (or over-estimated) ODE: err(1) t = (err(1) t )f PF θ (xt, t) + g(t)2 2 err( ˆ N[log qt(xt)]) , err(2) t = (err(1) t ) f PF θ (xt, t) + g(t)2 2 err( ˆ 2N[log qt(xt)]) , (143) by using approximations (133) and the error final value for log q T as |err(1) T log π(x T )| + |err(2) T | (144) F.5. Visualization of local error estimates Of course, the degree of final error (144) strongly depends on the order of the local errors err( ˆ N[log qt(xt)]) and err( ˆ 2N[log qt(xt)]). To see its order, we plot log10 scales of the local error functions in Fig. 10. If we believe err(N[log qt(x)]) is suppressed within the small tolerance value, we take it as 10 5 by the Runge-Kutta algorithm, the order of local errors are depending on the coefficients of x2. To check these values, we plot mean and std values of tr ( sθ(x, t)) and 2( sθ(x, t)) by sampling 500 points x Uniform(min(validation set) 0.1, max(validation set) 0.1) at each time t in Figs. 11 to 14. Simultaneously, we plot colored contours that corresponds the value of local error estimates based on (132) with x = 0.01, that are exactly same as the values on dashed line in Fig. 10. From these figures, one can see that the estimated local errors are almost located at safe region, i.e., errors are negative in log scale. As one can see, the 25-Gaussian case Figs. 13 and 14, the maximum values are slightly inside red regions, and we may be careful about it, but we leave further study as a future work. Understanding Diffusion Models by Feynman s Path Integral 0.01 0.02 0.03 0.04 0.05 Δx log10 (C1Δx2 + 10 5 0.01 0.02 0.03 0.04 0.05 Δx log10 (C2Δx2 + 10 5 3 2 1 0 1 2 3 4 5 Figure 10. The order of local errors. (Left) Contour plot corresponding to err( ˆ N[log qt(xt)]). (Right) Contour plot corresponding to err( ˆ 2N[log qt(xt)]), based on (132). Blue region has negative power, and relatively safe. Red region has positive power, and dangerous. The dotted line corresponds x = 0.01 that is the value used in Table 1. The value 10 5 in the 2nd-term numerator is the typical tolerance value of 0th-logq Solver in Algorithm 1, that corresponds to err(N[log qt(x)]). 0.2 0.4 0.6 0.8 1.0 t |tr ( sθ(x, t))| log10 (|tr ( sθ(x, t))|Δx2 + 10 5 0.2 0.4 0.6 0.8 1.0 t | 2( sθ(x, t))| log10 (| 2( sθ(x, t))|Δx2 + 10 5 3.2 2.4 1.6 0.8 0.0 0.8 1.6 2.4 3.2 1.2 0.6 0.0 0.6 1.2 1.8 2.4 3.0 3.6 Figure 11. Local error mean/std plot with pretrained model with SIMPLE-SDE trained by Swiss-roll. 0.2 0.4 0.6 0.8 t |tr ( sθ(x, t))| log10 (|tr ( sθ(x, t))|Δx2 + 10 5 0.2 0.4 0.6 0.8 t | 2( sθ(x, t))| log10 (| 2( sθ(x, t))|Δx2 + 10 5 3.2 2.4 1.6 0.8 0.0 0.8 1.6 2.4 3.2 1.2 0.6 0.0 0.6 1.2 1.8 2.4 3.0 3.6 Figure 12. Local error mean/std plot with pretrained model with COSINE-SDE trained by Swiss-roll. 0.2 0.4 0.6 0.8 1.0 t |tr ( sθ(x, t))| log10 (|tr ( sθ(x, t))|Δx2 + 10 5 0.2 0.4 0.6 0.8 1.0 t | 2( sθ(x, t))| log10 (| 2( sθ(x, t))|Δx2 + 10 5 3.2 2.4 1.6 0.8 0.0 0.8 1.6 2.4 3.2 1.2 0.6 0.0 0.6 1.2 1.8 2.4 3.0 3.6 Figure 13. Local error mean/std plot with pretrained model with SIMPLE-SDE trained by 25-Gaussian. 0.2 0.4 0.6 0.8 t |tr ( sθ(x, t))| log10 (|tr ( sθ(x, t))|Δx2 + 10 5 0.2 0.4 0.6 0.8 t | 2( sθ(x, t))| log10 (| 2( sθ(x, t))|Δx2 + 10 5 3.2 2.4 1.6 0.8 0.0 0.8 1.6 2.4 3.2 1.2 0.6 0.0 0.6 1.2 1.8 2.4 3.0 3.6 Figure 14. Local error mean/std plot with pretrained model with COSINE-SDE trained by 25-Gaussian.