# finitetime_analysis_of_discretetime_stochastic_interpolants__eab65e0e.pdf Finite-Time Analysis of Discrete-Time Stochastic Interpolants Yuhao Liu 1 Yu Chen 1 Rui Hu 1 Longbo Huang 1 The stochastic interpolant framework offers a powerful approach for constructing generative models based on ordinary differential equations (ODEs) or stochastic differential equations (SDEs) to transform arbitrary data distributions. However, prior analyses of this framework have primarily focused on the continuous-time setting, assuming a perfect solution of the underlying equations. In this work, we present the first discrete-time analysis of the stochastic interpolant framework, where we introduce an innovative discrete-time sampler and derive a finite-time upper bound on its distribution estimation error. Our result provides a novel quantification of how different factors, including the distance between source and target distributions and estimation accuracy, affect the convergence rate and also offers a new principled way to design efficient schedules for convergence acceleration. Finally, numerical experiments are conducted on the discrete-time sampler to corroborate our theoretical findings. 1. Introduction Stochastic interpolants (Albergo & Vanden-Eijnden, 2023; Albergo et al., 2023) provide a general framework for constructing continuous mappings between arbitrary distributions. This framework draws inspiration from flow-based and diffusion-based models, which generate samples by continuously transforming data points from a base distribution to a target distribution via learned ordinary differential equations (ODEs) or stochastic differential equations (SDEs). Within the stochastic interpolant framework, one obtains learnable ODEs or SDEs that transport data by defining an interpolation between data points sampled from different distributions. This framework offers significant design flexibility and has demonstrated promising results in various 1IIIS, Tsinghua University, Beijing, China. Correspondence to: Longbo Huang . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). applications, including probabilistic forecasting (Chen et al., 2024b), image generation (Ma et al., 2024; Albergo et al., 2024), and sequential modeling (Chen et al., 2024a). Despite its potential in real-world applications, there remains a gap between the theoretical analyses and practical implementations of stochastic interpolants. In practical scenarios, instead of perfectly solving the underlying equations, one can only access a learned estimator for a finite number of time steps, which necessitates the use of discrete-time samplers to simulate the true continuous generation process. However, previous analyses have largely focused on continuous-time generation, assuming perfect solutions to the underlying equations. This leads to a crucial question for bridging the gap: What is the convergence rate of discrete-time stochastic interpolant, and how to enhance its performance algorithmically? Similar problems have been studied in the theories of diffusion models, and most results were derived based on Girsanov-based methods in SDE analyses, which reduce the problem to providing upper bounds on the discretization errors. Specifically, existing analyses on the discretization errors can be mainly categorized into two types. The first type partitions the error into space-discretization and timediscretization. Among them, Lee et al. (2022) and Chen et al. (2023d) assume a uniform Lipschitz constant on the score function, while Chen et al. (2023a) do not, but they all utilize the Markovian property and the Gaussian form of the diffusion process to obtain their results. The second type uses Itô s calculus to obtain upper bounds, such as Benton et al. (2024), who adapt existing results from stochastic localization to produce tight bounds by finding the equivalence between two methods. However, the aforementioned ideas do not apply to stochastic interpolants due to the following key difference: the stochastic interpolant framework in consideration has a distinct structure introduced by a random interpolation between two distributions instead of a linear combination of one distribution with Gaussian, destroying the Markovian property. This difference not only necessitates novel analysis for the discretization errors of score functions but also requires additional analysis to bound the discretization errors for the Finite-Time Analysis of Discrete-Time Stochastic Interpolants velocity function, which arises from the general interpolation function introduced in this context. To tackle the above challenges, in this work, we offer the first finite-time convergence bound in Kullback-Leibler (KL) error for the SDE-based generative model within the stochastic interpolant framework. Our result presents a novel analysis building on existing Girsanov-based techniques. In the analysis of discretization errors, one key highlight of our approach is modeling the evolution of discretized terms via stochastic calculus. This allows us to decompose the discretization error into components linked to derivatives of conditional expectation. Notably, we leverage the Gaussian latent variables embedded within our stochastic interpolants, enabling the explicit representation of these derivatives as conditional expectations, and hence providing a key solution to the challenges. Our contributions. The main contributions of our paper are as follows. (i) This work presents the first finite-time convergence bound for the SDE-based generative model within the stochastic interpolant framework. Specifically, we formulate the discrete-time sampler using the Euler Maruyama scheme and derive a general error bound on the generative process, which notably does not require Lipschitz assumptions on the score functions or velocity functions. This setting aligns with recent analyses of score-based diffusion models that relax Lipschitz assumptions on the score functions, such as those by Chen et al. (2023a) and Benton et al. (2024). (ii) We propose a novel schedule of step sizes and rigorously bound the number of steps required to achieve an ε2 KLerror. In the specific case where the base distribution ρ0 is Gaussian, where our setting reduces to the standard diffusion model, our bound achieves the same order of dependence as that by Chen et al. (2023a). (iii) We implement the sampler with our proposed schedule and conduct a comparison against using uniform step sizes. Our results validate the theoretical findings and demonstrate the superior performance of our schedule when no additional regularity conditions are assumed. 2. Related Works Stochastic Interpolants The concept of stochastic interpolants is introduced by Albergo & Vanden-Eijnden (2023), establishing a framework for constructing generative models based on continuous-time normalizing flows. Building upon this foundation, Albergo et al. (2023) extend the framework by incorporating Gaussian noise into the interpolant, effectively unifying flow-based and diffusion-based methods. Both Albergo & Vanden-Eijnden (2023) and Albergo et al. (2023) investigate the impact of using estimators instead of the true velocities in the equations. Following the stochastic interpolants framework, several works have focused on specific applications. Albergo et al. (2024) utilize the framework to develop novel data coupling methods, addressing image generation tasks such as in-painting and super-resolution. Chen et al. (2024b) and Chen et al. (2024a) adapt the conditional generation framework with stochastic interpolants to tackle future state prediction and sequential modeling problems, respectively. Convergence Analysis of Diffusion Models Numerous results have been established on the convergence rates of diffusion models under various data assumptions (Bortoli, 2022; Lee et al., 2022). Notably, Chen et al. (2023d) employ an approximation argument to apply Girsanov s theorem in scenarios where the Novikov condition does not hold. Based on this analysis, Chen et al. (2023d) and Chen et al. (2023a) provide error bounds assuming Lipschitz score functions. Chen et al. (2023a) also develop a KL error bound without requiring Lipschitzness assumptions, leveraging early stopping. This bound is subsequently improved by Benton et al. (2024), achieving a ε2 KL error in O d log2(1/δ) steps, which exhibits near-linear dependence on the dimension d. In addition to convergence rate analysis, several works have focused on problems such as score approximation (Chen et al., 2023b), improved DDPM samplers (Liang et al., 2024; Li et al., 2024b;a) and ODE-based methods (Chen et al., 2023c). 3. Preliminaries: Stochastic Interpolants In this paper, we consider continuous-time stochastic processes that bridge any two arbitrary probability distributions in finite time. Formally, given two probability distributions ρ0 and ρ1 in Rd, a stochastic interpolant (Albergo et al., 2023; Albergo & Vanden-Eijnden, 2023; Albergo et al., 2024) is a stochastic process defined as: xt = I(t, x0, x1) + γ(t)z, t [0, 1] (1) where I is a twice-continuously differentiable interpolation function satisfying the boundary conditions I(0, x0, x1) = x0 and I(1, x0, x1) = x1, and there exists a constant C > 0 such that for all t [0, 1] and x0, x1 Rd, t I(t, x0, x1) C x0 x1 . (2) Here γ(t) is a time-dependent scale function with γ(t)2 C2[0, 1], γ(0) = γ(1) = 0, and γ(t) > 0 for t [0, 1]. This definition indicates that d dt(γ2(t)) is bounded by a constant. (x0, x1) is drawn from a joint measure ν with marginals ρ0 and ρ1, i.e., ρ0(dx0) = ν(dx0, Rd) and ρ1(dx1) = ν(Rd, dx1). z N(0, Id) is a standard Gaussian variable independent of (x0, x1). Finite-Time Analysis of Discrete-Time Stochastic Interpolants In the definition (1), I(t, x0, x1) represents the interpolation component, while γ(t)z introduces a Gaussian latent term crucial for subsequent analysis. We denote the density of xt by ρ(t, ) or simply ρ(t). According to the construction, the stochastic interpolant satisfies ρ(0) = ρ0 and ρ(1) = ρ1. This framework allows for a wide range of interpolation functions I(t, x0, x1) and scale functions γ(t), offering significant flexibility in design. Stochastic interpolants provide a framework for generative modeling through stochastic differential equations. As shown by Albergo et al. (2023), when E(x0,x1) ν t I(t, x0, x1) 4 and E(x0,x1) ν 2 t I 2 are bounded, the solution to the following forward SDE d XF t = b F (t, XF t )dt + p 2ϵ(t)d Wt, X0 ρ0 (3) satisfies XF t ρ(t) for all t [0, 1]. Here ϵ C[0, 1] is a non-negative function, and the drift term b F is defined as b F (t, x) = b(t, x) + ϵ(t)s(t, x), b(t, x) = E[ xt|xt = x] = E[ t I(t, x0, x1) + γ(t)z|xt = x], s(t, x) = log ρ(t, x) = γ 1(t)E[z|xt = x]. In the definition (4), s(t, x) is the well-known score function, and b(t, x) represents the mean velocity field of (1) (following the notations of Albergo et al. 2023). This implies that a process starting from ρ0 and evolving according to the forward SDE (3) will have density ρ(t) at time t. Consequently, at time t = 1, the process will have the desired target density ρ1. This establishes a stochastic mapping from ρ0 to ρ1, providing a foundation for generative modeling. Following Albergo et al. (2023), we further introduce the velocity function v(t, x) as: v(t, x) = E[ t I(t, x0, x1)|xt = x] = b(t, x) + γγs(t, x). (5) Both b(t, x) = v(t, x) γ(t)γ(t)s(t, x) and b F (t, x) = b(t, x) + ϵ(t)s(t, x) are linear combinations of v(t, x) and s(t, x). The function ϵ(t) controls the level of randomness in the mapping from ρ0 to ρ1. When ϵ(t) 0 the SDE reduces to an ODE. We assume ϵ(t) = ϵ is a constant without loss of generality, similar to Albergo et al. (2023) and dos Santos Costa et al. (2024). Connection with Diffusion Models. Consider the special case where x0 and x1 are independent, with x0 N(0, Id). Let I(t, x0, x1) = (1 t)x0 + tx1 and γ(t) = p 2t(1 t). Then, the stochastic interpolant can be expressed as xt = tx1 + p where z N(0, Id) is another standard Gaussian variable independent of x1. Diffusion models (Song et al., 2021) employ the Ornstein Uhlenbeck (OU) SDE: d Ys = Ysds + 2d Ws, Y0 pdata which gradually adds noise to the data distribution pdata. Given Y0, Ys can be written as Ys = e s Y0 + p 1 e 2s Z, Z N(0, Id). Therefore, if we choose x1 pdata in the stochastic interpolant, Ys and xe s have the same distribution. Previous Theoretical Results For SDE-based generative models, Albergo et al. (2023) provide the following KL error bound when an estimator ˆb F (t, XF t ) is used instead of the true drift term b F (t, XF t ) in the SDE. This bound, which can also be derived using Girsanov s Theorem (Chen et al., 2023d; Oko et al., 2023), is given by: KL(ρ(1) ˆρ(1)) 1 Rd ˆb F (t, x) b F (t, x) 2ρ(t, x)dxdt. This inequality establishes an upper bound on the distribution estimation error in terms of the error in estimating the drift function b F (t, XF t ). This estimation error is evaluated with respect to the true underlying density ρ(t, x). The primary limitation of the existing results is the assumption that the SDEs can be solved exactly (e.g., Chen et al. 2024b; Albergo et al. 2023). However, obtaining exact solutions is often challenging in practice, leading to the use of discrete-time samplers for estimating these solutions. Yet, when an SDE is discretized, discretization error causes distribution estimation error, which ultimately invalidates previous results. Moreover, the choice of discretizations can have a significant impact on the convergence of the dynamics, (see, e.g., Wibisono et al. 2016), leaving the design of optimal discretization methods largely open. 4. Main Results In this section, we present a novel analysis for the discretetime stochastic interpolant framework. Specifically, we focus on the following formulation: given a schedule {tk}N k=0 [0, 1] where t0 < t1 < < t N 1 < t N, we define an estimated process using the following SDE: d ˆXF t = ˆb F (tk, ˆXF tk)dt + 2ϵd Wt, t [tk, tk+1). (6) In practice, we can express this as: XF tk+1 =XF tk + (tk+1 tk)ˆb F (tk, ˆXF tk) 2ϵ(tk+1 tk)wk, wk N(0, Id) (7) Finite-Time Analysis of Discrete-Time Stochastic Interpolants for k = 0, 1, . . . , N 1. Here ˆb F (t, XF t ) represents an estimator of the true drift term b F . Equation (6), or equivalently (7), corresponds to the Euler-Maruyama discretization of the continuous-time SDE (Chen et al., 2023a;b). In this paper, we refer to (6) as the estimated SDE, while (3) is referred to as the true SDE. Below, we present our analysis for the discrete-time stochastic interpolant. We begin by introducing the main assumptions, which will be crucial for bounding the error between the estimated distribution and the true target distribution. Assumption 4.1. The joint measure ν defined in the stochastic interpolants (1) satisfies E (x0,x1) ν x0 x1 8 < , and the interpolation function is such that E (x0,x1) ν 2 t I(t, x0, x1) 2 M2 < , t [0, 1]. The first moment bound assumes that the initial and target distributions, ρ0 and ρ1, are not excessively far apart. In fact, the inequality (2) further implies that E t I(t, x0, x1) 8 < for any t [0, 1]. The second part of the assumption ensures that the time derivative of the interpolation function does not exhibit significant variations. Assumption 4.1 is similar to previous assumptions on stochastic interpolants, with the exception of the first part, which utilizes the eighth moment instead of the fourth moment (see Albergo et al. 2023 or Appendix A). Assumption 4.2. The estimator ˆb F satisfies k=0 (tk+1 tk)E ˆb F (tk, xtk) b F (tk, xtk) 2 ε2 b F , where the expectations is taken over xtk ρ(tk). Assumption 4.2 assumes that we have a sufficiently accurate estimator for the drift term b F (t, x) at the discretized time points. This assumption is analogous to common assumptions employed in the theoretical analysis of diffusion models (Benton et al., 2024; Chen et al., 2023a). Now we are ready to present our main theorem. Theorem 4.3. Suppose we start with an initial distribution ˆρ(t0) and evolve the process according to the estimated SDE (6) with ϵ = O(1) until time t = t N. Let ˆρ(t N) be the distribution obtained at time t = t N. Denote the ground truth marginal distributions by {ρ(t)}t [0,1], and define γk = mint [tk,tk+1] γ(t). We have under Assumptions 4.1 and 4.2 that: KL(ρ(t N) ˆρ(t N)) KL(ρ(t0) ˆρ(t0)) + ϵ 1ε2 b F + ϵ 1 N 1 X k=0 (tk+1 tk)3 h M2 + γ 6 k d3 + γ 2 k d p E x0 x1 8 i k=0 (tk+1 tk)2 γ 2 k d hp E x0 x1 4 + γ 2 k d i Here the notation f g denotes f = O(g). Theorem 4.3 provides the first finite-time error bound for the discrete-time stochastic interpolant framework (3), i.e., SDE (6). It explicitly quantifies the impact of the initial distribution mismatch (i.e., KL(ρ(t0) ˆρ(t0))) and the b F estimation error (i.e., ε2 b F ), demonstrates their dependence on the choice of latent scale γ(t) and the time discretization schedule {tk}N k=0. Notably, the bound offers a novel theoretical explanation for how the convergence behavior depends on the distance between the source and destination distributions, as reflected in the terms involving E x0 x1 p with p = 4 and p = 8. Now we explain the terms in Theorem 4.3. The terms (tk+1 tk)3 h M2 + γ 2 k d p E x0 x1 8 i and (tk+1 tk)2 γ 2 k d p E x0 x1 4 quantify the discretization error associated with the velocity function v(t, x) (Equation (5)), which is a component of the drift term b F (t, x). Notably, the distance of form E x0 x1 p is involved here, highlighting the influence of the distance between the source and target distributions on the discretization error. Conversely, the terms (tk+1 tk)3 γ 6 k d3 and (tk+1 tk)2 γ 4 k d2 quantify the discretization error arising from the score function s(t, x). This component of the discretization error exhibits a stronger dependence on the latent scale γ(t) and the data dimension d. Specifically, assuming sufficiently small step sizes, the dependence of the score discretization error on γ(t) is γ 4 k (if we assume sufficiently small step sizes), which aligns with the findings for diffusion models (Chen et al., 2023a; Benton et al., 2024). Since γ(0) = γ(1) = 0, the discretization error will become unbounded if we were to simulate the estimated SDE (6) from t0 = 0 to t N = 1. To address this, we choose 0 < t0 < t N < 1 to ensure that γk is lower bounded, thereby maintaining a finite discretization error bound in Theorem 4.3. Under this approach, the SDE is simulated within the interval [t0, t N], and an estimation of ρ(t N) is obtained instead of ρ1. This is acceptable when t N is sufficiently close to 1, as ρ(t N) will be sufficiently close to ρ1 (e.g., in terms of Wasserstein distance). This practice of choosing t N < 1 is analogous to the early stopping technique commonly employed in diffusion models (Song et al., 2021; Chen et al., 2023a; Benton et al., 2024). The term KL(ρ(t0) ˆρ(t0)) quantifies the effect of choosing Finite-Time Analysis of Discrete-Time Stochastic Interpolants a slightly different base distribution. As discussed earlier, we typically choose t0 > 0, and in many cases, the true base distribution ρ(t0) may not be readily available. This bound theoretically supports the use of a similar base distribution ˆρ(t0) as an approximation for the true base distribution. Finally, the term ε2 b F accounts for the error in estimating the drift term b F (t, x). Compared to previous continuous-time analyses of the stochastic interpolant framework, which typically measure the estimation error by an averaged error over the entire time interval (Albergo & Vanden-Eijnden, 2023; Albergo et al., 2023), our analysis evaluates the estimation error using a weighted average of the errors at the discretized time points. The bound provided by Theorem 4.3 explicitly depends on the choice of latent scale γ(t) and the time schedule {tk}N k=0. This dependence can be leveraged to assess the computational complexity for a given time schedule under a specific choice of γ(t). In Section 5, we will develop a time schedule that achieves a fast convergence rate. We now provide a proof sketch for Theorem 4.3. Proof Sketch of Theorem 4.3 The proof of Theorem 4.3 contains two key steps. In step one, we establish a bound on the KL divergence due to discretization error in the drift term b F (t, x) of the SDE, based on Girsanov s theorem. Then, in step two, we exploit special structure of the b F (t, x) by expressing its derivatives as conditional covariances, enabling the application of relevant expectation inequalities and eventually bounding the discretization error. Step One: Bounding the KL-divergence with Discretization Error. Leveraging the results provided by Chen et al. (2023d) (see Proposition C.2), which are derived using Girsanov s theorem, we obtain the following bound: KL(ρ(t N) ˆρ(t N)) KL(ρ(t0) ˆρ(t0)) + KL(P Q) = KL(ρ(t0) ˆρ(t0)) | {z } Initialization error tk E[ b F (t, XF t ) ˆb F (tk, XF tk) 2]dt where P and Q represent the path measures of the solutions to the true SDE (3) and estimated SDE (6), respectively, both with the same initial distribution ρ(t0). Applying the triangle inequality yields: E[ b F (t, XF t ) ˆb F (tk, XF tk) 2] 2 E[ b F (t, XF t ) b F (tk, XF tk) 2] | {z } Discretization error + 2 E[ b F (tk, XF tk) ˆb F (tk, XF tk) 2 | {z } Estimation error The second term on the right-hand-side corresponds to the estimation error of ˆb F (t, x), and its summation can be bounded by ε2 b F according to Assumption 4.2. The first term, on the other hand, represents the discretization error associated with b F (t, x) and requires further analysis. Step Two: Bound the Discretization Error. We now bound the discretization error above. A central tool in this part is the Itô s formula. By applying Itô s formula, we obtain: tk db F (s, XF s ) = Z t tk sb F (s, XF s )ds tk b F (s, XF s ) b F (s, XF s )ds tk ϵ b F (s, XF s )ds 2ϵ b F (s, XF s ) d Ws. While the application of Itô s formula is analogous to that of Benton et al. (2024), we refrain from eliminating the three linear terms due to their more complex forms in the context of stochastic interpolants. Instead, we apply Jensen s inequality on the integrals with respect to time and apply Itô s isometry (Le Gall 2016, Equation 5.8) on the integral with respect to Brownian motion, so that we can bound the term by the derivatives of b F (t, x) (i.e., terms like tb F (t, x) and xb F (t, x), see Lemma C.1). Since b F (t, x) can be expressed as a linear combination of v(t, x) and s(t, x), it remains to bound the derivatives of both v(t, x) and s(t, x), respectively. Note that v(t, x) and s(t, x) can be written as the conditional expectations of t I and γ 1z = xt I γ2 given xt = x. To bound the derivatives of these conditional expectations, we employ the following key equality: αE[f|xt = x] = E[ αf|xt = x] + Cov(f, α[ x I 2/2γ2]|xt = x). Here α can represent either x or t. This equality crucially relies on the Gaussian latent term γ(t)z introduced in the stochastic interpolant framework. This generalizes the result for s(t, x) = log ρ(t, x) = E[γ 2(I xt)|xt = x] in diffusion models, as found in previous works (see, e.g., Bortoli 2022; Benton et al. 2024). We apply this equality extensively and derive bounds for both s(t, x) and v(t, x), where the function v(t, x) does not appear or appears in a much simpler form (such as x) in the context of diffusion model theories. Subsequently, we apply a series of inequalities to ultimately bound the expectation over xt. Detailed derivations and proofs can be found in B. Finite-Time Analysis of Discrete-Time Stochastic Interpolants 5. Schedule Design for Faster Convergence In Theorem 4.3, we provide an upper bound on the KL divergence from the target distribution to the estimated distribution for a general class of SDE-based generative models. Since the bound depends on the choice of latent scale γ(t) and schedule {tk}N k=0, we are able to carefully design a time schedule for a given latent scale, thereby achieving a provably bounded error within a minimum number of steps. Specifically, we consider the common choice of latent scale in stochastic interpolants, γ(t) = p at(1 t), which is first introduced in Albergo et al. (2023). This choice is equivalent to changing the definition xt = I(t, x0, x1) + γ(t)z to xt = I(t, x0, x1) + ad Bt, where Bt is a standard Brownian bridge process independent of (x0, x1). For this γ(t), we present the following time schedule to optimize the sample complexity. Exponentially Decaying Time Schedule As suggested by Theorem 4.3, smaller steps need to be taken in order to balance the error terms. Moreover, to exactly cancel the γ-terms, we need hk = O( γ2 k) where γ is defined in Theorem 4.3. Hence, we propose an exponentially decaying time schedule inspired by the approach of Benton et al. (2024). Specifically, we first select a midpoint t M = 1 2. Let h (0, 1) be a parameter that controls the step size. We then define the time steps as follows: ( 1 2h(1 h)M k 1, k < M 1 2h(1 h)k M, k M. This leads to ( 1 2(1 h)M k, k < M 1 1 2(1 h)k M, k M. The parameter h determines the overall scale of the step sizes. A smaller h results in a finer discretization of the time interval. Let hk = tk+1 tk denote the step size at the k-th step. We observe that hk = O(h min{tk, 1 tk+1}) = O(h γ2 k), which satisfies the condition of canceling the γ-terms. Moreover, the total number of steps is given by N = O log(1/t0) + log(1/(1 t N)) log(1/(1 h)) = O h 1 log 1 t0(1 t N) Now we can provide the following bound: Proposition 5.1. Consider the same settings as in Theorem 4.3. Suppose hk = tt+1 tk = O(h γ2), ϵ = Θ(1) and h = O( 1 d). Then, we have KL(ρ(t N) ˆρ(t N)) ϵ 1ε2 b F + KL(ρ(t0) ˆρ(t0)) E x0 x1 4 + Nh2d2. Proposition 5.1 provides the KL error bound when the step sizes is chosen so that the γ-terms are canceled. Corollary 5.2. Using γ = p at(1 t) and the time schedule defined above, suppose that KL(ρ(t0) ˆρ(t0)) ε2 and ε2 b F ϵ ε2. Furthermore, assume that ϵ = Θ(1) and h = O( 1 d). Then, under the same settings as in Theorem 4.3, the total number of steps required to achieve KL(ρ(t N) ˆρ(t N)) = O(ε2) is: E x0 x1 4d log 1 t0(1 t N) +d2 log2 1 t0(1 t N) Corollary 5.2 provides the computational complexity of sampling data using the forward SDE. For a fixed error bound ε, the complexity scales proportionally to ε 2. We can further decompose the complexity into distance-related complexity and Gaussian diffusion complexity. Here O 1 ε2 p E x0 x1 4d log 1 t0(1 t N) is the distance-related complexity representing the number of steps required to achieve a sufficiently small discretization error associated with the velocity function v(t, x). O 1 ε2 d2 log2 1 t0(1 t N) is the Gaussian diffusion complexity representing the number of steps required to achieve a sufficiently small discretization error associated with the score function s(t, x). We briefly explain how to obtain this complexity. First, given a desired number of steps N, we select h = Θ N 1 log 1 t0(1 t N) to achieve the specified number of steps. Since hk = O( γ2 kh), we have: k=0 h3 k h M2 + γ 6 k d3 + γ 2 k d p E x0 x1 8 i Nh3d3 + h2 M2 + d p E x0 x1 8 , h2d2 + hkhd p Nh2d2 + hd p Finite-Time Analysis of Discrete-Time Stochastic Interpolants By substituting the chosen value of h for the given N into Theorem 4.3, we can derive the stated complexity bound. Comparison to a Uniform Schedule. To highlight the benefits of our proposed exponentially decaying time schedule, we compare it with a natural uniform schedule that satisfies hk = t N t0 N 1 N . We further assume the ideal case where ε2 b F = 0 and ρ(t0) = ˆρ(t0) in our analysis. According to Theorem 4.3, the error bound for the uniform schedule is given by k=0 h3 k(M2 + γ 6 k d3 + γ 2 k d p k=0 h2 k( γ 4 k d2 + γ 2 k d p E x0 x1 4). Since γ2 k = Θ(min{tk, 1 tk+1}), and noting that ( Θ(log(1/δ)), p = 1 Θ(δ (p 1)), p > 1 for a uniform schedule, the overall error bound becomes: KL(ρ(t N) ˆρ(t N)) E x0 x1 4d log 1 t0(1 t N) + 1 t0(1 t N)d2 . Consequently, the complexity of using a uniform schedule is given by N = O ε 2 log 1 t0(1 t N) + 1 t0(1 t N)d2 , which exhibits a higher computational complexity compared to the proposed exponentially decaying schedule. Comparison to Diffusion Models Results. By setting I(t, x0, x1) = (1 t)x0 + tx1, γ(t) = p 2t(1 t), x0 ρ0 = N(0, Id), and assuming that x0 and x1 are independent, the stochastic interpolant reduces to xt = 1 t2 z + tx1 for some z N(0, Id), which fits the diffusion model setting (Song et al., 2021). Assuming that the fourth moment of ρ1 is bounded by a constant (see Appendix C.5 for details), the complexity of our approach simplifies to N = O ε 2d2 log2 1 1 t N Figure 1. An illustration of the interpolants. We choose I(t, x0, x1) = (1 t)x0 + tx1 and γ(t) = p 2t(1 t). The first row of graphs shows the densities given by (1). The second row of graphs shows the estimated densities using SDE (6), where the estimator ˆb F is learned by a neural network. For diffusion models with an early stopping time δ, Chen et al. (2023a) established a complexity bound of O ε 2d2 log2 1 δ . By setting δ = 1 t N in our analysis, we recover the same complexity bound as that obtained for diffusion models. While Benton et al. (2024) further improves the complexity bound for diffusion models to O ε 2d log2 1 δ by leveraging techniques from stochastic localization, these techniques heavily rely on the Gaussian structure of diffusion models and cannot be directly applied to the more general stochastic interpolant framework. Other Choices of γ(t). In addition to the commonly used γ(t) = p at(1 t), our framework can readily be extended to analyze other choices of γ(t). In Appendix C.6, we present an analysis for γ2(t) = (1 s)2s, which is equivalent to the definition in Chen et al. (2024b). We show that the proposed time schedule in Appendix C.6 also outperforms the uniform schedule in terms of computational complexity, demonstrating the effectiveness of our schedule design. 6. Numerical Experiments In this section, we conduct experiments to validate our theoretical results. We implement the discretized sampler as defined in Equation (7), and evaluate its performance on on two-dimensional datasets (primarily from Grathwohl et al. 2019) and Gaussian mixtures. The experiments focus on the following three aspects: (i) the convergence rate using the exponentially decaying schedule, (ii) the effect of choosing different ρ0 as base densities, which is reflected by the distribution distance terms in our bound, and (iii) the dependence of the error bound on d. We employ I(t, x0, x1) = tx1 + (1 t)x0, γ(t) = p 2t(1 t) and ϵ = 1 in our experiments. Figure 1 presents a visualization of the interpolants and the estimated densities generated using the forward SDE. As we can see in Figure 1, the density defined by the stochastic interpolant progressively changes from the checkerboard density to the spiral density, and the estimated density given by SDE (6) Finite-Time Analysis of Discrete-Time Stochastic Interpolants (a) An illustration of generation results using different schedules. (b) A view of different ρ0 and their effects on the convergence rate. Figure 2. A visualization of experiments. (a) Different schedules. (b) Different ρ0 and ν. Figure 3. Estimated KL divergences. The x-axis refers to the number of steps, while the y-axis refers to the empirical KL divergence between target distribution ρ(t N) and estimated distribution ˆρ(t N). Figure 3a compares the exponentially decaying schedule (red line) with the uniform schedule (green line). Figure 3b compares different settings of ρ0 and ν (setting A: red line; setting B: green line). well tracks the change of the real density. Comparison of Different Time Schedules We now compare the performance of different schedules, namely, those proposed in Section 5 and the uniform step sizes. The task involves transforming from a "checkerboard" density to a "spiral" density. As illustrated in Figure 2a, employing exponentially decaying step sizes results in about 10 faster convergence of the target density compared to using uniform step sizes. To quantitatively assess this, we estimate the KL divergence between the target density and the generated densities using sampled data points, and we put the details of the estimation of KL divergence in Appendix D. The lower bound observed in the estimated distances is attributed to the estimation error ε2 b F and the inherent randomness in the TV distance estimation process. Both Figure 2a and Figure 3a corrobo- rate the superior convergence performance of exponentially decaying step sizes. Effect of Different Distribution Distances Second, we investigate the impact of different source densities (ρ0) and couplings (ν) on the complexity of the generation process while keeping the target density (ρ1) fixed. The densities are shown in 2b, where we have two source densities, namely A and B . Each block" in ρ0 is coupled with the corresponding block" in ρ1 in the same vertical position. Figure 3b shows the estimated TV distances between the generated data distribution and the target distribution for both scenarios. For both choices of ρ0, we utilized the exponentially decaying step size schedule. The results demonstrate that the generation process converges faster when the source density (ρ0) is closer to the target density (ρ1) under the specified coupling (ν). In contrast, the generation process converges slower when the source density is farther away from the target density. This observation highlights the influence of the initial distribution and the coupling structure on the overall convergence behavior. Test on d-dimensional Gaussian Mixtures Next, we compare the performance of the sampler under different dimension d. The source and target distribution are chosen as Gaussian mixtures, as in this case the drift b F (t, x) has an analytical form (Albergo et al., 2023). The comparison is shown in Figure 4. (a) Convergence for different d. (b) Error for fixed N. Figure 4. The impact of dimension d. Figure 4a shows the convergence for different d. Figure 4b compares the KL error for different d under fixed number of iterations N. The upper bound of the KL error provided by our theory is O(d2), but the empirical KL error evaluated in this experiment grows almost linearly w.r.t. d. This gap could be due to several reasons. For example, there might be some special properties of Gaussian mixtures which do not hold for general distributions, or there simply exists a sharper bound that we have not found. Filling this gap would be an interesting direction for future research. 7. Conclusion and Future Directions This paper provides the first discrete-time analysis for the SDE-based generative models within the stochastic inter- Finite-Time Analysis of Discrete-Time Stochastic Interpolants polant framework. We formulate a discrete-time sampler using the Euler Maruyama scheme to estimate the target distribution by leveraging learned velocity estimators. We then provide an upper bound on the KL divergence from the target distribution to the estimated distribution. Our result provides a novel quantification on how different factors, including the distance between source and target distributions p Eν x0 x1 p and the desired estimation accuracy ε2, affect the convergence rate and also offers a new principled way to design efficient schedule for convergence acceleration. Finally, we also conduct numerical experiments with the discrete-time sampler, which validates our theoretical findings. Future research avenues can fruitfully explore the enhancement of convergence bounds, with a particular focus on addressing the dependency on the dimension d. Notably, diffusion models are demonstrated by previous works to exhibit near d-linear convergence rates, indicating the potential for improvement. Another direction is to investigate strategies for refining the sampling algorithm to attain convergence rates superior to the currently observed O(ε 2). Furthermore, the finite-time convergence phenomenon of ODE-based generative models within the context of the stochastic interpolant framework warrants a more comprehensive investigation. Grant Acknowledgement This work is supported by the National Natural Science Foundation of China Grants 52450016 and 52494974. Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. Albergo, M. S. and Vanden-Eijnden, E. Building normalizing flows with stochastic interpolants. In The Eleventh International Conference on Learning Representations, 2023. Albergo, M. S., Boffi, N. M., and Vanden-Eijnden, E. Stochastic interpolants: A unifying framework for flows and diffusions, 2023. Albergo, M. S., Goldstein, M., Boffi, N. M., Ranganath, R., and Vanden-Eijnden, E. Stochastic interpolants with data-dependent couplings, 2024. Benton, J., Bortoli, V. D., Doucet, A., and Deligiannidis, G. Nearly d-linear convergence bounds for diffusion models via stochastic localization. In The Twelfth International Conference on Learning Representations, 2024. Bortoli, V. D. Convergence of denoising diffusion models under the manifold hypothesis. Transactions on Machine Learning Research, 2022. Chen, H., Lee, H., and Lu, J. Improved analysis of scorebased generative modeling: User-friendly bounds under minimal smoothness assumptions. In International Conference on Machine Learning, 2023a. Chen, M., Huang, K., Zhao, T., and Wang, M. Score approximation, estimation and distribution recovery of diffusion models on low-dimensional data. In International Conference on Machine Learning, 2023b. Chen, S., Chewi, S., Lee, H., Li, Y., Lu, J., and Salim, A. The probability flow ode is provably fast. Advances in Neural Information Processing Systems, 2023c. Chen, S., Chewi, S., Li, J., Li, Y., Salim, A., and Zhang, A. R. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions, 2023d. Chen, Y., Biloš, M., Mittal, S., Deng, W., Rasul, K., and Schneider, A. Recurrent interpolants for probabilistic time series prediction, 2024a. Chen, Y., Goldstein, M., Hua, M., Albergo, M. S., Boffi, N. M., and Vanden-Eijnden, E. Probabilistic forecasting with stochastic interpolants and föllmer processes, 2024b. dos Santos Costa, A., Mitnikov, I., Pellegrini, F., Daigavane, A., Geiger, M., Cao, Z., Kreis, K., Smidt, T., Kucukbenli, E., and Jacobson, J. Equijump: Protein dynamics simulation via so(3)-equivariant stochastic interpolants, 2024. Grathwohl, W., Chen, R. T. Q., Bettencourt, J., and Duvenaud, D. Scalable reversible generative models with free-form continuous dynamics. In International Conference on Learning Representations, 2019. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015. Le Gall, J.-F. Brownian motion, martingales, and stochastic calculus. Springer, 2016. Lee, H., Lu, J., and Tan, Y. Convergence for score-based generative modeling with polynomial complexity. In Advances in Neural Information Processing Systems, 2022. Li, G., Huang, Y., Efimov, T., Wei, Y., Chi, Y., and Chen, Y. Accelerating convergence of score-based diffusion models, provably, 2024a. Finite-Time Analysis of Discrete-Time Stochastic Interpolants Li, G., Wei, Y., Chen, Y., and Chi, Y. Towards faster nonasymptotic convergence for diffusion-based generative models, 2024b. Liang, Y., Ju, P., Liang, Y., and Shroff, N. Broadening target distributions for accelerated diffusion models via a novel analysis approach, 2024. Ma, N., Goldstein, M., Albergo, M. S., Boffi, N. M., Vanden Eijnden, E., and Xie, S. Sit: Exploring flow and diffusionbased generative models with scalable interpolant transformers, 2024. Nair, V. and Hinton, G. E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), 2010. Oko, K., Akiyama, S., and Suzuki, T. Diffusion models are minimax optimal distribution estimators. In Proceedings of the 40th International Conference on Machine Learning, 2023. Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. Wibisono, A., Wilson, A. C., and Jordan, M. I. A variational perspective on accelerated methods in optimization. Proceedings of the National Academy of Sciences, 2016. Finite-Time Analysis of Discrete-Time Stochastic Interpolants Appendices Appendix A: Supplementary details for Section 3 - This section provides additional details from (Albergo et al., 2023) that were not included in Section 3. It includes a formal statement of key equations and their associated conditions. Furthermore, it outlines the optimization objectives for the score and velocity estimators, which are used for model training in Section 6. Appendix B: Useful lemmas in bounding derivatives - This section presents lemmas that are essential for bounding the derivatives of velocity functions and score functions, along with their corresponding proofs. The proofs primarily rely on properties and inequalities related to (conditional) expectations. These lemmas play a crucial role in deriving the overall KL error bounds. Appendix C: Proofs of results in Section 4 and 5 - This section provides the complete proofs for the results presented in Section 4 and Section 5. This includes the proof of Theorem 4.3 (Appendix C.2), the proof of Proposition 5.1 and Corollary 5.2 (Appendices C.3,C.4), and additional details regarding the discussions in Section 5 (Appendix C.6). Appendix D: More details of numerical experiments - This section provides omitted details for Section 6, including the parameterization of estimators, choice of (t0, t N) and optimizers, and how the TV distance is estimated. We also include additional experiments for γ(t) = p Finite-Time Analysis of Discrete-Time Stochastic Interpolants We use to denote ℓ2 norm for both vectors and matrices. For a matrix A, we use A F = q P ij A2 ij to denote the Frobenious norm of A. We use d du, u, or just u to denote the (partial) derivative with respect to u. We use to denote the gradient or Jacobian, depending on whether the function is scalar-valued or vector-valued. If not specified, for the function in form of f(t, x) where t is a scalar and x is a vector, f(t, x) means the gradient vector or Jacobian matrix with respect to x rather than t. We use f(t, x) = Pd i=1 2 x2 i f as the Laplace operator. We use E[X] to denote the expectation of a random variable X, and Cov(X, Y ) to denote the covariance of two random variables X, Y . E[X|c] and Cov(X, Y |c) denote the corresponding conditional expectation and conditional covariance given condition c. We use the notation f(x) g(x) or f(x) = O(g(x)) to denote that there exists a constant C > 0 such that f(x) Cg(x). A. Supplementary Details for Section 3 This part summarizes some of the results from (Albergo et al., 2023) that are not introduced in Section 3. Proposition A.1. ((Albergo et al., 2023), Theorem 2.6, Corollaries 2.10 and 2.18, and their proofs) Suppose that the joint measure ν and the function I satisfies E (x0,x1) ν t I(t, x0, x1) 4 M1 < , E (x0,x1) ν 2 t I(t, x0, x1) 2 M2 < , t [0, 1]. (9) Then, ρ C1((0, 1), Cp(Rd)), s C1((0, 1), (Cp(Rd))d) and b C0((0, 1), (Cp(Rd))d), and both the solution of the probability flow ODE d dt Xt = b(t, Xt), X0 ρ0 and the solution of the forward SDE d XF t = b F (t, XF t )dt + p 2ϵ(t)d Wt, XF 0 ρ0 have the same marginal densities as (xt)t [0,1]. Here ϵ C[0, 1] with ϵ(t) 0 for all t [0, 1] and b F is defined as b F (t, x) = b(t, x) + ϵ(t)s(t, x). (10) Moreover, suppose that the densities ρ0, ρ1 are strictly positive elements of C2(Rd), and are such that Z Rd log ρ0(x) 2ρ0(x)dx < , Z Rd log ρ0(x) 2ρ1(x)dx < . Then ρ C1([0, 1], Cp(Rd)), s C1([0, 1], (Cp(Rd))d) and b C0([0, 1], (Cp(Rd))d). The notation is adapted from (Albergo et al., 2023) where f C1([0, 1], Cp(Rd)) means that the function f is C1 in t [0, 1] and Cp in x Rd. The above proposition provides a generative modeling in the form of d dt Xt = b(t, Xt) and d XF t = b F (t, XF t )dt + p In practice, we need to train an estimator to estimate velocity functions. By the following proposition, we can use the optimization objectives to train the estimators. Proposition A.2. ((Albergo et al., 2023), Theorems 2.7 and 2.8) b is the unique minimizer of Lb[ˆb] = Z 1 2 ˆb(t, xt) 2 ( t I(t, x0, x1) + γ(t)z) ˆb(t, xt) dt, and s is the unique minimizer of Ls[ˆs] = Z 1 2 ˆs(t, xt) 2 + γ 1(t)z ˆs(t, xt) dt. Here the notation " represents the inner product of two vectors. Finite-Time Analysis of Discrete-Time Stochastic Interpolants B. Bounding the Velocities and Scores B.1. Useful Lemmas To begin with, we first provide moment bounds on the Gaussian variable z N(0, Id). Lemma B.1. For any p 1, E z 2p C(p)dp, where C(p) is a constant that only depends on p. Proof. First, z 2 = Pn i=1 z2 i , where we represent z = (z1, z2, , zd)T . For any n positive numbers a1, a2, . . . , an, using Jensen s inequality, n X i=1 E[|zi|2p] (Jensen s inequality) dp E |z1|2p ({zi}d i=1 are i.i.d.) Here the constant 2 |x|2pdx < only depends on p. Also, the following is another simple fact that is useful for our analysis. Lemma B.2. For two vectors u Rn, v Rm, the matrix uv T Rn m satisfies uv T F = u v , where F denotes the Frobenious norm and denotes the 2-norm. Proof. By the definition of the Frobenious norm, j=1 (uv T )2 ij j=1 u2 i v2 j Recall that we have defined v(t, x) = E[ t I(t, x0, x1)|xt = x]. We then give bounds for the score functions and the velocity functions. Finite-Time Analysis of Discrete-Time Stochastic Interpolants Lemma B.3. For p 1, there exists a constant C(p) that depends only on p, s.t. for t (0, 1), E s(t, xt) p C(p)γ pdp/2, E v(t, xt) p C(p)E x1 x0 p, E b(t, xt) p C(p) h E x1 x0 p + γdp/2i , E b F (t, xt) p C(p) h E x1 x0 p + ( γp γ pϵp)dp/2i . Proof. When p 1, use the conditional expectation form of s and v and apply Jensen s inequality, we then obtain E s(t, xt) p = E γ 1E[z|xt = x] p γ p E z p C(p)γ pdp/2, E v(t, xt) p = E E[ t I|xt = x] p E t I p C(p)E x1 x0 p, Moreover, since b(t, x) = v(t, x) + γγs(t, x) and b F (t, x) = b(t, x) + ϵs(t, x), E b(t, xt) p C(p) h E x1 x0 p + γpdp/2i , E b F (t, xt) p C(p) h E x1 x0 p + ( γp γ pϵp)dp/2i . B.2. Bounds on Time and Space Derivatives Note: In the following sections, we will use the fact that d dtγ2(t) = O(1) and d2 dt2 γ2(t) = O(1). Before we move on to the lemmas, we first discuss the conditional expectation itself. By the definition xt = I(t, x0, x1) + γ(t)z, we can just know that the density of xt can be expressed as ρ(t, x) = Z Rd Rd 1 (2πγ(t)2)d/2 exp x I(t, x0, x1) 2 dν(x0, x1). Also, under the condition xt = x, the conditional measure of (x0, x1) is then 1 ρ(t, x) 1 (2πγ(t)2)d/2 exp x I(t, x0, x1) 2 dν(x0, x1). Therefore, for any function ft(xt, x0, x1), its conditional expectation can be written as E[ft(xt, x0, x1)|xt = x] = Z Rd Rd ft(x, x0, x1) ρ(t, x) 1 (2πγ(t)2)d/2 exp x I(t, x0, x1) 2 = E (x0,x1) ν h exp x I(t,x0,x1) 2 2γ(t)2 ft(x, x0, x1) i E (x0,x1) ν h exp x I(t,x0,x1) 2 We first consider the time derivative of v in the sense of expectation. Lemma B.4. We have E tv(t, xt) 2 E 2 t I 2 + γ 2d E x0 x1 4 + γ 2 γ4d3 for t (0, 1). Proof. For t (0, 1), we can first explicitly write v(t, x) = E (x0,x1) ν h exp x I(t) 2 2γ(t)2 t I(t) i E (x0,x1) ν h exp x I(t) 2 Finite-Time Analysis of Discrete-Time Stochastic Interpolants Here we write I(t) = I(t, x0, x1) for simplicity, and below we will omit t when it is clear in the context. We now want to compute tv(t, x). First notice that t I = exp x I 2 2 t I + t I x I 2 γ(t)3 γ + x I Note that supx R exp( x2/2)x = e 1/2 = C1 < , supx R exp( x2/2)x2 = 2e 1 = C2 < , we know that d dt t I 2 t I + C2γ 1 γ t I + C1γ 1 t I 2, Therefore, using dominated convergence theorem, we know that d dt E (x0,x1) ν t I = E (x0,x1) ν Similarly we can do this for the denominator, so that we can compute the overall derivative. Let ft(x0, x1) = x I(t) 2 2γ2 , for simplicity we may just write ft. Then, tv(t, x) = E (x0,x1) ν exp (ft) 2 t I E (x0,x1) ν [exp (ft)] + E (x0,x1) ν [exp (ft) t I tft] E (x0,x1) ν [exp (ft)] E (x0,x1) ν [exp (ft) t I] E (x0,x1) ν [exp (ft) tft] E (x0,x1) ν [exp (ft)] 2 = E[ 2 t I|xt = x] + Cov( t I, tft|xt = x), where the last equality uses the previous explanations of conditional expectations. Hence, tv(t, x) E[ 2 t I |xt = x] + p E[| tft|2|xt = x] p E[ t I 2|xt = x]. Therefore, we have E tv(t, xt) 2 2E[E[ 2 t I 2|xt]] + 2E[E[| tft|2|xt] E[ t I 2|xt]] ((a + b)2 2a2 + 2b2) 2E 2 t I 2 + 2 p E[E[| tft|2|xt]2] p E[E[ t I 2|xt]2] (Cauchy-Schwarz inequality) 2E 2 t I 2 + 2 p E[E[| tft|4|xt]] p E[E[ t I 4|xt]] (Jensen s inequality) 2E 2 t I 2 + 2 p Using the requirement t I C x0 x1 in the definition of stochastic interpolants, t I 4 x0 x1 4. For tft, we can directly obtain γ3 γ + γ 2(x I) t I = γ 1 γ z 2 + γ 1z t I. Recall that we have defined xt = I(t, x0, x1) + γ(t)z where z is an independent gaussian variable z N(0, Id). By Lemma B.1, E z 8 d4, E z 4 d2, we have E| tft|4 (γ 1 γ)4d4 + γ 4d2E x0 x1 4. Finite-Time Analysis of Discrete-Time Stochastic Interpolants Therefore, we can finally deduce that E tv(t, xt) 2 E 2 t I 2 + γ 2d E x0 x1 4 + γ 2 γ4d3. In addition, we want to consider the space derivative of the velocity for a fixed t (0, 1). That is, we want to give a bound for v(t, x). Here we use the notation v(t, x) to denote the Jacobian matrix d dxi v(t, x)j ij, where xi represents the value of vector x at the i-th dimension. Lemma B.5. We have E v(t, x) p F C(p)γ pdp/2p for p 1, t (0, 1), where C(p) is a constant that only depends on p and F denotes the Frobenius norm. Proof. Similar to the proof of Lemma B.4, t I = exp x I 2 where denotes the tensor product, which denotes t I x I 2 2γ2 = t I x I 2 2γ2 T here in the matrix form. Again, by dominated convergence theorem we can move the gradient operator into the expectation. Using the same notations (i.e., ft and so on), we can deduce that v(t, x) = E (x0,x1) ν[exp(ft) ( t I ft)] E (x0,x1) ν[exp(ft)] E (x0,x1) ν[exp(ft) t I] E (x0,x1) ν[exp(ft) ft] E (x0,x1) ν[exp(ft)] 2 = Cov( t I, ft|xt = x). Again, the last equality uses the definition of covariance. Thus, by Cauchy-Schwarz inequality, v(t, x) F p E[ t I 2|xt = x] p E[ ft 2|xt = x]. Therefore, we can use Cauchy-Schwarz inequality again and apply Jensen s inequality to deduce that for any p 1, E v(t, xt) p F q [E[E t I 2|xt]]p q [E[E ft 2|xt]]p It is clear that E t I 2p E x0 x1 2p. Note we then deduce that E ft 2p C(p)γ 2pdp for some constant that only depends on p. The lemma is then obtained. Despite the function v(t, x), we are also interested in the score function s(t, x). The following lemmas provide some similar bounds for s(t, x). Finite-Time Analysis of Discrete-Time Stochastic Interpolants E t (γs(t, xt)) 2 γ 2 γ2d3 + γ 2d2p and E ts(t, xt) 2 γ 4 γ2d3 + γ 4d2p for any t (0, 1), Proof. First using the analysis for the conditional expectations, we obtain that s(t, x) = log ρ(t, x) = E (x0,x1) ν h exp x I 2 E (x0,x1) ν h exp x I 2 In order to compute t(γs(t, x)), we apply a similar analysis as the proof of Lemma B.4 with exactly the same notations to deduce that ts(t, x) = E (x0,x1) ν [exp(ft) t(γ ft)] E (x0,x1) ν [exp(ft)] + E (x0,x1) ν [exp(ft) tft γ ft] E (x0,x1) ν [exp(ft)] E (x0,x1) ν [exp(ft) γ ft] E (x0,x1) ν [exp(ft) tft] E (x0,x1) ν [exp(ft)] 2 = E[ t(γ ft)|xt = x] + Cov(γ ft, tft|xt = x) The above term has exactly the same form as which in the proof of Lemma B.4, so by a similar analysis we can obtain that E t(γs(t, xt)) 2 2E t(γ ft) 2 + 2 p We have already deduced that E ft 4 γ 4d2, and E| tft|4 (γ 1 γ)4d4 + γ 4d2E x0 x1 4. t(γ ft) = t = γ 1 t I + γ 2 γ(x I) = γ 1 t I + γ 1 γz Hence, E ts(t, xt) 2 γ 2 γ2d3 + γ 2d2p which completes the first part. The proof of the second part is exactly the same by replacing γ ft with ft. Lemma B.7. For any p 1, there exists a constant C(p) < that only depends on p such that E s(t, x) p F C(p)γ 2pdp. Proof. With exactly the same ideas of the previous lemmas, we can obtain s(t, x) = E[ 2ft|xt = x] + Cov( ft, ft|xt = x) = γ 2I + γ 2Cov(z, z|xt = x) Finite-Time Analysis of Discrete-Time Stochastic Interpolants Then, for p 1, we have E s(t, xt) p F 2p 1 γ 2I p F + 2p 1γ 2p E E[ z 2|xt = x] p 2p 1γ 2pdp/2 + 2p 1γ 2p E z 2p (Jensen s inequality) C(p)γ 2pdp. Here for the first inequality we have used the fact (a + b)p 2p 1ap + 2p 1bp for a, b 0. We also need some bounds for s and v, where represents the Laplace operator. E v(t, xt) 2 γ 2d E x0 x1 4 + γ 4d2 for all t (0, 1). Proof. We still use the notations in the proof of Lemma B.4. First, in the proof of Lemma B.5, we have already shown that xiv(t, x) = E (x0,x1) ν[exp(ft) ( t I xift)] E (x0,x1) ν[exp(ft)] E (x0,x1) ν[exp(ft) t I] E (x0,x1) ν[exp(ft) xift] E (x0,x1) ν[exp(ft)] 2 = E (x0,x1) ν E ( x0, x1) ν[exp(ft) exp( ft)( t I t I) ( xift xi ft)] 2 E (x0,x1) ν E ( x0, x1) ν[exp(ft) exp( ft)] . The last equality is an alternative form of the covariance, and we use notations I = I(t, x0, x1) and ft = ft( x0, x1) for intermediate variables ( x0, x1). Hence, 2 xiv(t, x) = Cov( t I, 2 xift|xt = x) 2Cov[( t I t I)( xift xi ft), xift + xi ft|xt = xt = x]. For the first term, note that 2 xift = γ 2 is fixed. So, v(t, x) = 1 2Cov[( t I t I)( ft ft), ft + ft|xt = xt = x]. Here the covariance refers to the expectation of dot product instead of the expectation of tensor product. Then, use the fact E X EX 2 E X 2, we know that E[ ( t I t I)( ft ft)T 2 |xt = xt = x] E[ ft + ft 2|xt = xt = x] (Cauchy-Schwarz inequality) h E[ t I t I 4 |xt = xt = x] i1/4 h E[ ft ft 4 |xt = xt = x] i1/4 (Cauchy-Schwarz inequality) E[ ft 2|xt = x] (by symmetry) h E[ t I 4 |xt = x] i1/4 p E[ ft 4|xt = x]. (by symmetry) Finite-Time Analysis of Discrete-Time Stochastic Interpolants E v(t, xt) 2 E q E[ t I 4 |xt = x] E[ ft 4|xt = x] E h E[ t I 4 |xt = x] i p E [E[ ft 4|xt = x]2] (Cauchy-Schwarz inequality) E z 8 (Jensen s inequality) E x0 x1 4 d2 γ 2d E x0 x1 4 + γ 4d2. E s(t, x) 2 γ 6d3 for t (0, 1). s(t, x) = γ 2I + Cov( ft, ft|xt = x). Hence, with similar calculations and notations as in the proof of Lemma B.8, we can deduce that s(t, x) = 2Cov( ft, ft|xt = x) 2Cov[( ft ft)( ft ft)T , ft ft|xt = xt = x]. 2Cov[( ft ft)( ft ft)T , ft ft|xt = xt = x]. Then, with Hölder s inequality, we have s(t, x) E[ ft ft 3|xt = xt = x] 1/3 h E[ ( ft ft)( ft ft)T 3/2] i2/3 E[ ft 3|xt = x] 1/3 (by symmetry) E[ ft ft 3|xt = xt = x] 2/3 E[ ft 3|xt = x]. (by symmetry) Hence, by Jensen s inequality, E s(t, x) 2 E[E[ ft 3|xt = x]2] γ 6E z 6 γ 6d3. C. Omitted Proofs in Sections 4 and 5 C.1. Bounds along the forward Path Recall the forward ODE d Xt = b(t, Xt)dt and the forward SDE d XF t = b F (t, XF t )dt + Their solutions are denoted by Xt and XF t , respectively. Using the chain rule or Itô s formula, for a function f(t, x) that is twice continuously differentiable, we have df(t, Xt) = [ tf(t, Xt) + f(t, Xt) b(t, Xt)]dt, Finite-Time Analysis of Discrete-Time Stochastic Interpolants and df(t, XF t ) = [ tf(t, XF t ) + f(t, XF t ) b F (t, XF t ) + ϵ f]dt + 2ϵ f(t, XF t ) d Wt. With the above formula, we can now provide the following bound on the discretization error. Lemma C.1. For 0 < t0 t1 < 1, suppose ϵ = O(1), then, E v(t0, XF t0) v(t1, XF t1) 2 (t1 t0)2 h M2 + γ 6 mind3 + γ 2 mind p E x0 x1 8 i + ϵ(t1 t0)γ 2 mind p and E s(t0, XF t0) s(t1, XF t1) 2 (t1 t0)2 h γ 4 mind2p E x0 x1 4 + γ 6 mind3i + ϵ(t1 t0)γ 4 mind2. Here we denote γmin = minu [t0,t1] γ. Proof. According to the formula dv(t, XF t ) = [ tv(t, XF t ) + v(t, XF t ) b F (t, XF t ) + ϵ v(t, XF t )]dt + 2ϵ v(t, XF t ) d Wt, we know that v(t1, XF t1) v(t0, XF t0) 2 4 t0 uv(u, XF u )du t0 v(u, XF u ) b(u, XF u )du t0 ϵ v(u, XF u )du 2ϵ v(u, XF u ) d Wu For the first three terms, by Jensen s inequality we know that for any function Y , we have 2 (t1 t0) Z t1 t0 Y (u) 2du. For the last term, use Itô s isometry (Le Gall 2016, Equation 5.8), we can get 2ϵ v(u, XF u ) d Wu 2ϵ v(u, XF u ) 2 F du. Therefore, we can use Fubini s theorem to change the order of expectation and integral, and combine the results of Lemma B.4, B.5, B.8 and B.3 and Assumption 4.1 to get v(t1, XF t1) v(t0, XF t0) 2 (t1 t0) Z t1 t0 E uv(u, Xu) 2du + (t1 t0) Z t1 γ2 mind 1E v(u, Xu) 4 + γ 2 mind E b F (u, Xu) 4 du + ϵ2(t1 t0) Z t1 t0 E v(u, Xu) 2du t0 E v(u, XF u ) 2 F du (t1 t0)2 h M2 + γ 6 mind3 + γ 2 mind p E x0 x1 8 i + ϵ(t1 t0)γ 2d1p Finite-Time Analysis of Discrete-Time Stochastic Interpolants Note that we have already used the condition γ2 C2[0, 1] and γ γ = O(1). Similarly, we can use Lemma B.6, B.7, B.9 and B.3, and the formula ds(t, XF t ) = [ ts(t, XF t ) + s(t, XF t ) b F (t, XF t ) + ϵ s(t, XF t )]dt + 2ϵ s(t, XF t ) d Wt E s(t1, XF t1) s(t0, XF t0) 2 (t1 t0)2 h γ 4 mind2p E x0 x1 4 + γ 6 mind3i + ϵ(t1 t0)γ 4 mind2, where γmin = minu [s,t] γ. C.2. Proof of Theorem 4.3 We first give the following proposition, which is a result from (Chen et al., 2023d). Proposition C.2. (Section 5.2 of (Chen et al., 2023d)) Let P, Q be the path measures of solutions of SDE (3) and (6), where they both start from the same distribution ρ(t0) at time t = t0 and end at time t = t N. Then, if E[ b F (t, XF t ) ˆb F (tk, XF tk) 2] C for any t [t0, t N] and some constant C, we have KL(P Q) = 1 tk E[ b F (t, XF t ) ˆb F (tk, XF tk) 2]dt. Here the expectations are taken over the ground-truth forward process (XF t )t [t0,t N] P. Now, using the above proposition, we are ready to prove Theorem 4.3. Proof. Let P, Q be the path measures of the solutions to the SDE (3) and (6), where the solutions start from the same distribution ρ(t0) at time t = t0, as in Proposition C.2. We first want to check the condition of Proposition C.2. Note that E ˆb F (tk, XF tk) b F (t, XF t ) 2 (a) 2E ˆb F (tk, XF tk) b F (tk, XF tk) 2 + 2E b F (tk, XF tk) b F (t, XF t ) 2 (b) 2E ˆb F (tk, XF tk) b F (tk, XF tk) 2 + 4E v(tk, XF tk) v(t, XF t ) 2 + 4E ( γ(tk) γ(tk) + ϵ)s(tk, XF tk) ( γ(tk) γ(tk) + ϵ)s(t, XF t ) 2 (c) 2E ˆb F (tk, XF tk) b F (tk, XF tk) 2 + 4E v(tk, XF tk) v(t, XF t ) 2 + 8( γ(tk) γ(tk) + ϵ)2E s(tk, XF tk) s(t, XF t ) 2 + 8(γ(t) γ(t) γ(tk) γ(tk))2E s(t, XF t ) 2. Here (a), (b) and (c) use the triangle inequality and the fact (a + b)2 2a2 + 2b2. By Lemmas B.3 and C.1, this term is uniformly bounded in the closed interval [t0, t N]. In fact, we can apply these lemmas to obtain that E ˆb F (tk, XF tk) b F (t, XF t ) 2 (a) E ˆb F (tk, XF tk) b F (tk, XF tk) 2 + (t tk)2 h M2 + γ 6 k d3 + γ 2 k d p E x0 x1 8 i + ϵ(t tk) γ 2 k d p + (t tk)2 h γ 4 k d2p E x0 x1 4 + γ 6 k d3i + ϵ(t tk) γ 4 k d2 + (t tk)2 γ 2 k d (b) E ˆb F (tk, XF tk) b F (tk, XF tk) 2 + (t tk)2 h M2 + γ 6 k d3 + γ 2 k d p E x0 x1 8 i + ϵ(t tk) γ 2 k d hp E x0 x1 4 + γ 2 k d i . Finite-Time Analysis of Discrete-Time Stochastic Interpolants Here step (a) directly expands the discretization error using Lemmas B.3 and C.1; step (b) simplifies the terms by applying Young s inequality and that 1 + ϵ2 = O(1). Then, by Proposition C.2, KL(P Q) = 1 tk E[ b F (t, XF t ) ˆb F (tk, XF tk) 2]dt (a) ϵ 1ε2 b F + ϵ 1 N 1 X k=0 (tk+1 tk)3 h M2 + γ 6 k d3 + γ 2 k d p E x0 x1 8 i k=0 (tk+1 tk)2 γ 2 k d hp E x0 x1 4 + γ 2 k d i . Here step (a) just integrates over the upper bound of the disretization error. Now, consider KL(ρ(t N) ˆρ(t N)). Let ˆQ be the path measure of solutions of (6) starting from ˆρ(t0) instead of ρ(t0). Then, KL(ρ(t N) ˆρ(t N)) KL(P ˆQ) = EP d Q(X) + EP dˆρ(t0)(Xt0) = KL(P Q) + KL(ρ(t0) ˆρ(t0)). The proof is then completed. C.3. Proof of Proposition 5.1 Proof. Using the results of Theorem 4.3, KL(ρ(t N) ˆρ(t N)) (a) ε2 b F + KL(ρ(t0) ˆρ(t0)) + ϵ 1 N 1 X k=0 (tk+1 tk)3 h M2 + γ 6 k d3 + γ 2 k d p E x0 x1 8 i k=0 (tk+1 tk)2 γ 2 k d hp E x0 x1 4 + γ 2 k d i (b) ε2 b F + KL(ρ(t0) ˆρ(t0)) + ϵ 1 N 1 X h M2h3 k + h3d3 + hh2 kd p E x0 x1 8 i E x0 x1 4 + h2d2i (c) ε2 b F + KL(ρ(t0) ˆρ(t0)) + ϵ 1h2 M2 + d p E x0 x1 8 + ϵ 1Nh3d3 E x0 x1 4 + Nh2d2 (d) ε2 b F + KL(ρ(t0) ˆρ(t0)) + hd p E x0 x1 4 + Nh2d2. Here step (a) is the result of Theorem 4.3; step (b) uses the fact hk = tk+1 tk = O(h γ2 k); step (c) uses the fact PN 1 k=0 hk = t N t0 1; step (d) omits the higher-order terms. C.4. Proof of Corollary 5.2 Proof. When the number of steps is N, we have h = Θ N 1 log 1 t0(1 t N) Finite-Time Analysis of Discrete-Time Stochastic Interpolants Then, by Corollary 5.1 and the assumptions, KL(ρ(t N) ˆρ(t N)) ε2 + N 1 d p E x0 x1 4 log 1 t0(1 t N) + d2 log2 1 t0(1 t N) This gives the complexity to to make KL(ρ(t N) ˆρ(t N)) ε2. C.5. Reducing to Diffusion Models By modifying the definition of stochastic interpolant to xt = I(t, x1) + γ(t)z and change the condition on I to t I(t, x1) C x1 , we can repeat the previous analysis while replacing p E x1 p. For the case of diffusion models, we can choose I(t, x1) = tx1 and γ(t) = 1 t2 to obtain a process with the same marginal distributions. Moreover, under this definition of interpolants, we can choose t0 = 0 and hk = tk+1 tk (1 tk) as the time schedule to recover the sample complexity of diffusion models. C.6. Omitted Proofs for γ2(t) = (1 t)2t In this section, we will design a schedule for γ2(t) = (1 t)2t, and provide the corresponding complexity deduced using Theorem 4.3. Moreover, we also derived the complexity of using a uniform schedule for comparison. Corollary C.3. For γ2(t) = (1 t)2t, there exists a schedule so that under the same assumptions as Corollary 5.2, the complexity is given by E x0 x1 4d 1 1 t N + log 1 + d2 1 (1 t N)2 + log2 1 In addition, the complexity for using a uniform schedule is E x0 x1 4d 1 1 t N + log 1 + d2 1 (1 t N)3 + 1 Proof. Here we also take h M = 0.5 for some M > 0. Then, we define ( h A tk+1, k < M h B (1 tk)1.5, k M. for some h A [0, 0.5), h B [0, 1). For the part k < M and tk [0, 0.5), γ2(tk) = Θ(tk), so it is the same as what we have discussed for the case, and we need M = N1 = O 1 E x0 x1 4d log 1 + d2 log2 1 steps to make the discretization error k=0 (tk+1 tk)3 h M2 + γ 6 k d3 + γ 2 k d p E x0 x1 8 i k=0 (tk+1 tk)2 γ 2 k d hp E x0 x1 4 + γ 2 k d i ε2. For the part k M, N 1 X k=M (tk+1 tk)3 h M2 + γ 6 k d3 + γ 2 k d p E x0 x1 8 i = O(h2 B), Finite-Time Analysis of Discrete-Time Stochastic Interpolants and by that hk = Θ(h B γ1.5 k ) = Θ(h B(1 tk)1.5) (use in step (a) below), k=M h2 k γ 2 k d hp E x0 x1 4 + γ 2 k d i k=M hk h d p E x0 x1 4 γ 0.5 k + γ 2.5 k d2i E x0 x1 4 Z t N 0.5 (1 s) 0.5ds + d2 Z t N 0.5 (1 s) 2.5ds E x0 x1 4 + d2 1 (1 t N)1.5 Here the inequality (b) is by that γk = Θ(1 t) for t [tk, tk+1]. Now, we want to compute the number of steps N2 = N M for the part k M. Note that if tk = 1 2 p, it takes O(2p/2h 1 B ) more steps to reach 1 2 p 1. Hence N2 = O h 1 B (1 t N) 0.5 , so we need to take h B = Θ N 1(1 t N) 0.5 . Therefore, k=M h2 k γ 2 k d hp E x0 x1 4 + γ 2 k d i Thus, for the part k > M, we need N M = N2 = O steps to make the discretization error bounded by O(ε2). Hence, the overall complexity is given by N = N1 + N2, which is our result. If we use a uniform schedule, by Theorem 4.3 and that γ2(t) = Θ(min{t, (1 t)2}), we can bound KL(ρ(t N) ˆρ(t N)) (a) ε2 b F + KL(ρ(t0) ˆρ(t0)) E x0 x1 4d Z 0.5 t0 s 1ds + Z t N 0.5 (1 s) 2ds t0 s 2ds + Z t N 0.5 (1 s) 4ds ε2 b F + KL(ρ(t0) ˆρ(t0)) E x0 x1 4d 1 1 t N + log 1 N d2 1 (1 t N)3 + 1 which further gives the complexity bound for uniform schedule. Here the inequality (a) is by applying Theorem 4.3 and replacing γk with the term of the same order. This bound is worse than using the schedule satisfying that hk h γk. D. More Details of Numerical Experiments To parameterize the estimator ˆb F (t, x) for two-dimensional data, we utilize a simple multilayer perceptron (MLP) network. The input of the network comprises a three-dimensional vector (x, t), and its output is a two-dimensional vector ˆb F (t, x). The MLP architecture consists of three hidden layers, each with 256 neurons, followed by Re LU activation functions (Nair & Hinton, 2010). Finite-Time Analysis of Discrete-Time Stochastic Interpolants To train the estimator ˆb F (t, x), we leverage a simple quadratic objective (see Appendix A for details) whose optimizer is the real drift b F (t, x). Given the estimator, data batches, and sampled time points, we are ready to compute an empirical loss. We employ the Adam optimizer (Kingma & Ba, 2015) to train the network using the gradient computed on the empirical loss. We set t0 = 0.001 and t N = 0.999 to ensure that the initial density ρ(t0) is close to ρ0 and the estimated density ρ(t N) closely approximates ρ1. During the training process of the estimator ˆb F (t, x), we employ an importance sampling technique so that t [t0, t N] is sampled with probability proportional to γ 2(t). We implement the discretized sampler as defined in Equation (7). We use more than 10, 000 data samples to empirically visualize the densities in Figures 1, 2a and 2b. In addition, for KL divergence estimation, we estimate the density ratio by comparing the distance to the k-th nearest neighborhood. Specifically, for some x Rd, let {xi 0}n i=1 and {xj 1}m j=0 be i.i.d. samples from ρ0 and ρ1, and d0, d1 be the corresponding distance from x to the k-th nearest neighborhood, then we estimate ρ0(x) ρ1(x) k/(ndd 0) k/(mdd 1) = mdd 1 ndd 0 . Figure 5. Estimated KL divergence for different step size schedules, where we use γ2(t) = (1 t)2t. The red curve denotes the distance when we use the schedule designed in Appendix C.6, while the green curve denotes the distance when we use the uniform schedule. D.1. Additional Experiments for γ(t) = p We implement the schedule discussed in Appendix C.6 and compare it to the uniform schedule. We choose (t0, t N) = (0.001, 0.97) since γ2(t) is Θ((1 t)2) near t = 1. We choose ρ0 as the checkerboard" density and ρ1 as the spiral" density. We estimate the KL divergence KL(ρ(t N) ˆρ(t N)) to indicate how close the estimated distribution is to the target distribution. The comparison is shown in Figure 5.