# variational_schrödinger_diffusion_models__bd6001ad.pdf Variational Schr odinger Diffusion Models Wei Deng * 1 Weijian Luo * 2 Yixin Tan * 3 Marin Biloˇs 1 Yu Chen 1 Yuriy Nevmyvaka 1 Ricky T. Q. Chen 4 Schr odinger bridge (SB) has emerged as the go-to method for optimizing transportation plans in diffusion models. However, SB requires estimating the intractable forward score functions, inevitably resulting in the costly implicit training loss based on simulated trajectories. To improve the scalability while preserving efficient transportation plans, we leverage variational inference to linearize the forward score functions (variational scores) of SB and restore simulation-free properties in training backward scores. We propose the variational Schr odinger diffusion model (VSDM), where the forward process is a multivariate diffusion and the variational scores are adaptively optimized for efficient transport. Theoretically, we use stochastic approximation to prove the convergence of the variational scores and show the convergence of the adaptively generated samples based on the optimal variational scores. Empirically, we test the algorithm in simulated examples and observe that VSDM is efficient in generations of anisotropic shapes and yields straighter sample trajectories compared to the single-variate diffusion. We also verify the scalability of the algorithm in real-world data and achieve competitive unconditional generation performance in CIFAR10 and conditional generation in time series modeling. Notably, VSDM no longer depends on warm-up initializations and has become tuningfriendly in training large-scale experiments. 1. Introduction Diffusion models have showcased remarkable proficiency across diverse domains, spanning large-scale generations of image, video, and audio, conditional text-to-image tasks, *Equal contribution 1Machine Learning Research, Morgan Stanley, NY 2Peking University 3Duke University 4Meta AI (FAIR). Correspondence to: Wei Deng . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). and adversarial defenses (Dhariwal & Nichol, 2022; Ho et al., 2022; Kong et al., 2021; Ramesh et al., 2022; Zhang et al., 2024). The key to their scalability lies in the closedform updates of the forward process, highlighting both statistical efficiency (Koehler et al., 2023) and diminished dependence on dimensionality (Vono et al., 2022). Nevertheless, diffusion models lack a distinct guarantee of optimal transport (OT) properties (Lavenant & Santambrogio, 2022) and often necessitate costly evaluations to generate higherfidelity content (Ho et al., 2020; Salimans & Ho, 2022; Lu et al., 2022; Xue et al., 2023; Luo, 2023). Alternatively, the Schr odinger bridge (SB) problem (L eonard, 2014; Chen & Georgiou, 2016; Pavon et al., 2021; Caluya & Halder, 2022; De Bortoli et al., 2021), initially rooted in quantum mechanics (L eonard, 2014), proposes optimizing a stochastic control objective through the use of forward-backward stochastic differential equations (FBSDEs) (Chen et al., 2022b). The alternating solver gives rise to the iterative proportional fitting (IPF) algorithm (Kullback, 1968; Ruschendorf, 1995) in dynamic optimal transport (Villani, 2003; Peyr e & Cuturi, 2019). Notably, the intractable forward score function plays a crucial role in providing theoretical guarantees in optimal transport (Chen et al., 2023c; Deng et al., 2024). However, it simultaneously sacrifices the simulation-free property and largely relies on warm-up checkpoints for conducting large-scale experiments (De Bortoli et al., 2021; Chen et al., 2022b). A natural follow-up question arises: Can we train diffusion models with efficient transport? To this end, we introduce the variational Schr odinger diffusion model (VSDM). Employing variational inference (Blei et al., 2017), we perform a locally linear approximation of the forward score function, and denote it by the variational score. The resulting linear forward stochastic differential equations (SDEs) naturally provide a closed-form update, significantly enhancing scalability. Compared to the singlevariate score-based generative model (SGM), VSDM is a multivariate diffusion (Singhal et al., 2023). Moreover, hyperparameters are adaptively optimized for more efficient transportation plans within the Schr odinger bridge framework (Chen et al., 2022b). Theoretically, we leverage stochastic approximation (Rob- Variational Schr odinger Diffusion Models bins & Monro, 1951) to demonstrate the convergence of the variational score to the optimal local estimators. Although the global transport optimality is compromised, the notable simulation-free speed-ups in training the backward score render the algorithm particularly attractive for training various generation tasks from scratch. Additionally, the efficiency of simulation-based training for the linearized variational score significantly improves owing to computational advancements in convex optimization. We validate the strength of VSDM through simulations, achieving compelling performance on standard image generation tasks. Our contributions unfold in four key aspects: We introduce the variational Schr odinger diffusion model (VSDM), a multivariate diffusion with optimal variational scores guided by optimal transport. Additionally, the training of backward scores is simulationfree and becomes much more scalable. We study the convergence of the variational score using stochastic approximation (SA) theory, which can be further generalized to a class of state space diffusion models for future developments. VSDM is effective in generating data of anisotropic shapes and motivates straighter transportation paths via the optimized transport. VSDM achieves competitive unconditional generation on CIFAR10 and conditional generation in time series modeling without reliance on warm-up initializations. 2. Related Works Flow Matching and Beyond Lipman et al. (2023) utilized the Mc Cann displacement interpolation (Mc Cann, 1997) to train simulation-free CNFs to encourage straight trajectories. Consequently, Pooladian et al. (2023); Tong et al. (2023) proposed straightening by using minibatch optimal transport solutions. Similar ideas were achieved by Liu (2022); Liu et al. (2023) to iteratively rectify the interpolation path. Albergo & Vanden-Eijnden (2023); Albergo et al. (2023) developed the stochastic interpolant approach to unify both flow and diffusion models. However, straighter transport maps may not imply optimal transportation plans in general and the couplings are still not effectively optimized. Dynamic Optimal Transport Finlay et al. (2020); Onken et al. (2021) introduced additional regularization through optimal transport to enforce straighter trajectories in CNFs and reduce the computational cost. De Bortoli et al. (2021); Chen et al. (2022b); Vargas et al. (2021) studied the dynamic Schr odinger bridge with guarantees in entropic optimal transport (EOT) (Chen et al., 2023c); Shi et al. (2023); Peluchetti (2023); Chen et al. (2023b) generalized bridge matching and flow matching based EOT and obtained smoother trajectories, however, scalability remains a significant concern for Schr odinger-based diffusions. 3. Preliminaries 3.1. Diffusion Models The score-based generative models (SGMs) (Ho et al., 2020; Song et al., 2021b) first employ a forward process (1a) to map data to an approximate Gaussian and subsequently reverse the process in Eq.(1b) to recover the data distribution. d x t = f t( x t)dt + p βtd wt (1a) d x t = f t( x t) βt log ρt x t dt + p βtd wt, (1b) where x t, x t Rd; x 0 ρdata and x T ρprior; f t denotes the vector field and is often set to 0 (a.k.a. VE-SDE) or linear in x (a.k.a. VP-SDE); βt > 0 is the time-varying scalar; wt is a forward Brownian motion from t [0, T] with ρT ρprior; wt is a backward Brownian motion from time T to 0. The marginal density ρt of the forward process (1a) is essential for generating the data but remains inaccessible in practice due to intractable normalizing constants. Explicit Score Matching (ESM) Instead, the conditional score function log ρt|0 ( ) log ρt | x 0 is estimated by minimizing a user-friendly ESM loss (weighted by λ) between the score estimator st sθ( , t) and exact score (Song et al., 2021b) such that Et λt E x 0E x t| x 0[ st( x t) log ρt|0 x t 2 2] . (2) Notably, both VPand VE-SDEs yield closed-form expressions for any x t given x 0 in the forward process (Song et al., 2021b), which is instrumental for the scalability of diffusion models in real-world large-scale generation tasks. Implicit Score Matching (ISM) By integration by parts, ESM is equivalent to the ISM loss (Hyv arinen, 2005; Huang et al., 2021; Luo et al., 2024b) and the evidence lower bound (ELBO) follows log ρ0 (x0) EρT |0( ) log ρT |0 (x T ) 0 Eρt|0( ) h βt st 2 2 + 2 (βtst f t) i dt. ISM is naturally connected to Song et al. (2020), which supports flexible marginals and nonlinear forward processes but becomes significantly less scalable compared to ESM. 3.2. Schr odinger Bridge The dynamic Schr odinger bridge aims to solve a full bridge inf P D(ρdata,ρprior) KL(P|Q), (3) Variational Schr odinger Diffusion Models where D(ρdata, ρprior) is the family of path measures with marginals ρdata and ρprior at t = 0 and t = T, respectively; Q is the prior process driven by dxt = f t(xt)dt+ 2βtεd wt. It also yields a stochastic control formulation (Chen et al., 2021; Pavon et al., 2021; Caluya & Halder, 2022). inf u U E Z T 1 2 ut( x t) 2 2dt s.t. d x t = h f t( x ) + p βtut( x ) i dt + p 2βtεd wt (4) x 0 ρdata, x T ρprior, where U is the family of controls. The expectation is taken w.r.t ρ t( ), which denotes the PDF of the controlled diffusion (4); ε is the temperature of the diffusion and the regularizer in EOT (Chen et al., 2023c). Solving the underlying Hamilton Jacobi Bellman (HJB) equation and invoking the time reversal (Anderson, 1982) with ε = 1 2, Schr odinger system yields the desired forward-backward stochastic differential equations (FBSDEs) (Chen et al., 2022b): d x t = h f t( x t) + βt log ψ t( x t) i dt + p βtd wt, (5a) d x t = f t( x t) βt log φ t( x t) dt + p βtd wt, (5b) where ψ t( ) φ t( ) = ρ t( ), ρ0( ) ρdata, ρT ( ) ρprior. To solve the optimal controls (scores) ( log ψ , log φ ), a standard tool is to leverage the nonlinear Feynman-Kac formula (Ma & Yong, 2007; Karatzas & Shreve, 1998; Chen et al., 2022b) to learn a stochastic representation. Proposition 1 (Nonlinear Feynman-Kac representation). Assume Lipschitz smoothness and linear growth condition on the drift f and diffusion g in the FB-SDE (5). Define y t = log ψ t(xt) and y t = log φ t(xt). Then the stochastic representation follows y s = E y T Z T s Γζ( z t; z t)dt x s = xs Γζ( z t; z t) 1 2 z t 2 2 + p βt z t f t + ζ z t, z t , (6) where z t = βt y t, z t = βt y t, and ζ = 1. 4. Variational Schr odinger Diffusion Models SB outperforms SGMs in the theoretical potential of optimal transport and an intractable score function log ψ t(xt) is exploited in the forward SDE for more efficient transportation plans. However, there is no free lunch in achieving such efficiency, and it comes with three notable downsides: Solving log ψ t in Eq.(5a) for optimal transport is prohibitively costly and may not be necessary (Marzouk et al., 2016; Liu et al., 2023). The nonlinear diffusion no longer yields closed-form expression of x t given x 0 (Chen et al., 2022b). The ISM loss is inevitable and the estimator suffers from a large variance issue (Hutchinson, 1989). 4.1. Variational Inference via Linear Approximation FB-SDEs naturally connect to the alternating-projection solver based on the IPF (a.k.a. Sinkhorn) algorithm, boiling down the full bridge (3) to a half-bridge solver (Pavon et al., 2021; De Bortoli et al., 2021; Vargas et al., 2021). With P1 given and k = 1, 2, ..., we have: P2k := arg min P D(ρdata, ) KL(P P2k 1), (7a) P2k+1 := arg min P D( , ρprior) KL(P P2k). (7b) More specifically, Chen et al. (2022b) proposed a neural network parameterization to model ( z t, z t) using ( z θ t, z ω t ), where θ and ω refer to the model parameters, respectively. Each stage of the half-bridge solver proposes to solve the models alternatingly as follows L (θ) = Z T 0 E x t (5a) Γ1( z θ t; z ω t )dt x 0 = x0 L (ω) = Z T 0 E x t (5b) Γ1( z ω t ; z θ t)dt x T = x T where Γ1 is defined in Eq.(6) and denotes the approximate simulation parametrized by neural networks * However, solving the backward score in Eq.(8a) through simulations, akin to the ISM loss, is computationally demanding and affects the scalability in generative models. To motivate simulation-free property, we leverage variational inference (Blei et al., 2017) and study a linear approximation of the forward score log ψ (x, t) Atx with f t( x t) 1 2βt x t, which ends up with the variational FB-SDE (VFB-SDE): 2βt x t + βt At x t βtd wt, (9a) 2βt x t βt log ρ t( x t) dt + p where t [0, T] and log ρ t is the score function of (9a) and the conditional version is to be derived in Eq.(15). The half-bridge solver is restricted to a class of OU processes OU(ρdata, ) with the initial marginal ρdata. arg min P D(ρdata, ) KL(P P2k 1) arg min b P OU(ρdata, ) KL(b P P2k 1). * (resp. ) denotes the exact (resp. parametrized) simulation. Variational Schr odinger Diffusion Models By the mode-seeking property of the exclusive (reverse) KL divergence (Chan et al., 2022), we can expect the optimizer b P to be a local estimator of the nonlinear solution in (7a). Additionally, the loss function (8b) to learn the variational score At, where t [0, T], can be simplified to L (A) = Z T Γζ(Atxt; z θ t)dt x T = x T where Γζ is defined in Eq.(6). Since the structure property ψ t φ t = ρ t in Eq.(5) is compromised by the variational inference, we propose to tune ζ in our experiments. 4.2. Closed-form Expression of Backward Score Assume a prior knowledge of At is given, we can rewrite the forward process (9a) in the VFB-SDE and derive a multivariate forward diffusion (Singhal et al., 2023): 2βt I + βt At 2Dtβt x tdt + p βtd wt, (11) where Dt = I 2At Rd d is a positive-definite matrix . Consider the multivariate OU process (11). The mean and covariance follow dµt|0 2βt Dtµt|0 (12a) 2βt DtΣt|0 + Σt|0D t + βt I. (12b) Solving the differential equations with the help of integration factors, the mean process follows 2 [βD]tx0, (13) where [βD]t = R t 0 βs Dsds. By matrix decomposition Σt|0 = Ct H 1 t (S arkk a & Solin, 2019), the covariance process follows that: Ct Ht 2[βD]t [βI]t 0 1 2[βD ]t where the above matrix exponential can be easily computed through modern computing libraries. Further, to avoid computing the expensive matrix exponential for highdimensional problems, we can adopt a diagonal and timeinvariant Dt. Suppose Σt|0 has the Cholesky decomposition Σt|0 = Lt L t for some lower-triangular matrix Lt. We can have a closed-form update that resembles the SGM. x t = µt|0 + Ltϵ, Dt = 2At Rd d when the forward SDE is VE-SDE. where µt|0 is defined in Eq.(13) and ϵ is the standard ddimensional Gaussian vector. The score function follows log ρ t|0( x t) = 1 2 [( x t µt) Σ 1 t|0( x t µt)] = Σ 1 t|0( x t µt) (15) = L t L 1 t Ltϵ := L t ϵ. Invoking the ESM loss function in Eq.(2), we can learn the score function log ρ t|0( x t| x 0) using a neural network parametrization st( ) and optimize the loss function: A L t ϵ st(xt) 2 2. (16) One may further consider preconditioning techniques (Karras et al., 2022) or variance reduction (Singhal et al., 2023) to stabilize training and accelerate training speed. Speed-ups via Time-invariant and Diagonal Dt If we parametrize Dt as a time-invariant and diagonal positivedefinite matrix, the formula (14) has simpler explicit expressions that do not require calling matrix exponential operators. We present such a result in Corollary 1. For the image generation experiment in Section 7.3, we use such a diagonal parametrization when implementing the VSDM. Corollary 1. If Dt = Λ := diag(λ), where λi 0, 1 i d. If we denote the σ2 t := R t 0 βsds, then matrices Ct and Ht has simpler expressions with Ct = Λ 1 exp(1 2σ2 t Λ) exp( 1 which leads to Ct H 1 t = Λ 1 I exp( σ2 t Λ) . As a result, the corresponding forward transition writes µt|0 = exp( 1 2σ2 t Λ)x0, Lt = Λ 1 I exp( σ2 t Λ). In Corrolary 1 detailed in Appendix A, since the matrix Λ = diag(λ) is diagonal and time-invariant, the matrix exponential and square root can be directly calculated elementwise on each diagonal elements λi independently. 4.2.1. BACKWARD SDE Taking the time reversal (Anderson, 1982) of the forward multivariate OU process (11), the backward SDE satisfies d x t = ( 1 2Dtβt x t βtst( x t))dt + p βtd wt. (17) Notably, with a general PD matrix Dt, the prior distribution follows that x T N(0, ΣT |0) . We also note that the prior is now limited to Gaussian distributions, which is not a general bridge anymore. See the Remark on the selection of ρprior in section B.1. Variational Schr odinger Diffusion Models 4.2.2. PROBABILITY FLOW ODE We can follow Song et al. (2021b) and obtain the deterministic process directly: 2Dtβt x t 1 2βtst( x t) dt, (18) where x T N(0, ΣT |0) and the sample trajectories follow the same marginal densities ρ t(xt) as in the SDE. 4.3. Adaptive Diffusion via Stochastic Approximation Our major goal is to generate high-fidelity data with efficient transportation plans based on the optimal A t in the forward process (11). However, the optimal A t is not known a priori. To tackle this issue, we leverage stochastic approximation (SA) (Robbins & Monro, 1951; Benveniste et al., 1990) to adaptively optimize the variational score A(k) t through optimal transport and simulate the backward trajectories. (1) Simulate backward trajectoriest { x (k+1) nh }N 1 n=0 via the Euler Maruyama (EM) scheme of the backward process (17) with a learning rate h. (2) Optimize variational scores A(k) nh }N 1 n=0 : A(k+1) nh = A(k) nh ηk+1 L nh(A(k) nh ; x (k+1) nh ), where L nh(A(k) nh ; x (k+1) nh ) is the loss function (10) at time nh and is known as the random field. We expect that the simulation of backward trajectories { x (k+1) nh }N 1 n=0 given s(k+1) nh helps the optimization of A(k+1) nh and the optimized A(k+1) nh in turn contributes to a more efficient transportation plan for estimating s(k+2) nh and simulating the backward trajectories { x (k+2) nh }N 1 n=0 . Trajectory Averaging The stochastic approximation algorithm is a standard framework to study adaptive sampling algorithms (Liang et al., 2007). Moreover, the formulation suggests to stabilize the trajectories (Polyak & Juditsky, 1992) with averaged parameters A (k) nh as follows i=1 A(i) nh = 1 1 A (k 1) nh + 1 k A(k) nh , where A (k) nh is known to be an asymptotically efficient (optimal) estimator (Polyak & Juditsky, 1992) in the local state space A by assumption A1. Exponential Moving Average (EMA) Despite guarantees in convex scenarios, the parameter space differs tremendously in different surfaces in non-convex state space A. Empirically, if we want to exploit information from multiple modes, a standard extension is to employ the EMA technique (Trivedi & Kondor, 2017): A (k) nh = (1 η)A (k 1) nh + ηA(k) nh , where η (0, 1). The EMA techniques are widely used empirically in diffusion models and Schr odinger bridge (Song & Ermon, 2020; De Bortoli et al., 2021; Chen et al., 2022b) to avoid oscillating trajectories. Now we are ready to present our methodology in Algorithm 1. Computational Cost Regarding the wall-clock computational time: i) training (linear) variational scores, albeit in a simulation-based manner, becomes significantly faster than estimating nonlinear forward scores in Schr odinger bridge; ii) the variational parametrization greatly reduced the number of model parameters, which yields a muchreduced variance in the Hutchinson s estimator (Hutchinson, 1989); iii) since we don t need to update At as often as the backward score model, we can further amortize the training of At. In the simulation example in Figure.9(b), VSDM is only 10% slower than the SGM with the same training complexity of backward scores while still maintaining efficient convergence of variational scores. 5. Convergence of Stochastic Approximation In this section, we study the convergence of A(k) t to the optimal A t , where t [0, T] . The primary objective is to show the iterates (19) follow the trajectories of the dynamical system asymptotically: d At = L t(At)ds, (20) ds = limη 0 A(k+1) t A(k) t η and L t( ) is the mean field at time t: L t(At) = Z X L t(At; x ( ) t ) ρ t(d x ( ) t ), (21) where X denotes the state space of data x and L t denotes the gradient w.r.t. At; ρ t is the distribution of the continuous-time interpolation of the discretized backward SDE (22) from t = T to 0. We denote by A t one of the solutions of L t(A t ) = 0. The aim is to find the optimal solution A t to the mean field L t(A t ) = 0. However, we acknowledge that the equilibrium is not unique in general nonlinear dynamical systems. To tackle this issue, we focus our analysis around a neighborhood Θ of the equilibrium by assumption A1. After running sufficient many iterations with a small enough We slightly abuse the notation and generalize A(k) nh to A(k) t . Variational Schr odinger Diffusion Models Algorithm 1 Variational Schr odinger Diffusion Models (VSDM). ρprior is fixed to a Gaussian distribution. ηk is the step size for SA and h is the learning rate for the backward sampling of Eq.(17). ξn denotes the standard Gaussian vector at the sampling iteration n. The exponential moving averaging (EMA) technique can be used to further stabilize the algorithm. Simulation-free Optimization of Backward Score Draw x0 ρdata, n {0, 1, , N 1}, ϵ N(0, I). Sample xnh|x0 N(µnh|0, Σnh|0) by Eq.(13) and (14) given A(k) nh . Cache {µnh|0}N 1 n=0 and {L nh }N 1 n=0 via Cholesky decomposition of {Σnh}N 1 n=0 to avoid repeated computations. Optimize the score functions s(k+1) nh sufficiently through the loss function θ L nh ϵ s(k+1) nh (xnh) 2 2. Optimization of Variational Score via Stochastic Approximation (SA) Simulate the backward trajectory x (k+1) nh given A(k) nh via Eq.(22), where x (k+1) (N 1) N(0, Σ(k) (N 1)h|0). Optimize variational score A(k+1) nh using the loss function (10), where n {0, 1, , N 1}: A(k+1) nh = A(k) nh ηk+1 L nh(A(k) nh ; x (k+1) nh ). (19) until Stage k = kmax Sample x 0 with stochastic (resp. deterministic) trajectories via the discretized Eq.(17) (resp. Eq.(18)). step size ηk, suppose A(k) t Θ is somewhere near one equilibrium A t (out of all equilibrium), then by the induction method, the iteration tends to get trapped in the same region as shown in Eq.(32) and yields the convergence to one equilibrium A t . We also present the variational gap of the (sub)-optimal transport and show our transport is more efficient than diffusion models with Gaussian marginals. Next, we outline informal assumptions and sketch our main results, reserving formal ones for readers interested in the details in the appendix. We also formulate the optimization of the variational score At using stochastic approximation in Algorithm 2 in the supplementary material. Assumption A1 (Regularity). (Positive definiteness) For any t 0 and At A, Dt = I 2At is positive definite. (Locally strong convexity) For any stable local minimum A t with L t(A t ) = 0, there is always a neighborhood Θ s.t. A t Θ A and L t is strongly convex in Θ. By the mode-seeking property of the exclusive (reverse) KL divergence (Chan et al., 2022), we only make a mild assumption on a small neighborhood of the solution and expect the convergence given proper regularities. Assumption A2 (Lipschitz Score). For any t [0, T], the score log ρ t is L-Lipschitz. Assumption A3 (Second Moment Bound). The data distribution has a bounded second moment. Assumption A4 (Score Estimation Error). We have bounded score estimation errors in L2 quantified by ϵscore. We first use the multivariate diffusion to train our score estimators {s(k) t }N 1 n=0 via the loss function (16) based on the pre-specified A(k) t at step k. Similar in spirit to Chen et al. (2023a; 2022a), we can show the generated samples based on {s(k) t }N 1 n=0 are close in distribution to the ideal samples in Theorem 1. The novelty lies in the extension of single-variate diffusions to multi-variate diffusions. Theorem 1 (Generation quality, informal). Assume assumptions A1-A4 hold with a fixed A(k) t , the generated data distribution is close to the data distributions ρdata such that TV( ρ (k) 0 , ρdata) exp( T) + ( dh + ϵscore) To show the convergence of A(k) t to A t , the proof hinges on a stability condition such that the solution asymptotically tracks the equilibrium A t of the mean field (20). Lemma 2 (Local stability, informal). Assume the assumptions A1 and A2 hold. For t [0, T] and A Θ, the solution satisfies a local stability condition such that A A t , L t(A) A A t 2 2. The preceding result illustrates the convergence of the solution toward the equilibrium on average. The next assumption assumes a standard slow update of the SA process, which is standard for theoretical analysis but may not be always needed in empirical evaluations. Assumption A5 (Step size). The step size {ηk}k N is a positive and decreasing sequence k=1 ηk = + , k=1 η2 k < + . Variational Schr odinger Diffusion Models Next, we use the stochastic approximation theory to prove the convergence of A(k) t to an equilibrium A t . Theorem 2 (Convergence in L2). Assume assumptions A1-A5 hold. The variational score A(k) t converges to an equilibrium A t in L2 such that E[ A(k) t A t 2 2] 2ηk, where the expectation is taken w.r.t samples from ρ (k) t . In the end, we adapt Theorem 1 again to show the adaptively generated samples are asymptotically close to the samples based on the optimal A t in Theorem 3, which quantifies the quality of data based on more efficient transportation plans. Theorem 3 (Generation quality of adaptive samples). Given assumptions A1-A5, the generated sample distribution at stage k is close to the exact sample distribution based on the equilibrium A t such that TV( ρ 0, ρdata) exp( T) + ( dh + ϵscore + ηk) 6. Variational Gap Recall that the optimal and variational forward SDEs follow d x t = h f t( x t) + βt log ψ t( x t) i dt + p d x t = h f t( x t) + βt A(k) t x t i dt + p d x t = f t( x t) + βt A t x t dt + p where we abuse the notion of x t for the sake of clarity and they represent three different processes. Despite the improved efficiency based on the ideal A t compared to the vanilla At 0, the variational score inevitably yields a sub-optimal transport in general nonlinear transport. We denote the law of the above processes by L, L(k), and L . To assess the disparity, we leverage the Girsanov theorem to study the variational gap. Theorem 4 (Variational gap). Assume the assumption A2 and Novikov s condition hold. Assume f t and log ψ t are Lipschitz smooth and satisfy the linear growth. The variational gap follows that KL(L L ) = 1 0 E βt A t x t log ψ t( x t) 2 2 KL(L L(k)) ηk + KL(L L ). Connections to Gaussian Schr odinger bridge (GSB) When data follows a Gaussian distribution, VSDM approximates the closed-form OT solution of Schr odinger bridge (Janati et al., 2020; Bunne et al., 2023). We refer readers to Theorem 3 (Bunne et al., 2023) for the detailed transportation plans. Compared to the vanilla At 0, we can significantly reduce the variational gap with KL(L L ) using proper parametrization and sufficient training. We briefly compare VSDM to SGM and SB in the following: Properties SGM SB VSDM Entropic Optimal Transport Optimal Sub-Optimal Simulation-Free (backward) 7. Empirical Studies 7.1. Comparison to Gaussian Schrodinger Bridge VSDM is approximating GSB (Bunne et al., 2023) when both marginals are Gaussian distributions. To evaluate the solutions, we run our VSDM with a fixed βt 4 in Eq.(25) in Song et al. (2021b) and use the same marginals to replicate the VPSDE of the Gaussian SB with αt 0 and ct 2 in Eq.(7) in Bunne et al. (2023). We train VSDM with 20 stages and randomly pick 256 samples for presentation. We compare the flow trajectories from both models and observe in Figure 1 that the ground truth solution forms an almost linear path, while our VSDM sample trajectories exhibit a consistent alignment with trajectories from Gaussian SB. We attribute the bias predominantly to score estimations and numerical discretization. Figure 1. Gaussian SB (GSB) v.s. VSDM on the flow trajectories. 7.2. Synthetic Data We test our variational Schr odinger diffusion models (VSDMs) on two synthetic datasets: spiral and checkerboard (detailed in section D.2.1). We include SGMs as the baseline models and aim to show the strength of VSDMs on general shapes with straighter trajectories. As such, we stretch the Y-axis of the spiral data by 8 times and the X-axis of the checkerboard data by 6 times and denote them by spiral-8Y and checkerboard-6X, respectively. We adopt a monotone increasing {βnh}N 1 n=0 similar to Song et al. (2021b) and denote by βmin and βmax the minimum and maximum of {βnh}N 1 n=0 . We fix ζ = 0.75 and βmin = 0.1 and we focus on the study with different βmax. We find that SGMs work pretty well with βmax = 10 (SGM-10) on standard isotropic shapes. However, when it comes to Variational Schr odinger Diffusion Models spiral-8Y, the SGM-10 struggles to recover the boundary regions on the spiral-8Y data as shown in Figure 2 (top). Generations of Anisotropic Shapes To illustrate the effectiveness of our approach, Figure 2 (bottom) shows that VSDM-10 accurately reconstructs the edges of the spiral and generates high-quality samples. 2.5 0.0 2.5 2.5 0.0 2.5 2.5 0.0 2.5 4 2.5 0.0 2.5 2.5 0.0 2.5 20 2.5 0.0 2.5 10 2.5 0.0 2.5 4 Figure 2. Variational Schr odinger diffusion models (VSDMs, bottom) v.s. SGMs (top) with the same hyperparameters (βmax = 10). Straighter Trajectories The SGM-10 fails to fully generate the anisotropic spiral-8Y and increasing βmax to 20 or 30 (SGM-20 and SGM-30) significantly alleviates this issue. However, we observe that excessive βmax values in SGMs compromise the straightness and leads to inefficient transport, especially in the X-axis of spiral-8Y. (d) VSDM-10 Figure 3. Probability flow ODE via VSDMs and SGMs. SGM with βmax = 10 is denoted by SGM-10 for convenience. Instead of setting excessive βmax on both axes, our VSDM10, by contrast, proposes conservative diffusion scales on the X-axis of spiral-8Y and explores more on the Y-axis of spiral-8Y. As such, we obtain around 40% improvement on the straightness in Figure 3 and Table 4. Additional insights into a similar analysis of the checkboard dataset, convergence analysis, computational time, assessments of straightness, and evaluations via a smaller number of function evaluations (NFEs) can be found in Appendix D.2. 7.3. Image Data Modeling Experiment Setup In this experiment, we evaluate the performance of VSDM on image modeling tasks. We choose the CIFAR10 datasetas representative image data to demonstrate the scalability of the proposed VSDM on generative modeling of high-dimensional distributions. We Figure 4. Unconditional generated samples from VSDM on CIFAR10 (32 32 resolution) trained from scratch. refer to the code base of FB-SDE (Chen et al., 2022b) and use the same forward diffusion process of the EDM model (Karras et al., 2022). Since the training of VSDM is an alternative manner between forward and backward training, we build our implementations based on the open-source diffusion distillation code base (Luo et al., 2024a) , which provides a high-quality empirical implementation of alternative training with EDM model on CIFAR10 data. To make the VSDM algorithm stable, we simplify the matrix Dt to be diagonal with learnable diagonal elements, which is the case as we introduced in Corollary 1. We train the VSDM model from scratch on two NVIDIA A100-80G GPUs for two days and generate images from the trained VSDM with the Euler Maruyama numerical solver with 200 discretized steps for generation. Performances. We measure the generative performances in terms of the Fretchat Inception Score (FID (Heusel et al., 2017), the lower the better), which is a widely used metric for evaluating generative modeling performances. Tables 2 summarize the FID values of VSDM along with other optimal-transport-based and score-based generative models on the CIFAR10 datasets (unconditional without labels). The VSDM outperforms other optimal transportbased models with an FID of 2.28. This demonstrates that the VSDM has applicable scalability to model highdimensional distributions. Figure 7.3 shows some noncherry-picked unconditional generated samples from VSDM trained on the CIFAR10 dataset. Convergence Speed. To demonstrate the convergence speed of VSDM along training processes, we record the FID values in Table 1 for a training trail with no warmup on CIFAR10 datasets (unconditional). We use a batch size of 256 and a learning rate of 1e 4. We use the 2nd-order Heun numerical solver to sample. The result shows that VSDM has a smooth convergence performance. See code in https://github.com/pkulwj1994/diff_instruct Variational Schr odinger Diffusion Models Table 1. CONVERGENCE SPEED OF FID VALUES FOR VSDM. K IMAGES 0 10K 20K 30K 40K 50K 100K 150K 200K CONVERGE FID (NFE=35) 406.13 13.13 8.65 6.83 5.66 5.21 3.62 3.29 3.01 2.28 Table 2. CIFAR10 EVALUATION USING SAMPLE QUALITY (FID SCORE). OUR VSDM OUTPERFORMS OTHER OPTIMAL TRANS- PORT BASELINES BY A LARGE MARGIN. CLASS METHOD FID VSDM (OURS) 2.28 SB-FBSDE (CHEN ET AL., 2022B) 3.01 DOT (TANAKA, 2019) 15.78 DGFLOW (ANSARI ET AL., 2020) 9.63 SDE (SONG ET AL. (2021B)) 2.92 SCOREFLOW (SONG ET AL., 2021A) 5.7 VDM (KINGMA ET AL., 2021) 4.00 LSGM(VAHDAT ET AL., 2021) 2.10 EDM(KARRAS ET AL., 2022) 1.97 7.4. Time Series Forecasting We use multivariate probabilistic forecasting as a real-world conditional modeling task. Let {(t1, x1), . . . , (tn, xn)}, x Rd, denote a single multivariate time series. Given a dataset of such time series we want to predict the next P values xn+1, . . . , xn+P . In probabilistic modeling, we want to generate forecasts from learned p(xn+1:n+P |x1:n). The usual approach is to have an encoder that represents a sequence x1:i with a fixed-sized vector hi Rh, i, and then parameterize the output distribution p(xi+1|hi). At inference time we encode the history into hn and sample the next value from p(xn+1|hn), then use xn+1 to get the updated hn+1 and repeat until we obtain xn+P . In the previous works, the output distribution has been specified with a Copulas (Salinas et al., 2019) and denoising diffusion (Rasul et al., 2021). We augment our approach to allow conditional generation which requires only changing the model to include the conditioning vector hi. For that we adopt the U-Net architecture. We use the LSTM neural network as a sequence encoder. We use three real-world datasets, as described in Appendix D.3. We compare to the SGM and the denoising diffusion approach from Rasul et al. (2021) which we refer to as DDPM. Table 3 shows that our method matches or outperforms the competitors. Figure 5 is a demo for conditional time series generation and more details are presented in Figure 12 to demonstrate the quality of the forecasts. 8. Conclusions and Future Works The Schr odinger bridge diffusion model offers a principled approach to solving optimal transport, but estimat- Table 3. FORECASTING RESULTS (LOWER IS BETTER). CRPS-SUM ELECTRICITY EXCHANGE RATE SOLAR DDPM 0.026 0.007 0.012 0.001 0.506 0.058 SGM 0.045 0.005 0.012 0.002 0.413 0.045 VSDM (OUR) 0.038 0.006 0.008 0.002 0.395 0.011 Figure 5. Example for Electricity for 2 (out of 370) dimensions. ing the intractable forward score relies on implicit training through costly simulated trajectories. To address this scalability issue, we present the variational Schr odinger diffusion model (VSDM), utilizing linear variational forward scores for simulation-free training of backward score functions. Theoretical foundations leverage stochastic approximation theory, demonstrating the convergence of variational scores to local equilibrium and highlighting the variational gap in optimal transport. Empirically, VSDM showcases the strength of generating data with anisotropic shapes and yielding the desired straighter transport paths for reducing the number of functional evaluations. VSDM also exhibits scalability in handling large-scale image datasets without requiring warm-up initializations. In future research, we aim to explore the critically damped (momentum) acceleration (Dockhorn et al., 2022) and Hessian approximations to develop the ADAM alternative of diffusion models. Acknowledgements We thank Valentin De Bortoli, Tianyang Hu, and the anonymous reviewers for their valuable insights. Impact Statement This paper proposed a principled approach to accelerate the training and sampling of generative models using optimal transport. This work will contribute to developing textto-image generation, artwork creation, and product design. However, it may also raise challenges in the fake-content generation and pose a threat to online privacy and security. Variational Schr odinger Diffusion Models Albergo, M. S. and Vanden-Eijnden, E. Building Normalizing Flows with Stochastic Interpolants. In International Conference on Learning Representation (ICLR), 2023. Albergo, M. S., Bof, N. M., and Vanden-Eijnden, E. Stochastic Interpolants: A Unifying Framework for Flows and Diffusions. ar Xiv:2303.08797v1, pp. 1 48, 2023. Anderson, B. D. Reverse-time Diffusion Equation Models. Stochastic Processes and Their Applications, 12(3):313 326, 1982. Ansari, A. F., Ang, M. L., and Soh, H. Refining Deep Generative Models via Discriminator Gradient Flow. In International Conference on Learning Representations, 2020. Benveniste, A., M etivier, M., and Priouret, P. Adaptive Algorithms and Stochastic Approximations. Berlin: Springer, 1990. Blei, D. M., Kucukelbir, A., and Mc Auliffe, J. D. Variational Inference: A Review for Statisticians. Journal of the American Statistical Association, 112 (518), 2017. Bunne, C., Hsieh, Y.-P., Cuturi, m., and Krause, A. The Schr odinger Bridge between Gaussian Measures has a Closed Form. In AISTATS, 2023. Caluya, K. F. and Halder, A. Wasserstein Proximal Algorithms for the Schr odinger Bridge Problem: Density Control with Nonlinear Drift. IEEE Transactions on Automatic Control, 67(3):1163 1178, 2022. Chan, A., Silva, H., Lim, S., Kozuno, T., Mahmood, A. R., and White, M. Greedification Operators for Policy Optimization: Investigating Forward and Reverse KL Divergences. Journal of Machine Learning Research, 2022. Chen, H., Lee, H., and Lu, J. Improved Analysis of Scorebased Generative Modeling: User-friendly Bounds under Minimal Smoothness Assumptions. In International Conference on Machine Learning, pp. 4735 4763, 2023a. Chen, S., Chewi, S., Li, J., Li, Y., Salim, A., and Zhang, A. R. Sampling is as Easy as Learning the Score: Theory for Diffusion Models with Minimal Data Assumptions. ar Xiv preprint ar Xiv:2209.11215v2, 2022a. Chen, T., Liu, G.-H., and Theodorou, E. A. Likelihood Training of Schr odinger Bridge using Forward-Backward SDEs Theory. In International Conference on Learning Representation (ICLR), 2022b. Chen, T., Gu, J., Dinh, L., Theodorou, E. A., Susskind, J., and Zhai, S. Generative Modeling with Phase Stochastic Bridges. In ar Xiv:2310.07805v2, 2023b. Chen, Y. and Georgiou, T. Stochastic Bridges of Linear Systems. IEEE Transactions on Automatic Control, 61 (2), 2016. Chen, Y., Georgiou, T. T., and Pavon, M. Stochastic Control Liaisons: Richard Sinkhorn Meets Gaspard Monge on a Schr odinger Bridge. SIAM Review, 63(2):249 313, 2021. Chen, Y., Deng, W., Fang, S., Li, F., Yang, N., Zhang, Y., Rasul, K., Zhe, S., Schneider, A., and Nevmyvaka, Y. Provably Convergent Schr odinger Bridge with Applications to Probabilistic Time Series Imputation. In ICML, 2023c. Chewi, S. Log-Concave Sampling. online draft, 2023. De Bortoli, V., Thornton, J., Heng, J., and Doucet, A. Diffusion Schr odinger Bridge with Applications to Score Based Generative Modeling. In Advances in Neural Information Processing Systems (Neur IPS), 2021. Deng, W., Chen, Y., Yang, N., Du, H., Feng, Q., and Chen, R. T. Q. Reflected Schr odinger Bridge for Constrained Generative Modeling. In Conference on Uncertainty in Artificial Intelligence (UAI), 2024. Dhariwal, P. and Nichol, A. Diffusion Models Beat GANs on Image Synthesis. In Advances in Neural Information Processing Systems (Neur IPS), 2022. Dockhorn, T., Vahdat, A., and Kreis, K. Score-Based Generative Modeling with Critically-Damped Langevin Diffusion. In International Conference on Learning Representation (ICLR), 2022. Finlay, C., Jacobsen, J.-H., Nurbekyan, L., and Oberman, A. How to Train Your Neural ODE: the World of Jacobian and Kinetic Regularization. In ICML, 2020. Grathwohl, W., Chen, R. T. Q., Bettencourt, J., Sutskever, I., and Duvenaud, D. FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models. In International Conference on Learning Representation (ICLR), 2019. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Neural Information Processing Systems, 2017. Ho, J., Jain, A., and Abbeel, P. Denoising Diffusion Probabilistic Models. In Advances in Neural Information Processing Systems (Neur IPS), 2020. Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D. P., Poole, B., Norouzi, M., Fleet, D. J., and Salimans, T. Imagen Video: High Definition Video Generation with Diffusion Models. In ar Xiv:2210.02303, 2022. Variational Schr odinger Diffusion Models Huang, C.-W., Lim, J. H., and Courville, A. A Variational Perspective on Diffusion-Based Generative Models and Score Matching. In Advances in Neural Information Processing Systems (Neur IPS), 2021. Hutchinson, M. F. A Stochastic Estimator of the Trace of the Influence Matrix for Laplacian Smoothing Splines. Communications in Statistics-Simulation and Computation, 18(3):1059 1076, 1989. Hyv arinen, A. Estimation of Non-normalized Statistical Models by Score Matching. Journal of Machine Learning Research, 6(24):695 709, 2005. Janati, H., Muzellec, B., Peyr e, G., and Cuturi, M. Entropic Optimal Transport between Unbalanced Gaussian Measures has a Closed Form. In Advances in Neural Information Processing Systems (Neur IPS), 2020. Karatzas, I. and Shreve, S. E. Brownian Motion and Stochastic Calculus. Springer, 1998. Karras, T., Aittala, M., Aila, T., and Laine, S. Elucidating the Design Space of Diffusion-Based Generative Models. In Advances in Neural Information Processing Systems (Neur IPS), 2022. Kingma, D. P., Salimans, T., Poole, B., and Ho, J. Variational Diffusion Models. Ar Xiv, abs/2107.00630, 2021. Koehler, F., Heckett, A., and Risteski, A. Statistical Efficiency of Score Matching: The View from Isoperimetry. In ICLR, 2023. Kong, Z., Ping, W., Huang, J., Zhao, K., and Catanzaro, B. Diff Wave: A Versatile Diffusion Model for Audio Synthesis . In Proc. of the International Conference on Learning Representation (ICLR), 2021. Kullback, S. Probability Densities with Given Marginals. Ann. Math. Statist., 1968. Lavenant, H. and Santambrogio, F. The Flow Map of the Fokker Planck Equation Does Not Provide Optimal Transport. Applied Mathematics Letters, 133, 2022. Lee, H., Lu, J., and Tan, Y. Convergence for Scorebased Generative Modeling with Polynomial Complexity. Advances in Neural Information Processing Systems (Neur IPS), 2022. L eonard, C. A Survey of the Schr odinger Problem and Some of its Connections with Optimal Transport. Discrete & Continuous Dynamical Systems-A, 34(4):1533 1574, 2014. Liang, F., Liu, C., and Carroll, R. J. Stochastic Approximation in Monte Carlo Computation. Journal of the American Statistical Association, 102:305 320, 2007. Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., and Le, M. Flow Matching for Generative Modeling. In Proc. of the International Conference on Learning Representation (ICLR), 2023. Liptser, R. S. and Shiryaev, A. N. Statistics of Random Processes: I. General Theory. Springer Science & Business Media, 2001. Liu, Q. Rectified Flow: A Marginal Preserving Approach to Optimal Transport. ar Xiv:2209.14577, 2022. Liu, X., Gong, C., and Liu, Q. Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow. In International Conference on Learning Representation (ICLR), 2023. Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., and Zhu, J. DPMSolver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps. In Advances in Neural Information Processing Systems (Neur IPS), 2022. Luo, W. A Comprehensive Survey on Knowledge Distillation of Diffusion Models. ar Xiv preprint ar Xiv:2304.04262, 2023. Luo, W., Hu, T., Zhang, S., Sun, J., Li, Z., and Zhang, Z. Diff-instruct: A Universal Approach for Transferring Knowledge from Pre-trained Diffusion Models. Advances in Neural Information Processing Systems, 36, 2024a. Luo, W., Zhang, B., and Zhang, Z. Entropy-based Training Methods for Scalable Neural Implicit Samplers. Neur IPS, 36, 2024b. Ma, J. and Yong, J. Forward-Backward Stochastic Differential Equations and their Applications. Springer, 2007. Marzouk, Y., Moselhy, T., Parno, M., and Spantini, A. Sampling via Measure Transport: An Introduction. Handbook of Uncertainty Quantification, pp. 1 41, 2016. Mc Cann, R. J. A Convexity Principle for Interacting Gases. Advances in mathematics, 128(1):153 179, 1997. Øksendal, B. Stochastic Differential Equations: An Introduction with Applications. Springer, 2003. Onken, D., Fung, S. W., Li, X., and Ruthotto, L. OT-Flow: Fast and Accurate Continuous Normalizing Flows via Optimal Transport. In Proc. of the National Conference on Artificial Intelligence (AAAI), 2021. Pavon, M., Tabak, E. G., and Trigila, G. The Data-driven Schr odinger Bridge. Communications on Pure and Applied Mathematics, 74:1545 1573, 2021. Peluchetti, S. Diffusion Bridge Mixture Transports, Schr odinger Bridge Problems and Generative Modeling. Ar Xiv e-prints ar Xiv:2304.00917v1, 2023. Variational Schr odinger Diffusion Models Peyr e, G. and Cuturi, M. Computational Optimal Transport: With Applications to Data Science. Foundations and Trends in Machine Learning, 2019. Polyak, B. T. and Juditsky, A. Acceleration of Stochastic Approximation by Averaging. SIAM Journal on Control and Optimization, 30:838 855, 1992. Pooladian, A.-A., Ben-Hamu, H., Domingo-Enrich, C., Amos, B., Lipman, Y., and Chen, R. T. Q. Multisample Flow Matching: Straightening Flows with Minibatch Couplings. In ICML, 2023. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. Hierarchical Text-Conditional Image Generation with CLIP Latents. In ar Xiv:2204.06125v1, 2022. Rasul, K., Seward, C., Schuster, I., and Vollgraf, R. Autoregressive Denoising Diffusion Models for Multivariate Probabilistic Time Series Forecasting. In International Conference on Machine Learning, 2021. Robbins, H. and Monro, S. A Stochastic Approximation Method. Annals of Mathematical Statistics, 22:400 407, 1951. Ruschendorf, L. Convergence of the Iterative Proportional Fitting Procedure. Ann. of Statistics, 1995. Salimans, T. and Ho, J. Progressive Distillation for Fast Sampling of Diffusion Models. In ICLR, 2022. Salinas, D., Bohlke-Schneider, M., Callot, L., Medico, R., and Gasthaus, J. High-dimensional Multivariate Forecasting with Low-rank Gaussian Copula Processes. Advances in neural information processing systems, 2019. S arkk a, S. and Solin, A. Applied Stochastic Differential Equations. Cambridge University Press, 2019. Shi, Y., De Bortoli, V., Campbell, A., and Doucet, A. Diffusion Schr odinger Bridge Matching. In Advances in Neural Information Processing Systems (Neur IPS), 2023. Singhal, R., Goldstein, M., and Ranganath, R. Where to Diffuse, How to Diffuse, and How to Get Back: Automated Learning for Multivariate Diffusions. In Proc. of the International Conference on Learning Representation (ICLR), 2023. Song, Y. and Ermon, S. Improved Techniques for Training Score-Based Generative Models. In Advances in Neural Information Processing Systems (Neur IPS), 2020. Song, Y., Garg, S., Shi, J., and Ermon, S. Sliced Score Matching: A Scalable Approach to Density and Score Estimation. In Uncertainty in Artificial Intelligence, 2020. Song, Y., Durkan, C., Murray, I., and Ermon, S. Maximum Likelihood Training of Score-Based Diffusion Models . In Advances in Neural Information Processing Systems (Neur IPS), 2021a. Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-Based Generative Modeling through Stochastic Differential Equations . In International Conference on Learning Representation (ICLR), 2021b. Tanaka, A. Discriminator Optimal Transport. In Neural Information Processing Systems, 2019. Tong, A., Malkin, N., Huguet, G., Zhang, Y., Rector-Brooks, J., Fatras, K., Wolf, G., and Bengio, Y. Improving and Generalizing Flow-based Generative Models with Minibatch Optimal Transport. ar Xiv:2302.00482v3, 2023. Trivedi, S. and Kondor, R. Optimization for Deep Neural Networks. Slides - University of Chicago, 2017. Vahdat, A., Kreis, K., and Kautz, J. Score-based Generative Modeling in Latent Space. Advances in Neural Information Processing Systems, 34:11287 11302, 2021. Vanden-Eijnden, E. Introduction to Regular Perturbation Theory. Slides, 2001. URL https://cims.nyu. edu/ eve2/reg_pert.pdf. Vargas, F., Thodoroff, P., Lamacraft, A., and Lawrence, N. Solving Schr odinger Bridges via Maximum Likelihood. Entropy, 23(9):1134, 2021. Vempala, S. S. and Wibisono, A. Rapid Convergence of the Unadjusted Langevin Algorithm: Isoperimetry Suffices, 2022. Villani, C. Topics in Optimal Transportation, volume 58. American Mathematical Soc., 2003. Vono, M., Paulin, D., and Doucet, A. Efficient MCMC Sampling with Dimension-Free Convergence Rate using ADMM-type Splitting. Journal of Machine Learning Research, 2022. Xue, S., Yi, M., Luo, W., Zhang, S., Sun, J., Li, Z., and Ma, Z.-M. SA-Solver: Stochastic Adams Solver for Fast Sampling of Diffusion Models. Advances in Neural Information Processing Systems, 2023. Zhang, B., Luo, W., and Zhang, Z. Enhancing Adversarial Robustness via Score-Based Optimization. Advances in Neural Information Processing Systems, 36, 2024. Variational Schr odinger Diffusion Models Supplementary Material for Variational Schr odinger Diffusion Models In section A, we study the closed-form expression of matrix exponential for diagonal and time-invariant Dt; In section B, we study the convergence of the adaptive diffusion process; In section C, we study the variational gap of the optimal transport and discuss its connections to Gaussian Schr odinger bridge; In section D, we present more details on the empirical experiments. Notations: X is the state space for the data x; x (k) nh is the n-th backward sampling step with a learning rate h at the k-th stage. ηk is the step size to optimize A. A is the (latent) state space of A; A(k) t is the forward linear score estimator at stage k and time t, A t is the equilibrium of Eq.(25) at time t. L t is the random field in the stochastic approximation process and also the loss (10) at time t; L t is the mean field with the equilibrium A t . Given a fixed A(k) t at step k, log ρ (k) t (resp. log ρ (k) t|0 ) is the (resp. conditional) forward score function of Eq.(11) at time t and step k; A(k) t yields the approximated score function s(k) t and ρ (k) t is the distribution of the continuous-time interpolation of the discretized backward SDE (22). A. Closed-form Expression with Diagonal and Time-Invariant Dt In this section, we give the proof of Corollary 1. Proof Denote Dt = Λ := diag(λ), where λi 0, 1 i d, and σ2 t := R t 0 βsds, then by Eq. (14), we have 2[βD]t [βI]t 0 1 2[βD ]t = exp(Mt) Σ0 I Here [βD]t = R t 0 βs Dtds = σ2 t Λ. The matrix Mt is defined as 2σ2 t Λ σ2 t I 0 1 2σ2 t Λ Therefore, we have 2σ2 t Λ)2 0 0 ( 1 , M3 t = ( 1 2σ2 t Λ)3 σ2 t ( 1 2σ2 t Λ)4 0 0 ( 1 , M5 t = ( 1 2σ2 t Λ)5 σ2 t ( 1 According to the definition of matrix exponential, we have exp(Mt) = [I + 1 3!M3 t + ...] 2σ2 t Λ) h σ2 t I + 1 2σ2 t Λ)2 + 1 2σ2 t Λ)4 + ... i 2σ2 t Λ) σ2 t 1 2 σ2 t Λ 2σ2 t Λ)1 + 1 2σ2 t Λ)3 + 1 2σ2 t Λ)5 + ... i 2σ2 t Λ) Λ 1h exp( 1 2σ2 t Λ) exp( 1 Notice that, when we have Σ0 = 0, the expression can be simplified as follows Ct Ht = exp(Mt) 0 I Λ 1h exp( 1 2σ2 t Λ) exp( 1 Variational Schr odinger Diffusion Models Therefore, Ct H 1 t = Λ 1 I exp( σ2 t Λ) . As a result, the corresponding forward transition writes µt|0 = exp( 1 I exp( σ2 t Λ). B. Stochastic Approximation Stochastic approximation (SA), also known as the Robbins Monro algorithm (Robbins & Monro, 1951; Benveniste et al., 1990) offers a conventional framework for the study of adaptive algorithms. The stochastic approximation algorithm works by repeating the sampling-optimization iterations in the dynamic setting in terms of simulated trajectories. We present our algorithm in Algorithm 2. Algorithm 2 The (dynamic) stochastic approximation (SA) algorithm. The (dynamic) SA is a theoretical formulation of Algorithm 1. We assume optimizing the loss function (16) yields proper score estimations s(k+1) t at each stage k and time t to approximate log ρ (k) t ( x t| x 0) in Eq.(9b). repeat Simulation: Sample the backward process from (17) given a fixed A(k) nh x (k+1) (n 1)h = 1 2 I 2A(k) nh βnh x (k+1) nh βnhs(k+1) nh x (k+1) nh h + p βnhhξn, (22) where x (k+1) (N 1) N(0, Σ(k) (N 1)h|0), n [1, 2, , N 1] and h is the learning rate for the backward sampling (17) via the Euler Maruyama (EM) discretization. ξn denotes the standard Gaussian vector at the sampling iteration n. Optimization: Minimize the implicit forward loss function (10) A(k+1) nh = A(k) nh ηk+1 L nh(A(k) nh ; x (k+1) nh ), (23) where L nh(A(k) nh ; x (k+1) nh ) is the (dynamic) random field and ηk is the step size. n {0, 1, , N 1}. until Stage k = kmax To facilitate the analysis, we assume we only make a one-step sampling in Eq.(23). Note that it is not required in practice and multiple-step extensions can be employed to exploit the cached data more efficiently. The theoretical extension is straightforward and omitted in the proof. We also slightly abuse the notation for convenience and generalize Anh to At. Theoretically, the primary objective is to show the iterates (19) follow the trajectories of the dynamical system asymptotically: d At = L t(At)ds, (24) where L t(At) is the mean field defined as follows: L t(At) = Z X L t(At; x ( ) t ) ρ t(d x ( ) t ). (25) We denote by A t the solution of L t(A t ) = 0. Since the samples simulated from ρ t are slightly biased due to the convergence of forward process, discretization error, and score estimation errors as shown in Theorem 1. We expect the mean field is also biased with a perturbed equilibrium. However, by the perturbation theory (Vanden-Eijnden, 2001), the perturbation is mild and controlled by the errors in Theorem 1. Hence although A t is not the optimal linear solution in terms of optimal transport, it still yields efficient transportation plans. Since the exclusive (reverse) KL divergence is known to approximate a single mode (denoted by A t ) in fitting multi-modal distributions, we proceed to assume the following regularity conditions for the solution A t and the neighborhood of A t . Assumption A1 (Regularity). (Positive definiteness) For any t 0 and At A, there exists a constant λmin > 0 s.t. λmin I Dt = I 2At, where A B means B A is semi positive definite. (Locally strong convexity) For any stable Variational Schr odinger Diffusion Models local minimum A t with L t(A t ) = 0, there is always a neighborhood Θ s.t. A t Θ A and L t is strongly convex in Θ, i.e. there exists fixed constants M > m > 0 s.t. for A Θ, m I 2 L t A2 (A) MI. The first part of the above assumption is standard and can be achieved by an appropriate regularization during the training; the second part only assumes the strong convexity for a small neighborhood Θ of the optimum A t . As such, when conditions for Eq.(31) hold, we can apply the induction method to make sure all the subsequent iterates of A(k) t stay in the same region Θ and converge to the local minimum A t . For future works, we aim to explore the connection between m and λmin. Next, we lay out three standard assumptions following Chen et al. (2023a) to conduct our analysis. Similar results are studied by Lee et al. (2022); Chen et al. (2022a) with different score assumptions. Assumption A2 (Lipschitz Score). The score function log ρ t ( log ρ t,A)|| is L-Lipschitz in both x and A for any t [0, T]. For any A, B A and any x, y X, we have log ρ t,A(x) log ρ t,A(y) 2 L x y 2 log ρ t,A(x) log ρ t,B(y) 2 L A B where 2 is the standard L2 norm and is matrix norm. Assumption A3 (Second Moment Bound). The data distribution has a bounded second moment m2 2 := Eρdata[ 2 2] < . Assumption A4 (Score Estimation Error). For all t [0, T], and any At, we have some estimation error . E ρ t[ st log ρ t 2 2] ϵ2 score. We first use the multivariate diffusion to train our score estimators {s(k) t }N 1 n=0 via the loss function (16) based on the pre-specified A(k) t . Following Chen et al. (2023a), we can show the generated samples based on {s(k) t }N 1 n=0 are close in distribution to the ideal samples in Theorem 1. The novelty lies in the extension of single-variate diffusions to multi-variate diffusions. Next, we use the stochastic approximation theory to prove the convergence of A(k) t to a local equilibrium A t in Theorem 2. In the end, we adapt Theorem 1 again to show the adaptively generated samples are asymptotically close to the samples based on the optimal A t in Theorem 3, which further optimizes the transportation plans through a variational formulation. To facilitate the understanding, we summarize the details as follows Sample via A(k) t Random Field Mean Field Convergence of A(k) t Sample via A t s(k) t Backward Sampling ========== Theorem 1 L t(A(k) t ; x (k+1) t ) Eq.(25) ==== L t(A(k) t ) Convergence ======= Theorem 2 A(k) t A t Adaptive Sampling ========== Theorem 3 lim k x (k+1) t . Proof of Sketch Part B.1: The generated samples (backward trajectories) approximate the ideal samples from the fixed A(k) t . Part B.2: We employ the SA theory to show the convergence A(k) t to the optimal estimator A t . Part B.3: The adaptively generated samples approximate the ideal samples from the optimal A t asymptotically. B.1. Convergence of Approximated Samples with a Fixed At The following result is majorly adapted from Theorem 2.1 of Chen et al. (2023a), where the single-variate diffusions are extended to the general multi-variate diffusions. Recall that the forward samples xt are sampled by (11) given a fixed At, we denote the density of xt by ρ t with ρ 0 = ρdata. To facilitate the proof, we introduce an auxiliary variable yt simulated from (11) with y0 N(0, I) such that yt is always a Gaussian distribution at time t and KL(ρdata N(0, I)) is well defined (not applicable to deterministic initializations for y0). We denote the auxiliary distribution of yt at time t by ρ t . For a fixed T > 0 and score estimations st, let ρ t be the distribution of the continuous-time interpolation of the discretized backward SDE from t = T to 0 with ρ T = ρ T . Then generation quality is measured by the distance between ρ 0 and ρdata. ||We abstain from using log ρ t,At for the sake of clarity. The smoothness w.r.t. At is only used in Eq.(33). When its use may lead to confusion elsewhere, we employ the log ρ t,A notation. Variational Schr odinger Diffusion Models Theorem 1 (Generation quality). Assume assumptions A2, A3, and A4 hold. Given a fixed At by assumption A1, the generated data distribution via the EM discretization of Eq.(17) is close to the data distributions ρdata such that TV( ρ 0, ρdata) q KL(ρdata γd) exp( T) | {z } convergence of forward process T | {z } EM discretization T | {z } score estimation , where γd is the standard Gaussian distribution. Proof Following Chen et al. (2023a), we employ the chain rule for KL divergence and obtain: KL(ρdata ρ 0) KL( ρ T ρ T ) + E ρ T (x)[KL( ρ 0|T ( x)| ρ 0|T ( x)], where ρ 0|T is the conditional distribution of x0 given x T and likewise for ρ 0|T . Note that the two terms correspond to the convergence of the forward and reverse process respectively. We proceed to prove that Part I: Forward process KL( ρ T ρ T ) = KL( ρ T ρ T ) KL(ρdata γd)e T , Part II: Backward process E ρ T (x)[KL( ρ 0|T ( |x) ρ 0|T ( |x)] (L2dh + m2 2h2)T + ϵ2 score T. Part I: By the Fokker-Plank equation, we have d dt KL( ρ t ρ t ) = 1 2βt J ρ t ( ρ t) J ρ t ( ρ t) = Z ρ t(x) ln ρ t(x) ρ t (x) is the relative Fisher information of ρ t with respect to ρ t . Note that for all t 0, ρ t is a Gaussian distribution and hence satisfies the log-Sobolev inequality (Vempala & Wibisono, 2022). It follows that KL( ρ t ρ t ) 1 2αt J ρ t ( ρ t), where αt is the log-Sobolev constant of ρ t . This implies that d dt KL( ρ t ρ t ) αtβt KL( ρ t ρ t ). Applying the Gr onwall s inequality yields KL( ρ t ρ t ) e R t 0 αsβsds KL( ρ 0 ρ 0) e α R t 0 βsds KL( ρ 0 ρ 0), where the last inequality is followed by Lemma 1 and α is a lower bound estimate of the LSI constant inft [0,T ] αt. Then by Pinsker s Inequality, we have TV( ρ t, ρ t ) q 2KL( ρ t ρ t ) q 2e α R t 0 βsds KL( ρ 0 ρ 0) q KL(ρdata γd) exp( t). Part II: The proof for the convergence of the reverse process is essentially identical to Theorem 2.1 of Chen et al. (2023a), with the only potential replacements being instances of xt xkh 2 with DT t(xt xkh) 2. However, they are equivalent due to Assumption A1. Therefore, we omit the proof here. In conclusion, the convergence follows that KL(ρdata ρ 0) KL(ρdata γd)e T + (L2dh + m2 2h2)T + ϵscore T. And we obtain the final result using the Pinsker s Inequality. Variational Schr odinger Diffusion Models Lemma 1 (Lower bound of the log-Sobolev constant). Under the same assumptions and setups in Theorem 1, we have inf t [0,T ] αt min{1, λmin} =: α O(1). Proof Consider the auxiliary process for yt: Randomness from the initial: By the mean diffusion in Eq.(12a), the conditional mean diffusion of yt at time t, denoted by µt,y, follows that µt,y = Dtµ0,y, where Dt = e 1 2 [βD]t. Since y0 N(0, I), we know µt,y N(0, Dt D t ). Randomness from Brownian motion: the covariance diffusion induced by Brownian motion follows from Σt|0 in Eq.(12b). Since y0 N(0, I) and yt is an OU process in Eq.(11), we know that yt is always a Gaussian distribution at time t 0 with mean 0. As such, we know that ρ t = N(0, Dt D t + Σt|0). (26) It follows that TV( ρ t, ρ t ) q 2e R t 0 αsβsds KL( ρ 0 ρ 0). Now we need to bound the log-Sobolev constant αt of ρ t . Let Σt = Dt D t + Σt|0. Recall that if a distribution p is α-strongly log-concave, then it satisfies the log-Sobolev inequality (LSI) with LSI constant α (Vempala & Wibisono, 2022). So for the Gaussian distribution ρ t , it suffices to bound the (inverse of) smallest eigenvalue of Σt. Recall from Eq.(12b) that Σt satisfies the ODE 2βt(DtΣt + Σt D t ) + βt I, Σ0 = I. Fix a normalized vector x Rd and denote ut = x Σtx for t [0, T]. By the cyclical property of the trace, we have x DtΣtx = Tr(x DtΣtx) = Tr(DtΣtxx ) λmin Tr(Σtxx ) = λminut. It follows that dt λminβtut + βt. Applying the Gr onwall s inequality tells us that ut 1 λmin (1 e λmin R T 0 βsds) + e λmin R T 0 βsds max{1, 1/λmin}. Since x can be any normalized vector, we have that the largest eigenvalue of Σt is bounded by max{1, 1/λmin} and hence inf t [0,T ] αt min{1, λmin} =: α O(1), where αt is the log-Sobolev constant of ρ t . Remark: In our theoretical analysis, we introduced an auxiliary variable y0 γd to make sure KL(ρdata γd) is well defined. Moreover, the distribution of y T is set to ρ T in Eq.(26). However, we emphasize that the introduction of yt is only for theoretical analysis and we adopt a simpler prior N(0, ΣT |0) instead of N(0, DT D T + ΣT |0) in Eq.(26) for convenience. Variational Schr odinger Diffusion Models B.2. Part II: Stochastic Approximation Convergence A(k) t converges to A t by tracking a mean-field ODE with some fluctuations along the trajectory. Before we prove the convergence, we need to show the stability property of the mean-field ODE such that small fluctuations of earlier iterates do not affect the convergence to the equilibrium. To that end, we construct a Lyapunov function Vt(A) = 1 2m A A t 2 2 to analyze the local stability condition of the solution. This result shows that when the solution is close to the equilibrium A t Θ A, At will asymptotically track the trajectory of the mean field (24) within Θ when the step size ηk 0. Lemma 2 (Local stabiltity). Assume the assumptions A1 and A2 hold. For any A Θ, the solution satisfies a local stability condition such that A A t , Vt(A) = A A , L t(A) m A A 2 2. Proof By the smoothness assumption A2 and Taylor expansion, for any A Θ, we have L t(A) = L t(A ) + Hess L t e A (A A ) = Hess L t e A (A A ), (27) where Hess L t A denotes the Hessian of L t with A at time t; e A is some value between A and A t by the mean-value theorem. Next, we can get A A t , L t(A) = Hess L t e A A A 2 2 m A A 2 2, where the last inequality follows by assumption A1. Additionally, we show the random field satisfies a linear growth condition to avoid blow up in tails. Lemma 3 (Linear growth). Assume the assumptions A2 and A3 hold. There exists a constant C > 0 such that A(k) t Θ at the SA step k and time t, the random field is upper bounded in L2 such that E[ Lt(A(k) t , x (k+1) t ) 2 2|Fk] C(1 + A(k) t A t 2 2) := C(1 + G(k) t 2 2), where the trajectory x (k+1) t is simulated by (22); Fk is a σ-filtration formed by ( x (1) t , A(1) t , x (2) t , A(2) t , , x (k) t , A(k) t ). Proof By the unbiasedness of the random field, we have E[ L t(A(k) t ; x (k+1) t ) L t(A(k) t )|Fk] = 0. (28) It follows that E[ L t(A(k) t ; x (k+1) t ) 2 2|Fk] = E[ L t(A(k) t ; x (k+1) t ) L t(A(k) t ) + L t(A(k) t )) 2 2|Fk] = E[ L t(A(k) t ; x (k+1) t ) L t(A(k) t ) 2 2|Fk] + L t(A(k) t ) 2 2 sup E[ L t(A(k) t ; x (k+1) t ) L t(A(k) t ) 2 2|Fk] + M 2 A(k) t A t 2 2, where the last inequality follows by assumption A1 and Eq.(27). By assumption A2 and A3 and the process (17), we know sup E[ L t(A(k) t ; x (k+1) t ) L t(A(k) t ) 2 2|Fk] < . Denote by C := max{sup E[ L t(A(k) t ; x (k+1) t ) L t(A(k) t ) 2 2|Fk], M 2}, we can conclude that E[ L t(A(k) t ; x (k+1) t ) 2 2|Fk] C(1 + A(k) t A t 2 2). Next, we make standard assumptions on the step size following Benveniste et al. (1990) (page 245). Assumption A5 (Step size). The step size {ηk}k N is a positive and decreasing sequence k=1 ηk = + , lim k inf 2m ηk ηk+1 + ηk+1 ηk A standard choice is to set ηk := A kα+B for some α ( 1 2, 1] and some suitable constants A > 0 and B > 0. Variational Schr odinger Diffusion Models Theorem 2 (Convergence in L2). Assume assumptions A1, A2, A3, A4, and A5 hold. The variational score A(k) t in algorithm 2 converges to a local minimizer A t . In other words, given a large enough k k0, where ηk0 1 E[ A(k) t A t 2 2] 2ηk, where the expectation is taken w.r.t samples from ρ (k) t . Proof To show A(k) t converges to A t , we first denote G(k) t = A(k) t A t . Subtracting A on both sides of Eq.(23): G(k+1) t = G(k) t ηk+1 Lt(A(k) t ; x (k+1) t ). By the unbiasedness of the random field, we have E[ Lt(A(k) t ; x (k+1) t ) L t(A(k) t )|Fk] = 0. (30) Taking the expectation in L2, we have E[ G(k+1) t 2 2|Fk] = G(k+1) t 2 2 2ηk+1E G(k) t , Lt(A(k); x (k+1) t ) + η2 k+1E Lt(A(k) t ; x (k+1) t ) 2 2|Fk = G(k+1) t 2 2 2ηk+1 G(k) t , L t(A(k) t ) + η2 k+1E Lt(A(k) t ; x (k+1) t ) 2 2|Fk , where the second equality is followed by the unbiasedness property in Eq.(30). Applying the stepsize assumption A5, we have ηk+1 ηk + 2mηkηk+1 Cη2 k+1. Then for ηk 1 2, we have 2(ηk+1 ηk + ηkηk+1(2m ηk+1C)) Cη2 k+1. Rewrite the above equation as follows 2ηk+1 (1 2ηk+1m + Cη2 k+1)(2ηk) + Cη2 k+1. By the induction method, we have Given some large enough k k0, where ηk0 1 2, A(k) t is in some subset Θ ** of A that follows E[ Gk t 2 2] 2ηk. (31) Applying Eq.(B.2) and Eq.(B.2), respectively, we have E[ G(k+1) t 2 2|Fk] (1 2ηk+1m)E[ G(k) t 2 2] + η2 k+1E Lt(A(k) t ; x (k+1) t ) 2 2|Fk (1 2ηk+1m + Cη2 k+1)E[ G(k) t 2 2] + Cη2 k+1, (1 2ηk+1m + Cη2 k+1)(2ηk) + Cη2 k+1 (32) where the first inequality is held by the stability property in Lemma 2 and the last inequality is followed by the growth property in Lemma 3. Since A t , A(k) t Θ, Eq.(32) implies that A(k+1) t Θ, which concludes the proof. **By assumption A1, such Θ A exists, otherwise it implies that the mean field function is a constant and conclusion holds as well. Variational Schr odinger Diffusion Models B.3. Part III: Convergence of Adaptive Samples based on The Optimal A We have evaluated the sample quality in Theorem 1 based on a fixed At, which, however, may not be efficient in terms of transportation plans. To evaluate the sample quality in terms of the limiting optimal A , we provide the result as follows: Theorem 3. Given assumptions A1-A5, the generated sample distribution at stage k is ϵ-close to the exact sample distribution ρ T based on the equilibrium A t such that TV( ρ 0, ρdata) q KL(ρdata γd) exp( T) + (L T + (ϵscore + ηk) Proof By assumption A4, for any A(k) t A, we have E ρ (k) t [ s(k) t log ρ (k) t 2 2] ϵ2 score. Combining Theorem 2 and the smoothness assumption A2 of the score function log ρ (k) t w.r.t A(k) t , we have E ρ (k) t [ log ρ (k) t log ρ t 2 2] ηk. (33) It follows that the score function s(k) t is also close to the optimal log ρ t in the sense that E ρ (k) t [ s(k) t log ρ t 2 2] E ρ (k) t [ s(k) t log ρ t 2 2 | {z } by Assumption A4 ] + E ρ (k) t [ log ρ (k) t log ρ t 2 2 | {z } by Eq.(33) ϵ2 score + ηk. Applying Theorem 1 with the adaptive score error in Eq.(34) to replace ϵ2 score concludes the proof. Remark: The convergence of samples based on the adaptive algorithms is slightly weaker than the standard one due to the adaptive update, but this is necessary because A t is more transport efficient than a vanilla At. C. Variational Gap Recall that the optimal forward SDE in the forward-backward SDEs (5) follows that d x t = h f t( x t) + βt log ψ t( x t) i dt + p βtd wt. (35) The optimal variational forward SDE follows that d x t = f t( x t) + βt A t x t dt + p βtd wt. (36) The variational forward SDE at the k-th step follows that d x t = h f t( x t) + βt A(k) t x t i dt + p βtd wt. (37) Since we only employ a linear approximation of the forward score function, our transport is only sub-optimal. To assess the extent of this discrepancy, we leverage the Girsanov theorem to study the variational gap. We denote the law of the processes by L( ) in Eq.(35), L ( ) in Eq.(36) and L(k)( ) in Eq.(37), respectively. Theorem 4. Assume assumptions A2 and A3 hold. Assume f t and log ψ t are Lipschitz smooth and satisfy the linear growth condition. Assume the Novikov s condition holds for At A, where t [0, T]: 0 βt At x t βt log ψ t( x t) 2 2dt < . Variational Schr odinger Diffusion Models The variational gap (VG) via the linear parametrization is upper bounded by KL(L L ) = 1 βt A t x t log ψ t( x t) 2 2dt KL(L L(k)) ηk + KL(L L ). By Girsanov s formula (Liptser & Shiryaev, 2001), the Radon Nikodym derivative of L( ) w.r.t. L ( ) follows that d L d L x = exp Z T A t x t log ψ t( x t) dwt 1 0 βt A t x t log ψ t( x t) 2 2dt , where wt is the Brownian motion under the Wiener measure. Consider a change of measure (Øksendal, 2003; Chewi, 2023) wt = ewt d w, M t, d Mt = p βt A t x t log ψ t( x t) , dwt , where ewt is a L-standard Brownian motion and satisfies martingale property under the L measure. Now the variational gap is upper bounded by KL(L( ) L ( )) = EL( ) A t x t log ψ t( x t) dewt + 1 0 βt A t x t log ψ t( x t) 2 2dt 0 βt A t x t log ψ t( x t) 2 2dt 0 E βt A t x t log ψ t( x t) 2 2 Similarly, applying (a + b)2 2a2 + 2b2, we have KL(L( ) L(k)( )) 3 0 E βt A(k) t x t A t x t 2 2 | {z } convergence of SA + A t x t log ψ t( x t) 2 2 | {z } variational gap based on A t 0 E βt A t x t log ψ t( x t) 2 2 D. Experimental Details D.1. Parametrization of the Variational Score For the general transport, there is no closed-form update and we adopt an SVD decomposition with time embeddings to learn the linear dynamics in Figure 6. The number of parameters is reduced by thousands of times, which have greatly reduced the training variance (Grathwohl et al., 2019). D.2. Synthetic Data D.2.1. CHECKERBOARD DATA The generation of the checkerboard data is presented in Figure. 7. The probability path is presented in Figure. 8. The conclusion is similar to the spiral data. Variational Schr odinger Diffusion Models Figure 6. Architecture of the linear module. Both U and V are orthogonal matrices and Λ denotes the singular values. 2.5 0.0 2.5 4 2.5 0.0 2.5 4 Figure 7. Variational Schr odinger diffusion models (VSDMs, right) v.s. SGMs (left) with the same hyperparameters (βmax = 10). 40 20 0 20 7 40 20 0 20 7 40 20 0 20 7 40 20 0 20 7 (d) VSDM-10 Figure 8. Probability flow ODE via VSDMs and SGMs. SGM with βmax = 10 is denoted by SGM-10 for convenience. D.2.2. CONVERGENCE AND COMPUTATIONAL TIME Convergence Study Under the same setup, VSDM-10 adaptively learns At (and Dt) on the fly and adapts through the pathological geometry via optimal transport. For the spiral-8Y data, the Y-axis of the singular values of Dt (scaled by βmax) converges from 10 to around 19 as shown in Figure 9. The singular value of the X-axis quickly converges from 10 to a conservative scale of 7. We also tried VSDM-20 and found that both the Y-axis and X-axis converge to similar scales, which justifies the stability. (a) Spiral-8Y (b) Convergence v.s. time (c) Checkboard-6X Figure 9. Optimization of Λ of D scaled by βmax (scaled lambda of D) of VSDM-10 and VSDM-20. Computational Time We tried different budgets to train the variational scores and observed in Figure 9(b) that 300 iterations yield the fastest convergence among the 4 choices but also lead to 23% extra time compared to SGM. Reducing the number of iterations impacts convergence minimally due to the linearity of the variational scores and significantly reduces the training time. D.2.3. EVALUATION OF THE STRAIGHTNESS Straighter trajectories lead to a smaller number of functional evaluations (NFEs). In section D.2.3, we compare VSDM-20 with SGM-20 with NFE=6 and NFE=8 using the same computational budget and observe in Figure 10 and 11 the superiority of the VSDM model in generating more details. To evaluate the straightness of the probability flow ODE, similar in spirit to Pooladian et al. (2023), we define our straightness Variational Schr odinger Diffusion Models metric by approximating the second derivative of the probability flow (18) as follows 0 E x t ρ t d2 x t(i) dt2 where i {1, 2}, x t(1) and x t(2) denote the X-axis and Y-axis, respectively. S 0 and S = 0 only when the transport is a straight path. We report the straightness in Table 4 and find the improvement of VSDM-10 over SGM-20 and SGM-30 is around 40%. We also tried VSDM-20 on both datasets and found a significant improvement over the baseline SGM methods. However, despite the consistent convergence in Figure 9, we found VSDM-20 still performs slightly worse than VSDM-10, which implies the potential to tune βmax to further enhance the performance. Table 4. STRAIGHTNESS METRIC DEFINED IN EQ.(38) VIA SGMS AND VSDM WITH DIFFERENT βmax S. SGM WITH βmax = 10 (SGM-10) FAILS TO GENERATE DATA OF ANISOTROPIC SHAPES AND IS NOT REPORTED. STRAIGHTNESS (X / Y) SPIRAL-8Y CHECKERBOARD-6X SGM-20 8.3 / 49.3 53.5 / 11.0 SGM-30 9.4 / 57.3 64.6 / 13.1 VSDM-20 6.3 / 45.6 49.4 / 7.4 VSDM-10 5.5 / 38.7 43.9 / 6.5 D.2.4. A SMALLER NUMBER OF FUNCTION EVALUATIONS We also compare our VSDM-20 with SGM-20 based on a small number of function evaluations (NFE). We use probability flow to conduct the experiments and choose a uniform time grid for convenience. We find that both models cannot fully generate the desired data with NFE=6 in Figure 10 and VSDM appears to recover more details, especially on the top and bottom of the spiral. For the checkboard data, both models work nicely under the same setting and we cannot see a visual difference. With NFE=8 in Figure 11, we observe that our VSDM-20 works remarkably well on both datasets and is slightly superior to SGM-20 in generating the corner details. 2.5 0.0 2.5 2.5 0.0 2.5 2.5 0.0 2.5 4 2.5 0.0 2.5 4 2.5 0.0 2.5 2.5 0.0 2.5 4 2.5 0.0 2.5 4 Figure 10. Variational Schr odinger diffusion models (bottom) v.s. SGMs (top) with the same hyperparameters (βmax = 20) and six function evaluations (NFE=6). Both models are generated by probability flow ODE. 2.5 0.0 2.5 2.5 0.0 2.5 2.5 0.0 2.5 4 2.5 0.0 2.5 4 2.5 0.0 2.5 2.5 0.0 2.5 4 2.5 0.0 2.5 4 Figure 11. Variational Schr odinger diffusion models (bottom) v.s. SGMs (top) with the same hyperparameters (βmax = 20) and eight function evaluations (NFE=8). Both models are generated by probability flow ODE. Variational Schr odinger Diffusion Models D.3. Multivariate Probabilistic Forecasting Data. We use publicly available datasets. Exchange rate has 6071 8-dimensional measurements and a daily frequency. The goal is to predict the value over the next 30 days. Solar is an hourly 137-dimensional dataset with 7009 values. Electricity is also hourly, with 370 dimensions and 5833 measurements. For both, we predict the values over the next day. Training. We adopt the encoder-decoder architecture as described in the main text, and change the decoder to either our generative model or one of the competitors. The encoder is an LSTM with 2 layers and a hidden dimension size 64. We train the model for 200 epochs, where each epoch takes 50 model updates. In case of our model we also alternate between two training directions at a predefined rate. The neural network parameterizing the backward direction has the same hyperparameters as in (Rasul et al., 2021), that is, it has 8 layers, 8 channels, and a hidden dimension of 64. The DDPM baseline uses a standard setting for the linear beta-scheduler: βmin = 0.0001, βmax = 0.1 and 150 steps. 20 04 11 18 25 08 15 22 29 06 13 Dimension 1 Observations Median prediction 90.0% prediction interval 50.0% prediction interval 20 04 11 18 25 08 15 22 29 06 13 Dimension 2 20 04 11 18 25 08 15 22 29 06 13 Dimension 3 00:00 20-Oct 00:00 21-Oct 06:0012:0018:00 06:0012:0018:00 50 Dimension 0 Observations Median prediction 90.0% prediction interval 50.0% prediction interval 00:00 20-Oct 00:00 21-Oct 06:0012:0018:00 06:0012:0018:00 Dimension 5 00:00 20-Oct 00:00 21-Oct 06:0012:0018:00 06:0012:0018:00 Dimension 25 Figure 12. Example forecasts for Electricity (top), Exchange (middle), and Solar (bottom) datasets using our VSDM model. We show 3 out of 370, 8, and 137 dimensions, respectively.