# light_schrödinger_bridge__cee77abd.pdf Published as a conference paper at ICLR 2024 LIGHT SCHR ODINGER BRIDGE Alexander Korotin 1,2, Nikita Gushchin 1, Evgeny Burnaev1,2. 1Skolkovo Institute of Science and Technology, 2Artificial Intelligence Research Institute a.korotin@skoltech.ru, n.gushchin@skoltech.ru Despite the recent advances in the field of computational Schr odinger Bridges (SB), most existing SB solvers are still heavy-weighted and require complex optimization of several neural networks. It turns out that there is no principal solver which plays the role of simple-yet-effective baseline for SB just like, e.g., kmeans method in clustering, logistic regression in classification or Sinkhorn algorithm in discrete optimal transport. We address this issue and propose a novel fast and simple SB solver. Our development is a smart combination of two ideas which recently appeared in the field: (a) parameterization of the Schr odinger potentials with sum-exp quadratic functions and (b) viewing the log-Schr odinger potentials as the energy functions. We show that combined together these ideas yield a lightweight, simulation-free and theoretically justified SB solver with a simple straightforward optimization objective. As a result, it allows solving SB in moderate dimensions in a matter of minutes on CPU without a painful hyperparameter selection. Our light solver resembles the Gaussian mixture model which is widely used for density estimation. Inspired by this similarity, we also prove an important theoretical result showing that our light solver is a universal approximator of SBs. Furthemore, we conduct the analysis of the generalization error of our light solver. The code for our solver can be found at https://github.com/ngushchin/Light SB. Figure 1: Unpaired male female translation by our Light SB solver applied in the latent space of ALAE for 1024x1024 FFHQ images. Our Light SB converges on 4 cpu cores in less than 1 minute. 1 INTRODUCTION Over the last several years, there has been a considerable progress in developing the computational approaches for solving the Schr odinger Bridge problem (Schr odinger, 1931; 1932, SB), which is also known as the dynamic version of the Entropic Optimal Transport (Cuturi, 2013, EOT) problem. The SB problem requires finding the diffusion process between two given distributions that is maximally similar to some given prior process. In turn, EOT problem is a simplification in which a user is interested only in the joint marginal distribution on the first and the last steps of the diffusion. Equal contribution. Published as a conference paper at ICLR 2024 Historically, the majority of studies in EOT/SB sub-field of machine learning are concentrated around studying EOT between discrete probability distributions, see (Peyr e et al., 2019) for a survey. Inspired by the recent developments in the field of generative modeling via diffusions (Ho et al., 2020; Rombach et al., 2022) and the diffusion-related nature of SB, various researches started developing solvers for a continuous setups of EOT/SB, see (Gushchin et al., 2023b) for a survey. This implies that a learner has only a sample access to the (continuous) distributions and based on it has to recover the entire SB process (e.g., its drift) between the entire distributions. Such solvers are straightforwardly applicable to image generation (Wang et al., 2021; De Bortoli et al., 2021) and image-to-image translation tasks (Gushchin et al., 2023a) as well as to important smaller scale data transfer tasks, e.g., with the biological data (Vargas et al., 2021; Koshizuka & Sato, 2022). Almost all existing EOT/SB solvers (see M4) have complex neural networks parameterization and many hyper-parameters; they expectedly require time-consuming training/inference procedures. Although this is unavoidable in large-scale generative modeling tasks, these techniques look too heavyweighted when a user just wants to learn EOT/SB between some moderate-dimensional data distributions, e.g., those appearing in perspective biological applications of OT (Tong et al., 2020; Vargas et al., 2021; Koshizuka & Sato, 2022; Bunne et al., 2022; 2023; Tong et al., 2023). In this paper, we address this issue and present the following main contributions: 1. We propose a novel light solver for continuous SB with the Wiener prior, i.e., EOT with the quadratic transport cost. Our solver has a straightforward non-minimax learning objective and uses the Gaussian mixture parameterization for the EOT/SB (M3.1, 3.2). This development allows us to solve EOT/SB between distributions in moderate dimensions in a matter of minutes thanks to avoiding time-consuming max-min optimization, simulation of the full process trajectories, iterative learning and MCMC techniques which are in use in existing continuous solvers (M4). 2. We show that our novel light solver provably satisfies the universal approximation property for EOT/SB between the distributions supported on compact sets (M3.3). 3. We derive the finite sample learning guarantees for our solver. We show that the estimation error vanishes at the standard parametric rate with the increase of the sample size (Appendix A). 4. We demonstrate the performance of our light solver in a series of synthetic and real-data experiments (M5), including the ones with the real biological data (M5.3) considered in related works. Our light solver exploits the recent advances in the field of EOT/SB, namely, using the log-sumexp quadratic functions to parameterize Schr odinger potentials for constructing EOT/SB benchmark distributions (Gushchin et al., 2023b) and viewing EOT as the energy-based model (Mokrov et al., 2024) which optimizes a certain Kullback-Leibler divergence. We discuss this in M3.1. 2 BACKGROUND: SCHR ODINGER BRIDGES In this section, we recall the main properties of the Schr odinger Bridge (SB) problem with the Wiener prior. We begin with discussing the Entropic Optimal Transport (Cuturi, 2013; Genevay, 2019), which is known as the static SB formulation. Next, we recall the dynamic Schr odinger bridge formulation and its relation to EOT (L eonard, 2013; Chen et al., 2016). Finally, we summarize the aspects of the practical setup for learning EOT/SB which we consider throughout the paper. We work in the D-dimensional Euclidean space (RD, ). We use P2,ac(RD) to denote the set of absolutely continuous Borel probability distributions on RD which have a finite second moment. For any p P2,ac(RD), we write p(x) to denote its density at a point x RD. In what follows, KL is a short notation for the well-celebrated Kullback-Leibler divergence, and N(x|r, S) is the density at a point x RD of the normal distribution with mean r RD and covariance 0 S RD D. Entropic Optimal Transport (EOT) with the quadratic cost. Assume that p0 P2,ac(X), p1 P2,ac(Y) have finite entropy. For ϵ > 0, the EOT problem between them is to find the minimizer of min π Π(p0,p1) 1 2 x0 x1 2π(x0, x1)dx0dx1 + ϵKL (π p0 p1) , (1) where Π(p0, p1) is the set of probability distributions (transport plans) on RD RD whose marginals are p0 and p1, respectively, and p0 p1 is the product distribution. KL in (1) is assumed to be equal to + if π is not absolutely continuous. Therefore, one can consider only absolutely continuous π. The minimizer π of (1) exists; it is unique and called the EOT plan. Published as a conference paper at ICLR 2024 (Dynamic) Schr odinger Bridge with the Wiener prior. We employ Ωas the space of RD-valued functions of time t [0, 1] describing trajectories in RD which start at time t = 0 and end at t = 1. In turn, we use P(Ω) to denote the set of probability distributions on Ω, i.e., stochastic processes. We use d Wt to denote the differential of the standard Wiener process. For a process T P(Ω), we denote its joint distribution at t = 0, 1 by πT P(RD RD). In turn, we use T|x0,x1 to denote the distribution of T for t (0, 1) conditioned on T s values x0, x1 at t = 0, 1. Let W ϵ P(Ω) denote the Wiener process with volatility ϵ > 0 which starts at p0 at t = 0. Its differential satisfies the stochastic differential equation (SDE): d W ϵ t = ϵd Wt. The SB problem with the Wiener prior W ϵ between p0, p1 is to minimize the following objective: min T F(p0,p1) KL (T W ϵ) , (2) where F(p0, p1) P(Ω) is the subset of stochastic processes which start at distribution p0 (at the time t = 0) and end at p1 (at t = 1). This problem has the unique solution which is a diffusion process T described by the SDE: d Zt = g (Zt, t)dt + d W ϵ t (L eonard, 2013, Prop. 2.3). The process T is called the Schr odinger Bridge and g : RD [0, 1] RD is called the optimal drift. Equivalence between EOT and SB. It is known that the solutions of EOT and SB are related to each other: for the EOT plan π and the SB process T it holds that π = πT . Hence, solution π to EOT (1) can be viewed as a part of the solution T to SB (2). What remains uncovered by EOT in SB is the conditional process T|x0,x1. Fortunately, it is known that T|x0,x1 = W ϵ |x0,x1, i.e., it simply matches the inner part of the Wiener prior W ϵ which is the well-studied Brownian Bridge (Pinsky & Karlin, 2011, Sec. 8.3.3). Hence, one may treat EOT and SB as equivalent problems. Characterization of solutions (L eonard, 2013). The EOT plan π = πT has a specific form π (x0, x1) = ψ (x0) exp x0 x1 2/2ϵ ϕ (x1), (3) where ψ , ϕ : RD R+ are two measurable functions called the Schr odinger potentials. They are defined up to multiplicative constants. The optimal drift g of SB can be derived from ϕ : g (x, t) = ϵ x log Z RD N(x |x, (1 t)ϵID)ϕ (x )dx . (4) To use (4), it sufficies to know ϕ up to the multiplicative constant. Computational Schr odinger Bridge setup. Even though SB/EOT have many useful theoretical properties, solving the SB/EOT problems remains challenging in practice. Analytical solution is available only for the Gaussian case (Chen et al., 2015; Janati et al., 2020; Mallasto et al., 2022; Bunne et al., 2023) plus for some manually constructed benchmark pairs of distributions p0, p1 (Gushchin et al., 2023b). Moreover, in real world setting where SBs are applied (Vargas et al., 2021; Bunne et al., 2023; Tong et al., 2023), distributions p0 and p1 are almost never available explicitly but only through their empirical samples. For the rigor of the exposition, below we formalize the typical EOT/SB learning setup which we consider in our paper. We assume that the learner has access to empirical samples XN 0 = {x1 0, x2 0, . . . , x N 0 } p0 and XM 1 = {x1 1, x2 1, . . . , x M 1 } p1 from the (unknown) data distributions p0, p1 P2,ac(RD). These samples (train data) are assumed to be independent. The task of the learner is to recover the solution (process T or plan π ) to SB problem (2) between the entire underlying distributions p0, p1. The setup above is the learning from empirical samples and is usually called the continuous OT. In the continuous setup, it is essential to be able to do the out-of sample estimation, e.g., simulate the SB process trajectories starting from new (test) points xnew 0 p0 and ending at p1. In some practical use cases of SB, e.g., generative modeling (De Bortoli et al., 2021) and data-to-data translation (Gushchin et al., 2023a), a user is primarily interested only in the ends of these trajectories, i.e., new synthetic data samples from p1. In the biological applications, it may be useful to study the entire trajectory as well (Bunne et al., 2023; Pariset et al., 2023). Hence, finding the solution to SB usually implies recovering the drift g of SB or the conditional distributions π (x1|x0) of the EOT plan. 3 LIGHT SCHR ODINGER BRIDGE SOLVER In M3.1, we derive the learning objective for our light SB solver. In M3.2, we present its training and inference procedures. In M3.3, we prove that our solver is a universal approximator of SBs. Published as a conference paper at ICLR 2024 3.1 DERIVING THE LEARNING OBJECTIVE The main idea of our solver is to recover a parametric approximation πθ π of the EOT plan. Then this learned plan πθ will be used to construct an approximation Tθ T to the entire SB process (M3.2). To learn πθ, we want to directly minimize the KL divergence between π and πθ: KL (π πθ) min θ Θ . (5) This objective is straightforward but there is an obvious obstacle: we do not know the EOT plan π . The magic is that optimization (5) can still be performed despite not knowing π . To begin with, recall that we already know that the EOT plan π has a specific form (3). We define u (x0) def = exp( x0 2 2ϵ )ψ (x0) and v (x1) def = exp( x1 2 2ϵ )ϕ (x1). These two are measurable functions RD R+, and we call them the adjusted Schr odinger potentials. Now (3) reads as π (x0, x1) = u (x0) exp x0, x1 /ϵ v (x1) π (x1|x0) exp x0, x1 /ϵ v (x1). (6) Our idea is to exploit this knowledge to parameterize πθ. We define πθ(x0, x1) = p0(x0)πθ(x1|x0) = p0(x0)exp x0, x1 /ϵ vθ(x1) cθ(x0) , (7) i.e., we parameterize v as vθ. In turn, cθ(x0) def = R RD exp x0, x1 /ϵ vθ(x1)dx1 is the normalization. Our parameterization guarantees that R RD πθ(x0, x1)dx1 = p0(x0). This is reasonable as in practice a learner is interested in the conditional distributions π (x1|x0) rather than the density π (x0, x1) of the plan. To be precise, we parameterize all the conditional distributions πθ(x1|x0) via a common potential vθ. Below we will see that it is sufficient for both training and inference. In Appendix E, we show that within our framework it is easy to also parameterize the density of p0 in (7). Surprisingly, this approach naturally coincides with just fitting a separate density model for p0. Now we show that optimization (5) can be performed without the knowledge of π . Proposition 3.1 (Feasible reformulation of the KL minimization). For parameterization (7), it holds that the main KL objective (5) admits the representation KL π πθ = L(θ) L , where L(θ) def = Z RD log cθ(x0)p0(x0)dx0 Z RD log vθ(x1)p1(x1)dx1, (8) and L R is a constant depending on distributions p0, p1 and value ϵ > 0 but not on θ. We see that minimizing KL equals to the minimization of the difference in expectations of log cθ(x0) and log vθ(x1) w.r.t. p0, p1, respectively. This means that the objective value (8) admits Monte-Carlo estimation from random samples and (8) can be optimized by using the stochastic gradient descent w.r.t. θ. Yet there is still an obstacle that cθ may be hard to compute analytically. We fix this below. We recall (6) with x0 = 0 and see that π (x1|x0 = 0) v (x1), i.e., v can be viewed as an unnormalized density of a distribution. Inspired by this observation, we employ the (unnormalized) Gaussian mixture parameterization for the adjusted potential vθ: vθ(x1) def = k=1 αk N(x1|rk, ϵSk), (9) where θ def = {αk, rk, Sk}K k=1 are the parameters: αk 0, rk RD and symmetric 0 Sk RD D. Note that we multiply Sk in (9) by ϵ just to simplify the further derivations. For parameterization (9), conditional distributions πθ(x1|x0) and normalization constants cθ(x0) are tractable. Proposition 3.2 (Tractable form of plan s components). For the Gaussian Mixture parameterization (9) of the adjusted Schr odinger potential vθ in (7), it holds that πθ(x1|x0) = 1 cθ(x0) k=1 eαk(x0)N(x1|rk(x0), ϵSk) where rk(x0) def = rk + Skx0, eαk(x0) def = αk exp x T 0 Skx0 + 2r T k x0 2ϵ , cθ(x0) def = k=1 eαk(x0). Published as a conference paper at ICLR 2024 The proposition provides the closed-form expression for cθ(x0) which is needed to optimize (8). Relation to prior work. Optimizing objective (8) and using the Gaussian mixture parameterization (9) for the Schr odinger potentials are two ideas that appeared separately in (Mokrov et al., 2024) and (Gushchin et al., 2023b), respectively, and in different contexts. Our contribution is to combine these ideas together to get an efficient and light solver to SB, see the details below. (a) Energy-based view on EOT. In (Mokrov et al., 2024), the authors aim to solve the continuous EOT problem. Using the duality for weak OT, the authors derive the objective which matches our (8) up to some change of variables, see their eq. (17) and Theorem 2. The lack of closed form for the normalization constant forces the authors to use the complex energy-based modeling (Le Cun et al., 2006; Song & Kingma, 2021, EBM) techniques to perform the optimization. This is computationally heavy due to using the MCMC both during training and inference of the solver. Compared to their work, we employ a special parameterization of the Schr odinger potential, which allows to obtain the closed form expression for the normalizing constant. In turn, this allows us to optimize (8) directly with the minibatch gradient descent and removes the burden of using MCMC at any point. Besides, our solver yields the closed form expression for SB (see M3.2) while their approach does not. (b) Parameterization of Schr odinger potentials with Gaussian mixtures. In (Gushchin et al., 2023a), the authors are interested in manually constructing pairs of continuous probability distributions for which the ground truth EOT/SB solution is available analytically. They propose a generic method to do this (Theorem 3.2) and construct several pairs to be used as a benchmark for EOT/SB solvers. The key building block in their methodology is to use a kind of the Gaussian mixture parameterization for Schr odinger potentials, which allows to obtain the closed form EOT and SB solutions for the constructed pairs (Proposition 3.3, Corollary 3.5). Our parameterization (9) coincides with theirs up to some change of variables. The important difference is that we use it to learn the EOT/SB solution (via learning parameters θ by optimizing (8) for a given pair of distributions p0, p1. At the same time, the authors pick the parameters θ at random to simply set up some potential and use it to build some pairs of distributions with the EOT plan between them available by their construction. To summarize, our contribution is to unite these two separate ideas (a) and (b) from the field of EOT/SB. We obtain a straightforward minimization objective (8) which can be easily approached by standard gradient descent (M3.2). This makes the process of solving EOT/SB easier and faster. 3.2 TRAINING AND INFERENCE PROCEDURES TRAINING. As the distributions p0, p1 are accessible only via samples X0 = {x1 0, . . . , x N 0 } p0 and X1 ={x1 1, . . . , x M 1 } p1 (recall the setup in M2), we optimize the empirical counterpart of (8):1 b L(θ) def = 1 n=1 log cθ(xn 0) 1 m=1 log vθ(xm 1 ) L(θ). (10) We use the (minibatch) gradient descent w.r.t. parameters θ. To further simplify the optimization, we consider diagonal matrices Sk in our parameterization (9) of vθ. Not only does it help to drastically reduce the number of learnable parameters in θ but it also allows to quickly compute S 1 k in O(D) time. This simplification strategy works reasonably well in practice, see M5 below. Importantly, it is theoretically justified: in fact, it suffices to even use scalar covariance matrices Sk = λk ID 0 in vθ, see our Theorem 3.4. The other details are in Appendix D. INFERENCE. The conditional distributions πθ(x1|x0) are mixtures of Gaussians whose parameters are given explicitly in Propostion 3.2. Hence, sampling x1 given x0 is straightforward and lightspeed. So far we have discussed EOT-related training and inference aspects and skipped the question how to use πθ to set-up some process Tθ T approximating SB. We fix it below. With each distribution πθ defined by (7) via formula (7), we associate the specific process T = Tθ whose joint distribution at t = 0, 1 matches πθ and conditionals satisfy Tθ|x0,x1 = W ϵ |0,1. Informally, this means that we insert the Brownian Bridge inside the joint distribution πθ at t = 0, 1. Below we show that this process admits the closed-form drift gθ and the quality of approximation of T by Tθ is the same as that of approximation of π by πθ. Proposition 3.3 (Properties of Tθ). Let vθ be an unnormalized Gaussian mixture given by (9) and πθ given by (7). Then Tθ introduced above is a diffusion process governed by the following SDE: Tθ : d Xt = gθ(Xt, t)dt + ϵd Wt, X0 p0, (11) 1We discuss the generalization properties of our light SB solver in Appendix A. Published as a conference paper at ICLR 2024 gθ(x, t) def = ϵ x log N(x|0, ϵ(1 t)ID) αk N(rk|0, ϵSk)N(h(x, t)|0, At k) with At k def = t ϵ(1 t)ID + S 1 k ϵ and hk(x, t) def = 1 ϵ(1 t)x + 1 ϵ S 1 k rk. Moreover, it holds that KL (T Tθ) = KL (π πθ) . (12) The proposition provides a closed form for the drift gθ of Tθ for all (x, t) RD [0, 1]. Now that we know what the process looks like, it is straightforward to sample its random trajectories starting at given input points x0. We describe two ways for it based on the well-known schemes. Euler-Maryama simulation. This is the well-celebrated time-discretized scheme to solve SDEs. Let t = 1 S be the time discretization step for an integer S > 0. Consider the following iteratively constructed (for s {0, 1, S 1}) sequence starting at x0: x(s+1) t xs t + gθ(xs t, s t) t + p ϵ tξs with ξs N(0, I), (13) where ξs are i.i.d. random Gaussian variables. Then the sequence {xs t}S s=1 is a time-discretized approximation of some true trajectory of Tθ starting from x0. Since our solver provides closed form gθ (Proposition 3.3) for all t [0, 1], one may employ any arbitrary small discretization step t. Brownian Bridge simulation. Given a start point x0, one can sample an endpoint x1 πθ(x1|x0) from the respective Gaussian mixture (Proposition 3.2). What remains is to sample the trajectory from the conditional process Tθ|x0,x1 which matches the Brownian Bridge W ϵ |x0,x1. Suppose that we already have some trajectory x0, xt1, . . . , xt L, x1 with 0 < t1 < < t L < 1 (initially L = 0, we have only x0, x1), and we want to refine the trajectory by inserting a point at tl < t < tl+1. Following the properties of the Brownian bridge, it suffices to sample xt N xt|xtl + t tl tl+1 tl (xtl+1 xtl), ϵ(t tl)(tl+1 t ) Using this approach, one may sample arbitrarily precise trajectories of Tθ without any discretization errors. Unlike (13), this sampling technique does not use the drift gθ and to get a random sample at any time t one does not need to sequentially unroll the entire prior trajectory at [0, t). 3.3 UNIVERSAL APPROXIMATION PROPERTY Considering plans πθ with the Gaussian mixture parameterization (9), it is natural to wonder whether this parameterization is universal. Namely, we aim to understand whether Tθ can approximate any Schr odinger bridge arbitrarily well if given a sufficient amount of components in the mixture vθ. We provide a positive answer assuming that p0, p1 are supported on compact sets. This assumption is not restrictive as in practice many real-world distributions are compactly supported anyway. Theorem 3.4 (Gaussian mixture parameterization for the adjusted potential provides the universal approximation of Schr odinger bridges). Assume that p0 and p1 are compactly supported. Then for all δ > 0 there exists a Gaussian mixture vθ (9) with scalar covariances Sk = λk ID 0 of its components that satisfies the inequality KL (T Tθ) = KL (π πθ) < δ. Although this result looks concise and simple, its proof is quite challenging. The main cornerstone in proving the result is that the key object to be approximated, i.e., the adjusted potential v , is just a measurable function without any nice properties such as the continuity. To overcome this issue, we employ non-trivial facts from the duality theory for weak OT (Gozlan et al., 2017). We also highlight the fact that our result provides an approximation of T on the non-compact set. Indeed, while p0 and p1 are compactly supported, Tθ s marginals at all the time steps t (0, 1) are always supported on the entire RD which is not compact, recall, e.g., (14). This aspect adds additional value to our result as usually the approximation is studied in the compact sets. To our knowledge, our Theorem 3.4 is the first ever result about the universal approximation of SBs. 4 RELATED WORK Over several recent years, there has been a notable progress in developing neural SB/EOT solvers. For a review and a benchmark of them, we refer to (Gushchin et al., 2023b). The dominant majority of them have rather non-trivial training or inference procedures, which complicates the usage of them in practical problems. Below we summarize their main principles. Published as a conference paper at ICLR 2024 Solver Property Allows to sample from Nonminimax objective Noniterative objective Nonsimulation based training Recovers the drift g (x, t) Recovers the density of Does not use simulation inference Satisfies the universal approximation Works for reasonably small ϵ (Seguy et al., 2018) ? (Daniels et al., 2021) ? (Mokrov et al., 2024) ? (Gushchin et al., 2023a) ? (Vargas et al., 2021) ? (De Bortoli et al., 2021) ? (Chen et al., 2021a) ? (Shi et al., 2023) ? (Kim et al., 2024) ? (Tong et al., 2023) ? Light SB (ours) Table 1: Comparison of features of existing EOT/SB solvers and our proposed light solver. Dual form solvers for EOT. The works (Genevay et al., 2016; Seguy et al., 2018; Daniels et al., 2021) aim to solve EOT (1), i.e., the static version of SB. The authors approach the classic dual EOT problem (Genevay, 2019, M3.1) with neural networks. They recover the conditional distributions π (x1|x0) or only the barycentric projections x0 7 R RD x1π (x1|x0)dx1 without learning the actual SB process T . Unfortunately, both solvers do not work for small ϵ due to numerical instabilities (Gushchin et al., 2023b, Table 2). At the same time, large ϵ is of limited practical interest as the EOT plan is nearly independent, i.e., π (x0, x1) p0(x0)p1(x1) and π (x1|x0) p1(x1). In this case, it becomes reasonable just to learn the unconditional generative model for p1. This issue is addressed in the work (Mokrov et al., 2024) which we discussed in M3.1. There the authors consider the weak OT dual form (Backhoff-Veraguas et al., 2019, Theorem 1.3) for EOT and demonstrate that it can be approached with energy-based modeling techniques (Le Cun et al., 2006, EBM). Their solver as well as (Daniels et al., 2021) still heavily rely on using time-consuming MCMC techniques. The above-mentioned solvers can be also adapted to sample trajectories from SB by using the Brownian Bridge just as we do in M3.2. However, unlike our light solver, these solvers do not provide an access to the optimal drift g . Recently, (Gushchin et al., 2023a) demonstrated that one may reformulate the weak EOT dual problem so that one can get the SB s drift g from its solution as well. However, their solver requires dealing with a challenging max-min optimization problem and requires simulating the full trajectories of the learned process, which complicates training. Iterative proportional fitting (IPF) solvers for SB. Most SB solvers (Vargas et al., 2021; De Bortoli et al., 2021; Chen et al., 2021a; 2023) directly aim to recover the optimal drift g as it can be later used to simulate the SB trajectories as we discussed in M3.2. Such solvers are mostly based on the iterative proportional fitting procedure (Fortet, 1940; Kullback, 1968; Ruschendorf, 1995), which is also known as the Sinkhorn algorithm (Cuturi, 2013) and, in fact, coincides with the well-known expectation-maximization algorithm (Dempster et al., 1977, EM), see (Vargas & N usken, 2023, Proposition 4.1) for discussion. That is, the above-mentioned solvers learn two SDEs (forward and inverse processes) and iteratively update them one after the other (IPF steps). The first two solvers do this via the mean-matching regression while the others optimize a divergence-based objective, see (Chen et al., 2021a, M1), (Chen et al., 2023, M5). All these solvers require performing multiple IPF steps. At each of the steps, they simulate the full trajectories of the learned process which introduces considerable computational overhead since these processes are represented via large neural networks. In particular, due to the error accumulation as IPF proceeds, it is known that such solvers may forget the Wiener prior, see (Vargas & N usken, 2023, M4.1.2) and references therein. Other solvers for EOT/SB. In (Shi et al., 2023), a new approach to SB based on alternative Markovian and Reciprocal projections is introduced. In (Tong et al., 2023), the authors exploit the property that SB solution T consists of entropic OT plan π and Brownian bridge W ϵ |x0,x1. They propose a (a) x p0, y p1. (b) ϵ = 2 10 3. (c) ϵ = 0.01. (d) ϵ = 0.1. Figure 2: The process Tθ learned with Light SB (ours) in Gaussian Swiss roll example. Published as a conference paper at ICLR 2024 ϵ=0.1 ϵ=1 ϵ=10 D =2 D =16 D =64 D =128 D =2 D =16 D =64 D =128 D =2 D =16 D =64 D =128 Best solver 1.94 13.67 11.74 11.4 1.04 9.08 18.05 15.23 1.40 1.27 2.36 1.31 Light SB 0.03 0.08 0.28 0.60 0.05 0.09 0.24 0.62 0.07 0.11 0.21 0.37 std 0.01 0.04 0.02 0.02 0.003 0.006 0.007 0.007 0.02 0.01 0.01 0.01 Table 2: Comparisons of c BW2 2-UVP (%) between the optimal plan π and the learned plan πθ. new objective to learn T in the form of SDE if the EOT plan π is known. Since π is not actually known, the authors use the minibatch OT to approximate it. In (Kim et al., 2024), the authors propose exploiting the self-similarity of the SB problem and consider the family of SB problems on intervals {[ti, 1]}N i=1 to sequentially learn T as a series of conditional distributions xti+1|xti. However, they add an empirical regularization which may bias the solution. Besides, there exist SB solvers for specific setups with the paired train data available (Somnath et al., 2023; Liu et al., 2023). Summary. We provide a Table 1 with the summary of the features of the discussed EOT/SB solvers. Additionally, in Appendix F, we mention other OT solvers which are related but not closely relevant to our paper because of considering non EOT/SB formulations or non-continuous settings. 5 EXPERIMENTAL ILLUSTRATIONS Below we evaluate our light solver on setups with both synthetic (M5.1, M5.2) and real data distributions (M5.3, M5.4). The code for our solver is written in Py Torch available at https: //github.com/ngushchin/Light SB. The experiments are issued in the form of convenient *.ipynb notebooks. Reproducing each experiment requires a few minutes on CPU with 4 cores. The implementation and experimental details are given in Appendix D. 5.1 TWO-DIMENSIONAL EXAMPLES To show the effect of ϵ on the learned process Tθ, we give a toy example of mapping 2D Gaussian Swiss Roll with our light solver for ϵ = 2 10 3, 10 2, 10 1, see Fig. 2. As expected, for small ϵ the trajectories are almost straight, and the process Tθ is nearly deterministic. The volatility of trajectories increases with ϵ, and the conditional distributions πθ(x1|x0) become more disperse. 5.2 EVALUATION ON THE EOT/SB BENCHMARK To empirically verify that our light solver correctly recovers the EOT/SB, we evaluate it on a recent EOT/SB benchmark by (Gushchin et al., 2023b, M5). The authors provide high-dimensional continuous distributions (p0, p1) for which the ground truth conditional EOT plan π ( |x0) and SB process T are known by the construction. Moreover, they use these pairs to evaluate many solvers from M4. We use their mixtures benchmark pairs (see their M4) with various dimensions and ϵ, and use the same conditional BW2 2-UVP metric (see their M5) to compare our recovered plan πθ with the ground truth plan π . In Table 2, we report the results of our solver vs. the best solver in each setup according to their evaluation. As clearly seen, our solver outperforms the best solver by a considerable margin. This is reasonable as the benchmark distributions are constructed using the similar principles which our solver exploits, namely, the sum-exp (Gaussian mixture) parameterization of the Schr odinger potential. Therefore, our light solver has a considerable inductive bias for solving the benchmark. 5.3 SINGLE CELL DATA One of the important applications of SB is the analysis of biological single cell data (Koshizuka & Sato, 2022; Bunne et al., 2021; 2022). In Appendix C, we evaluate our algorithm on the popular embryonic stem cell differentiation dataset which has been used in many previous works (Tong et al., 2020; Vargas et al., 2021; Bunne et al., 2023; Tong et al., 2023); here we consider the more high-dimensional dataset from the Kaggle completion Open Problems - Multimodal Single-Cell Integration (MSCI) which was first used in (Tong et al., 2023). The MSCI dataset consists of single-cell data from four human donors at 4 time points (days 2, 3, 4, and 7). We solve the SB/EOT problem between distribution pairs at days 2 and 4, 3 and 7, and evaluate how well the solvers recover the intermediate distributions at days 3 and 4 correspondingly. We work with PCA projections Setup Solver type Solver DIM 50 100 1000 Discrete EOT Sinkhorn (Cuturi, 2013) [1 GPU V100] 2.34 (90 s) 2.24 (2.5 m) 1.864 (9 m) Continuous EOT Langevin-based (Mokrov et al., 2024) [1 GPU V100] 2.39 0.06 (19 m) 2.32 0.15 (19 m) 1.46 0.20 (15 m) Continuous EOT Minimax (Gushchin et al., 2023a) [1 GPU V100] 2.44 0.13 (43 m) 2.24 0.13 (45 m) 1.32 0.06 (71 m) Continuous EOT IPF (Vargas et al., 2021) [1 GPU V100] 3.14 0.27 (8 m) 2.86 0.26 (8 m) 2.05 0.19 (11 m) Continuous EOT KL minimization Light SB (ours) [4 CPU cores] 2.31 0.27 (65 s) 2.16 0.26 (66 s) 1.27 0.19 (146 s) Table 3: Energy distance (averaged for two setups and 5 random seeds) on the MSCI dataset along with 95%-confidence interval ( intervals) and average training times (s - seconds, m - minutes). Published as a conference paper at ICLR 2024 (a) Male Female. (b) Female Male. (c) Adult Child. (d) Child Adult. Figure 3: Unpaired translation by our Light SB solver applied in the latent space of ALAE for 1024x1024 FFHQ images. Our Light SB converges on 4 cpu cores in less than 1 minute. with DIM = 50, 100, 1000 components. We use the energy distance (Rizzo & Sz ekely, 2016, ED) as a metric and present the results for different classes of SB solvers in Table 3 along with the training time. We see that Light SB achieves similar quality to other EOT/SB solvers but faster and without GPU. The details of used preprocessing, hyperparameter and baselines are in Appendix D.4 5.4 UNPAIRED IMAGE-TO-IMAGE TRANSLATION One application which is frequently considered in EOT/SB papers (Daniels et al., 2021; Chen et al., 2021b) is the unpaired image-to-image translation (Zhu et al., 2017). Our solver may be hard to apply to learning SB directly in the image space. To be precise, it is not designed to be used in image spaces just like the conventional Gaussian mixture model is not used for image synthesis. Still we show that our solver might be handy for working in the latent spaces of generative models. We consider the task of male female translation. We pick pre-trained ALAE autoencoder (Pidhorskyi et al., 2020) for entire 1024 1024 FFHQ dataset (Karras et al., 2019) of 70K human faces. We split first 60K faces (train) into male and female subsets and use the encoder to extract 512-dimensional latent codes {zn 0 }N n=1 and {zm 1 }M m=1 from the images in each subset. Training. We learn the latent EOT plan πθ(z1|z0) by using the above-mentioned unpaired samples from the latent distributions. The training process takes less than 1 minute on 4 CPU cores. Inference. To perform male female translation for a new male face xnew 0 (from 10K test faces), we (1) encode it via znew 0 = Enc(xnew 0 ), (2) sample z1 πθ(z1|znew 0 ) and then (3) decode x1 = Dec(z1) and return it. Note that here (unlike M5.3) the process Tθ is not needed, only πθ is used. Results. The qualitative results are given in Fig. 1 and Fig. 3a. Furthermore, in Fig. 3, we provide additional examples for other setups: female male and child adult. For brevity, we show only 1 translated images per an input image. In Appendix H, we give extra examples and study the effect of ϵ. Our experiments qualitatively confirm that our Light SB can solve distribution translation tasks in high dimensions (D=512), and it can be used to easily convert auto-encoders to translation models. 6 DISCUSSION Potential impact. Compared to the existing EOT/SB solvers, our light solver provides many advantages (Table 1). It is one-step (no IPF steps), does not require max-min optimization, does not require the simulation of the process during the training, provides the closed form of the learned drift gθ of the process Tθ T and of the conditional distributions πθ(x1|x0) π (x1|x0) of the plan. Moreover, our solver is provably a universal approximator of SBs. Still the key benefit of our light solver is its simplicity and ease to use: it has a straightforward optimization objective and does not use heavy-weighted neural parameterization. These facts help our light solver to converge in a matter of minutes without spending a lot of user/researcher s time on setting up dozens of hyperparameters, e.g., neural network architectures, and waiting hours for the solver to converge on GPUs. We believe that these properties could help our solver to become the standard easy-to-use baseline EOT/SB solver with potential applications to data analysis tasks. Limitations and broader impact are discussed in Appendix G. Published as a conference paper at ICLR 2024 7 REPRODUCIBILITY The code for our solver is available at https://github.com/ngushchin/Light SB. 1. To reproduce experiments from M5.1 it is enough to train Light SB model by running notebook notebooks/Light SB swiss roll.ipynb with hyperparameters described in MD.2 and then run notebook notebooks/swiss roll plot.ipynb to plot Fig. 2. 2. To reproduce experiments from M5.2 it is needed to install Entropic OT benchmark from github https://github.com/ngushchin/Entropic OTBenchmark and then run notebook Light SB EOT benchmark.ipynb with hyperparameters described in MD.3 to reproduce reported metrics in Table 2. 3. To reproduce experiments from Appendix C it is needed to install library from https://github.com/Krishnaswamy Lab/Trajectory Net and then to run notebook notebooks/Light SB single cell.ipynb. All required data is already preprocessed and located in data folder. 4. To reproduce experiments from M5.3 it is needed to download data from https: //www.kaggle.com/competitions/open-problems-multimodal/ and then to run notebook data/data preprocessing.ipynb to preprocess data. The experiments with Light SB solver can be reproduced by running the notebook notebooks/Light SB MSCI.ipynb. The experiments with Sinkhorn solver can be reproduced by running the notebook notebooks/Sinkhorn MSCI.ipynb 5. The code for ALAE is already included in our code and to reproduce experiments from M5.4 it is first necessary to load the ALAE model by running the script ALAE/training artifacts/download all.py. We have already coded the FFHQ dataset from ALAE and these data can be downloaded directly using notebook notebooks/Light SB alae.ipynb. To train the Light SB model it is necessary to run the notebook notebooks/Light SB alae.ipynb. The same notebook also contains code for plotting the results of trained models. 8 ACKNOWLEDGEMENTS The work was supported by the Analytical center under the RF Government (subsidy agreement 000000D730321P5Q0002, Grant No. 70-2021-00145 02.11.2021). Brandon Amos. On amortizing convex conjugates for optimal transport. In The Eleventh International Conference on Learning Representations, 2022. Arip Asadulaev, Alexander Korotin, Vage Egiazarian, Petr Mokrov, and Evgeny Burnaev. Neural optimal transport with general cost functionals. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=g Iiz7t Bt YZ. Julio Backhoff-Veraguas, Mathias Beiglb ock, and Gudmun Pammer. Existence, duality, and cyclical monotonicity for weak transport costs. Calculus of Variations and Partial Differential Equations, 58(6):1 28, 2019. Charlotte Bunne, Stefan G Stark, Gabriele Gut, Jacobo Sarabia del Castillo, Kjong-Van Lehmann, Lucas Pelkmans, Andreas Krause, and Gunnar R atsch. Learning single-cell perturbation responses using neural optimal transport. bio Rxiv, pp. 2021 12, 2021. Charlotte Bunne, Laetitia Papaxanthos, Andreas Krause, and Marco Cuturi. Proximal optimal transport modeling of population dynamics. In International Conference on Artificial Intelligence and Statistics, pp. 6511 6528. PMLR, 2022. Published as a conference paper at ICLR 2024 Charlotte Bunne, Ya-Ping Hsieh, Marco Cuturi, and Andreas Krause. The schr odinger bridge between gaussian measures has a closed form. In International Conference on Artificial Intelligence and Statistics, pp. 5802 5833. PMLR, 2023. Tianrong Chen, Guan-Horng Liu, and Evangelos Theodorou. Likelihood training of schr odinger bridge using forward-backward sdes theory. In International Conference on Learning Representations, 2021a. Yongxin Chen, Tryphon T Georgiou, and Michele Pavon. Optimal steering of a linear stochastic system to a final probability distribution, part i. IEEE Transactions on Automatic Control, 61(5): 1158 1169, 2015. Yongxin Chen, Tryphon T Georgiou, and Michele Pavon. On the relation between optimal transport and schr odinger bridges: A stochastic control viewpoint. Journal of Optimization Theory and Applications, 169:671 691, 2016. Yongxin Chen, Tryphon T Georgiou, and Michele Pavon. Stochastic control liaisons: Richard sinkhorn meets gaspard monge on a schrodinger bridge. SIAM Review, 63(2):249 313, 2021b. Yu Chen, Wei Deng, Shikai Fang, Fengpei Li, Tianjiao Nicole Yang, Yikai Zhang, Kashif Rasul, Shandian Zhe, Anderson Schneider, and Yuriy Nevmyvaka. Provably convergent schr odinger bridge with applications to probabilistic time series imputation. 2023. Jaemoo Choi, Jaewoong Choi, and Myungjoo Kang. Generative modeling through the semi-dual formulation of unbalanced optimal transport. ar Xiv preprint ar Xiv:2305.14777, 2023. Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information processing systems, 26, 2013. Max Daniels, Tyler Maunu, and Paul Hand. Score-based generative neural networks for large-scale optimal transport. Advances in neural information processing systems, 34:12955 12965, 2021. Valentin De Bortoli, James Thornton, Jeremy Heng, and Arnaud Doucet. Diffusion schr odinger bridge with applications to score-based generative modeling. Advances in Neural Information Processing Systems, 34:17695 17709, 2021. Nabarun Deb, Promit Ghosal, and Bodhisattva Sen. Rates of estimation of optimal transport maps using plug-in estimators via barycentric projections. Advances in Neural Information Processing Systems, 34:29736 29753, 2021. Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society: series B (methodological), 39(1): 1 22, 1977. Pavel Dvurechensky, Alexander Gasnikov, and Alexey Kroshnin. Computational optimal transport: Complexity by accelerated gradient descent is better than by sinkhorn s algorithm. In International conference on machine learning, pp. 1367 1376. PMLR, 2018. Jiaojiao Fan, Shu Liu, Shaojun Ma, Hao-Min Zhou, and Yongxin Chen. Neural monge map estimation and its applications. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=2m ZSl Qscj3. Featured Certification. Kilian Fatras, Younes Zine, R emi Flamary, Remi Gribonval, and Nicolas Courty. Learning with minibatch wasserstein: asymptotic and gradient properties. In International Conference on Artificial Intelligence and Statistics, pp. 2131 2141. PMLR, 2020. R emi Flamary, Nicolas Courty, Alexandre Gramfort, Mokhtar Z. Alaya, Aur elie Boisbunon, Stanislas Chambon, Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo Fournier, L eo Gautheron, Nathalie T.H. Gayraud, Hicham Janati, Alain Rakotomamonjy, Ievgen Redko, Antoine Rolet, Antony Schutz, Vivien Seguy, Danica J. Sutherland, Romain Tavenard, Alexander Tong, and Titouan Vayer. Pot: Python optimal transport. Journal of Machine Learning Research, 22(78): 1 8, 2021. URL http://jmlr.org/papers/v22/20-451.html. Published as a conference paper at ICLR 2024 Robert Fortet. R esolution d un syst eme d equations de m. schr odinger. Journal de Math ematiques Pures et Appliqu ees, 19(1-4):83 105, 1940. Milena Gazdieva, Alexander Korotin, Daniil Selikhanovych, and Evgeny Burnaev. Extremal domain translation with neural optimal transport. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=v ZRi Mjo826. Aude Genevay. Entropy-regularized optimal transport for machine learning. Ph D thesis, Paris Sciences et Lettres (Com UE), 2019. Aude Genevay, Marco Cuturi, Gabriel Peyr e, and Francis Bach. Stochastic optimization for largescale optimal transport. In Advances in neural information processing systems, pp. 3440 3448, 2016. Nathael Gozlan, Cyril Roberto, Paul-Marie Samson, and Prasad Tetali. Kantorovich duality for general transport costs and applications. Journal of Functional Analysis, 273(11):3327 3405, 2017. Nikita Gushchin, Alexander Kolesov, Alexander Korotin, Dmitry Vetrov, and Evgeny Burnaev. Entropic neural optimal transport via diffusion processes. In Advances in Neural Information Processing Systems, 2023a. Nikita Gushchin, Alexander Kolesov, Petr Mokrov, Polina Karpikova, Andrey Spiridonov, Evgeny Burnaev, and Alexander Korotin. Building the bridge of schr\ odinger: A continuous entropic optimal transport benchmark. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023b. Pierre Henry-Labordere. (martingale) optimal transport and anomaly detection with neural networks: A primal-dual algorithm. Available at SSRN 3370910, 2019. Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840 6851, 2020. Jan-Christian H utter and Philippe Rigollet. Minimax estimation of smooth optimal transport maps. 2021. Hicham Janati, Boris Muzellec, Gabriel Peyr e, and Marco Cuturi. Entropic optimal transport between unbalanced gaussian measures has a closed form. Advances in neural information processing systems, 33:10468 10479, 2020. Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4401 4410, 2019. Beomsu Kim, Gihyun Kwon, Kwanyoung Kim, and Jong Chul Ye. Unpaired image-to-image translation via neural schr odinger bridge. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=u QBW7ELXf O. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Alexander Korotin, Vage Egiazarian, Arip Asadulaev, Alexander Safin, and Evgeny Burnaev. Wasserstein-2 generative networks. In International Conference on Learning Representations, 2021a. URL https://openreview.net/forum?id=b Eoxz W_EXsa. Alexander Korotin, Lingxiao Li, Aude Genevay, Justin M Solomon, Alexander Filippov, and Evgeny Burnaev. Do neural optimal transport solvers work? a continuous wasserstein-2 benchmark. Advances in Neural Information Processing Systems, 34:14593 14605, 2021b. Alexander Korotin, Vage Egiazarian, Lingxiao Li, and Evgeny Burnaev. Wasserstein iterative networks for barycenter estimation. In Thirty-Sixth Conference on Neural Information Processing Systems, 2022a. URL https://openreview.net/forum?id=Gi Enzx Tna MN. Published as a conference paper at ICLR 2024 Alexander Korotin, Alexander Kolesov, and Evgeny Burnaev. Kantorovich strikes back! wasserstein GANs are not optimal transport? In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022b. URL https://openreview.net/ forum?id=Vt EEpi-d Glt. Alexander Korotin, Daniil Selikhanovych, and Evgeny Burnaev. Kernel neural optimal transport. In International Conference on Learning Representations, 2023a. URL https://openreview. net/forum?id=Zuc_MHt Uma4. Alexander Korotin, Daniil Selikhanovych, and Evgeny Burnaev. Neural optimal transport. In International Conference on Learning Representations, 2023b. URL https://openreview. net/forum?id=d8CBRl WNkq H. Takeshi Koshizuka and Issei Sato. Neural lagrangian schr\ {o} dinger bridge: Diffusion modeling for population dynamics. In The Eleventh International Conference on Learning Representations, 2022. Solomon Kullback. Probability densities with given marginals. The Annals of Mathematical Statistics, 39(4):1236 1243, 1968. Fabian Latorre, Leello Tadesse Dadi, Paul Rolland, and Volkan Cevher. The effect of the intrinsic dimension on the generalization of quadratic classifiers. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=_h Kvtsq Itc. Yann Le Cun, Sumit Chopra, Raia Hadsell, M Ranzato, and Fujie Huang. A tutorial on energy-based learning. Predicting structured data, 1(0), 2006. Christian L eonard. A survey of the schr\ odinger problem and some of its connections with optimal transport. ar Xiv preprint ar Xiv:1308.0215, 2013. Guan-Horng Liu, Arash Vahdat, De-An Huang, Evangelos A Theodorou, Weili Nie, and Anima Anandkumar. I2sb: Image-to-image schr\ odinger bridge. ar Xiv preprint ar Xiv:2302.05872, 2023. Xingchao Liu, Chengyue Gong, et al. Flow straight and fast: Learning to generate and transfer data with rectified flow. In The Eleventh International Conference on Learning Representations, 2022. Ashok Makkuva, Amirhossein Taghvaei, Sewoong Oh, and Jason Lee. Optimal transport mapping via input convex neural networks. In International Conference on Machine Learning, pp. 6672 6681. PMLR, 2020. Anton Mallasto, Augusto Gerolin, and H a Quang Minh. Entropy-regularized 2-wasserstein distance between gaussian measures. Information Geometry, 5(1):289 323, 2022. Tudor Manole, Sivaraman Balakrishnan, Jonathan Niles-Weed, and Larry Wasserman. Plugin estimation of smooth optimal transport maps. ar Xiv preprint ar Xiv:2107.12364, 2021. Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2018. Petr Mokrov, Alexander Korotin, Alexander Kolesov, Nikita Gushchin, and Evgeny Burnaev. Energy-guided entropic neural optimal transport. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id= d6t Us Ze Vs7. Quang Minh Nguyen, Hoang H Nguyen, Yi Zhou, and Lam M Nguyen. On unbalanced optimal transport: Gradient methods, sparsity and approximation error. ar Xiv preprint ar Xiv:2202.03618, 2022. T Tin Nguyen, Hien D Nguyen, Faicel Chamroukhi, and Geoffrey J Mc Lachlan. Approximation by finite mixtures of continuous density functions that vanish at infinity. Cogent Mathematics & Statistics, 7(1):1750861, 2020. Published as a conference paper at ICLR 2024 Matteo Pariset, Ya-Ping Hsieh, Charlotte Bunne, Andreas Krause, and Valentin De Bortoli. Unbalanced diffusion schr\ odinger bridge. ar Xiv preprint ar Xiv:2306.09099, 2023. Kaare Brandt Petersen, Michael Syskind Pedersen, et al. The matrix cookbook. Technical University of Denmark, 7(15):510, 2008. Gabriel Peyr e, Marco Cuturi, et al. Computational optimal transport. Foundations and Trends in Machine Learning, 11(5-6):355 607, 2019. Stanislav Pidhorskyi, Donald A Adjeroh, and Gianfranco Doretto. Adversarial latent autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14104 14113, 2020. Mark A. Pinsky and Samuel Karlin. 8 - brownian motion and related processes. In Mark A. Pinsky and Samuel Karlin (eds.), An Introduction to Stochastic Modeling (Fourth Edition), pp. 391 446. Academic Press, Boston, fourth edition edition, 2011. ISBN 978-0-12-381416-6. doi: https://doi. org/10.1016/B978-0-12-381416-6.00008-3. URL https://www.sciencedirect.com/ science/article/pii/B9780123814166000083. Aram-Alexandre Pooladian and Jonathan Niles-Weed. Entropic estimation of optimal transport maps. ar Xiv preprint ar Xiv:2109.12004, 2021. Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International conference on machine learning, pp. 1530 1538. PMLR, 2015. Philippe Rigollet and Austin J Stromme. On the sample complexity of entropic optimal transport. ar Xiv preprint ar Xiv:2206.13472, 2022. Maria L Rizzo and G abor J Sz ekely. Energy distance. wiley interdisciplinary reviews: Computational statistics, 8(1):27 38, 2016. Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj orn Ommer. Highresolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684 10695, 2022. Litu Rout, Alexander Korotin, and Evgeny Burnaev. Generative modeling with optimal transport maps. In International Conference on Learning Representations, 2021. Ludger Ruschendorf. Convergence of the iterative proportional fitting procedure. The Annals of Statistics, pp. 1160 1174, 1995. Erwin Schr odinger. Uber die umkehrung der naturgesetze. Verlag der Akademie der Wissenschaften in Kommission bei Walter De Gruyter u ..., 1931. Erwin Schr odinger. Sur la th eorie relativiste de l electron et l interpr etation de la m ecanique quantique. In Annales de l institut Henri Poincar e, volume 2, pp. 269 310, 1932. Vivien Seguy, Bharath Bhushan Damodaran, Remi Flamary, Nicolas Courty, Antoine Rolet, and Mathieu Blondel. Large scale optimal transport and mapping estimation. In International Conference on Learning Representations, 2018. Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014. Yuyang Shi, Valentin De Bortoli, Andrew Campbell, and Arnaud Doucet. Diffusion schr odinger bridge matching. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=qy07OHs JT5. Vignesh Ram Somnath, Matteo Pariset, Ya-Ping Hsieh, Maria Rodriguez Martinez, Andreas Krause, and Charlotte Bunne. Aligned diffusion schr\ odinger bridges. ar Xiv preprint ar Xiv:2302.11419, 2023. Yang Song and Diederik P Kingma. How to train your energy-based models. ar Xiv preprint ar Xiv:2101.03288, 2021. Published as a conference paper at ICLR 2024 Austin Stromme. Sampling from a schr odinger bridge. In International Conference on Artificial Intelligence and Statistics, pp. 4058 4067. PMLR, 2023. Amirhossein Taghvaei and Amin Jalali. 2-Wasserstein approximation via restricted convex potentials with application to improved training for GANs. ar Xiv preprint ar Xiv:1902.07197, 2019. Alexander Tong, Jessie Huang, Guy Wolf, David Van Dijk, and Smita Krishnaswamy. Trajectorynet: A dynamic optimal transport network for modeling cellular dynamics. In International conference on machine learning, pp. 9526 9536. PMLR, 2020. Alexander Tong, Nikolay Malkin, Kilian Fatras, Lazar Atanackovic, Yanlei Zhang, Guillaume Huguet, Guy Wolf, and Yoshua Bengio. Simulation-free schr\ odinger bridges via score and flow matching. ar Xiv preprint ar Xiv:2307.03672, 2023. Francisco Vargas and Nikolas N usken. Transport, variational inference and diffusions: with applications to annealed flows and schr\ odinger bridges. ar Xiv preprint ar Xiv:2307.01050, 2023. Francisco Vargas, Pierre Thodoroff, Austen Lamacraft, and Neil Lawrence. Solving schr odinger bridges via maximum likelihood. Entropy, 23(9):1134, 2021. Gefei Wang, Yuling Jiao, Qian Xu, Yang Wang, and Can Yang. Deep generative learning via schr odinger bridge. In International Conference on Machine Learning, pp. 10794 10804. PMLR, 2021. Yiling Xie, Yiling Luo, and Xiaoming Huo. An accelerated stochastic algorithm for solving the optimal transport problem. ar Xiv preprint ar Xiv:2203.00813, 2022. Karren D Yang and Caroline Uhler. Scalable unbalanced optimal transport using generative adversarial networks. In International Conference on Learning Representations, 2018. Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223 2232, 2017. Published as a conference paper at ICLR 2024 A GENERALIZATION PROPERTIES OF OUR LIGHT SOLVER In theory, to recover the optimal plan π one can solve L(θ) minθ which is equivalent to the direct minimization of KL (π πθ) w.r.t. θ (Proposition 3.1). According to (8), L(θ) consists of the difference of integrals of log cθ(x0) and log vθ(x1) over the distributions p0 and p1, respectively. In practice, there are several sources of errors which do not allow to perfectly optimize the objective. 1. Statistical (estimation) error. Since distributions p0, p1 are accessible only via empirical samples X0 = {x1 0, . . . , x N 0 } p0 and X1 = {x1 1, . . . , x N 1 } p1, one is forced to optimize the empirical counterpart b L(θ) of L(θ). In this objective, the integrals over p0, p1 are replaced with their estimates using samples X0, X1, recall (10). For given samples X0, X1, we denote bθ = bθ(X0, X1) = arg min θ b L(θ). (15) Usually, b L(θ) is called the empirical risk and bθ is the empirical risk minimizer. 2. Approximation error. The parametric class for vθ over which one optimizes the objective is restricted. For example, we consider (unnormalized) Gaussian mixtures vθ with K components (parametrized with θ = {αk, rk, Sk}K k=1). This may lead to irreducible error in approximation of the OT plan π with πθ due to parametric restrictions. In our setup, the quantity L(θ ) L = min θ L(θ) L (16) is the approximation error. Here θ = arg minθ L(θ) is the best approximator (in a given class). 3. Optimization error. In practice, we solve b L(θ) minθ with the gradient descent. The optimization w.r.t. is non-convex and there are no general guarantees of convergence to the global empirical risk minimizer bθ. This may introduce an additional optimization error. Analysing this quantity is a too general question of the non-convex optimization and goes far beyond the scope of our paper. Therefore, for further analysis we assume this error to be zero. Given the gap between the theoretical objective L(θ) and its empirical counterpart b L(θ), it is natural to wonder how close is the recovered πbθ to π . We aim to obtain the bound for the expected KL between π and πbθ (or, equivalently, T and Tθ), i.e., EKL π πbθ = EKL T Tbθ , where the expectation is taken w.r.t. the random realization of the train data X0, X1. This quantity is the natural definition of the generalization error in our setting. Note that EKL T Tbθ Prop. 3.3 = EKL π πbθ Prop. 3.1 = E L(bθ) L = E L(bθ) L(θ ) + E L(θ ) L = E L(bθ) L(θ ) | {z } Statistical error | {z } Approximation error (16) Thanks to our Theorem 3.4, we already known that the second term (the approximation error) can be made arbitrarily small if we pick a Gaussian mixture with sufficiently large amount of components. Hence, our analysis below focuses on bounding the statistical error and understanding the rate of its convergence to zero as a function of available sample sizes N, M. Our following theorem demonstrates that the statistical error decreases at the usual parametric rate. Theorem A.1 (Bound for the statistical error). Assume that p0, p1 are compactly supported. Assume that the considered parametric class Θ ( θ) consists of (unnormalized) Gaussian mixtures with K components with bounded means rk R (for some R > 0), covariances s I Sk SI (for some 0 < s S) and weights a αk A (for some 0 < a A). Then the following holds: E L(bθ) L(θ ) C0 where constants C0, C1 depend only on K, R, s, S, a, A, p0, p1, ϵ but not on sample sizes M, N. The proof is given in the next Appendix section. In the future, it would be interesting to study the trade-off between the statistical error and approximation error rather than study these instances separately as we do in our paper. However, providing such an analysis will probably require making stronger assumptions (e.g., smoothness) on the distributions p0, p1 and the true optimal adjusted Schr odinger potential v . We leave this interesting question for the future work. Published as a conference paper at ICLR 2024 B.1 PROOFS FOR THE RESULTS IN THE MAIN TEXT Proof of Proposition 3.1. In the derivations below, we use H( ) to denote the entropy, i.e., the minus KL divergence with the Legesgue measure. We obtain KL (π πθ) = Z RD RD π (x0, x1) log π (x0, x1) πθ(x0, x1)dx0dx1 = RD RD π (x0, x1) log p0(x0)πθ(x1|x0) | {z } =πθ(x0,x1) RD RD π (x0, x1) log p0(x0)dx0dx1 Z RD RD π (x0, x1) log πθ(x1|x0)dx0dx1 = =p0(x0) z }| { π (x0) log p0(x0)dx0 | {z } = H(p0) RD RD π (x0, x1) log exp x0, x1 /ϵ vθ(x1) cθ(x0) dx0dx1 = H(π ) + H(p0) ϵ 1 Z RD RD x0, x1 π (x0, x1)dx0dx1 | {z } RD RD π (x0, x1) log cθ(x0)dx0dx1 Z RD RD π (x0, x1) log vθ(x1)dx0dx1 = RD p0(x0) log cθ(x0)dx0 Z RD p1(x1) log vθ(x1)dx1 = L(θ) L , (19) which is exactly what we need. Proof of Proposition 3.2. We use equation (7) for πθ(x0, x1) and equation (9) for vθ(x1) to derive: πθ(x1|x0) = exp x0, x1 /ϵ vθ(x1) cθ(x0) = 1 cθ(x0) exp x0, x1 /ϵ K X k=1 αk N(x1|rk, ϵSk) = k=1 αk exp x0, x1 /ϵ N(x1|rk, ϵSk) = k=1 αk(2π) D/2|ϵSk| 1/2 exp x0, x1 /ϵ exp( 1 2(x1 rk)T S 1 k ϵ (x1 rk)) = k=1 αk(2π) D/2|ϵSk| 1/2 exp 1 2ϵ 2x T 0 x1 (x1 rk)T S 1 k (x1 rk) = k=1 αk(2π) D/2|ϵSk| 1/2 exp 1 2ϵ(2x T 0 x1 x T 1 S 1 k x T 1 + 2r T k S 1 k x1 r T k S 1 k rk) = k=1 αk(2π) D/2|ϵSk| 1/2 exp 1 2ϵ( x T 1 S 1 k x T 1 + 2 (Skx0 + rk)T | {z } =rk(x0) S 1 k x1 r T k S 1 k rk) = k=1 αk(2π) D/2|ϵSk| 1/2 exp 1 2ϵ(x1 rk(x0))T S 1 k (x1 rk(x0) 2ϵ( r T k S 1 k rk + r T k (x0)S 1 k r T k (x0)) = Published as a conference paper at ICLR 2024 k=1 αk exp r T k S 1 k rk + r T k (x0)S 1 k r T k (x0) 2ϵ N(x1|rk(x0), ϵSk) = k=1 αk exp r T k S 1 k rk + (Skx0 + rk)T (x0)S 1 k (Skx0 + rk)(x0) 2ϵ N(x1|rk(x0), ϵSk) = k=1 αk exp x T 0 Skx0 + 2r T k x0 2ϵ | {z } =eαk(x0) N(x1|rk(x0), ϵSk) = k=1 eαk(x0)N(x1|rk(x0), ϵSk). RD πθ(x1|x0)dx1 = 1, we see that cθ = PK k=1 eαk(x0) and conclude the proof. Proof of Proposition 3.3. Define pθ = R RD πθ(x0, x1)dx0 as the density of the second marginal of πθ. From the OT benchmark constructor (Gushchin et al., 2023b, Theorem 3.2), it follows that constructed πθ is the unique EOT plan between p0 and pθ: just set f (x1) def = x1 2/2 + ϵ log vθ(x1) in the mentioned theorem. Thus Tθ is the Schr odinger bridge between p0 and pθ by its construction. Then the fact that Tθ is given by SDE (11) follows from the direct integration of (4) using ϕθ(x1) def = exp( x1 2 2ϵ )vθ(x1) as the Schr odinger potential: gθ(x, t) = ϵ x log Z RD N(x |x, (1 t)ϵID)ϕθ(x )dx = RD N(x |x, (1 t)ϵID) exp( x 2 2ϵ )vθ(x )dx = RD N(x |x, (1 t)ϵID) exp( x 2 k=1 αk N(x |rk, ϵSk)dx = RD N(x |x, (1 t)ϵID)N(x |rk, ϵSk) exp( x 2 ϵ x log (2π) D 2 |(1 t)ϵID| 1 2 | {z } x log of it=0 RD exp( (x x)T (x x) 2ϵ(1 t) (x rk)S 1 k (x rk) 2ϵ + x T x ϵ x log exp( x T x 2ϵ(1 t)) 2 exp( r T k S 1 k rk 2ϵ ) 2[x T ( t ϵ(1 t)ID + S 1 k x ] + [ 1 ϵ(1 t)x + 1 ϵ S 1 k rk]T def =hk(x,t) x )dx = (20) ϵ x log exp( x T x 2ϵ(1 t)) 2 exp( r T k S 1 k rk 2ϵ ) 2h T k (x, t)(At k) 1hk(x, t)) = (21) Published as a conference paper at ICLR 2024 ϵ x log (2π) D 2 exp( x T x 2ϵ(1 t)) | {z } N(x|0,ϵ(1 t)ID 2 exp( r T k S 1 k rk 2ϵ ) | {z } N(rk|0,ϵSk) 2h T k (x, t)(At k) 1hk(x, t)) | {z } N(h(x,t)|0,At k) ϵ x log N(x|0, ϵ(1 t)ID) αk N(rk|0, ϵSk)N(h(x, t)|0, At k) In the transition from (20) to (21) we use the integral formula from (Petersen et al., 2008, Sec 8.1.1). In the transition from (21) to (22), we simply multiply the expression under x log by (2π) 2D, as this does not change the expression. Finally, with the measure disintegration theorem (Vargas et al., 2021, Appendix C), we obtain KL (T Tθ) = KL (π πθ) + Z RD RD (((((((((( KL T |x0,x1 Tθ|x0,x1 πθ(x0, x1)dx0dx1 = KL (π πθ) . where we cancel out the KL term as it coincides with KL W ϵ |x0,x1 W ϵ |x0,x1 Proof of Theorem 3.4. It is intuitively clear that if we are able to approximate v arbitrarily well (in some sense) via vθ, then we also achieve small KL (π πθ) as πθ explicitly depends on vθ. The challenge here is that v is just a measurable function without any prior known properties, e.g., continuity. Hence, approximating it with a continuous mixture in some reasonable norm, e.g., the uniform norm , may be even impossible. This emphasizes the challenge of deriving the desired universal approximation result and points to necessity to use more tricky strategies. Recall that for all δ > 0 we need to find an unnormalized Gaussian mixture vθ = vθ(δ) such that KL (π πθ) < δ. To begin with, pick any such δ > 0 and fix it until the end of the proof. Stage 1. This stage is about employing certain known facts from the EOT duality. Let us use Cost(π ) def = Z RD RD 1/2 x0 x1 2dπ (x0, x1) + ϵKL (π p0 p1) (23) to denote the optimal value of (1). We start from considering the equivalent reformulation of (1): min π Π(p0,p1) 1 2 x0 x1 2π(x0, x1)dx0dx1 + ϵKL (π p0 p1) = ϵH(p1) + min π Π(p0,p1) RD 1 2 x0 x1 2π(x1|x0)dx1 ϵH π( |x0) p0(x0)dx0 | {z } where π( |x0) denotes conditional distribution of x1 given x0. Term J in (24) is known as the weak representation of EOT, see (Gushchin et al., 2023b, Eq. (3) and (5)) for an extra discussion, and admits a dual form (Backhoff-Veraguas et al., 2019, Eq. (1.3)): J = sup f C2,b(RD) J(f) def = sup f C2,b(RD) RD f C(x0)p0(x0)dx0 + Z RD f(x1)p1(x1)dx1 , (25) where C2,b(RD) def = {f : RD R continuous s.t. α, β, γ R : α 2 + β f( ) γ} and f C is the so-called weak (entropic) C-transform of f which is defined by f C(x0) def = inf q P2(RD){ Z RD 1 2 x0 x1 2q(x1)dx1 ϵH q Z Y f(x1)q(x1)dx1}. (26) Published as a conference paper at ICLR 2024 Here q P2(RD) are all the probability distributions whose second moment is finite. We slightly abuse the notation as we write q(x1) although q here is not necessarily absolutely continuous. However, if q does not have density, then ϵH(q) = + , which is a bad option for the minimization problem (26). Therefore, one may consider q P2,ac(RD) P2(RD) in (26). The advantage of EOT compared to many other OT formulations is that the minimizer of (26) is available explicitly: qf x0(x1) def = 1 Zf(x0) exp f(x1) 1/2 x0 x1 2 see (Mokrov et al., 2024, Proof of Theorem 1). The mentioned paper considers the compact subsets of RD but their derivation is generic and works for our non-compact case as well. Here Zf(x0) def = Z RD exp f(x1) 1 is the normalizing constant. It is finite thanks to the upper boundness of f due to belonging to C2,b(RD). Due to the same reason, it is not hard to check that qf x has a finite second moment. If we further follow the mentioned work and plug qf x0 in (26), we get f C(x0) = ϵ log Zf(x0). From (25) and the definition of the supremum, it follows that for all δ > 0 there exists some function bf C2,b(RD) for which the following inequality holds: J( bf) = ϵ Z RD log Z b f(x0)p0(x0)dx0 + Z RD bf(x1)p1(x1)dx1 > Cost(π ) ϵH(p1) | {z } =J δ . For our needs, we pick δ def = δϵ 2 and suitable bf for it and move on to the next stage. Stage 2. Let bγ R be an upper bound for bf, i.e., bf(x1) bγ for all x1 RD. It exists thanks to bf C2,b(RD). Recall that p1 is compactly supported by the assumption of the theorem. Let R > 0 be some radius such that the zero-centered ball of this radius contains the support of p1. We define ef(x1) def = bf(x1) max{0, x1 2 R2} bf(x1) bγ. We see that ef bf = Z e f Z b f = ef C bf C = Z RD ef C(x0)p0(x0)dx0 Z RD bf C(x0)p0(x0)dx0. (29) By the construction of ef, it holds that ef(x1) = bf(x1) when x1 is in the support of p1. Thus, Z RD ef(x1)p1(x1)dx1 = Z RD bf(x1)p1(x1)dx1. (30) We combine (29) with (30) and see that J( ef) J( bf) > J δ = Cost(π ) ϵH(p1) δ . We note that p0 is compactly supported, and Z e f is continuous (w.r.t. x0) and non-negative. Therefore, there exists a constant zmin > 0 such that Z e f(x0) zmin when x0 belongs to the support of p0. Analogously, since p1 is compactly supported, we may find a positive constant emin > 0 such that 1 2 exp( e f(x1)/ϵ) emin for all x1 in the support of p1. We fix constants zmin, emin for next steps. Right now we derive exp e f(x1)/ϵ exp bγ max{0, x1 2 R2} exp bγ + R2 ϵ exp( x1 2/ϵ). This means that x1 7 exp e f(x1)/ϵ is a normalizable density ( R RD exp e f(x1)/ϵ dx1 < ) because its density is bounded by an unnormalized Gaussian density. Additionally, we see that exp e f(x1)/ϵ vanishes at infinity. Thus, for every δ > 0 there exists an unnormalized2 Gaussian mixture veθ = veθ(δ ) (Nguyen et al., 2020, Theorem 5a) which is δ -close to exp( e f/ϵ) in the uniform norm on RD: veθ exp( e f/ϵ) = sup x1 RD |veθ(x1) exp( e f(x1)/ϵ)| < δ . (31) 2The result of (Nguyen et al., 2020) considers the approximation of normalized mixtures. This detail is not important and the result straightforwardly extends to unnormalized mixtures. One can first normalize the mixture, approximate it and then re-scale both the target and the approximator back to the original scale. Published as a conference paper at ICLR 2024 From the statement of the mentioned theorem it also follows that one may pick all the covariances in the mixture veθ to be scalar, i.e., veθ(x1) = PK k=1 βk N(x1|µk, ϵσ2 k I) for some K and µk RD, σk R+ (k {1, . . . , K}). Indeed, just recall the definition of Mg m in (Nguyen et al., 2020) and put g to be a standard D-dimensional normal distribution. For further derivations, we pick δ def = min emin + (2πϵ) D/2 1, emin , (32) and its respective mixture veθ with scalar components covariances. We define vθ(x1) def = veθ(x1) exp( x1 2 2ϵ ). It is again an unnormalized Gaussian mixture because it is a product of two unnomalized Gaussian mixtures. Besides, it also has scalar covariances of its components because multiplier exp( x1 2 2ϵ ) s covariance is scalar itself. More precisely, we have vθ(x1) def = veθ(x1) exp( x1 2 k=1 βk N(x1|µk, ϵσ2 k I) exp( x1 2 k=1 βk N(x1|µk, ϵσ2 k I)N(x1|0, ϵI) = 2πϵ)Dβk N(0|µk, ϵ(1 + σ2 k)I) | {z } N(x1| µk 1 + σ2 k | {z } , ϵ σ2 k σ2 k + 1 | {z } k=1 αk N(x1|rk, ϵλk I). (33) Here in transition to (33) we use the formulas from (Petersen et al., 2008, M8.1.8). We derive that Z e f(x0) = Z RD exp e f(x1)/ϵ exp 1/2 x0 x1 2 veθ(x1) δ exp 1/2 x0 x1 2 RD veθ(x1) exp 1/2 x0 x1 2 RD exp 1/2 x0 x1 2 ϵ dx1 | {z } =(2πϵ)D/2 RD vθ(x1) exp x0, x1 /ϵ dx1 δ (2πϵ) D/2, or, equivalently, Z e f(x0) + δ (2πϵ) D/2 > exp( x0 2 RD vθ(x1) exp x0, x1 /ϵ dx1 | {z } =cθ(x0) = exp( x0 2 2ϵ )cθ(x0). (34) Since z 7 log z is a 1 zmin -Lipschitz function on [zmin, + ) and Z e f(x0) zmin for all x0 in the support of p0, we may write δ (2πϵ) D/2 zmin log Z e f(x0) + δ (2πϵ) D/2 log Z e f(x0) . We use this inequality with (34) to derive log Z e f(x0) + δ (2πϵ) D/2 zmin log Z e f(x0) + δ (2πϵ) D/2 > x0 2 2ϵ + log cθ(x) for all x0 in the support of p0. We integrate this expression for x0 p0 and obtain Z RD log Z e f(x0)p0(x0)dx0 + δ (2πϵ) D/2 2ϵ p0(x0)dx0 + Z RD log cθ(x0)p0(x0)dx0. Published as a conference paper at ICLR 2024 After regrouping the terms, we get Z 2ϵ p0(x0)dx0 Z RD log cθ(x0)p0(x0)dx0 + δ (2πϵ) D/2 RD log Z e f(x0)p0(x0)dx0. (35) Now we study another expression. Recalling that z 7 log(z) is 1 emin -Lipschitz for z emin, we get emin log exp( e f(x1)/ϵ) log exp( e f(x1)/ϵ) δ = e f(x1)/ϵ log exp( e f(x1)/ϵ) δ (36) for all x1 in the support of p1. Here we also use the fact that exp( e f(x1)/ϵ) δ exp( e f(x1)/ϵ) emin 2emin emin emin by the choice of δ , recall the definition of emin and see (32). We recall (31) to get log veθ(x1) (31) > log exp( e f(x1)/ϵ) δ (36) e f(x1)/ϵ δ We exploit this observation to derive Z 2ϵ p1(x1)dx1 + Z RD log vθ(x1)p1(x1)dx1 = Z RD log veθ(x1)p1(x1)dx1 > p1(x1)dx1 = Z ϵ p1(x1)dx1 δ emin . (37) We sum (37) with (35) and get = L(θ) z }| { Z RD log vθ(x1)p1(x1)dx1 Z RD log cθ(x)p0(x0)dx0 > RD log Z e f(x0)p0(x0)dx0 + Z ϵ p1(x1)dx1 | {z } =ϵ 1J( e f) emin δ (2πϵ) D/2 2ϵ p0(x0)dx0 Z 2ϵ p1(x1)dx1 = ϵ 1J( ef) δ emin δ (2πϵ) D/2 2ϵ p0(x0)dx0 Z 2ϵ p1(x1)dx1 > ϵ 1 Cost(π ) ϵH(p1) δϵ emin δ (2πϵ) D/2 2ϵ p0(x0)dx0 Z 2ϵ p1(x1)dx1 = ϵ 1 Cost(π ) ϵH(p1) Z 2 p0(x0)dx0 Z 2 p1(x1)dx1 | {z } = L in (18) emin δ (2πϵ) D/2 emin δ (2πϵ) D/2 zmin , (38) where L matches the constant defined in (18). Indeed, L = H(π ) + H(p0) | {z } =KL(π p0 p1) H(p1) RD RD x0, x1 π (x0, x1)dx0dx1 = H(p1) + KL(π p0 p1) ϵ 1 Z RD RD x0, x1 π (x0, x1)dx0dx1 | {z } =ϵ 1 Cost(π )+1/2 R RD x0 2p0(x0)dx0+1/2 R RD x1 2p1(x1)dx1 ϵ 1 Cost(π ) ϵH(p1) Z 2 p0(x0)dx0 Z 2 p1(x1)dx1 Published as a conference paper at ICLR 2024 At the same time, from (19) and (38) we derive that KL (π πθ) = L(θ) L < δ emin + (2πϵ) D/2 Thus, Gaussian mixture vθ is a one that we seek for. This finishes the proof. B.2 PROOF OF THEOREM A.1 IN APPENDIX A The proof follows from combining the two auxiliary facts below. Proposition B.1 (Rademacher bound on the statistical error). It holds that E L(bθ) L(θ ) 4RN(V0, p0) + 4RM(V1, p1), where V0 = {log cθ|θ Θ}, V1 = {log vθ|θ Θ} and RN(V, p) denotes the well-celebrated Rademacher complexity (Shalev-Shwartz & Ben-David, 2014, M26) of the functional class V w.r.t. to the sample size N of distribution p. Proof of Proposition B.1. This result can be derived exactly the same way as (Mokrov et al., 2024, Theorem 4) or (Taghvaei & Jalali, 2019, Theorem 3.4). Proposition B.2 (Rademacher complexity bound for constrained log-sum-exp quadratic functions). Let 0 < a A, let 0 < u U, let 0 < w W and V > 0. Consider the class of functions V = x 7 log k=1 αk exp x T Ukx + v T k x + wk) with (39) u I Uk = U T k UI; vk V ; w wk W; a αk A . We say that such class is the class of constrained log-sum-exp quadratic functions. Assume that p is compactly supported and the support lies in a zero-centered ball of a radius P > 0. Then where the constant C depends only on K, u, U, a, A, V, w, W, P but not on the sample size N. Proof of Proposition B.2. The Rademacher complexity of linear constrained functions x 7 v T k x + wk is well known and is bounded by O( 1 N ), see (Shalev-Shwartz & Ben-David, 2014, M26.2). The complexity of the constrained quadratic functions x 7 x T Ukx is also O( 1 N ) which follows from their representation using the Reproducing Kernel Hilbert spaces (RKHS), see (Latorre et al., 2021, Lemma 5 & Eq. 24) and additionally (Mohri et al., 2018, Theorem 6.12). Hence, by the wellknown additivity of the Rademacher complexity it also follows that the complexity of constrained forms x 7 x T Ukx + v T k x + wk is also bounded by O( 1 N ). Since x, Uk, vk, wk are bounded, the function x 7 exp x T Ukx + v T k x + wk) is Lipschitz in x with the shared Lipschitz constant for all admissible Uk, vk, wk. Therefore, the Rademacher complexity of such functions is also O( 1 N ), recall the Talagrand s contraction principle (Mohri et al., 2018, Lemma 5.7). The same applies to x 7 αk exp x T Ukx + v T k x + wk) = exp x T Ukx + v T k x + [wk + log αk]) as these are also constrained exp-quadratic forms but with slightly adjusted constraints on the bias parameter wk. Using the additivity of the Rademacher complexity again, we see that K-sums k=1 αk exp x T Ukx + v T k x + wk) of such functions also have complexity bounded by O( 1 N ). The remaining step is to note that each such function is both lower and upper bounded (by some positive numbers depending on the constraints), hence the logarithm of such functions is also a Lipschitz operation with some finite Lipschitz constant. Thus, the complexity of x 7 log PK k=1 αk exp x T Ukx + v T k x + wk) is also O( 1 N ); the constant hidden in O( ) incapsulates the dependedce on K, u, U, a, A, V, w, W, P. Published as a conference paper at ICLR 2024 Finally, we can prove the bound on the estimation error in Theorem A.1. Proof of Theorem A.1. Just note that both V0 and V1 are the constrained classes of log-sum-exp quadratic functions in the sense of Proposition B.2 and apply Proposition B.1. For V1 this directly follows from the assumptions of the current Theorem. For V0 it follows from the the fact that cθ is also a log-sum-exp quadratic function with constrained parameters (our Proposition 3.2). C EMBRYONIC STEM CELL DIFFERENTIATION SINGLE CELL DATA For the embryonic stem cell differentiation single cell setup we use code and data from https://github.com/Krishnaswamy Lab/Trajectory Net to work with the embryonic stem cell differentiation dataset and to evaluate our light solver. Solver W1 metric OT-CFM 0.79 0.068 [SF]2M-Exact 0.793 0.066 Light SB (ours) 0.823 0.017 Reg. CNF 0.825 T. Net 0.848 DSB 0.862 0.023 I-CFM 0.872 0.087 [SF]2M-Geo 0.879 0.148 [SF]2M-Sink 1.198 0.342 SB-CFM 1.221 0.380 DSBM 1.775 0.429 Table 4: The quality of intermediate distribution restoration of single-cell data by different methods. The provided data shows the cell differentiation collected at five different intervals (t0: day 0 to 3, t1: day 6 to 9, t2: day 12 to 15, t3: day 18 to 21, t4: day 24 to 27). These collected cells were analysed by sc RNAseq subjected to quality control filtering and then represented as feature vectors using Principal Component Analysis (PCA). Following the above-mentioned works, we consider solving the problem of transporting the cell distribution at time ti 1 to time ti+1 for i [1, 2, 3]. Then we predict the cell distributions at the intermediate time ti and compute the Wasserstein-1 (W1) distance between the predicted distribution and the ground truth distribution. We average over all 3 setups i [1, 2, 3] and present our results in Table 4. To estimate the standard deviation, we run 5 experiments with different seeds for each i [1, 2, 3]. For Light SB we use K = 100, lr = 10 2, ϵ = 0.1, batch size 128 and do 2 103 gradient steps. We use results for other methods from (Tong et al., 2023), whose authors were the first to consider this setup in (Tong et al., 2020). Our solver (ϵ=0.1) performs at the level of the best other methods, while converging only in 1 minute on 4 CPU cores and learning only 1000 parameters. D DETAILS OF THE EXPERIMENTS D.1 GENERAL IMPLEMENTATION DETAILS To minimize (10), we parameterize αk, rk and Sk of vθ (9) and use the Adam optimiser (Kingma & Ba, 2014). We parameterize αk as using the logarithm log αk; we parameterize rk directly as a vector; we parameterise the matrix Sk in the diagonal form with the values log(Sk)i,i on its diagonal. Initialization. We initialize log αk by log 1 K , rk by using random samples from p1 and log(Sk)i,i by log 0.1 (it is can be tuned as a hyperparameter but even without any tuning it works well with this initialisation on every considered experimental setup). D.2 DETAILS OF TOY EXPERIMENTS We use K = 500 in all the cases. For ϵ = 10 1 and ϵ = 10 2, we use lr = 10 3 and for ϵ = 2 10 3 we use lr = 10 and batchsize 128. We do 104 gradient steps. D.3 DETAILS OF EVALUATION ON THE BENCHMARK We use K = 50 in all the cases. For ϵ = 10 1, we use lr = 10 3. For ϵ = 2 10 3, we use Adam with lr = 10 and batch size 128. We do 104 gradient steps. In Table 5, we additionally present results of the non-conditional BW2 2-UVP metric. Light SB beats other solvers or performs at the same level for all setups. Published as a conference paper at ICLR 2024 ϵ=0.1 ϵ=1 ϵ=10 D =2 D =16 D =64 D =128 D =2 D =16 D =64 D =128 D =2 D =16 D =64 D =128 Best solver 0.016 0.05 0.25 0.22 0.005 0.09 0.56 0.12 0.01 0.02 0.15 0.23 Light SB 0.005 0.017 0.037 0.069 0.004 0.01 0.03 0.07 0.03 0.04 0.17 0.30 std 0.002 0.007 0.007 0.008 0.002 0.004 0.006 0.007 0.01 0.01 0.01 0.02 Table 5: Comparisons of BW2 2-UVP (%) between the target P1 and learned right marginal of πθ. D.4 DETAILS OF MULTIMODAL SINGLE-CELL INTEGRATION EXPERIMENTS We use data from the Kaggle competition Open Problems - Multimodal Single-Cell Integration : https://www.kaggle.com/competitions/open-problems-multimodal The data describes gene expression of cells at days 2, 3, 4 and 7 which containt 6071, 7643, 8485, 7195 data points, respectively. Analogously to Tong et al. (2023), we use only CITEseq expression data; to remove the donor-dependent bias we select only one donor with ID 13176. To preprocess the data, we use PCA projections with 50, 100, 1000 components. Then for each case we consider 2 setups: data from day 2 and day 4 as a distribution pair for the SB problem with data from day 3 for evaluation, and data from day 3 and day 7 as a distribution pair for the SB problem with data from day 4 for evaluation. In each setup, we normalize the data by scaling it to the sum of each feature variance over the concatenated data from the start, end, and evaluation days (after PCA projection). For the evaluation, we take the prediction for the evaluation day for all cells from the start day and calculate the energy distance with the ground truth distribution. For all described setups we use ϵ = 0.1 in our and baseline solvers. For our Light SB solver we use K = 10, lr = 10 2 and batchsize 128. We do 104 gradient steps. Baselines. We compare Light SB with SB/EOT algorithms from three different classes: maximin (Gushchin et al., 2023a), Langevin-based (Mokrov et al., 2024) and IPF-based (Vargas et al., 2021). For completeness, we also add the popular discrete EOT Sinkhorn solver (Cuturi, 2013). 1. Maximin solver. For (Gushchin et al., 2023a) solver we use the official code from https://github.com/ngushchin/Entropic Neural Optimal Transport We use the same hyperparameters for this setup as the authors (Gushchin et al., 2023a, Appendix E) use in their high-dimensional Gaussian setup. The only exception is the number of discretization steps N, which we set to 100 as well as for SB solver (Vargas et al., 2021) below. 2. IPF-based solver. For (Vargas et al., 2021) solver we use the official code from https://github.com/franciscovargas/GP_Sinkhorn Instead of Gaussian processes which the authors use, we use the same neural network as in (Gushchin et al., 2023b) to get better scalability. We use N = 100 discretization steps, 50 IPF iterations, 10 epochs on the each IPF-iteration and 128 samples from distributions p0 and p1 in each of them. We use the Adam optimizer with lr = 10 4 for optimization. 3. Langevin-based solver. For (Mokrov et al., 2024) solver, we use the official code from https://github.com/Petr Mokrov/Energy-guided-Entropic-OT We take the advantage of the author s setup from their 2D Gaussian Swissroll experiment. Following our experimental framework, we adapt the original code by increasing the dimensionality of the learned fully-connected NN potentials. The chosen hidden directionalities for the potentials are [256, 256, 256] for D = 50, 100 and [2048, 1024, 512] for D = 1000. We choose all hyperparameters of the method in concordance with their code for ε = 0.1 EOT regularization coefficient, except lr. We pick lr = 5 10 4 for training stability reasons. The numbers of training iterations are N = 8K for D = 50, 100 and N = 4K for D = 1000. We get predictions for intermediate distributions by using Brownian Bridge (analogous to (14)). 4. Discrete solver.3 To run the Sinkhorn algorithm (Cuturi, 2013), we use the ot.sinkhorn with parameters method="sinkhorn log" and stop threshold=1e-8 procedure from 3Discrete OT neither can be straightforwardly used to infer trajectories for new (out-of-train-sample) input cells, nor it provides the trajectories for existing cells. In our setup, the latter issue can be overcome by inserting Published as a conference paper at ICLR 2024 Python OT Package (Flamary et al., 2021). Note the default threshold parameter is 10 9 but we found that the algorithm stucks at tolerance 10 8; hence, we increased the tolerance. Remark. We use the full-dataset (a.k.a. full-batch) discrete OT. We found that for ϵ = 0.1 (which we use for all the solvers) it converges very slowly, even requiring more time to converge than our light solver in dimension 1000. This is explainable the convergence of Sinkhorn algorithm notably degrades when ϵ 0; it empirically seems like this degradation is worse that in our solver. We demonstrate the convergence plots of the Sinkhorn vs. our light solver in Figure 4. Figure 4: Convergence speed comparison on MSCI dataset, starting day 3, ending day 7 and evaluation day 4. D.5 DETAILS OF IMAGE DATA EXPERIMENTS We use the official ALAE code and model from https://github.com/podgorskiy/ALAE and neural network extracted attributes for the FFHQ dataset from https://github.com/DCGM/ffhq-features-dataset We use K = 10, lr = 10 3 and batchsize 128. We do 104 gradient steps. E COMPLETE PARAMETERIZATION OF THE EOT PLAN As we pointed in M3.1, our solver obtains an approximate density πθ(x1|x0) of conditional distributions π (x1|x0) but does not recover the density of the entire plan πθ(x0, x1) π (x0, x1). However, in some practical applications it may be needed to recover this density. Fortunately, this requires only a minor modification of our light solver. Indeed, consider the parameterization πω,θ(x0, x1) = pω(x0)πθ(x1|x0) = pω(x0)exp x0, x1 /ϵ vθ(x1) cθ(x0) (40) which is analogous to (7), but we also introduce a density model pω for the left marginal of the plan. By repeating the derivations of the proof of Proposition 3.1, it is not hard to see that KL (π πω,θ) = KL (p0 pω) + L(θ) L , (41) which means that learning of parameters ω turns to be a separate density estimation problem. It can be solved, e.g., with the well-celebrated expectation-maximization algorithm (Dempster et al., 1977) for Gaussian mixture models or with a normalizing flow (Rezende & Mohamed, 2015). F RELATED WORK: OTHER OT SOLVERS For completeness of the exposition, we mention OT solvers which are not directly relevant to our EOT/SB setup because of considering non-entropic OT or discrete OT settings. the Brownian Bridge (analogously to (14)) on top of pairs of samples from the recovered discrete EOT plan (Stromme, 2023). Sampling from this bridge, one may construct an approximation of the distribution at the intermediate time of interest. Published as a conference paper at ICLR 2024 Discrete OT solvers. There exist many OT solvers (Cuturi, 2013; Dvurechensky et al., 2018), (Nguyen et al., 2022; Xie et al., 2022) for the discrete OT setup (Peyr e et al., 2019). This setup requires finding a discrete matching between given train samples but does not require generalization on the test data. Hence, discrete solvers are not relevant to our continuous EOT/SB setup (M2). It is worth mentioning that there are several works studying the statistical properties of OT and developing out-of-sample estimators based on the discrete/batched OT solutions (H utter & Rigollet, 2021; Pooladian & Niles-Weed, 2021; Manole et al., 2021; Deb et al., 2021), Rigollet & Stromme (2022); Fatras et al. (2020). Despite having good theoretical properties, they mostly estimate the barycentric projection but not the entire plan π . Other continuous OT solvers. Above in the work, we discuss the EOT/SB solvers but there exist many papers proposing neural solvers for other OT formulations: unregularized OT (Henry Labordere, 2019; Makkuva et al., 2020; Korotin et al., 2021a;b; 2022b;a; Fan et al., 2023; Liu et al., 2022; Gazdieva et al., 2023; Rout et al., 2021; Amos, 2022), weak OT (Korotin et al., 2023b;a), unbalanced OT (Choi et al., 2023; Yang & Uhler, 2018) general OT (Asadulaev et al., 2024). These works are of limited relevance as they do not solve EOT/SB. G LIMITATIONS AND BROADER IMPACT Limitations. We summarize and list some limitations of our light solver. 1. Analogously to the well-celebrated Sinkhorn algorithm (Cuturi, 2013) for the discrete OT, our solver may experience computational instabilities when applied to very small regularization coefficients ϵ > 0. This is due to the necessity to compute the exponent of values which proportional to ϵ 1 when computing cθ or vθ in (8). However, as our experiments show (M5), our light solver actually works well for reasonably small ϵ. 2. Our main optimization objective (8) is not necessarily convex w.r.t. the parameters θ, i.e., the gradient-based optimization methods may experience local minima. While we mention this issue, we do not consider it serious enough: anyway, many existing machine learning algorithms have non-convex objectives (k-means, Gaussian mixture models, deep neural networks, etc.) and do not necessarily converge to the global optimum but still work well in downstream tasks. Note that the existing alternative continuous EOT/SB solvers (M4) also have non-convex objectives. 3. Our solver uses a kind of a Gaussian mixture approximation (9) of EOT/SB which may be too restrictive to apply our solver to large-scale generative modeling problems unlike the complex neural EOT/SB solvers which we mention in M4. But this is very natural: default Gaussian mixture model for density estimation is also not used for modeling complex real-data distributions, e.g., images. At the same time, such models still play irreplaceable role in many smaller scale problems due to its extremal simplicity and ease of use. Therefore, we hope that our solver will play an analogous role in the field of computational continuous EOT/SB. 4. Our work considers SB with the Wiener prior W ϵ, which is equivalent to EOT with the quadratic cost c(x, y) = 1 2 x0 x1 2 and entropic regularization value ϵ. In this case, the Gaussian mixture parameterization (9) is useful as it provides the closed-form expression for conditional distributions πθ(x1|x0) and drift gθ (Proposition 3.2). This is due to the fact that the product of the unnormalized Gaussian mixture density vθ with the unnormalized normal density exp( 1 2 x0 x1 2) is again a Gaussian mixture. We do not know if our light solver can be easily generalized to more general costs or priors. We leave this question open for future studies. Broader Impact. Deep learning models become more complex, and it may negatively affect the Earth s ecology as the required computational clusters need increasing amounts of energy to work and water to cool them. In fact, sometimes deep neural networks (DNNs) are applied when they may be unnecessary, so they pointlessly burn resources. Our work investigates SB and its applications to moderate-dimensional data, where DNNs are widely used. Our proposed light solver demonstrates that SB can be well-learned without DNNs and in a few minutes even on CPU. This helps to avoid time-consuming GPU-training of previous solvers and decrease the negative impact on nature. Potential negative broader impact of our research is the same as that of most of the other ML researches. Namely, the constant advances in the field of the AI and the implementation of ML models in production pipelines may lead to transformation or replacement of some jobs in industry. Published as a conference paper at ICLR 2024 H ADDITIONAL EXPERIMENTAL RESULTS H.1 DEPENDENCE ON THE PARAMETER ϵ. In Figure 5, we show how the solution learned by Light SB depends on the parameter ϵ in the Adult child experiment. As expected, the diversity increases with the increase of ϵ. H.2 ADDITIONAL IMAGE GENERATION RESULTS. In Figure 6 we show more samples from Light SB trained on every considered image setup. Published as a conference paper at ICLR 2024 (a) Light SB Adult Child, ϵ = 0.1. Almost no diversity. (b) Light SB Adult Child, ϵ = 0.5. Resonable diversity. (c) Light SB Adult Child, ϵ = 1.0. Moderate diversity. (d) Light SB Adult Child, ϵ = 10.0. High diversity. Figure 5: Light SB Adult Child for different ϵ. Published as a conference paper at ICLR 2024 (a) Light SB Male Female, ϵ = 0.1 more samples. (b) Light SB Female Male, ϵ = 0.1 more samples. (c) Light SB Adult Child, ϵ = 0.1 more samples. (d) Light SB Child Adult more samples. Figure 6: Light SB more samples for every considered setup.