# diffusion_schrödinger_bridge_matching__a88c6aba.pdf Diffusion Schrödinger Bridge Matching University of Oxford Valentin De Bortoli ENS ULM Andrew Campbell University of Oxford Arnaud Doucet University of Oxford Solving transport problems, i.e. finding a map transporting one given distribution to another, has numerous applications in machine learning. Novel mass transport methods motivated by generative modeling have recently been proposed, e.g. Denoising Diffusion Models (DDMs) and Flow Matching Models (FMMs) implement such a transport through a Stochastic Differential Equation (SDE) or an Ordinary Differential Equation (ODE). However, while it is desirable in many applications to approximate the deterministic dynamic Optimal Transport (OT) map which admits attractive properties, DDMs and FMMs are not guaranteed to provide transports close to the OT map. In contrast, Schrödinger bridges (SBs) compute stochastic dynamic mappings which recover entropy-regularized versions of OT. Unfortunately, existing numerical methods approximating SBs either scale poorly with dimension or accumulate errors across iterations. In this work, we introduce Iterative Markovian Fitting (IMF), a new methodology for solving SB problems, and Diffusion Schrödinger Bridge Matching (DSBM), a novel numerical algorithm for computing IMF iterates. DSBM significantly improves over previous SB numerics and recovers as special/limiting cases various recent transport methods. We demonstrate the performance of DSBM on a variety of problems. 1 Introduction Mass transport problems are ubiquitous in machine learning (Peyré and Cuturi, 2019). For discrete measures, the Optimal Transport (OT) map can be computed exactly but is computationally intensive. In a landmark paper, Cuturi (2013) showed that an entropy-regularized version of OT can be computed more efficiently using the Sinkhorn algorithm (Sinkhorn, 1967). This has enabled the use of OT techniques in a variety of applications ranging from biology (Bunne et al., 2022) to shape correspondence (Feydy et al., 2017). However, applications involving high-dimensional continuous distributions and/or large datasets remain challenging for these techniques. One of such data-rich applications is generative modeling, a central transport problem in machine learning which requires designing a deterministic or stochastic mapping transporting a reference noise distribution to the data distribution. For example, Generative Adversarial Networks (Goodfellow et al., 2014) define a static, deterministic transport map, while Denoising Diffusion Models (DDMs) (Song et al., 2021b; Ho et al., 2020) build a dynamic, stochastic transport map by simulating a Stochastic Differential Equation (SDE), whose drift is learned using score matching (Hyvärinen, 2005; Vincent, 2011). The excellent performances of DDMs have motivated recent developments of Bridge Matching and Flow Matching models, which are dynamic transport maps using SDEs (Song et al., 2021a; Peluchetti, 2021; Liu, 2022; Albergo et al., 2023) or ODEs (Albergo and Vanden Eijnden, 2023; Heitz et al., 2023; Lipman et al., 2023; Liu et al., 2023b). Compared to DDMs, Bridge and Flow Matching methods do not rely on a forward noising diffusion converging to the reference distribution in infinite time, and are also more generally applicable as they can approximate transport maps between two general distributions based on their samples. Nonetheless, these transport maps Equal contribution. 37th Conference on Neural Information Processing Systems (Neur IPS 2023). Bridge Matching Denoising Diffusion Flow Matching Figure 1: Relationship between DSBM and existing methods. Sets for alternating projections Preserved properties IPF P0 = π0; PT = πT M, R(Q) IMF M; R(Q) P0 = π0, PT = πT Table 1: Comparison between Iterative Markovian Fitting (IMF) and Iterative Proportional Fitting (IPF). The Schrödinger Bridge is the unique P s.t. P0 = π0, PT = πT , P M, P R(Q) simultaneously by Proposition 5. M is the space of (regular) Markov measures and R(Q) the space of reciprocal measures of Q. are not necessarily close to the OT map minimizing the Wasserstein-2 metric, which is appealing for its many attractive properties (Peyré and Cuturi, 2019; Villani, 2009). In contrast, the Schrödinger Bridge (SB) problem is a dynamic version of entropy-regularized OT (EOT) (Föllmer, 1988; Léonard, 2014b). The SB is the finite-time diffusion which admits as initial and terminal distributions the two distributions of interest and is the closest in Kullback Leibler divergence to a reference diffusion. Numerous methods to approximate SBs numerically have been proposed, see e.g. (Bernton et al., 2019; Chen et al., 2016; Finlay et al., 2020; Caluya and Halder, 2021; Pavon et al., 2021), but these techniques tend to be restricted to low-dimensional settings. Recently, novel techniques using diffusion-based ideas have been proposed in (De Bortoli et al., 2021; Vargas et al., 2021; Chen et al., 2022) based on Iterative Proportional Fitting (IPF) (Fortet, 1940; Kullback, 1968; Rüschendorf and Thomsen, 1993), a continuous state-space extension of the Sinkhorn algorithm (Essid and Pavon, 2019). These approaches have been shown to scale better empirically, but numerical errors tend to accumulate over iterations (Fernandes et al., 2021). In this paper, our contributions are three-fold. First, we introduce Iterative Markovian Fitting (IMF), a new procedure to compute SBs which alternates between projecting on the space of Markov processes and on the reciprocal class, i.e. the measures which have the same bridge as the reference measure of SB (Léonard et al., 2014). We establish various theoretical results for IMF. Contrary to IPF, the IMF iterates always preserve the initial and terminal distributions. The differences between IPF and IMF are presented in Table 1. Second, we propose Diffusion Schrödinger Bridge Matching (DSBM), a novel algorithm approximating numerically the SB solution derived from IMF. DSBM requires at each iteration solving a simple regression problem in the spirit of Bridge and Flow Matching, and does not suffer from the time-discretization and forgetting issues of previous DSB techniques (De Bortoli et al., 2021; Vargas et al., 2021; Chen et al., 2022). Finally, we demonstrate the performance of DSBM on a variety of transport tasks.2 Notations. We denote by P(C), the space of path measures, i.e. P(C) = P(C([0, T], Rd)) where T > 0. The subset of Markov path measures associated with an SDE of the form d Xt = vt(Xt)dt + σtd Bt, with σ, v locally Lipschitz, is denoted M. For any Q M, the reciprocal class of Q is denoted R(Q), see Definition 3. We also denote Qt its marginal distribution at time t, Qs,t the joint distribution at times s and t, Qs|t the conditional distribution at time s given state at time t, and Q|0,T P(C) its diffusion bridge. Unless specified otherwise, all gradient operators are w.r.t. the variable xt with time index t. Let (X, X) and (Y, Y) be probability spaces. Given a Markov kernel K : X Y [0, 1] and a probability measure µ defined on X, we write µK the probability measure on Y such that for any A Y we have µK(A) = R X K(x, A)dµ(x). In particular, for any joint distribution Π0,T over Rd Rd, we denote the mixture of bridges measure as Π = Π0,T Q|0,T P(C), which is short for Π( ) = R Rd Rd Q|0,T ( |x0, x T )Π0,T (dx0, dx T ). 2 Dynamic Mass Transport Techniques 2.1 Denoising Diffusion and Bridge Matching Models Denoising Diffusion Models (Song et al., 2021b; Ho et al., 2020) are a popular class of generative models. They define a forward noising process Q M using the SDE d Xt = 1 2Xtdt + d Bt on the time-interval [0, T], where X0 Rd is drawn from the data distribution π0 and (Bt)t [0,T ] is a d- 2Code can be found at https://github.com/yuyang-shi/dsbm-pytorch. dimensional Brownian motion. This diffusion3 converges towards the standard Gaussian distribution N(0, Id) as T . A generative model is given by its time-reversal (Yt)t [0,T ] = (XT t)t [0,T ], where Y0 QT and d Yt = { 1 2Yt + log QT t(Yt)}dt + d Bt (Anderson, 1982; Haussmann and Pardoux, 1986). In practice, (Yt)t [0,T ] is initialized with Y0 πT = N(0, Id), and the Stein score log Qt(xt) = EQ0|t[ log Qt|0(Xt|X0) | Xt = xt] is approximated using a neural network sθ(t, xt) minimizing the denoising score matching loss EQ0,t[ log Qt|0(Xt|X0) sθ(t, Xt) 2]. An alternative to considering the time-reversal of a forward noising process is to build bridges between the two distributions and learn a mimicking diffusion process. This approach generalizes DDMs and allows for more flexible choices of sampling processes. We call this framework Bridge Matching and adopt a presentation similar to Peluchetti (2021); Liu et al. (2022b), where πT is the data distribution.4 We denote Q M the path measure associated with the following process d Xt = ft(Xt)dt + σtd Bt, X0 Q0. (1) Consider now the distribution of this process pinned down at an initial and terminal point x0, x T , denoted Q|0,T ( |x0, x T ). Under mild assumptions, the pinned process Q|0,T ( |x0, x T ) is a diffusion bridge and is given by d X0,T t = {ft(X0,T t ) + σ2 t log QT |t(x T |X0,T t )}dt + σtd Bt, X0,T 0 = x0, (2) which satisfies X0,T T = x T using Doob h-transform theory (Rogers and Williams, 2000). Next, we define an independent coupling Π0,T = π0 πT , and let Π = Π0,T Q|0,T . This path measure Π is a mixture of bridges. We aim to find a Markov diffusion d Yt = {ft(Yt) + vt(Yt)}dt + σtd Bt on [0, T] which admits the same marginals as Π; i.e. for any t [0, T], Yt Πt, so YT πT . For such vt, a generative model for sampling data distribution πT is obtained by simulating (Yt)t [0,T ]. It can be verified that indeed Yt Πt for v t (xt) = σ2 t EΠT |t[ log QT |t(XT |Xt) | Xt = xt]. We present the theory behind this idea more formally using Markovian projections in Section 3.1. In practice, we do not have access to v t and it is learned using neural networks with regression loss EΠt,T [ σ2 t log QT |t(XT |Xt) vθ(t, Xt) 2]. (3) For ft = 0 and σt = σ, Q|0,T is a Brownian Bridge and we have T x T +(1 t T )x0 +σt(Bt t T BT ), d X0,T t = {(x T X0,T t )/(T t)}dt+σtd Bt, (4) T BT ) N(0, t(1 t T ) Id). The regression loss (3) associated with (4) is given by EΠt,T [ (XT Xt)/(T t) vθ(t, Xt) 2]. (5) Letting σ 0, we recover Flow Matching models (see Appendix A.1 for further details). 2.2 Schrödinger Bridges and Optimal Transport The Schrödinger Bridge (SB) problem (Schrödinger, 1932) consists in finding a path measure PSB P(C) such that PSB = argmin P{KL(P|Q) : P0 = π0, PT = πT }, (6) where Q P(C) is a reference path measure. In what follows, we consider Q defined by the diffusion process (1) which is Markov, and without loss of generality, we assume Q0 = π0. Hence PSB is the path measure closest to Q in terms of Kullback Leibler divergence which satisfies the initial and terminal constraints PSB 0 = π0 and PSB T = πT . Another crucial property of PSB is that it can also be defined as a mixture of bridges PSB = ΠSB 0,T Q|0,T , where ΠSB 0,T = argminΠ0,T {KL(Π0,T |Q0,T ) : Π0 = π0, ΠT = πT } is the solution of the static SB problem (Léonard, 2014b). In particular, for Q associated with (σBt)t [0,T ] we have ΠSB 0,T = argminΠ0,T {EΠ0,T [||X0 XT ||2 2σ2T H(Π0,T ) : Π0 = π0, ΠT = πT }, 3This is known as the Ornstein Uhlenbeck (OU) process or VPSDE (Song et al., 2021b). 4To keep notations consistent with existing works, π0 is the data distribution in the context of DDM and SB, whereas πT is the data distribution in Bridge Matching. However, both SB and Bridge Matching methods allow transfer between arbitrary distributions π0, πT , so this distinction is not important. where H(µ) denotes the entropy, i.e. ΠSB 0,T is the solution of the entropy-regularized OT problem. In this case, the SB can also be obtained theoretically by solving the following problem (Dai Pra, 1991) v SB = argminv{ R T 0 EPt[||v(t, Xt)||2]dt : d Xt = v(t, Xt)dt + σd Bt, P0 = π0, PT = πT }. Then PSB is given by the SDE with drift v SB initialized with X0 π0. For σ = 0, we recover the classical OT problem and the Benamou Brenier formula (Benamou and Brenier, 2000). A common approach to solve (6) is the Iterative Proportional Fitting (IPF) method (Fortet, 1940; Kullback, 1968; Rüschendorf, 1995) defining a sequence of path measures ( Pn)n N where P2n+1 = argmin P{KL( P| P2n) : PT = πT }, P2n+2 = argmin P{KL( P| P2n+1) : P0 = π0}, (7) with initialization P0 = Q. This procedure alternates between projections on the set of path measures with given initial distribution π0 and terminal distribution πT . It can be shown (De Bortoli et al., 2021) that ( Pn)n N are associated with diffusions and that for any n N, P2n+1 is the time-reversal of P2n with initialization πT , and P2n+2 is the time-reversal of P2n+1 with initialization π0. Leveraging this property, De Bortoli et al. (2021) proposed Diffusion Schrödinger Bridge (DSB), an algorithm which learns the time-reversals iteratively. In particular, DDMs can be seen as the first iteration of DSB. 3 Iterative Markovian Fitting 3.1 Markovian Projection and Reciprocal Projection Markovian Projection. Projecting on Markov measures is a key ingredient in our methodology and in the Bridge Matching framework. This concept was introduced multiple times in the literature (Gyöngy, 1986; Peluchetti, 2021; Liu et al., 2022b). In particular, we focus on Markovian projection of path measures given by a mixture of bridges Π = Π0,T Q|0,T P(C). Definition 1. Assume that Q is given by (1) and that for any (x0, x T ) Rd, Q|0,T ( |x0, x T ) is associated with (X0,T t )t [0,T ] given by d X0,T t = {ft(X0,T t )+σ2 t log QT |t(x T |X0,T t )}dt+σtd Bt, with σ : [0, T] (0, + ). Then, when it is well-defined, we introduce the Markovian projection of Π, M = proj M(Π) M, which is associated with the SDE d X t = {ft(X t ) + v t (X t )}dt + σtd Bt, v t (xt) = σ2 t EΠT |t[ log QT |t(XT |Xt) | Xt = xt]. Note that in our definition σt > 0 so log QT |t(x T |xt) is well-defined, but Flow Matching can be recovered as the deterministic case in the limit σt = σ 0. In the following proposition, we show that the Markovian projection is indeed a projection for the reverse Kullback Leibler divergence, and that it preserves marginals of Πt. Proposition 2. Assume that σt > 0. Let M = proj M(Π). Then, under mild assumptions, we have M = argmin M{KL(Π|M) : M M}, KL(Π|M ) = 1 2 R T 0 EΠ0,t[ σ2 t EΠT |0,t[ log QT |t(XT |Xt) | X0, Xt] v t (Xt) 2]/σ2 t dt. In addition, we have that for any t [0, T], M t = Πt. In particular, M T = ΠT . Reciprocal Projection. While the Markovian projection ensures that the obtained measure is Markov, the associated bridge measure is not preserved in general, i.e. proj M(Π)|0,T = Π|0,T = Q|0,T . Measures with same bridge as Q are said to be in its reciprocal class (Léonard et al., 2014). Definition 3. Π P(C) is in the reciprocal class R(Q) of Q M if Π = Π0,T Q|0,T . We define the reciprocal projection of P P(C) as Π = proj R(Q)(P) = P0,T Q|0,T . Similarly to Proposition 2, we have the following result, which justifies the term reciprocal projection. Proposition 4. Let P P(C), Π = proj R(Q)(P). Then, Π = argminΠ{KL(P|Π) : Π R(Q)}. The reciprocal projection Π of a Markov path measure M does not preserve the Markov property in general. In fact, the Schrödinger Bridge is the unique path measure which satisfies the initial and terminal conditions, is Markov and is in the reciprocal class of Q, see (Léonard, 2014b). Proposition 5. Let P be a Markov measure in the reciprocal class of Q such that P0 = π0, PT = πT . Then, under assumptions on Q, π0 and πT , P is unique and is equal to the Schrödinger Bridge PSB. 3.2 Iterative Markovian Fitting Based on Proposition 5, we propose a novel methodology called Iterative Markovian Fitting (IMF) to solve Schrödinger Bridges. We consider a sequence (Pn)n N such that P2n+1 = proj M(P2n), P2n+2 = proj R(Q)(P2n+1), (8) with P0 such that P0 0 = π0, P0 T = πT and P0 R(Q). These updates correspond to alternatively performing Markovian projections and reciprocal projections. Combining Proposition 2 and Definition 3, we get that for any n N, Pn 0 = π0 and Pn T = πT . This property is in contrast to the IPF algorithm (7) for which the marginals at the initial and final times are not preserved. We highlight this duality between IPF (7) and IMF (8) in Table 1. We conclude this section with a theoretical analysis of IMF. First, we start by showing a Pythagorean theorem for both the Markovian projection and the reciprocal projection. Lemma 6. Under mild assumptions, if M M, Π R(Q) and KL(Π|M) < + , we have KL(Π|M) = KL(Π|proj M(Π)) + KL(proj M(Π)|M). If KL(M|Π) < + , we have KL(M|Π) = KL(M|proj R(Q)(M)) + KL(proj R(Q)(M)|Π). Using Lemma 6, we have the following proposition. Proposition 7. Under mild assumptions, we have KL(Pn+1|PSB) KL(Pn|PSB) < , and limn + KL(Pn|Pn+1) = 0. Hence, for the IMF sequence (Pn)n N, the Markov path measures (P2n+1)n N are getting closer to the reciprocal class, while the reciprocal path measures (P2n+2)n N are getting closer to the set of Markov measures. Proposition 7 should be compared with (Rüschendorf, 1995, Proposition 2.1, Equation (2.16)) which shows that, for the IPF sequence ( Pn)n N, we have limn + KL( Pn+1| Pn) = 0. This result is similar to Proposition 7 but for the forward Kullback Leibler divergence. Using Proposition 7, we finally prove the convergence of the IMF sequence (Pn)n N to the Schrödinger Bridge. This result was first shown in the concurrent work (Peluchetti, 2023, Theorem 2). We present a simpler proof in Appendix C.6. Theorem 8. Under mild assumptions, the IMF sequence (Pn)n N admits a unique fixed point P = PSB, and limn + KL(Pn|P ) = 0. 4 Diffusion Schrödinger Bridge Matching In this section, we present Diffusion Schrödinger Bridge Matching (DSBM), a practical algorithm for solving the SB problem obtained by combining the IMF procedure with Bridge Matching. Iterative Markovian Fitting in practice. IMF alternatively projects on the Markov class M and the reciprocal class R(Q). We denote Mn+1 = P2n+1 M and Πn = P2n R(Q). Assuming we know how to sample from the bridge Q|0,T given the initial and terminal conditions, sampling from the reciprocal projection proj R(Q)(M) is simple: First, sample (X0, XT ) from the joint distribution M0,T .5 Then, sample from the bridge Q|0,T ( |X0, XT ). The bottleneck of IMF is in the computation of Markovian projections. By Definition 1, M = proj M(Π) is associated with the process d Xt = {ft(Xt) + σ2 t EΠT |t[ log QT |t(XT |Xt) | Xt]}dt + σtd Bt, X0 π0. By Proposition 2, we can learn M using Mθ given by d Xt = {ft(Xt) + vθ (t, Xt)}dt + σtd Bt, X0 π0, (9) θ = argminθ{ R T 0 EΠt,T [ σ2 t log QT |t(XT |Xt) vθ(t, Xt) 2]/σ2 t dt : θ Θ}, (10) 5In practice, we sample the SDE associated with M and save a batch of joint samples (X0, XT ). This is similar to the trajectory caching procedure in De Bortoli et al. (2021), but we only retain initial and final samples. where {vθ : θ Θ} is a parametric family of functions, usually given by a neural network. The optimal vθ (t, xt) = σ2 t EΠT |t[ log QT |t(XT |Xt) | Xt = xt] for any t [0, T] and xt Rd. With the above two procedures for computing proj R(Q)(M) and proj M(Π), we can now describe a numerical method implementing IMF (8). Let Π0 = Π0 0,T Q|0,T where Π0 0 = π0, Π0 T = πT . Learn M1 proj M(Π0) given by (9) with vθ given by (10). Next, sample from Π1 = proj R(Q)(M1) = M1 0,T Q|0,T by sampling from M1 0,T and reconstructing the bridge Q|0,T . We iterate the process to obtain a sequence (Πn, Mn+1)n N. In practice, this algorithm performs poorly (see Figure 3), since the approximate minimization (10) for computing Mn+1 may not admit Mn+1 T = πT exactly as in Proposition 2. Instead, we incur a bias between Mn+1 T and πT which accumulates for each n N. To mitigate this problem, we alternate between a forward Markovian projection and a backward Markovian projection. This procedure is justified by the following proposition. Proposition 9. Assume that Π = Π0,T Q|0,T with Q associated with d Xt = ft(Xt)dt + σtd Bt. Under mild conditions, the Markovian projection M = proj M(Π) is associated with both d Xt = {ft(Xt) + σ2 t EΠT |t[ log QT |t(XT |Xt) | Xt]}dt + σtd Bt, X0 Π0, (11) d Yt = { f T t(Yt) + σ2 T t EΠ0|T t[ log QT t|0(Yt|YT ) | Yt]}dt + σT td Bt, Y0 ΠT .(12) In Proposition 9, (11) is the definition of the Markovian projection, see Definition 1. However, (12) is an equivalent representation as a time-reversal. In practice, (Yt)t [0,T ] is approximated with d Yt = { f T t(Yt) + vφ (T t, Yt)}dt + σT td Bt, Y0 πT , (13) φ = argminφ{ R T 0 EΠ0,t[ σ2 t log Qt|0(Xt|X0) vφ(t, Xt) 2]/σ2 t dt : φ Φ}. (14) The optimal vφ (t, xt) = σ2 t EΠ0|t[ log Qt|0(Xt|X0) | Xt = xt] for any t [0, T] and xt Rd. Algorithm 1 Diffusion Schrödinger Bridge Matching 1: Input: Joint distribution Π0 0,T , tractable bridge Q|0,T , number of outer iterations N N. 2: Let Π0 = Π0 0,T Q|0,T . 3: for n {0, . . . , N 1} do 4: Learn vφ using (14) with Π = Π2n. 5: Let M2n+1 be given by (13). 6: Let Π2n+1 = M2n+1 0,T Q|0,T . 7: Learn vθ using (10) with Π = Π2n+1. 8: Let M2n+2 be given by (9). 9: Let Π2n+2 = M2n+2 0,T Q|0,T . 10: end for 11: Output: vθ , vφ Note that X0 π0 in the forward projection, while Y0 πT in the backward projection. Therefore, using the backward projection removes the bias on πT accumulated from the forward projection. Leveraging the time-symmetry of the Markovian projection and alternating between (13) and (9) yields the DSBM methodology summarized in Algorithm 1. It is also possible to learn both the forward and backward processes at each step, and enforce that the backward and forward processes match. We explore this in Appendix G. Initialization coupling. We now relate Algorithm 1 to the classical IPF and practical algorithms such as DSB (De Bortoli et al., 2021). Instead of initializing DSBM with Π0 0,T given by a coupling between π0, πT , if we initialize it by Π0 0,T = Q0,T where Q0 = π0 and QT |0 is given by the reference process defined in (1), then DSBM also recovers the IPF iterates used in DSB. Proposition 10. Suppose the families of functions {vθ : θ Θ} and {vφ : φ Φ} are rich enough so that they can model the optimal vector fields. Let (Πn, Mn+1)n N be the optimal DSBM sequence in Algorithm 1 initialized with Π0 0,T = Q0,T , and let ( Pn)n N be the optimal DSB sequence given by the IPF iterates in (7). Then for any n N, n 1, we have Mn = Pn. We will thus call DSBM-IPF, the DSBM algorithm initialized with the joint distribution given by the forward reference process Π0 0,T = Q0,T ; and DSBM-IMF, the DSBM algorithm initialized with an independent coupling Π0 0,T = π0 πT . However, the training procedure of DSBM-IPF is very different from the one of (De Bortoli et al., 2021; Chen et al., 2022). In existing works, Pn+1 is obtained as the time-reversal of Pn which requires full trajectories from Pn, see e.g. (De Bortoli et al., 2021, Proposition 6). In contrast, in Algorithm 1 we only use the coupling Mn 0,T to create the bridge measure Πn = Mn 0,T Q|0,T . By doing so, (i) the losses (10) and (14) can be easily evaluated at any time t [0, T]; (ii) the trajectory caching procedure in DSBM is more computationally and memory efficient; (iii) while every IPF iteration Pn is also supposed to be in R(Q), in practice one can observe a forgetting of the bridge Q|0,T (Fernandes et al., 2021). In DSBM, this effect is countered by explicit projections on the reciprocal class. See Appendix F for more details. Probability flow ODE. At equilibrium of DSBM, we have that (Yt)t [0,T ] given by (13) is the time reversal of (Xt)t [0,T ] given by (9) and are both associated with the optimal Schrödinger Bridge path measure P . As a result, we have that vφ (t, x) = vθ (t, x) + σ2 t log P t (x). Hence, a probability flow (Z t )t [0,T ] such that Law(Z t ) = P t for any t [0, T] is given by d Z t = {ft(Z t ) + 1 2[vθ (t, Z t ) vφ (t, Z t )]}dt, Z 0 π0. See also De Bortoli et al. (2021); Chen et al. (2022) for derivation of this result. Note however that the path measure induced by (Z t )t [0,T ] does not correspond to P ; in particular, (Z 0, Z T ) is not an entropic OT plan. However, since for any t [0, T], Z t has marginal distribution P t , we can compute the log-likelihood of the model (Song et al., 2021b; Huang et al., 2021). 5 Related Work Markovian projection and Bridge Matching. The concept of Markovian projection has been rediscovered multiple times (Krylov, 1984; Gyöngy, 1986; Dupire, 1994). In the machine learning context, this was first proposed by Peluchetti (2021) to define Bridge Matching models. More recently, Liu et al. (2022b) derived theoretical properties of the Markovian projection in Proposition 2, first part of Lemma 6, and applied Bridge Matching for learning data on discrete and constrained domains. Bridge and Flow Matching. Flow Matching corresponds to deterministic bridges with deterministic samplers (ODEs) and has been under active study (Liu et al., 2023b; Liu, 2022; Lipman et al., 2023; Albergo and Vanden-Eijnden, 2023; Heitz et al., 2023; Pooladian et al., 2023; Tong et al., 2023). Denoising Diffusion Implicit Models (DDIM) (Song et al., 2021a) can also be formulated as a discrete-time version of Flow Matching, see Liu et al. (2023b). These models have been extended to the Riemannian setting by Chen and Lipman (2023). Recently, Albergo et al. (2023) studied the influence of stochasticity in the bridge, through the concept of stochastic interpolants. Liu et al. (2023a); Delbracio and Milanfar (2023) used Bridge Matching to perform image restoration tasks and noted benefits of stochasticity empirically. Closely related to our work is the Rectified Flow algorithm of Liu et al. (2023b), which corresponds to an iterative Flow Matching procedure in order to improve the straightness of the flow and thus eases its simulation. An iterative rectifying procedure using stochastic interpolants is also proposed in (Albergo et al., 2023, Section 3.5). Our proposed DSBM-IMF algorithm is closest to Rectified Flow, which can be seen as the deterministic limiting case of DSBM-IMF as σ 0. However, there are a few important theoretical and practical differences. Most notably, we adopt the SDE approach which is crucial for the validity of Proposition 5 as well as for the empirical performance of DSBM. We discuss further distinctions between DSBM and Rectified Flow in Appendix A.3. Diffusion Schrödinger Bridge. Schrödinger Bridges (Schrödinger, 1932) are ubiquitous in probability theory (Léonard, 2014b) and stochastic control (Dai Pra, 1991; Chen et al., 2021). More recently, they have been used for generative modeling: De Bortoli et al. (2021) introduced the DSB algorithm and Vargas et al. (2021); Chen et al. (2022) introduced similar algorithms. The case of Dirac delta terminal distribution was investigated by Wang et al. (2021). These methods were later extended to solve conditional simulation and more general control problems (Shi et al., 2022; Thornton et al., 2022; Liu et al., 2022a; Chen et al., 2023; Tamir et al., 2023). In Somnath et al. (2023), SBs are learned using one Bridge Matching iteration, assuming access to the true Schrödinger static coupling. Our proposed method DSBM-IPF is closest to DSB, but with improved continous-time training and projections on the reciprocal class which mitigate two limitations of DSB. Concurrently with our work, Peluchetti (2023) independently introduced the DSBM-IMF approach (named IDBM therein). 6 Experiments 2D Experiments. We first show our proposed methods can generate correct samples and learn lower kinetic energy transport maps in some 2D examples. We compare our method DSBM with flow-based methods including Flow Matching (FM) (Lipman et al., 2023), Conditional Flow Matching (CFM), OT-CFM (Tong et al., 2023), and Rectified Flow (RF) (Liu et al., 2023b); and other SB methods including DSB (De Bortoli et al., 2021) and SB-CFM (Tong et al., 2023). OT-CFM and SB-CFM utilizes sample-based mini-batch OT or EOT solvers (Fatras et al., 2021; Flamary et al., 2021) to define an approximate OT or SB static coupling ΠOT 0,T or ΠSB 0,T , see also Pooladian et al. (2023); Stromme (2023). We can also utilize this idea in the DSBM-IMF framework, which corresponds to using the initialization coupling Π0 0,T = ΠSB 0,T in Algorithm 1. This approximate SB coupling ΠSB 0,T also satisfies ΠSB 0 = π0 and ΠSB T = πT but can provide a better initialization than the independent coupling Π0 0,T = π0 πT . We name this approach DSBM-IMF+. The rest of the methods do not use OT solvers. DSB and DSBM directly learn the EOT map as the solution of the diffusion process. In Table 2, we show the 2-Wasserstein distance between the true and generated samples, as well as the integrated path energy defined as E[ R T 0 ||v(t, Zt)||2dt] where v is the learned drift along the ODE trajectory Zt. For direct comparability, we report for DSBM using its probability flow ODE. Lower path energies represent shorter (and potentially easier to integrate) trajectories. We find that in this low dimensional setting, OT-CFM performs the best by utilizing OT solvers, but DSBM outperforms FM and CFM when OT solvers are not used. Further, DSBM outperforms DSB on all datasets, suggesting DSBM solves the SB problem with higher accuracy. The results also show that among SB methods, DSBM-IMF+ can achieve lower sampling error than DSBM-IPF and DSBM-IMF. It also performs better than SB-CFM on 3 of the datasets and achieve lower path energy on all datasets. Finally, we find Rectified Flow achieves lower sampling error than DSBM except for the moons-8gaussians task, for which DSBM is significantly more accurate. Since RF can be informally seen as DSBM in the case σ 0, this suggests the optimal σ varies for each task and between generative and general transfer tasks. Figure 2 visualizes how σ affects the straightness and sample quality of learned transport maps between two mixture distributions. High-Dimensional Gaussian Experiment. We next perform the Gaussian transport experiment in De Bortoli et al. (2021) with dimension d = 50 to verify the scalability of our proposed approach. The true SB can be computed analytically in this case (Bunne et al., 2023). In Figure 3, we plot the convergence of the learned mean E[X0], variance Var(X0), and covariance Cov(X0, XT ) between times 0, T. We also consider RF and a related baseline IMF-b, which performs IMF numerically but only in the backward direction. All methods converge approximately to the correct mean. However, 2-Wasserstein (Euler 20 steps) Dataset moons scurve 8gaussians moons-8gaussians DSBM-IPF 0.140 0.006 0.140 0.024 0.315 0.079 0.812 0.092 DSBM-IMF 0.144 0.024 0.145 0.037 0.338 0.091 0.838 0.098 DSBM-IMF+ 0.123 0.014 0.130 0.025 0.276 0.030 0.802 0.172 DSB 0.190 0.049 0.272 0.065 0.411 0.084 0.987 0.324 SB-CFM 0.129 0.024 0.136 0.030 0.238 0.044 0.843 0.079 FM 0.212 0.025 0.161 0.033 0.351 0.066 - CFM 0.215 0.028 0.171 0.023 0.370 0.049 1.285 0.314 RF 0.129 0.022 0.126 0.019 0.267 0.041 1.522 0.304 OT-CFM 0.111 0.005 0.102 0.013 0.253 0.040 0.716 0.187 Path energy moons scurve 8gaussians moons-8gaussians 1.598 0.034 2.110 0.059 14.91 0.310 42.16 1.026 1.580 0.036 2.092 0.053 14.81 0.255 41.00 1.495 1.594 0.043 2.116 0.018 14.88 0.252 41.09 1.206 - - - - 1.649 0.035 2.144 0.044 15.08 0.209 45.69 0.661 2.227 0.056 2.950 0.074 18.12 0.416 - 2.391 0.043 3.071 0.026 18.00 0.090 116.5 2.633 1.185 0.052 1.633 0.074 14.84 0.441 37.61 3.906 1.178 0.020 1.577 0.036 15.10 0.215 30.50 0.626 Table 2: Sampling quality as measured by 2-Wasserstein distance and path energy for the 2D experiments. 1 SD over 5 seeds. Best values are in bold and second best are italicized. Figure 2: Learned SB probability flow between two mixtures of Gaussians (green yellow). 0 5 10 15 20 Iteration DSB IMF-b DSBM-IPF DSBM-IMF RF 0 5 10 15 20 Iteration 0 5 10 15 20 Iteration Figure 3: Convergence of Gaussian experiment in d = 50. KL 10 3 d = 5 d = 20 d = 50 DSB 3.26 1.60 13.0 3.49 32.8 1.28 SB-CFM 1.45 0.73 12.3 1.47 49.4 3.91 DSBM-IPF 1.23 0.23 4.42 0.76 8.75 0.87 DSBM-IMF 1.34 0.51 5.05 0.95 9.76 1.67 Table 3: Average KL(Pt|PSB t ) at 21 uniformly spaced t. (c) DSBM-IPF Figure 4: Samples of MNIST digits transferred from letters. 0 50 100 150 200 250 Training step ( 1000) DSB DSBM-IPF DSBM-IMF BM CFM OT-CFM RF Figure 5: FID vs iteration. the variance estimates become inaccurate for RF and IMF-b. Among SB methods, DSB and IMF-b also gave inaccurate SB covariance estimates as the number of iteration increases. On the other hand, DSBM does not suffer from this issue. In Table 3, we further quantify the accuracy and compare with SB-CFM (Tong et al., 2023) by computing the KL divergence between the marginal distributions of the learned process Pt and the true SB PSB t . Our proposed methods achieve similar KL divergence as SB-CFM in dimension d = 5, but are much more accurate in higher dimensions. MNIST, EMNIST transfer. We test our method for domain transfer between MNIST digits and EMNIST letters as in De Bortoli et al. (2021). We compare DSBM as a direct substitute of DSB, and also with Bridge Matching (BM) (Peluchetti, 2021; Liu et al., 2022b), CFM, OT-CFM and RF. We plot some output samples from different algorithms in Figure 4 and the convergence of FID score in Figure 5. We find that OT-CFM becomes less applicable in higher dimensions and produces samples of worse quality (Figure 4a). On the other hand, image quality deteriorates during training of DSB and RF. DSBM achieves higher quality samples visually, and does not suffer from deterioration. It is also about 30% more efficient than DSB in terms of runtime. Celeb A transfer. Next, we evaluate and perform some ablations of our method on a transfer task on the Celeb A 64 64 dataset. We consider the images given by the tokens male/old and female/young. In Figures 6 and 7, we show that as σ increases, the quality of the images (as measured by the FID score) increases until σ is too high, but the alignment (as measured by LPIPS) between the generated image and the original sample decreases. Additionally, we investigate the dependency between σ and image dimension in Figure 8. In particular, for the same σ = 1, the outputs of DSBM for Celeb A 128 128 are better aligned with the original data than for Celeb A 64 64. This is in agreement with the observations of Chen (2023); Hoogeboom et al. (2023) that the noise schedule in diffusion models should scale with the resolution. AFHQ transfer. We demonstrate the scalability of our method on an additional transfer experiment on the AFHQ 512 512 dataset between the classes cat and wild. The results are shown in Figure 9. On this higher-dimensional problem, we observe that DSBM can also generate realistic samples which are similar to the input. Unpaired Fluid Flows Downscaling. Finally, we apply DSBM to perform downscaling of geophysical fluid dynamics, i.e. super-resolution of low resolution spatial data. We use the dataset in (Bischoff and Deck, 2023), which consists of unpaired low (64 64) and high (512 512) resolution fields. As shown in Figure 10, DSBM is able to learn high resolution reconstructions by only slightly noising the low resolution input. In contrast, Bischoff and Deck (2023) use two diffusion models in forward and backward directions (Diffusion-fb) based on Meng et al. (2022), which improves over the Random baseline. Figure 11 shows that DSBM-IPF and DSBM-IMF achieve much lower ℓ2 Figure 6: Left to right: initial and generated samples (64 64) obtained after 20 DSBM-IMF iterations for σ2 {0.01, 0.1, 1, 10}. 10 2 10 1 100 101 30 40 50 60 70 10 2 10 1 100 101 Figure 7: FID (blue) and LPIPS (red) scores (lower is better for both) as we vary σ2. Figure 8: Top to bottom: DSBM (σ = 1) 64 64; 128 128; original images. (a) Transfer between classes cat and wild. (b) Transfer between classes wild and cat. Figure 9: DSBM domain transfer results on the AFHQ 512 512 dataset. Figure 10: (a) Left to right: source low resolution sample, intermediate state and final reconstruction of DSBM-IPF; (b) an unpaired high resolution sample. 2 4 8 16 0.0 L2 of high and low res Supersaturation 2 4 8 16 0.0 Random Diffusion-fb DSBM-IPF DSBM-IMF Data wavenumber Figure 11: ℓ2 distance between low resolution source and high resolution reconstructed fields. distances for all frequency classes in the dataset than Diffusion-fb (and thus Random), indicating DSBM is able to reconstruct high resolution fields consistent with the low resolution source. 7 Discussion In this work, we introduce IMF, a new methodology for learning Schrödinger Bridges. IMF is an alternative to the classical IPF and can be interpreted as its dual. Building on this new framework, we present two practical algorithms, DSBM-IPF and DSBM-IMF, for learning SBs. These algorithms mitigate the time-discretization and bias accumulation issues of existing methods. However, DSBM still has some limitations. First, our results suggest DSBM is most effective for solving general transport problems. For generative modeling, we only find minor improvements compared to Bridge and Flow Matching on CIFAR-10 (see Appendix I.6). Second, while DSBM is more efficient than DSB, it still requires sampling from the learned process during the caching step. Finally, the EOT problem becomes more difficult to solve numerically for small values of σ. In future work, we would like to further investigate the differences between DSBM-IMF and DSBMIPF. IMF also appears useful for developing a better understanding of the Rectified Flow algorithm (Liu et al., 2023b), as IMF minimizes a clear objective (6) and Rectified Flow can be seen as a limiting case of it. Finally, Rectified Flow has also been extended to solve OT problems with general convex costs by Liu (2022), and it would be interesting to derive a SB version of this extension. Acknowledgements YS acknowledges support from the Huawei UK Fellowship. AC acknowledges support from the EPSRC CDT in Modern Statistics and Statistical Machine Learning (EP/S023151/1). AD acknowledges support of the UK Dstl and EPSRC grant EP/R013616/1. This is part of the collaboration between US DOD, UK MOD and UK EPSRC under the Multidisciplinary University Research Initiative. He also acknowledges support from the EPSRC grants Co Sines (EP/R034710/1) and Bayes4Health (EP/R018561/1). Albergo, M. S., Boffi, N. M., and Vanden-Eijnden, E. (2023). Stochastic interpolants: A unifying framework for flows and diffusions. ar Xiv preprint ar Xiv:2303.08797. Albergo, M. S. and Vanden-Eijnden, E. (2023). Building normalizing flows with stochastic interpolants. In International Conference on Learning Representations. Anderson, B. D. (1982). Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313 326. Banerjee, A., Guo, X., and Wang, H. (2005). On the optimality of conditional expectation as a Bregman predictor. IEEE Transactions on Information Theory, 51(7):2664 2669. Barczy, M. and Kern, P. (2013). Representations of multidimensional linear process bridges. Random Operators and Stochastic Equations, 21(2):159 189. Benamou, J.-D. and Brenier, Y. (2000). A computational fluid mechanics solution to the Monge Kantorovich mass transfer problem. Numerische Mathematik, 84(3):375 393. Bernton, E., Heng, J., Doucet, A., and Jacob, P. E. (2019). Schrödinger bridge samplers. ar Xiv preprint ar Xiv:1912.13170. Bischoff, T. and Deck, K. (2023). Unpaired downscaling of fluid flows with diffusion bridges. ar Xiv preprint ar Xiv:2305.01822. Bogachev, V. I., Krasovitskii, T. I., and Shaposhnikov, S. V. (2021). On uniqueness of probability solutions of the Fokker Planck Kolmogorov equation. Sbornik: Mathematics, 212(6):745. Bunne, C., Hsieh, Y.-P., Cuturi, M., and Krause, A. (2023). The Schrödinger bridge between Gaussian measures has a closed form. In International Conference on Artificial Intelligence and Statistics. Bunne, C., Papaxanthos, L., Krause, A., and Cuturi, M. (2022). Proximal optimal transport modeling of population dynamics. In International Conference on Artificial Intelligence and Statistics, pages 6511 6528. PMLR. Caluya, K. F. and Halder, A. (2021). Wasserstein proximal algorithms for the Schrödinger bridge problem: Density control with nonlinear drift. IEEE Transactions on Automatic Control, 67(3):1163 1178. Chen, R. T. and Lipman, Y. (2023). Riemannian flow matching on general geometries. ar Xiv preprint ar Xiv:2302.03660. Chen, T. (2023). On the importance of noise scheduling for diffusion models. ar Xiv preprint ar Xiv:2301.10972. Chen, T., Liu, G.-H., Tao, M., and Theodorou, E. A. (2023). Deep momentum multi-marginal Schrödinger bridge. ar Xiv preprint ar Xiv:2303.01751. Chen, T., Liu, G.-H., and Theodorou, E. A. (2022). Likelihood training of Schrödinger bridge using forward-backward SDEs theory. In International Conference on Learning Representations. Chen, Y., Georgiou, T., and Pavon, M. (2016). Entropic and displacement interpolation: a computational approach using the Hilbert metric. SIAM Journal on Applied Mathematics, 76(6):2375 2396. Chen, Y., Georgiou, T. T., and Pavon, M. (2021). Optimal transport in systems and control. Annual Review of Control, Robotics, and Autonomous Systems, 4. Choi, Y., Uh, Y., Yoo, J., and Ha, J.-W. (2020). Stargan v2: Diverse image synthesis for multiple domains. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Chung, K. L. and Walsh, J. B. (2006). Markov processes, Brownian motion, and Time Symmetry, volume 249. Springer Science & Business Media. Csiszár, I. (1975). I-divergence geometry of probability distributions and minimization problems. The Annals of Probability, 3(1):146 158. Cuturi, M. (2013). Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems. Dai Pra, P. (1991). A stochastic control approach to reciprocal diffusion processes. Applied Mathematics and Optimization, 23(1):313 329. De Bortoli, V., Thornton, J., Heng, J., and Doucet, A. (2021). Diffusion Schrödinger bridge with applications to score-based generative modeling. In Advances in Neural Information Processing Systems. Delbracio, M. and Milanfar, P. (2023). Inversion by direct iteration: An alternative to denoising diffusion for image restoration. ar Xiv preprint ar Xiv:2303.11435. Dupire, B. (1994). Pricing with a smile. Risk, 7(1):18 20. Essid, M. and Pavon, M. (2019). Traversing the Schrödinger bridge strait: Robert Fortet s marvelous proof redux. Journal of Optimization Theory and Applications, 181(1):23 60. Fatras, K., Zine, Y., Majewski, S., Flamary, R., Gribonval, R., and Courty, N. (2021). Minibatch optimal transport distances; analysis and applications. ar Xiv preprint ar Xiv:2101.01792. Fernandes, D. L., Vargas, F., Ek, C. H., and Campbell, N. D. (2021). Shooting Schrödinger s cat. In Fourth Symposium on Advances in Approximate Bayesian Inference. Feydy, J., Charlier, B., Vialard, F.-X., and Peyré, G. (2017). Optimal transport for diffeomorphic registration. In Medical Image Computing and Computer Assisted Intervention MICCAI, pages 291 299. Finlay, C., Gerolin, A., Oberman, A. M., and Pooladian, A.-A. (2020). Learning normalizing flows from entropy-Kantorovich potentials. ar Xiv preprint ar Xiv:2006.06033. Flamary, R., Courty, N., Gramfort, A., Alaya, M. Z., Boisbunon, A., Chambon, S., Chapel, L., Corenflos, A., Fatras, K., Fournier, N., Gautheron, L., Gayraud, N. T., Janati, H., Rakotomamonjy, A., Redko, I., Rolet, A., Schutz, A., Seguy, V., Sutherland, D. J., Tavenard, R., Tong, A., and Vayer, T. (2021). Pot: Python optimal transport. Journal of Machine Learning Research, 22(78):1 8. Föllmer, H. (1988). Random fields and diffusion processes. In École d Été de Probabilités de Saint-Flour XV XVII, 1985 87, pages 101 203. Springer. Fortet, R. (1940). Résolution d un système d équations de M. Schrödinger. Journal de Mathématiques Pures et Appliqués, 1:83 105. Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial networks. ar Xiv preprint ar Xiv:1406.2661. Gyöngy, I. (1986). Mimicking the one-dimensional marginal distributions of processes having an Itô differential. Probability Theory and Related Fields, 71:501 516. Haussmann, U. G. and Pardoux, E. (1986). Time reversal of diffusions. The Annals of Probability, 14(4):1188 1205. Heitz, E., Belcour, L., and Chambon, T. (2023). Iterative α-(de)blending: a minimalist deterministic diffusion model. In SIGGGRAPH. Heng, J., Bortoli, V. D., Doucet, A., and Thornton, J. (2021). Simulating diffusion bridges with score matching. ar Xiv preprint ar Xiv:2111.07243. Ho, J., Jain, A., and Abbeel, P. (2020). Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems. Hoogeboom, E., Heek, J., and Salimans, T. (2023). simple diffusion: End-to-end diffusion for high resolution images. ar Xiv preprint ar Xiv:2301.11093. Huang, C.-W., Lim, J. H., and Courville, A. C. (2021). A variational perspective on diffusion-based generative models and score matching. In Advances in Neural Information Processing Systems. Hyvärinen, A. (2005). Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4). Krylov, N. (1984). Once more about the connection between elliptic operators and Itô s stochastic equations. In Statistics and Control of Stochastic Processes, Steklov Seminar, pages 214 229. Kullback, S. (1968). Probability densities with given marginals. The Annals of Mathematical Statistics, 39(4):1236 1243. Kullback, S. (1997). Information Theory and Statistics. Dover Publications, Inc., Mineola, NY. Reprint of the second (1968) edition. Léonard, C. (2012). Girsanov theory under a finite entropy condition. In Séminaire de Probabilités XLIV, pages 429 465. Springer. Léonard, C. (2014a). Some properties of path measures. In Séminaire de Probabilités XLVI, pages 207 230. Springer. Léonard, C. (2014b). A survey of the Schrödinger problem and some of its connections with optimal transport. Discrete & Continuous Dynamical Systems-A, 34(4):1533 1574. Léonard, C., Rœlly, S., Zambrini, J.-C., et al. (2014). Reciprocal processes. a measure-theoretical point of view. Probability Surveys, 11:237 269. Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. (2023). Flow matching for generative modeling. In International Conference on Learning Representations. Liu, G.-H., Chen, T., So, O., and Theodorou, E. A. (2022a). Deep generalized Schrödinger bridge. In Advances in Neural Information Processing Systems. Liu, G.-H., Vahdat, A., Huang, D.-A., Theodorou, E. A., Nie, W., and Anandkumar, A. (2023a). I2SB: Image-to-image Schrödinger bridge. ar Xiv preprint ar Xiv:2302.05872. Liu, Q. (2022). Rectified flow: A marginal preserving approach to optimal transport. ar Xiv preprint ar Xiv:2209.14577. Liu, X., Gong, C., and Liu, Q. (2023b). Flow straight and fast: Learning to generate and transfer data with rectified flow. In International Conference on Learning Representations. Liu, X., Wu, L., Ye, M., and Liu, Q. (2022b). Let us build bridges: Understanding and extending diffusion generative models. ar Xiv preprint ar Xiv:2208.14699. Liu, Z., Luo, P., Wang, X., and Tang, X. (2015). Deep learning face attributes in the wild. In International Conference on Computer Vision. Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.-Y., and Ermon, S. (2022). SDEdit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations. Palmowski, Z. and Rolski, T. (2002). A technique for exponential change of measure for Markov processes. Bernoulli, pages 767 785. Pavon, M., Trigila, G., and Tabak, E. G. (2021). The data-driven Schrödinger bridge. Communications on Pure and Applied Mathematics, 74:1545 1573. Peluchetti, S. (2021). Non-denoising forward-time diffusions. https://openreview.net/forum? id=o Vf IKuhqf C. Peluchetti, S. (2023). Diffusion bridge mixture transports, Schrödinger bridge problems and generative modeling. ar Xiv preprint ar Xiv:2304.00917. Peyré, G. and Cuturi, M. (2019). Computational optimal transport. Foundations and Trends in Machine Learning, 11(5-6):355 607. Pooladian, A.-A., Ben-Hamu, H., Domingo-Enrich, C., Amos, B., Lipman, Y., and Chen, R. (2023). Multisample flow matching: Straightening flows with minibatch couplings. In International Conference on Machine Learning. Rogers, L. C. G. and Williams, D. (2000). Diffusions, Markov processes, and Martingales. Vol. 2. Cambridge Mathematical Library. Cambridge University Press, Cambridge. Itô calculus, Reprint of the second (1994) edition. Rüschendorf, L. (1995). Convergence of the iterative proportional fitting procedure. The Annals of Statistics, 23(4):1160 1174. Rüschendorf, L. and Thomsen, W. (1993). Note on the Schrödinger equation and i-projections. Statistics & Probability letters, 17(5):369 375. Schrödinger, E. (1932). Sur la théorie relativiste de l électron et l interprétation de la mécanique quantique. Annales de l Institut Henri Poincaré, 2(4):269 310. Shi, Y., De Bortoli, V., Deligiannidis, G., and Doucet, A. (2022). Conditional simulation using diffusion Schrödinger bridges. In Uncertainty in Artificial Intelligence. Sinkhorn, R. (1967). Diagonal equivalence to matrices with prescribed row and column sums. The American Mathematical Monthly, 74(4):402 405. Somnath, V. R., Pariset, M., Hsieh, Y.-P., Martinez, M. R., Krause, A., and Bunne, C. (2023). Aligned diffusion Schrödinger bridges. ar Xiv preprint ar Xiv:2302.11419. Song, J., Meng, C., and Ermon, S. (2021a). Denoising diffusion implicit models. In International Conference on Learning Representations. Song, K.-U. (2022). Applying regularized Schrödinger-bridge-based stochastic process in generative modeling. ar Xiv preprint ar Xiv:2208.07131. Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. (2021b). Scorebased generative modeling through stochastic differential equations. In International Conference on Learning Representations. Stromme, A. (2023). Sampling from a Schrödinger bridge. In International Conference on Artificial Intelligence and Statistics. Tamir, E., Trapp, M., and Solin, A. (2023). Transport with support: Data-conditional diffusion bridges. ar Xiv preprint ar Xiv:2301.13636. Thornton, J., Hutchinson, M., Mathieu, E., De Bortoli, V., Teh, Y. W., and Doucet, A. (2022). Riemannian diffusion Schrödinger bridge. ar Xiv preprint ar Xiv:2207.03024. Tong, A., Malkin, N., Huguet, G., Zhang, Y., Rector-Brooks, J., Fatras, K., Wolf, G., and Bengio, Y. (2023). Conditional flow matching: Simulation-free dynamic optimal transport. ar Xiv preprint ar Xiv:2302.00482. van Erven, T. and Harremoes, P. (2014). Rényi divergence and kullback-leibler divergence. IEEE Transactions on Information Theory, 60(7):3797 3820. Vargas, F., Thodoroff, P., Lamacraft, A., and Lawrence, N. (2021). Solving Schrödinger bridges via maximum likelihood. Entropy, 23(9):1134. Villani, C. (2009). Optimal transport, volume 338 of Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences]. Springer-Verlag, Berlin. Old and new. Vincent, P. (2011). A connection between score matching and denoising autoencoders. Neural Computation, 23(7):1661 1674. Wang, G., Jiao, Y., Xu, Q., Wang, Y., and Yang, C. (2021). Deep generative learning via Schrödinger bridge. In International Conference on Machine Learning. Outline of the Appendix In Appendix A, we first clarify the relationship between different methods in the existing literature and our proposed DSBM framework. In Appendix B, we focus on the family of linear SDEs, and draw a link between the parameterization of bridges in this paper and the stochastic interpolant in Albergo et al. (2023). In Appendix C, we give proofs for results in the main text. In Appendix D, we present additional theoretical results for IMF in the Gaussian case. In Appendix E, we derive the discrete-time version of Markovian projection. In Appendix F, we explain the benefits of DSBM compared to DSB in more detail. In Appendix G, we describe a method for learning the forward and backward processes jointly and propose a consistency loss between the forward and backward processes. In Appendix H, we present additional methodological details for a practical scaling of the loss function to reduce variance, similar to standard Denoising Diffusion Models. In Appendix I, we give further details for all experiments and additional experimental results. Finally, we discuss broader impacts of our work in Appendix J. A Discussion of Existing Works A.1 Bridge Matching and Flow Matching Models In this section, we clarify the relationship between variants of Flow Matching and show that they are equivalent under some conditions. We follow the nomenclature of Tong et al. (2023). We refer to the algorithm originally proposed in Lipman et al. (2023) using linear probability paths and described in (Tong et al., 2023, Section 4.1) as Flow Matching (FM), and the algorithm proposed in (Tong et al., 2023, Section 4.2) as Conditional Flow Matching (CFM). There is a small constant parameter σmin in both algorithms, which controls the smoothing of the modeled distribution. We consider the case σmin = 0. Then CFM recovers exactly the 1st iteration of Rectified Flow (Liu et al., 2023b). Furthermore, FM, CFM and the 1st iteration of Rectified Flow are all equivalent when performing generative modeling with a standard Gaussian π0. We refer to them collectively as Flow Matching models (FMMs) as they only differ in the smoothing method. We also present them all under the Bridge Matching framework. These models can also be interpreted in the context of stochastic interpolants (Albergo and Vanden-Eijnden, 2023; Albergo et al., 2023). Finally, we present recent applications of Bridge Matching and show that some of the objectives in Somnath et al. (2023); Liu et al. (2023a); Delbracio and Milanfar (2023) are identical. Flow Matching and Conditional Flow Matching. In Flow Matching (FM), the objective (Lipman et al., 2023, Equation (21)) is EΠt,T [ (XT Xt)/(T t) vθ(t, Xt) 2], where Πt,T is given by πT (XT )N(Xt; t T XT , (1 t In Conditional Flow Matching (CFM), X0,T t = t T XT + (1 t T )X0, with X0 N(0, Id) and the objective (Tong et al., 2023, Equation (16)) is given by EΠ0,T [ (XT X0)/T vθ(t, X0,T t ) 2]. (15) This is the same as (Liu et al., 2023b, Equation (1)). Furthermore, (XT X0)/T = (1 t T )(XT X0)/(T t) = (XT X0,T t )/(T t), so the CFM objective is equivalent to EΠt,T [ (XT X0,T t )/(T t) vθ(t, X0,T t ) 2]. (16) The optimal vθ(t, xt) = (EΠT |t[XT | Xt = xt] xt)/(T t). In the case of generative modeling, π0 is a standard Gaussian distribution and Π0,T is given by N(X0; 0, Id)πT (XT ). Thus, Πt,T is also given by πT (XT )N(X0,T t ; t T XT , (1 t T ) 2). Therefore, the FM (Lipman et al., 2023) and CFM (Tong et al., 2023) objectives are exactly the same. However, CFM is also applicable when π0 is not Gaussian distributed, so CFM is a generalized version of FM6. 6In the case σmin > 0, FM and CFM are indeed different in how smoothing is performed, and we refer to Tong et al. (2023) for a more detailed analysis. Stochastic Interpolant. In (Albergo and Vanden-Eijnden, 2023; Albergo et al., 2023), the concept of stochastic interpolant is introduced. In Albergo and Vanden-Eijnden (2023), the interpolation is deterministic (not necessarily linear), of the form It(x0, x T ) = α(t)x0 + β(t)x T , while in Albergo et al. (2023), the interpolation is stochastic given by It(x0, x T ) = α(t)x0 + β(t)x T + γ(t)Z for Z N(0, Id). In Albergo and Vanden-Eijnden (2023), an ODE is learned and the associated velocity field vθ is obtained by minimizing the following objective (Albergo and Vanden-Eijnden, 2023, Equation (9)) EΠ0,T [ t It(X0, XT ) vθ(t, X0,T t ) 2]. Hence, if It(x0, x T ) = t T x0 + (1 t T )x T , we recover (15). Link with Bridge Matching. When Q is associated with the Brownian motion (σBt)t [0,T ] and σ 0 in Bridge Matching, we recover the same objective (5) as the Flow Matching objective (16), since log QT |t(XT |Xt) = (XT Xt)/(σ2(T t)). Bridge Matching can also be applied to general distributions π0, πT ; i.e. π0 does not have to be restricted to a Gaussian. Therefore, Bridge Matching is a generalized version of Flow Matching, see also (Liu et al., 2022b, Equation (10)). Inverse problems and interpolation. Somnath et al. (2023) and Liu et al. (2023a) present Bridge Matching algorithms between aligned data (X0, XT ) Π0,T . The objectives (Somnath et al., 2023, Equation (8)) and (Liu et al., 2023a, Equation (12)) are equivalent to the Bridge Matching objective (3). The main difference between Liu et al. (2023a) and Somnath et al. (2023) resides in the choice of Π0,T . In the case of Somnath et al. (2023), this choice is motivated by the access to aligned data with applications in biology assuming they are distributed as the true Schrödinger static coupling, i.e. Π0,T = ΠSB 0,T . In the case of Liu et al. (2023a), Π0,T corresponds to a pairing between clean and corrupted images, e.g. with Π0 = π0 the distribution of clean images and ΠT = πT the distribution of corrupted images obtained from the clean images using the degradation kernel ΠT |0. Finally, in (Delbracio and Milanfar, 2023, Equation (5)) the authors consider a reconstruction process of the form d Xt = (EΠ0|t[X0 | Xt] Xt)/tdt, XT ΠT , (17) where here we have replaced F(xt, t) by EΠ0|t[X0 | Xt = xt]. This is justified if the p norm in (Delbracio and Milanfar, 2023, Equation (4)) is replaced by 2 2 (or any Bregman Loss Function, see Banerjee et al. (2005)). In Delbracio and Milanfar (2023), Π0,T corresponds to the joint distribution of clean and corrupted images as in Liu et al. (2023a). Exchanging the role of Π0 and ΠT , (17) can be rewritten equivalently as d Xt = (EΠT |t[XT | Xt] Xt)/(T t)dt, X0 Π0. We thus obtain the optimal Flow Matching vector field vθ(t, xt) = (EΠT |t[XT | Xt = xt] xt)/(T t) in (16). Note that Delbracio and Milanfar (2023) also incorporates a stochastic version of their objective (Delbracio and Milanfar, 2023, Equation (7)). It remains an open question whether this objective can be understood as a special instance of the Bridge Matching framework. A.2 On DSBM and Existing Works In this section, we show that the DSBM framework recovers the above existing algorithms for different choices of bridges Q|0,T and couplings Π0 0,T in Algorithm 1. For the independent coupling Π0 0,T = π0 πT and Brownian bridge Q|0,T (4) with diffusion parameter σt = σ, the loss function (10) recovers the Brownian Bridge Matching loss (5). Letting σ 0, we recover Flow Matching (Lipman et al., 2023). In this case, further iterations repeating lines 7-9 in Algorithm 1 (with only forward projections) recover Rectified Flow (Liu et al., 2023b). If the coupling Π0 0,T is given by an estimation of the OT map between π0 and πT , then the first iteration recovers OT-CFM (Tong et al., 2023; Pooladian et al., 2023). Finally, for general bridges Q|0,T , if we are given the optimal Schrödinger Bridge static coupling Π0 0,T = ΠSB 0,T , then the DSBM procedure converges in one iteration and we recover Somnath et al. (2023). A.3 DSBM and Rectified Flow We discuss the differences in more detail between our proposed DSBM method and Rectified Flow. Both Rectified Flow and DSBM are general frameworks for building transport maps between two general distributions π0, πT . However, there are a few important theoretical and practical differences. Firstly, we adopt the SDE approach as opposed to the ODE approach in Rectified Flow. This distinction is crucial in theory, as Proposition 5, which guarantees the uniqueness of the characterization of SB, is valid only when σt > 0. Consequently, Rectified Flow is not guaranteed to converge to the dynamic optimal transport solution (see e.g. counterexample in Liu (2022)). In a following work, Liu (2022) established formal connections between Rectified Flow and OT when restricting the class of vector fields to gradient fields. In DSBM, the connection to OT is obtained by considering its entropy-regularized version. Furthermore, by adopting the SDE approach, we observe significant improvements of sample quality in our experiments when performing transport between two general distributions. This is in line with the theoretical analysis in Albergo et al. (2023). On the other hand, while Bridge Matching also achieves high sample quality using the SDE approach, the transported samples are much more dissimilar to the input data (see e.g. Figures 14, 15, 19). Lastly, Rectified Flow also performs Markovian projections iteratively, but only in the forward direction. Consequently, the bias in the learned marginals Pn T is accumulated and cannot be corrected in later iterations, i.e. the first iteration of RF will achieve the most accurate marginal P1 T . Subsequent iterations can improve the straightness of the flow, but at the cost of sampling accuracy of Pn T . We observe in practice that this becomes particularly problematic if the first iteration of Rectified Flow (which is equivalent to CFM) fails to provide a good transport and learn an accurate P1 T , e.g. in the case of moons-8gaussians (Table 2), Gaussian transport (Figure 3), and MNIST, EMNIST transfer (Figure 5 and Figure 13). As Rectified Flow cannot recover from this issue, we observe the accuracy of Pn T only deteriorates in further iterations as n increases. In our methodology, we leverage Proposition 9 to perform forward and backward Bridge Matching, and we observe that the marginal accuracy is able to improve with iteration. B The Design Space of Brownian Bridges B.1 Relationship to Stochastic Interpolants From stochastic interpolants to Brownian bridges. In this section, we draw a link between our parameterization of bridges and the one used in Albergo et al. (2023). In Albergo et al. (2023), a stochastic interpolant is defined as Xt = αtx0 + βtx T + γt Z, (18) where Z N(0, Id). Since their methodology and analysis mainly relies on the probability flow, they work with (18), which is easier to analyse. In our setting, as we deal mostly with diffusions, it is natural to parameterize Brownian bridges as follows d Xt = { αt Xt + βtx T }dt + γtd Bt. (19) The goal of this section is to derive explicit formulas between the parameters αt, βt and γt of (18) and the parameters αt, βt and γt of (19). Consider (Xt)t [0,T ] given by (19). We have that for any t [0, T] Xt = exp[ At]x0 + R t 0 βs exp[As At]dsx T + R t 0 γs exp[As At]d Bs, where At = R t 0 αsds. Therefore, we have that αt = exp[ R t 0 αsds], βt = R t 0 βs exp[ R t s αudu]ds, γ2 t = R t 0 γ2 s exp[ 2 R t s αudu]ds, (20) αt = α t αt , βt = β t + βtαt, γ2 t = ( γ2 t ) + 2 γ2 t αt = 2 γt γ t + 2 γ2 t αt. (21) Using this relationship, we get that the Markovian projection, see Definition 1, is given by d X t = f t (Xt)dt + γtd Bt, f t (xt) = EΠT |t[ αt Xt + βt XT | Xt = xt]. We have that f t (xt) = EΠT |t[ αt Xt + βt XT | Xt = xt] = EΠ0,T |t[ αt( αt X0 + βt XT + γt Z) + βt XT | Xt = xt]. Using (21), we get that f t (xt) = EΠ0,T |t[ α t X0 + β t XT + α t γt αt Z | Xt = xt]. In Albergo et al. (2023), it is shown that log M t (xt) = EΠ0,T |t[Z | Xt = xt]/ γt, where M is the Markovian projection. The probability flow associated with (X t )t [0,T ] is given by d Z t = {f t (Z t ) γ2 t 2 log M t (Z t )}dt = {EΠ0,T |t[ α t X0 + β t XT + ( αt γt + γ2 t 2 γt )Z | Xt = Z t ]}dt = {EΠ0,T |t[ α t X0 + β t XT + γ t Z | Xt = Z t ]}dt. Hence, we recover (Albergo et al., 2023, Theorem 2.6). Non-Markov path measures. A natural question is whether (19) arises as the bridge measure of some Markov measure. For instance, if Q is associated with (x0 + Bt)t [0,T ], then pinning the process at x T at time T, we get that the associated bridge measure Q|0,T is given by d X0,T t = (x T Xt)/(T t)dt + d Bt. Therefore, we recover (19) with αt = βt = 1 T t and γt = 1. Using (20), we get that αt = 1 t T and γ2 t = (T t)t/T. We recover (4), upon noting that Bt t T BT is Gaussian with zero mean and variance (T t)t/T. More generally, we consider a Markov measure Q associated with (Xt)t [0,T ] such that d Xt = at Xtdt + ctd Bt, X0 = x0. We now derive the associated bridge measure Q|0,T : XT = exp[ ΛT + Λt]Xt + R T t cs exp[Λs ΛT ]d Bs, with Λt = R t 0 asds. We have that c2 t xt log QT |t(x T |xt) = (c2 t exp[Λt ΛT ]/ R T t c2 s exp[2(Λs ΛT )]ds)x T (c2 t exp[2(Λt ΛT )]/ R T t c2 s exp[2(Λs ΛT )]ds)xt. Therefore, combining this result and (2), we get that Q|0,T is associated with αt = at + c2 t exp[ 2 R T t asds]/ R T t c2 s exp[ 2 R T s audu]ds, βt = c2 t exp[ R T t asds]/ R T t c2 s exp[ 2 R T s audu]ds, γt = ct. In that case (at, ct)t [0,T ] entirely parameterize (αt, βt, γt)t [0,T ]. Hence, in the Ornstein-Uhlenbeck setting, if Q|0,T is the bridge of a Markov measure, it is fully parameterized by two functions while in the non-Markov setting it is parameterized by three functions. In this paper, we present our framework in the Markovian setting as the Schrödinger Bridge problem is usually defined with respect to Markov reference measures. However, our methodology could be extended in a straightforward fashion to the non-Markovian setting. This would allow for a further exploration of the design space of DSBM. B.2 Linear SDE and Bridge Matching In this section, we study further the diffusion bridge of linear SDEs. Arbitrary Markov measures can be chosen to build bridges; however, we want to be able to compute some representations of the bridge in an explicit way. More precisely, denoting (X0,T t )t [0,T ] the diffusion bridge with x0, x T the initial and final condition, we want to have access to the following: integral sampler: we want to have a formula to sample X0,T t for any t [0, T] without having to run a stochastic process forward or backward. forward sampler: we want to have a forward SDE for X0,T t with explicit coefficients, which might depend on x T , running in a forward fashion terminating at x T . backward sampler: we want to have a backward SDE for Y0,T t = X0,T T t with explicit coefficients, which might depend on x0, running in a backward fashion terminating at x0. We focus on linear SDEs of the form d Xt = αβt Xtdt + σβ1/2 t d Bt, which are particularly amenable, where (βt)t [0,T ] is a schedule with β C([0, T], (0, + )). B.2.1 Brownian motion First, we consider the Brownian motion setting and let (Xt)t [0,T ] be associated with Q with d Xt = β1/2 t d Bt. We consider (X0,T t )t [0,T ] conditioned at both ends X0,T 0 = x0 and X0,T T = x T . First, using (Barczy and Kern, 2013, Theorem 3.3), we have that for any t [0, T] X0,T t = R(t,T ) R(0,T )x0 + R(0,t) R(0,T )(x T XT ) + Xt, with R(s, t) = R t s βudu = σ2(Bt Bs), where for any t [0, T], Bt = R t 0 βsds. Therefore, we get that X0,T t = (1 B(t) B(T ))x0 + B(t) B(T )(x T XT ) + Xt. (22) (22) defines the integral sampler. In addition, using (Barczy and Kern, 2013, Theorem 3.2), we have X0,T 0 = x0, d X0,T t = { σ2βt/γ(t, T)X0,T t + σ2βt/γ(t, T)x T }dt + σβ1/2 t d Bt, where γ(s, t) = σ2(B(t) B(s)). Therefore, we get that X0,T 0 = x0, d X0,T t = { βt B(T ) B(t)X0,T t + βt B(T ) B(t)x T }dt + σβ1/2 t d Bt. (23) (23) defines the forward sampler. Finally, we derive the backward sampler by considering the time-reversal of the forward unconditional process (initialized at x0). Following Haussmann and Pardoux (1986), Y0,T 0 = x T , d Y0,T t = σ2βT t log QT t|0(Y0,T t |x0)dt + σβ1/2 T td Bt. (24) In addition, we have that Xt = x0 + σB(t)1/2εt, εt N(0, Id). Hence, we get that for any t [0, T] and x Rd log Qt|0(x|x0) = (x x0)/(σ2B(t)). Combining this result and (24), we get Y0,T 0 = x T , d Y0,T t = { βT t B(T t)Y0,T t + βT t B(T t)x0}dt + σβ1/2 T td Bt. (25) Combining (22), (23) and (25), we get X0,T t = λtx0 + ϕt(x T XT ) + Xt. X0,T 0 = x0, d X0,T t = {κf t X0,T t + Ψf t x T }dt + σβ1/2 t d Bt, Y0,T 0 = x T , d Y0,T t = κb T t Y0,T t + Ψb T tx0}dt + σβ1/2 T td Bt, λt = 1 B(t) B(T ), ϕt = B(t) κf t = βt B(T ) B(t), Ψf t = βt B(T ) B(t), B(t), Ψb t = βt B(t). B.2.2 Ornstein-Uhlenbeck Second, we consider the Ornstein-Uhlenbeck setting and let (Xt)t [0,T ] with d Xt = αβt Xtdt + σβ1/2 t d Bt, with α = 0. We consider (X0,T t )t [0,T ], the stochastic process (Xt)t [0,T ] conditioned at both ends X0,T 0 = x0 and X0,T T = x T . First, using (Barczy and Kern, 2013, Theorem 3.3), we have that for any t [0, T] X0,T t = R(t,T ) R(0,T )x0 + R(0,t) R(0,T )(x T XT ) + Xt, with R(s, t) = exp[α(B(t) B(s))]γ(s, t), with γ(s, t) = R t s σ2β(u) exp[ 2α(B(t) B(u))]du. In particular, we have γ(s, t) = σ2 2α(1 exp[ 2α(B(t) B(s))], R(s, t) = σ2 α sinh(α(B(t) B(s))). (26) Therefore, we get that X0,T t = sinh(α(B(T ) B(t)) sinh(αB(T )) x0 + sinh(αB(t)) sinh(αB(T ))(x T XT ) + Xt. (27) (27) defines the integral sampler. In addition, using (Barczy and Kern, 2013, Theorem 3.2) and (26), we have X0,T 0 = x0 and d X0,T t = { αβt Xt σ2βt exp[ 2α(B(T ) B(t))] γ(t,T ) X0,T t + σ2βt exp[ α(B(T ) B(t))] γ(t,T ) x T }dt + σβ1/2 t d Bt = { αβt Xt 2αβt exp[2α(B(T ) B(t))] 1X0,T t + 2αβt exp[ α(B(T ) B(t))] exp[ α(B(T ) B(t))]x T }dt + σβ1/2 t d Bt = { αβt exp[2α(B(T ) B(t))]+1 exp[2α(B(T ) B(t))] 1X0,T t + 2αβt exp[ α(B(T ) B(t))] exp[ α(B(T ) B(t))]x T }dt + σβ1/2 t d Bt = { αβt coth(α(B(T) B(t)))X0,T t + αβt csch(α(B(T) B(t)))x T }dt + σβ1/2 t d Bt. In the formula, coth is the hyperbolic cotangent function defined as coth(x) = 1 tanh(x) = cosh(x) sinh(x) and csch is the hyperbolic cosecant function defined as csch(x) = 1 sinh(x). Combining this result and (26), we get that X0,T 0 = x0, d X0,T t = { αβt coth(α(B(T) B(t)))X0,T t +αβt csch(α(B(T) B(t)))x T }dt+σβ1/2 t d Bt. (28) (28) defines the forward sampler. Finally, we derive the backward sampler by considering the time-reversal of the forward unconditional process (initialized at x0). Following Haussmann and Pardoux (1986), Y0,T 0 = x T , d Y0,T t = {αβT t Y0,T t + σ2βT t log QT t|0(Y0,T t |x0)}dt + σβ1/2 T td Bt. (29) In addition, we have that Xt = exp[ αB(t)]x0 + σ 2α(1 exp[ 2αB(t)])1/2εt, εt N(0, Id). Hence, we get that for any t [0, T] and x Rd log Qt|0(x|x0) = 2α(x exp[ αB(t)]x0)/(σ2(1 exp[ 2αB(t)])). Combining this result and (29), we get Y0,T 0 = x T and d Y0,T t = {αβT t Y0,T t 2αβT t 1 exp[ αB(T t)]Y0,T t + 2αβT t exp[ αB(T t)] 1 exp[ 2αB(T t)] x0}dt + σβ1/2 T td Bt = { αβT t 1+exp[ 2αB(T t)] 1 exp[ 2αB(T t)]Y0,T t + 2αβT t exp[αB(T t)] exp[ αB(T t)]x0}dt + σβ1/2 T td Bt = { αβT t coth(αB(T t))Y0,T t + αβT t csch(αB(T t))x T }dt + σβ1/2 T td Bt. Y0,T 0 = x T , d Y0,T t = { αβT t coth(αB(T t))Y0,T t +αβT t csch(αB(T t))x T }dt+σβ1/2 T td Bt. (30) Combining (27), (28) and (30), we get X0,T t = λtx0 + ϕt(x T XT ) + Xt. X0,T 0 = x0, d X0,T t = {κf t X0,T t + Ψf t x T }dt + σβ1/2 t d Bt, Y0,T 0 = x T , d Y0,T t = {κb T t Y0,T t + Ψb T tx0}dt + σβ1/2 T td Bt, λt = sinh(α(B(T ) B(t)) sinh(αB(T )) , ϕt = sinh(αB(t)) sinh(αB(T )), κf t = αβt coth(α(B(T) B(t))), Ψf t = αβt csch(α(B(T) B(t))), κb t = αβt coth(αB(t)), Ψb t = αβt csch(αB(t)). Using that tanh(x) x and sinh(x) x for x 0, we recover the Brownian motion setting by letting α 0. Note that Albergo et al. (2023) show that given an integral sampler, which does not necessarily comes from a Markovian process, a forward sampler with the same marginals can be defined, although it does not necessarily satisfies the fact that the paths have the same distribution. C.1 Proof of Proposition 2 We refer the reader to Chung and Walsh (2006); Rogers and Williams (2000) for an introduction to Doob h-transform. Our theoretical treatment of the Doob h-transform closely follows Palmowski and Rolski (2002). First, we introduce the infinitesimal generator A given for any f C c ([0, T] Rd, R), t [0, T] and x Rd by Af(t, x) = ft(x), f(t, x) + σ2 t 2 f(t, x) + tf(t, x). (31) The following assumption ensures that the diffusion associated with Q as well as its Markovian projections are well-defined. A1. f, σ and (t, xt) 7 EΠT |t[ log QT |t(XT |Xt) | Xt = xt] are locally Lipschitz and there exist C > 0, ψ C([0, T], R+) such that for any t [0, T] and x0, xt Rd, we have ft(xt) C(1 + xt ), C σt 1/C, EΠT |t[ log QT |t(XT |Xt) | Xt = xt] Cψ(t)(1 + xt ). We consider the following assumption, which will ensure that we can apply Doob h-transform techniques. A2. For any x0 Rd, ΠT |0 is absolutely continuous w.r.t. QT |0. For any x0 Rd, let ϕT |0 be given for any x T Rd by ϕT |0(x T |x0) = dΠT |0(x T |x0)/d QT |0(x T |x0) and assume that for any x0 Rd, x T 7 ϕT |0(x T |x0) is bounded. For any x0 Rd, let ϕt|0 given for any xt Rd and t [0, T] by ϕt|0(xt|x0) = R Rd ϕT |0(x T |x0)d QT |t(x T |xt). (32) Finally, we assume that for any x0 Rd, (t, xt) 7 1/ϕt|0(xt|x0) and (t, xt) 7 Aϕt|0(xt|x0) are bounded. This means that for any x0 Rd, (t, xt) 7 ϕt(xt|x0) is a good function in the sense of (Palmowski and Rolski, 2002, Proposition 3.2). Note here that these assumptions could be relaxed on a case-bycase basis. We leave this study for future work. The following lemma is a direct consequence of A2 and (32). It ensures that the h-function ϕt|0 satisfies the backward Kolmogorov equation. Lemma 11. Assume A2. Then, ϕ C1,2([0, T) Rd, R) and Aϕ|0 = 0. Using (31), we have that for any x0 Rd and f C c ([0, T] Rd, R), t [0, T] and xt Rd (A(fϕ|0) f Aϕ|0)(t, xt)/ϕ|0(t, xt) = Af(t, xt) + σ2 t f(t, xt), log ϕt|0(xt|x0) . Finally, we consider the following assumption, which will ensure that the Doob h-transform is well-defined. A 3. For any x0 Rd, there exists C 0 such that for any t [0, T] and xt Rd, log ϕt|0(xt|x0) C(1 + x0 + xt ). We are now ready to state and prove Proposition 2. Note that the Markovian projection is defined in Definition 1. Finally, we define M the space of path measures such that P M if P is associated with d Xt = {ft(Xt) + vt(Xt)}dt + σtd Bt, with σ, v locally Lipschitz. This restriction of Markov measures allows us to apply the entropic version of the Girsanov theorem (Léonard, 2012). It has no impact on our methodology. Proposition. Assume A1, A2, A3. Let M = proj M(Π). Then, M = argmin M{KL(Π|M) : M M}, KL(Π|M ) = 1 2 R T 0 EΠ0,t[ σ2 t EΠT |0,t[ log QT |t(XT |Xt) | X0, Xt] v t 2]/σ2 t dt. In addition, we have that for any t [0, T], M t = Πt. In particular, M T = ΠT . Proof. First, we recall that Π is given by Π = Qϕ0,T with ϕ0,T = dΠ0,T d Q0,T . In particular, we have Π|0 = Q|0ϕT |0, where ϕT |0 = dΠT |0 d QT |0 . Therefore, using Lemma 11, (Palmowski and Rolski, 2002, Lemma 3.1, Lemma 4.1), the remark following (Palmowski and Rolski, 2002, Lemma 4.1), A1, A2 and A3, we get that Π|0 is Markov and associated with the distribution of (Xt)t [0,T ] given for any t [0, T] by Xt = R t 0{fs(Xs) + σ2 s log ϕs|0(Xs|X0)}ds + R t 0 σsd Bs, (33) where for any t [0, T], x0, xt Rd we recall that ϕt|0(xt|x0) = R Rd ϕT |0(x T |x0)d QT |t(x T |xt). (34) First, we have that for any t [0, T], xt, x0 Rd Qt|0(xt|x0)ϕt|0(xt|x0) = R Rd Qt|0,T (xt|x T , x0)dΠT |0(x T |x0) = Πt|0(xt|x0). Therefore, we get that for any t [0, T] and xt, x0 Rd ϕt|0(xt|x0) = dΠt|0(xt|x0) d Qt|0(xt|x0). (35) In addition, we have the following identity for any t [0, T], x0, xt, x T Rd QT |0(x T |x0)Qt|0,T (xt|x0, x T ) = Qt|0(xt|x0)QT |t(x T |xt). Using (34), this result and (35), we get that for any t [0, T] and x0, xt Rd log ϕt|0(xt|x0) = R Rd ΠT |0(x T |x0)QT |t(x T |xt) QT |0(x T |x0)ϕt|0(xt|x0) log QT |t(x T |xt)dx T Rd ΠT |0(x T |x0)Qt|0,T (xt|x0,x T ) Qt|0(xt|x0)ϕt|0(xt|x0) log QT |t(x T |xt)dx T Rd Πt,T |0(xt,x T |x0) Πt|0(xt|x0) log QT |t(x T |xt)dx T Rd log QT |t(x T |xt)dΠT |t,0(x T |xt, x0). Hence, combining this result and (33), we get Xt = R t 0{fs(Xs) + σ2 s EΠT |t,0[ log QT |t(XT |Xt) | Xt, X0]}ds + R t 0 σsd Bs. Let M be Markov defined by d Xt = {ft(Xt) + vt(Xt)}dt + σtd Bt, such that KL(Π|M) < + with σ, v locally Lipschitz. Using (Léonard, 2012, Theorem 2.3), we get that KL(Π|M) = 1 2 R T 0 EΠ0,t[ σ2 t EΠT |t,0[ log QT |t(XT |Xt) | Xt, X0] vt(Xt) 2]/σ2 t dt. In addition, we have that for any t [0, T], EΠ0,t[ σ2 t EΠT |t,0[ log QT |t(XT |Xt) | Xt, X0] vt(Xt) 2] EΠ0,t[ σ2 t EΠT |t,0[ log QT |t(XT |Xt) | Xt, X0] v t (Xt) 2], where v t (xt) = σ2 t EΠT |t[ log QT |t(XT |Xt) | Xt = xt] which concludes the first part of the proof. For the second part of the proof, we show that for any t [0, T], we have M t = Πt. First, we have that M t and Πt satisfy the same Fokker-Planck equation, see (Peluchetti, 2021, Theorem 2) for instance. We conclude by uniqueness of the solutions of the Fokker-Planck equation under A1 and A3, see Bogachev et al. (2021) for instance. C.2 Proof of Proposition 4 Proof. By the additive property of KL divergence (Léonard, 2014a), KL(P|Π) = KL(P0,T |Π0,T ) + EP0,T [KL(P|0,T |Π|0,T )]. Restricting Π|0,T = Q|0,T directly gives that the KL minimizer is Π with Π 0,T = P0,T , and thus Π = P0,T Q|0,T which we recall is short for Π ( ) = R Rd Rd Q|0,T ( |x0, x T )P0,T (dx0, dx T ). C.3 Proof of Proposition 5 This result is a direct consequence of (Léonard, 2014b, Theorem 2.12), which we recall here for completeness. Proposition. Assume that Q M, that Q0 = QT = Q, that for any x0, x T Rd, d Q0,T /d( Q Q)(x0, x T ) exp[ A(x0) A(x T )] with A 0 measurable, R Rd Rd exp[ B(x0) B(x T )]d Q(x0, x T ) < + with B 0 measurable. Assume that there exists t0 (0, T) and X measurable such that Qt0(X) > 0 and for all x X, Q0,T Q0,T |t0( |Xt0 = x). In addition, assume that KL(π0| Q) < + , KL(πT | Q) < + , R Rd(A + B)(x0)dπ0(x0) < + , R Rd(A + B)(x T )dπT (x T ) < + . Then there exists a unique Schrödinger Bridge PSB. In addition let P be a Markov measure in the reciprocal class of Q such that P0 = π0 and PT = πT . Assume that KL(P|Q) < + . Then P is the unique Schrödinger Bridge PSB. Proof. The first part of the proof is a consequence of (Léonard, 2014b, Theorem 2.12(a)). The second part is a consequence of (Léonard, 2014b, Theorem 2.12(b)) and (Léonard et al., 2014, Theorem 2.14). C.4 Proof of Lemma 6 Lemma. Let M M and Π R(Q) and assume A1, A2, A3. If KL(Π|M) < + and KL(proj M(Π)|M) < + we have KL(Π|M) = KL(Π|proj M(Π)) + KL(proj M(Π)|M). (36) For any P P(C), if KL(P|Π) < + , we have KL(P|Π) = KL(P|proj R(Q)(P)) + KL(proj R(Q)(P)|Π). (37) Proof. We start with the proof of (36). Similarly to Proposition 2, where we have M M to ensure that we can apply (Léonard, 2012, Theorem 2.3), we get KL(Π|M) = 1 2 R T 0 EΠ0,t[ vt(Xt) σ2 t EΠT |0,t[ log QT |t(XT |Xt) | X0, Xt] 2]/σ2 t dt. In addition, we have KL(proj M(Π)|M) = 1 2 R T 0 EΠt[ vt(Xt) σ2 t EΠT |t[ log QT |t(XT |Xt) | Xt] 2]/σ2 t dt. Finally, using Proposition 2, we have that KL(Π|proj M(Π)) 2 R T 0 EΠ0,t[ σ2 t EΠT |0,t[ log Q(XT |Xt) | X0, Xt] σ2 t EΠT |t[ log Q(XT |Xt) | Xt] 2]/σ2 t dt 2 R T 0 (EΠ0,t[ EΠT |0,t[ log Q(XT |Xt) | X0, Xt] 2] EΠt[ EΠT |t[ log Q(XT |Xt) | Xt] 2])σ2 t dt. Using this result, we have 2KL(Π|proj M(Π)) + 2KL(proj M(Π)|M) = R T 0 (EΠ0,t[ EΠT |0,t[ log Q(XT |Xt) | X0, Xt] 2] EΠt[ EΠT |t[ log Q(XT |Xt) | Xt] 2])σ2 t dt + R T 0 EΠt[ vt(Xt)/σ2 t EΠT |t[ log QT |t(XT |Xt) | Xt] 2]σ2 t dt = R T 0 (EΠ0,t[ EΠT |0,t[ log Q(XT |Xt) | X0, Xt] 2] EΠt[ EΠT |t[ log Q(XT |Xt) | Xt] 2])σ2 t dt + R T 0 (EΠt[ vt(Xt)/σ2 t 2] + EΠt[ EΠT |t[ log QT |t(XT |Xt) | Xt] 2])σ2 t dt 2 R T 0 EΠt[ vt(Xt)/σ2 t , EΠT |t[ log QT |t(XT |Xt) | Xt] ]σ2 t dt = R T 0 EΠ0,t[ EΠT |0,t[ log Q(XT |Xt) | X0, Xt] 2]σ2 t dt + R T 0 EΠt[ vt(Xt)/σ2 t 2]σ2 t dt 2 R T 0 EΠt[ vt(Xt)/σ2 t , EΠT |t[ log QT |t(XT |Xt) | Xt] ]σ2 t dt = R T 0 EΠ0,t[ EΠT |0,t[ log Q(XT |Xt) | X0, Xt] 2]σ2 t dt + R T 0 EΠt[ vt(Xt)/σ2 t 2]σ2 t dt 2 R T 0 EΠ0,t[ vt(Xt)/σ2 t , EΠT |0,t[ log QT |t(XT |Xt) | X0, Xt] ]σ2 t dt = 2KL(Π|M), which concludes the first part of the proof. For (37), define Π = proj R(Q)(P) = P0,T Q|0,T . Using (Csiszár, 1975, Equation 2.6), we have KL(P|Π) = KL(P|Π ) + R dΠ (ω))d P(ω) = KL(P|Π ) + R Rd Rd log( dΠ 0,T dΠ0,T (x0, x1))d P0,T (x0, x1) = KL(P|Π ) + R Rd Rd log( dΠ 0,T dΠ0,T (x0, x1))dΠ 0,T (x0, x1) = KL(P|Π ) + KL(Π |Π), which concludes the proof. C.5 Proof of Proposition 7 Proposition. Assume that the conditions of Proposition 5 and Lemma 6 apply for Pn for every n N and for the Schrödinger Bridge PSB, we have KL(Pn+1|PSB) KL(Pn|PSB) < , and limn + KL(Pn|Pn+1) = 0. Proof. We follow the technique of Rüschendorf (1995) but for the reverse Kullback Leibler divergence. Applying Lemma 6, we get for any N N KL(P0|PSB) = KL(P0|P1) + KL(P1|PSB) = PN i=0 KL(Pi|Pi+1) + KL(PN+1|PSB), which concludes the proof. C.6 Proof of Theorem 8 Theorem. Assume that the conditions of Proposition 5 and Lemma 6 apply for Pn for every n N and for the Schrödinger Bridge PSB, the IMF sequence (Pn)n N admits a unique fixed point P = PSB, and limn + KL(Pn|P ) = 0. Proof. By Proposition 7, KL(Pn|PSB) KL(P0|PSB) < for all n N. Now, using the coercivity of KL( |PSB) (this is where our analysis differs from the one of the IPF), we have that the IMF sequence (Pn)n N and its subsequences (Mn+1)n N and (Πn)n N are subsets of {P P(C) : KL(P|PSB) KL(P0|PSB)} which is (relatively) compact. Thus, (Mn+1)n N contains a convergent subsequence Mnj M as j , and (Πnj)j N contains a further convergent subsequence Πnjk Π weakly as k . As the Markov and the reciprocal classes are closed under weak convergence, M M and Π R(Q). Now, by the lower semicontinuity of KL divergence in the weak topology (van Erven and Harremoes, 2014, Theorem 19), 0 KL(M |Π ) lim infk KL(Mnjk |Πnjk ) = 0. Hence, M = Π which we denote as P M R(Q). Also, P satisfies P 0 = π0 and P T = πT as is satisfied by all Pn. By Proposition 5, P is the unique Schrödinger Bridge PSB. Finally, limn + KL(Pn|P ) = 0 follows using limk KL(Mnjk |P ) = 0 and the monotonicity of KL(Pn|PSB) by Proposition 7. C.7 Proof of Proposition 9 Proof. The proof is similar to the one of Proposition 2. In particular, the time-reversal of Q|0,T ( |x0, x T ) is associated with d Y0,T t = { f T t(Y0,T t ) + σ2 T t log QT t|0(Y0,T t |x0)}dt + σT td Bt, Y0,T 0 = x T . (38) One can view both (11) (12) as SDEs with drift defined as the conditional expectation of the drift of (2) (38) under Π0,T |t in the forward and backward directions respectively. C.8 Proof of Proposition 10 Proof. We proceed by induction. Firstly, for Π0 0,T = Q0,T , Π0 = P0 = Q at initialization. We can also define M0 = Q, such that M0 = P0 and Πn = proj R(Q)(Mn) for all n N. By (De Bortoli et al., 2021, Section 3.5), the optimal DSB sequence Pn is Markov and Pn = Pn 0,T Q|0,T , where Pn 0,T is the IPF sequence of the static SB problem. In other words, Pn M R(Q). Suppose M2n+1 = P2n+1. By definition, M2n+2 0 = P2n+2 0 = π0, i.e. both forward processes are initialized at π0. In DSB, by De Bortoli et al. (2021), P2n+2 is defined as the time-reversal of P2n+1, such that P2n+2 |0 = P2n+1 |0 . Hence, P2n+2 = π0 P2n+1 |0 . In DSBM, we first perform reciprocal projection Π2n+1 = proj R(Q)(M2n+1) = M2n+1 0,T Q|0,T . Since M2n+1 = P2n+1 R(Q), however, we have that Π2n+1 = M2n+1. Furthermore, since M2n+1 = P2n+1 M, proj M(Π2n+1) = proj M(M2n+1) = M2n+1. Thus, M2n+2 given by (9) is such that M2n+2 0 = π0 and M2n+2 |0 = proj M(Π2n+1)|0 = M2n+1 |0 . We conclude that M2n+2 = π0M2n+1 |0 = P2n+2. Similar arguments holds for the the reverse projection (13). Therefore, Mn = Pn for all n N. C.9 The set of Markov measures is not convex The result of Lemma 6 should be compared with the information geometry result of (Csiszár, 1975, Theorem 2.2), which states that if C is a convex set and P C, then under mild conditions, KL(P|Q) = KL(P|proj C(Q)) + KL(proj C(Q)|Q), where proj C(Q) = argmin P{KL(P|Q) : P C} is the projection of Q on C. Note that, contrary to Lemma 6, (Csiszár, 1975, Theorem 2.2) is given for the forward Kullback Leibler divergence whereas Lemma 6 is given for the reverse KL divergence. In addition, (Csiszár, 1975, Theorem 2.2) requires the projection set C to be convex which is not satisfied for the space of Markov measures M. We give a simple counter-example proving that the set of Markov measures is not convex. Let p1(x0, x1, x2) = p1(x0)p1(x1|x0)p1(x2|x1) and p2(x0, x1, x2) = p2(x0)p2(x1|x0)p2(x2|x1) on {0, 1}3 such that p1(x0 = 1) = α0, p1(x1 = 1|x0) = α1, p1(x2 = 1|x1) = α2. Additionally, we set p2(x0 = 1) = β0, p2(x1 = 1|x0) = β1, p2(x2 = 1|x1) = β2. Finally, we set q = (1/2)p1 + (1/2)p2. Consider q(x2 = 1|x1 = 1, x0 = 1) = q(x2 = 1, x1 = 1, x0 = 1)/q(x1 = 1, x0 = 1) and q(x2 = 1|x1 = 1) = q(x2 = 1, x1 = 1)/q(x1 = 1). Let = 4[q(x2 = 1, x1 = 1, x0 = 1)q(x1 = 1) q(x2 = 1, x1 = 1)q(x1 = 1, x0 = 1)] = (α0α1α2 + β0β1β2)(α1 + β1) (α1α2 + β1β2)(α0α1 + β0β1) = α0α1β1α2 + β0α1β1β2 β0α1β1α2 α0α1β1β2 = α1β1β2(β0 α0) + α1β1α2(α0 β0) = α1β1(β0 α0)(β2 α2). q is Markov if and only if = 0. Therefore q is not Markov as soon as α0 = β0 and α2 = β2. C.10 Impact of the resolution on the entropic regularization In this section, we show how the resolution of the input affects the entropic regularization. First, we consider the Schrödinger bridge problem between π0 and π1 (T = 1) with Q associated with (σBt)t [0,1]. In that case the static Schrödinger Bridge is given by P 0,1 such that P 0,1 = argmin{ R Rd Rd x0 x1 2d P(x0, x1) εKL(P|π0 π1) : P0 = π0, P1 = π1} where ε = 2σ2. We now describe the solution of a higher dimensional problem given by the upsampling of the marginals. Let f N with f 1 and down : Rfd Rd such that for any k {1, . . . , d}, and x Rfd, down(x) = x with x Rd and xk = xkf. We also denote up : Rd Rfd such that for any k {1, . . . , d}, and x Rd, up( x) = x with x Rfd and xk = x(k 1)f+j, for j {1, . . . , d}. Note that down up = Id. We denote πup 0 = up#π0 and πup 1 = up#π1. We extend up and down to Rd Rd and Rfd Rfd by letting for any x, y Rd and x, y Rfd up( x, y) = (up( x), up( y)), down(x, y) = (down(x), down(y)). First, note that since up down = Id on up(Rd) we get that for any P supported on up(Rd) up(Rd), using (Kullback, 1997, Theorem 4.1) KL(P|πup 0 πup 1 ) = KL(down#P|π0 π1). Second let P such that P0 = πup 0 and P1 = πup 1 . Then, P is supported on up(Rd) up(Rd). Therefore, for any P such that P0 = πup 0 and P1 = πup 1 KL(P|πup 0 πup 1 ) = KL(down#P|π0 π1). Finally, for any P such that P0 = πup 0 and P1 = πup 1 we have R Rfd Rfd x0 x1 2d P(x0, x1) = f 2 R Rd Rd x0 x1 2d(down#P)( x0, x1). Therefore, for any P such that P0 = πup 0 and P1 = πup 1 we have R Rfd Rfd x0 x1 2d P(x0, x1) εKL(P|πup 0 πup 1 ) Rfd Rfd x0 x1 2d(down#P)( x0, x1) (ε/f 2)KL(down#P|π0 π1)). Therefore, we have the following result. Proposition 12. Let ε > 0. P is the solution of the static Schrödinger bridge with marginals πup 0 , πup 1 and regularization ε if and only if down#P is the solution of the static Schrödinger bridge with marginals π0, π1 and regularization ε/f 2. This means in particular that the Schrödinger Bridge is not invariant via upsampling. In Appendix I.4, we confirm these results visually upon noting that for the same σ, running DSBM at resolution 64 64 and 128 128 gives different results, even after downsampling of the 128 128 results. D Convergence of IMF in the Gaussian setting In this section, we study the IMF in an one-dimensional Gaussian case. We consider T = 1, Π0 = Π1 = N(0, (1/2β2)) and Q associated with (σBt)t [0,1] where σ > 0. In what follows, we let Σ0 = (1/2β2) 1 c2 where c [0, 1]. We also denote σ2 = 2σ2β2. We start with the following result which gives an explicit expression of some marginals of the reciprocal projection. Lemma 13. Let Π0 0,1 = N(0, Σ0) with Σ0 given by (39). Let Π0 = Π0 0,1Q|0,1. For any t [0, 1], we have that Π0 0,1,t = N(0, Σ) with Σ = (1/2β2) 1 c2 at k c2 1 a1 t k at k a1 t k bt k where we have at k = 1 t + tc2, bt k = 1 + t(1 t)(2(c2 1) + σ2). Proof. Let t [0, 1] and X0,1 t Π0 t|0,1 = Qt|0,1. Using (Barczy and Kern, 2013, Theorem 3.3), we get that X0,1 t = (1 t)X0 + t X1 + σ(t(1 t))1/2Z, with Z N(0, Id) independent from (X0, X1). Hence, we get that E[X0,1 t ] = 0, and Cov(X0, X0,1 t ) = E[X0,1 t X0] = (1 t)/(2β2) + tc2/(2β2) = at k/(2β2). Similarly, we get that Cov(X1, X0,1 t ) = a1 t k /(2β2). Finally, we get that Var(X0,1 t ) = E[((1 t)X0 + t X1)2] + σ2t(1 t) = (1 t)2/(2β2) + 2t(1 t)c2/(2β2) + t2/(2β2) + σ2t(1 t)/(2β2) = (1 2t + 2t2 + 2t(1 t)c2 + σ2t(1 t))/(2β2) = (1 + t(1 t)(2(c2 1) + σ2))/(2β2), which concludes the proof. Leveraging Lemma 13, we can give an explicit expression of the drift term in the Markovian projection. Lemma 14. Let Π0 0,1 = N(0, Σ0) with Σ0 given by (39). Let Π0 = Π0 0,1Q|0,1. For any t [0, 1] and xt Rd, we have that σ2EΠ0 1|t[ log Q1|t|Xt = xt] = (1 2t)(c2 1) σ2t 1+t(1 t)(2(c2 1)+ σ2)xt. Hence, the Markovian projection of Π0, denoted M1 is associated with (Xt)t [0,1] with X0 Π0, d Xt = (1 2t)(c2 1) σ2t 1+t(1 t)(2(c2 1)+ σ2)Xt + σd Bt. (40) Proof. Using (Barczy and Kern, 2013, Theorem 3.2), we get that Q|0,1 is associated with d X0,1 t = σ2 log Q1|t(x1|Xt)dt + σd Bt, X0,1 0 = x0, where for any t [0, 1), we have and xt Rd, σ2 log Q1|t(x1|xt) = (x1 xt)/(1 t). Therefore, we get that M1 is associated with (Xt)t [0,1] such that d Xt = EΠ0 1|t[σ2 log Q1|t(X1|Xt)|Xt]dt + σd Bt, X0 = x0. Therefore, we get that d Xt = (EΠ0 1|t[X1|Xt] xt)/(1 t)dt + σd Bt, X0 = x0. Using Lemma 13, we have that for any t [0, 1] and xt Rd, EΠ0 1|t[X1|Xt = xt] = a1 t k /bt kxt. In addition, we have for any t [0, 1] a1 t k /bt k 1 = (t + (1 t)c2 1 t(1 t)(2(c2 1) + σ2))/(1 + t(1 t)(2(c2 1) + σ2)) = ((1 t)(c2 1) t(1 t)(2(c2 1) + σ2))/(1 + t(1 t)(2(c2 1) + σ2)) = (1 t)((c2 1) t(2(c2 1) + σ2))/(1 + t(1 t)(2(c2 1) + σ2)) = (1 t)((1 2t)(c2 1) t σ2))/(1 + t(1 t)(2(c2 1) + σ2)), which concludes the proof. Note that since σ > 0 and c2 [0, 1], we get that for any t [0, 1], 1 + t(1 t)(2(c2 1) + σ2) > 0 and therefore the drift is well-defined, smooth and sublinear. In particular, (40) admits a unique strong solution. In what follows, we denote G : [0, 1] [0, 1] R given for any t [0, 1] by G(t, c2) = R t 0 (1 2s)(c2 1) σ2s 1+s(1 s)(2(c2 1)+ σ2)ds. We have the following useful lemma. Lemma 15. Let c [0, 1], σ > 0 and p = 2c2 + σ2. We distinguish three cases: (a) If p < 2, then G(1, c2) = σ2(4 p2) 1/2 tan 1((4 p2)1/2/p). (b) If p = 2, then G(1, c2) = σ2/2. (c) If p > 2, then G(1, c2) = σ2(p2 4) 1/2 tanh 1((p2 4)1/2/p). This lemma is useful combined with the following proposition, which gives the analytical update formula of Gaussian IMF covariances as a function of G(1, c2). Proposition 16. Let Π0 0,1 = N(0, Σ0) with Σ0 given by (39). Then Π1 0,1 = N(0, Σ1) with Σ1 = (1/2β2) 1 c2 1 c2 1 1 , c2 1 = f(c2 0), with f : [0, 1] [0, 1] given for any c [0, 1] by f(c2) = exp[G(1, c2)]. Proof. We have that X1 = exp[G(1, c2)]X0 + M1, where M1 is a Gaussian random variable with zero mean independent from X0. Therefore, we get that Cov(X0, X1) = exp[G(1, c2)]/(2β2). In addition, we have that E[X2 1] = E[X2 0] = 1/2β2, which concludes the proof. Iterating the procedure in Proposition 16, we obtain a sequence of IMF covariances (c2 n)n N satisfying c2 n+1 = f(c2 n). Finally, we show that this iterative procedure recovers the true SB coupling ΠSB 0,1 = N(0, ΣSB) as a fixed point. The formula of ΣSB is given e.g. in (Bunne et al., 2023, Equation (2)) which we use below. Proposition 17. Let ΠSB 0,1 = N(0, ΣSB) be the true static SB solution, with ΣSB = (1/2β2) 1 c2 SB c2 SB 1 , c2 SB = 1 4 + σ4 σ2). Then ΠSB 0,1 is a fixed point of the iterative procedure in Proposition 16, i.e. f(c2 SB) = c2 SB. Proof. By straightforward calculations, p SB = 2c2 SB + σ2 = 4 + σ4. If σ = 0, p SB = 2 and thus G(1, c2 SB) = σ2/2 = 0. Hence f(c2 SB) = exp(G(1, c2 SB)) = 1 = c2 SB. If σ > 0, we have p SB > 2 and G(1, c2 SB) = tanh 1( σ2/ 4 + σ4). Hence, f(c2 SB) = exp(G(1, c2 SB)) = 1 4 + σ4 σ2) = c2 SB. We thus correctly recover the true static SB solution ΠSB 0,1 as fixed point of IMF. We visualize the convergence of this fixed point procedure for a variety of parameter settings in Figure 12. The convergence appears to be very fast in only two or three iterations. 0.0 0.2 0.4 0.6 0.8 1.0 c2 n Convergence of IMF Update function Identity True SB solution Fixed point iterations (a) σ = 0.1, c2 0 = 0 0.0 0.2 0.4 0.6 0.8 1.0 c2 n Convergence of IMF (b) σ = 1, c2 0 = 0 0.0 0.2 0.4 0.6 0.8 1.0 c2 n Convergence of IMF (c) σ = 1, c2 0 = 1 0.0 0.2 0.4 0.6 0.8 1.0 c2 n Convergence of IMF (d) σ = 5, c2 0 = 1 Figure 12: Convergence of IMF in the analytic case given by Proposition 16. E Discrete-Time Markovian Projection We derive in this section a discrete-time version of the Markovian projection and show that, in some limiting case, we recover the continuous-time projection. In the discrete case, we let π(x0:N) = π(x0, x N) QN 1 k=0 qk+1|0,k,N(xk+1|x0, xk, x N) = π(x0, x N) QN 2 k=0 qk+1|k,N(xk+1|xk, x N). We consider a Markovian measure p given by p(x0:N) = p(x0) QN 1 k=0 pk+1|k(xk+1|xk). Now let us compute KL(π|p). We have KL(π(x0:N)|p(x0:N)) = PN 2 k=0 R Rd Rd KL(qk+1|k,N|pk+1|k)πk,N(xk, x N)dxkdx N + KL(π0|p0) + R Rd KL(π(x N|x0)|p(x N|x N 1))π(x0, x N 1)dx0dx N 1. In what follows, we denote L0 = KL(π0|p0), LN = R Rd KL(π(x N|x0)|p(x N|x N 1))π(x0, x N 1)dx0dx N 1, Rd Rd KL(qk+1|k,N|pk+1|k)πk,N(xk, x N)dxkdx N, We have the following proposition. Proposition 18. The minimizer pk+1|k of Lk+1 is given by pk+1|k(xk+1|xk) = R Rd qk+1|k,N(xk+1|xk, x N)πN|k(x N|xk)dx N. (41) If p0 = q0, then for any k {0, . . . , N 1}, pk = πk. In addition, assume that pk+1|k(xk+1|xk) = exp[ xk+1 xk γf(xk) 2/(2γ)]/(2πγ) d/2 and qk+1|k,N(xk+1|xk, x N) = exp[ xk+1 xk γf(xk, x N) 2/(2γ)]/(2πγ)d/2. Finally, assume that xk+1 xk γ1/2. Then, we have that Rd f(xk, x N)π(x N|xk)dx N + o(γ1/2). (42) Proof. The proofs of (41) and (42) are straightforward and left to the reader. We now prove that if p0 = π0, then for any k {1, . . . , N}, pk = πk. First, we have that for any k {0, . . . , N 1}, π(xk, xk+1, x N) = π(xk, x N)q(xk+1|xk, x N). Assume now that pk = πk, then we have pk+1(xk+1) = R Rd pk(xk)pk+1|k(xk+1|xk)dxk = R Rd Rd pk(xk)qk+1|k,N(xk+1|xk, x N)πN|k(x N|xk)dxkdx N = R Rd Rd πk(xk)qk+1|k,N(xk+1|xk, x N)πN|k(x N|xk)dxkdx N = R Rd Rd πk,k+1,N(xk, xk+1, x N)dxkdx N = πk+1(xk+1), which concludes the proof. In particular, in the previous proposition, if f(xk, x N) = log q(x N|xk), i.e. we have a discretization of the bridge then f(xk) = R Rd log q(x N|x K)π(x N|xk)dx N, which recovers the Markovian projection in continuous-time. F Comparing DSBM-IPF and DSB We analyze further the differences between DSBM-IPF proposed here and DSB proposed in De Bortoli et al. (2021) and related algorithms in Vargas et al. (2021); Chen et al. (2022). All algorithms solve the SB problem using the IPF iterates. However, DSB-type algorithms solve for the IPF iterates using time-reversals, whereas DSBM solves for the iterates using Markovian and reciprocal projections. A comparison between these two methodologies is made in Section 4. We investigate here further benefit (iii) of DSBM in Section 4, i.e. the benefit of explicitly projecting onto the reciprocal class of Q. Intuitively speaking, we directly incorporate the reference measure Q in the training procedure as our inductive bias. More formally, suppose we have at the current IPF iteration M2n (Markov diffusion in the forward direction) and want to learn M2n+1 (Markov diffusion in the backward direction). Due to training error and the forgetting issue (Fernandes et al., 2021), however, M2n no longer has the correct bridge Q|0,T . Now suppose we first perform IMF for M2n and learn M2n, in the forward direction. That is to say, we repeat alternative reciprocal and Markovian projections and obtain a sequence (M2n,m)m N in the forward direction converging to M2n, . Then M2n, now has the correct bridge Q|0,T by Proposition 7, since M2n, is the SB between M2n 0 and M2n T . Theoretically, M2n t = M2n, t for t = 0, T, but due to training error accumulating it may be that M2n T = M2n, T . However, M2n 0 = M2n, 0 , since M2n, is in the forward direction and starting from samples from π0. As a result, we can obtain a Markov forward diffusion M2n, , which has M2n, 0 = π0 and the correct bridge Q|0,T . These are the same set of properties that the reference measure Q has. As a result, replacing Q with M2n, in (6) results in the same SB solution. Consequently, now continuing the IPF iterations from M2n, 0 , it is as if we restart IPF afresh using a modified SB problem PSB = argmin{KL(P|M2n, ) : P0 = π0, PT = πT }. If M2n, is closer than Q to M in the sense of KL divergence, then we obtain a better initialization of the IPF procedure. As proposed in Algorithm 1, DSBM performs the Markovian and reciprocal projection only once before switching between the forward and backward directions. However, it is still beneficial compared to DSB with less bias accumulation in the bridge. Algorithmically, one main difference between DSB-type algorithms and DSBM due to the above distinction occurs in the trajectory caching step. In DSB, a fixed discretization of SDE needs to be chosen, and all intermediate samples from the discretized Euler-Maruyama simulation of the SDE need to be saved. Furthermore, a second set of drift evaluation needs to be performed for all datapoints in the trajectory (De Bortoli et al., 2021, Equations (12), (13)). The IPML algorithm in Vargas et al. (2021) is also similar to DSB, but Gaussian processes are used to fit the drifts of forward and backward SDEs instead of neural networks. In Chen et al. (2022), the implicit score matching loss is used instead, but all intermediate points in the SDE trajectory also need to be saved. On the contrary, DSBM does not require intermediate samples during trajectory caching and only retains the joint samples at times 0, T. Then, the intermediate trajectories are reconstructed using the reference bridge Q|0,T . G Joint Training of Forward and Backward Processes We recall below the DSBM algorithm given in Algorithm 1. Algorithm 2 Diffusion Schrödinger Bridge Matching 1: Input: Joint distribution Π0 0,T , tractable bridge Q|0,T , number of outer iterations N N. 2: Let Π0 = Π0 0,T Q|0,T . 3: for n {0, . . . , N 1} do 4: Learn vφ using (14) with Π = Π2n. 5: Let M2n+1 be given by (13). 6: Let Π2n+1 = M2n+1 0,T Q|0,T . 7: Learn vθ using (10) with Π = Π2n+1. 8: Let M2n+2 be given by (9). 9: Let Π2n+2 = M2n+2 0,T Q|0,T . 10: end for 11: Output: vθ , vφ Our main observation comes from Proposition 9. In particular, under mild assumptions, we have that the Markovian projection M = proj M(Π) is associated with d Xt = {ft(Xt) + σ2 t EΠT |t[ log QT |t(XT |Xt) | Xt]}dt + σtd Bt, X0 π0, d Yt = { f T t(Yt) + σ2 T t EΠ0|T t[ log QT t|0(Yt|YT ) | Yt]}dt + σT td Bt, Y0 πT . Considering the following losses, θ = argminθ{ R T 0 EΠt,T [ σ2 t log QT |t(XT |Xt) vθ(t, Xt) 2]/σ2 t dt : θ Θ}, (43) φ = argminφ{ R T 0 EΠt,0[ σ2 t log Qt|0(Xt|X0) vφ(t, Xt) 2]/σ2 t dt : φ Φ}. (44) If the families of functions {vθ : θ Θ} and {vφ : θ Φ} are rich enough, we have for any t [0, T] and xt Rd, vθ (t, xt) = σ2 t EΠT |t[ log QT |t(XT |Xt) | Xt = xt] and vφ (t, xt) = σ2 t EΠ0|t[ log Qt|0(Xt|X0) | Xt = xt]. In practice, this means that the Markovian projection can be computed in a forward or backward fashion equivalently. Therefore, given a coupling Π = Π2n, we can update both vθ and vφ. This means that we train the forward and backward model jointly. We then consider M2n+1 b associated with (13) and M2n+1 f associated with (9). Note that if the families of functions {vθ : θ Θ} and {vφ : θ Φ} are rich enough then M2n+1 f = M2n+1 b . Mixture from forward and backward. Once we have obtained both the forward update and the backward update, our next task is to define the new mixture of bridge Π2n+1. In Algorithm 1, since we train only the backward model M2n+1 = M2n+1 b , we define Π2n+1 = M2n+1 0,T Q|0,T . In the case of joint training, we have access to M2n+1 b and M2n+1 f . One way to define a new mixture of bridge is to compute Π2n+1 = 1 2(M2n+1 b,0,T Q|0,T + M2n+1 f,0,T )Q|0,T . This choice ensures that in the case where M2n+1 f = M2n+1 b we have Π2n+1 = M2n+1 f,0,T Q|0,T = M2n+1 b,0,T Q|0,T . It also ensures that all the steps in the joint DSBM training algorithms are symmetric. We leave the study of an optimal combination of M2n+1 f and M2n+1 b for future work. Consistency loss. In addition to the losses (43) and (44), we also consider an additional consistency loss. A similar idea was explored in Song (2022). In DSB (De Bortoli et al., 2021; Chen et al., 2022) and DSBM, see Algorithm 1, the processes parameterized by vθ (forward) and vφ backward are identical only at equilibrium. Thus imposing the forward and the backward processes match at each step of DSB or DSBM would lead to some bias. However, this is not the case in the joint training setting. Indeed, in that case, we have M2n+1 f = M2n+1 b if the families are rich enough. Therefore, we get that d Yt = { f T t(Yt) + vφ(T t, Yt)}dt + σT td Bt, Y0 πT , (45) is the time reversal of d Xt = {ft(Xt) + vθ(t, Xt)}dt + σtd Bt, X0 π0. (46) Computing the time-reversal of (45), we have d Xt = {ft(Xt) vφ(t, Xt) + σ2 t log Π2n t (Xt)}dt + σtd Bt, X0 π0. (47) Identifying (47) and (46), we get that for any t [0, T] and xt Rd vθ(t, xt) = vφ(t, xt) + σ2 t log Π2n t (xt). (48) We highlight that letting σt 0 for any t [0, T], we get that vθ = vφ, which confirms that the time-reversal of an ODE is simply given by flipping the sign of the velocity. Therefore, we propose the following loss which links the parameters θ and φ Lcons(θ, φ) = R T 0 EΠ2n t [ vθ(t, Xt) + vφ(t, Xt) σ2 t log Π2n t (Xt) 2]/σ2 t dt. Leveraging tools from implicit score matching (Hyvärinen, 2005) and the divergence theorem, we get that Lcons(θ, φ) = R T 0 EΠ2n t [ vθ(t, Xt) + vφ(t, Xt) 2/σ2 t + 2div(vθ(t, Xt) + vφ(t, Xt))]dt + C, where C 0 is a constant which does not depend on θ and φ. Alternatively, (48) shows that log Π2n t (xt) = EΠ2n T |t[ log QT |t|Xt = xt] + EΠ2n 0|t[ log Qt|0|Xt = xt]. We thus also have the following denoising score matching consistency loss Lcons(θ, φ) = R T 0 EΠ2n 0,t,T [ vθ(t, Xt)+vφ(t, Xt) σ2 t ( log QT |t(XT |Xt)+ log Qt|0(Xt|X0)) 2]/σ2 t dt, The advantage of this DSM loss is that it does not rely on any divergence computation. Below, we recall the two losses used to estimate the Markovian projection (43) and (44) L(θ) = R T 0 EΠt,T [ σ2 t log QT |t(XT |Xt) vθ(t, Xt) 2]/σ2 t dt, L(φ) = R T 0 EΠt,0[ σ2 t log Qt|0(Xt|X0) vφ(t, Xt) 2]/σ2 t dt. The complete loss we consider in the joint training of the algorithm is of the form Lλ(θ, φ) = L(θ) + L(φ) + λLcons(θ, φ), (49) where λ > 0 is an additional regularization parameter. We now state a version of DSBM which performs joint training in Algorithm 3. Algorithm 3 Diffusion Schrödinger Bridge Matching (Joint Training) 1: Input: Coupling Π0 0,T , tractable bridge Q|0,T , N N 2: Let Π0 = Π0 0,T Q|0,T . 3: for n {0, . . . , N 1} do 4: Learn vφ , vθ using (49) with Π = Πn. 5: Let Mn+1 f be given by (9). 6: Let Mn+1 b be given by (13). 7: Let Mn+1 = 1 2(Mn+1 f + Mn+1 b ). 8: Let Πn+1 = Mn+1 0,T Q|0,T . 9: end for 10: Output: vθ , vφ H Loss Scaling Similar to the loss weighting in standard diffusion models (Song et al., 2021b; Ho et al., 2020), we derive a similar weighting to reduce the variance of our objective. We focus on the forward direction of Markovian projection in this case, and the backward case can be derived similarly. Our forward loss in the DSBM framework is given by (10), where the inner expectation is given by EΠt,T [ σ2 t log QT |t(XT |Xt) vθ(t, Xt) 2]. Letting Q|0,T be a Brownian bridge with diffusion parameter σ and assuming T = 1, this becomes E(X0,X1) Π0,1,Z N(0,Id)[ X1 X0 σ p t/(1 t)Z vθ(t, X0,T t ) 2] with X0,T t = t X1 + (1 t)X0 + σ p t(1 t)Z. When t 1, we see that the regression target is dominated by the noise term σ p t/(1 t)Z which needs to be predicted based on information contained within X0,T t . The loss will have an approximate scale of σ2t/(1 t) when t 1 which will be very large. To avoid these large values affecting gradient descent, we can downweight the loss by 1 + σ2t/(1 t) (we can add 1 to effectively cause no loss scaling when t is close to 0) (1 + σ2t/(1 t)) 1E(X0,X1) Π0,1,Z N(0,Id)[ X1 X0 σ p t/(1 t)Z vθ(t, X0,T t ) 2]. Similar arguments can be applied to the backward loss (14) EΠt,0[ σ2 t log Qt|0(Xt|X0) vφ(t, Xt) 2] = E(X0,X1) Π0,1,Z N(0,Id)[ X0 X1 σ p (1 t)/t Z vφ(t, X0,T t ) 2], which we then downweight by 1 + σ2(1 t)/t (1 + σ2(1 t)/t) 1E(X0,X1) Π0,1,Z N(0,Id)[ X0 X1 σ p (1 t)/t Z vφ(t, X0,T t ) 2]. I Experiments In this section, we present further details of the experiment setups as well as further experiment results. In all experiments, we use Brownian motion for the reference measure Q with corresponding Brownian bridge (4) and T = 1. We use the Adam optimizer with learning rate 10 4 and Si LU activations unless specified otherwise. The experiments are run on computing clusters with a mixture of both CPU and GPU resources. I.1 2D Experiments For the 2D experiments, we closely follow Tong et al. (2023) and the released code7, and use the same synthetic datasets and the 2-Wasserstein distance between the test set and samples simulated using probability flow ODE as the evaluation metric. However, we use 10000 samples in the test set since we find the 2-Wasserstein distance can vary greatly with only 1000 samples (which can be as high as 0.3 even between two set of samples both drawn from the ground truth distribution). We use a simple MLP with 3 hidden layers and 256 hidden units to parameterize the forward and backward drift networks. We use batch size 128 and 20 diffusion steps with uniform schedule at sampling time. Each outer iteration is trained for 10000 steps and we train for 20 outer iterations. As the initial coupling is more optimal in DSBM-IMF+, we reduce the number of outer iterations to 4 and train for 50000 steps in each outer iteration. For flow methods, we train for 200000 steps in total. For Table 2 we use σt = 1 in all cases, except the moons-8gaussians dataset where we use σt = 5. Note that FM cannot be used for the moons-8gaussians task since it requires a Gaussian source, but CFM is applicable. The experiments are run using 1 CPU and take approximately 200 minutes (for both training and testing). I.1.1 Variance of the reference measure Q We comment further on the effect of σt in the reference path measure Q. We assume a timehomogeneous σt = σ for simplicity. In Figure 2, we vary σ and visualize the learned transport for a 3gaussians problem of transporting between two Gaussian mixtures. In Table 4 we show the 2-Wasserstein distance between the test set and generated samples for this 3gaussians problem as well as the moons-8gaussians problem. We find that large values of σ result in increasingly curved transport paths, and correspondingly reduced performance when σ is excessively large. Conversely, we also find reduced performance when σ is excessively small. We conjecture this is due to increased optimization difficulty and bias accumulation. Firstly, the EOT problem becomes more difficult to solve as σ is taken to 0, which would require a higher number of outer iterations. Further, the introduction of noise also decreases optimization difficulty by smoothing the intermediate marginals between the two terminal distributions. The benefit of setting σ > 0 and using a stochastic sampler was also observed in Albergo et al. (2023); Delbracio and Milanfar (2023). Finally, we conjecture setting σ > 0 could also increase the diversity of sampled couplings and may alleviate some bias accumulation issues in the outer iterations. When σ = 0, these issues result in the artifacts observed in the transferred samples (marked using yellow points) in Figure 2. The appropriate value for σ depends on the spatial scaling of the problem as shown in Table 4, where the optimum σ is larger for the larger scale moons-8gaussians problem. 3gaussians moons-8gaussians σ 2-Wasserstein σ 2-Wasserstein σ = 0 0.646 0.028 σ = 0 1.459 0.008 σ = 0.1 0.724 0.039 σ = 1.0 1.285 0.346 σ = 0.3 0.546 0.169 σ = 2.0 0.916 0.292 σ = 1.0 0.439 0.072 σ = 4.0 0.818 0.249 σ = 3.0 0.543 0.078 σ = 8.0 0.989 0.179 Table 4: 2-Wasserstein distance for varying value of σ used in the DSBM-IMF method. 7https://github.com/atong01/conditional-flow-matching (code released under MIT license) I.2 Gaussian Experiment Similar to the 2D experiments, we use a simple MLP with 2 hidden layers and 256 hidden units to parameterize the forward and backward drift networks. This is a smaller network compared to the large network in De Bortoli et al. (2021). We use batch size 128 and 20 diffusion sampling steps with uniform time schedule at inference time. Each outer iteration is trained for 10000 steps and we train for 20 outer iterations. The experiments are also run using 1 CPU, and for the case d = 50 finish in approximately 200 minutes. We note that DSB, IMF-b, and DSBM methods all solve for the Schrödinger Bridge transport, whereas Rectified Flow does not and so is not plotted on the covariance convergence plot in Figure 3 since it is not comparable. For Table 3, we assume the marginals of the learned process Pt are also independently Gaussian distributed in each dimension. We thus estimate the KL divergence from the sample mean and variance of each dimension of Pt using the analytic KL formula between Gaussian distributions. I.3 MNIST Transfer Experiment We follow De Bortoli et al. (2021) closely for the setup of this experiment. We use the set of first 5 letters of EMNIST (A-E, a-e) such that both domains have 10 classes in total. We use the same U-Net architecture, batch size 128 and 30 diffusion sampling steps. The network size is approximately 6.6 million parameters. Each outer iteration is trained for 5000 steps. We refresh the cache dataloader every 1000 steps with 10000 new samples. Contrary to De Bortoli et al. (2021), in our experiments we find that we obtain better sampling quality for both DSB and DSBM using a uniform noising schedule. We simply choose σt = 1 for all t and T = 1 in our experiments. For DSBM-IPF, we train for at most 50 outer iterations (i.e. 250000 total number of steps). For DSBM-IMF and Rectified Flow, since their first iteration corresponds to Bridge Matching and Flow Matching respectively, we first pretrain the forward and backward networks for 100000 steps using the Bridge Matching or Flow Matching losses. DSBM-IMF then switches to iterative training with 5000 steps per outer iteration, the same as DSBM-IPF. For Rectified Flow, we train for 50000 steps per outer iteration. The experiments are performed using 2 GPUs and take approximately one day. We provide further experiment results in Figures 13, 14 and 15. In Figure 13, we show samples generated using different algorithms and at different points of convergence. Samples generated using CFM and Rectified Flow with 2 rectified steps in (a) and (b) appear to be less clear and identifiable. OT-CFM improves upon CFM slightly in (c), but many samples still appear to be unclear. For DSB, the algorithm has not converged after 10 iterations, and many samples in (d) still appear to be letters. After 20 iterations, there are still letter-like samples in Figure 4b, and the digit classes also appear to be unbalanced with many instances of digit 0 . After 30 iterations, however, the sample quality of DSB becomes very poor in (e). On the other hand, as shown in (f)-(j), we observe that DSBM-IPF converges faster than DSB, with more accurate samples even in iterations 10 and 20, and the sample quality continues to improve until the end of training after 50 iterations. We present some additional trajectory samples at the end of training in Figures 14 and 15 in both the forward and backward directions. We observe DSBM is able to transfer samples across the two domains faithfully, and the output samples preserve interesting similarities compared to the input, whereas Bridge Matching preserves much less similarity. The Mean Squared Distance (MSD) between the initial and final samples also confirm that DSBM methods transfer more closely to the original inputs, and DSBM-IMF further achieves the best FID score out of the three methods. I.4 Celeb A Transfer Experiment For Celeb A (Liu et al., 2015), we test DSBM on resized 64 64 and 128 128 resolution images. We evaluate the dependency of the results with respect to σ > 0 on Celeb A 64 64. We also showcase the scalibility of our method by training DSBM on Celeb A 128 128. In both cases, the dataset is split between male/old and female/young. We gather 20000 samples of each class on which we perform classical data augmentation such as horizontal flipping. For our ablation study w.r.t. the value of σ > 0, we run DSBM for 20 iterations (note that the loss and the quality was still improving after 20 DSBM iterations but by stopping the runs early we were allowed to draw comparisons with more values of σ). We use a U-Net architecture with 4 resolution levels and 2 residual blocks per resolution level. The batch size is fixed to 64 and the EMA rate is (b) Rectified Flow @ 3 (d) DSB @ 10 (e) DSB @ 30 (f) DSBM-IPF @ 10 (g) DSBM-IPF @ 20 (h) DSBM-IPF @ 30 (i) DSBM-IPF @ 40 (j) DSBM-IPF @ 50 Figure 13: Samples of MNIST digits transferred from the EMNIST letters using different methods. @ indicates the progressed number of outer iterations n + 1. fixed to 0.999. We refresh the cache dataloader every 500 steps with 10000 new samples. The SDE sampler is chosen to be the modified Euler-Maruyama sampler, see Heng et al. (2021) for instance, with a constant schedule for the stepsizes. We use 100 sampling steps at inference time and to refresh the cache. For each outer DSBM iteration we train the model for 20000 iterations. We provide additional transfer results in resolution 128 128 in Figure 16. We do not change the training setting for this experiment. I.5 AFHQ Transfer Experiment For AFHQ (Choi et al., 2020), we test DSBM between classes cat and wild with 512 512 resolution images. Each class contains approximately 5000 samples. We first pretrain the networks using Bridge Matching for 100000 steps, then run DSBM for 20 iterations with 25000 steps per outer iteration. We follow Liu et al. (2023b) and use the same U-Net architecture8. The batch size is 4 and the EMA rate is 0.999. We choose σ2 = 5 and again we use 100 sampling steps with constant stepsizes. I.6 CIFAR-10 Generative Modeling Experiment We also test our method in the standard generative modeling framework on the CIFAR-10 dataset. We again use a U-Net architecture with 4 resolution levels and 2 residual blocks per resolution level. The network size is approximately 39.6 million parameters. The batch size is fixed to 128 and the EMA rate is fixed to 0.9999. The Adam W optimizer is used for this task. For DSBM-IMF and Rectified Flow, again we first pretrain the networks using the Bridge Matching or Flow Matching losses, then switch to DSBM-IMF or RF training with 100000 steps per outer iteration. We find that 1 or 2 additional outer iterations appear sufficiently effective on this task, and additional outer iterations can cause sample quality to drop. For our main experiment, the pretraining stage ran for approximately 6 days, and DSBM-IMF ran for approximately 4 additional days using 4 V100 GPUs. The best results during training for each method are reported in Table 5 and Figure 17, where we compute the FID score between 50000 samples in the CIFAR-10 training set and 50000 generated samples following standard practice for this task. We observe that DSBM-IMF can clearly improve upon Bridge Matching at the same value of σ, which suggests that further outer iterations in DSBM is beneficial for improving sample quality. This is contrary to Rectified Flow, which causes the FID score to worsen compared to Flow Matching after only 1 rectified iteration. However, as σ increases, 8https://github.com/gnobitab/Rectified Flow/blob/main/Image Generation/configs/ rectified_flow/afhq_cat_pytorch_rf_gaussian.py (code released under Apache-2.0 license) FID 17.14 MSD 0.579 (a) Bridge Matching FID 15.27 MSD 0.354 (b) DSBM-IPF FID 10.59 MSD 0.375 (c) DSBM-IMF Figure 14: Left: EMNIST to MNIST sample trajectory with 30 diffusion steps at t = 0, 1/3, 2/3, 1. Right: FID score of final samples, and Mean Squared Distance between initial and final samples. FID 9.402 MSD 0.561 (a) Bridge Matching FID 13.59 MSD 0.359 (b) DSBM-IPF FID 8.990 MSD 0.375 (c) DSBM-IMF Figure 15: Left: MNIST to EMNIST sample trajectory with 30 diffusion steps at t = 0, 1/3, 2/3, 1. Right: FID score of final samples, and Mean Squared Distance between initial and final samples. Figure 16: Transfer results between images given by the tokens female/young and male/old. Top row: original images (left) and generated images (right). Bottom row: original images (left) and generated images (right). we observe the FID score worsens for both Bridge Matching and DSBM as more stochasticity is introduced in the sampler. The best result of DSBM-IMF is obtained using σ2 = 0.2 and is slightly better than FM (i.e. with σ2 = 0) using 100 Euler steps. On the other hand, using the dopri5 ODE solver, FM achieves a FID of 4.055 with on average 148 integration steps. In Figure 17, we observe that both RF and DSBM-IMF are very effective in improving sampling quality at low number of diffusion steps, i.e. low number of function evaluations (NFEs), compared to Bridge and Flow Matching as well as OT-CFM which improves upon CFM slightly. DSBM-IMF also achieves lower FID score than RF as the NFEs are taken higher. Additional strategies such as distillation and fast SDE solvers can also be useful for improving few-step sampling quality further. I.7 Fluid Flows Experiment We use the fluid flows dataset9 from Bischoff and Deck (2023). The dataset consists of unpaired low (64 64) and high (512 512) resolution fields, as well as a context field with local information for the high resolution field. The data fields consist of two channels representing supersaturation and vorticity. The context field is dependent on the wavenumber kx = ky {1, 2, 4, 8, 16}, which 9https://github.com/Cli MA/Cli MADatasets.jl (code released under MIT license) 0 4.931 6.010 σ2 BM DSBM-IMF 0.2 5.427 4.511 0.5 8.274 6.896 1 12.749 9.881 Table 5: FID results on the CIFAR-10 train set using 100 Euler(-Maruyama) steps. 100 101 102 103 BM CFM OT-CFM RF DSBM-IMF Figure 17: FID vs number of diffusion steps (NFE) with NFE between 1 and 1000. BM CFM OT-CFM RF DSBM-IMF Figure 18: Zoomed-in version of Figure 17 with NFE between 100 and 1000. (a) Low-res (b) Diffusion-fb (c) Bridge Matching (d) DSBM-IPF (e) DSBM-IMF Figure 19: (a) Source low resolution sample; (b)(c)(d)(e) intermediate state and final reconstruction of each algorithm, for wavenumber kx = ky = 2. specifies the frequency of the saturation specific humidity modulation. We follow Bischoff and Deck (2023) for data processing and network architecture, which is given by a U-Net with an additional spatial mean bypass network given by an MLP. The network size is approximately 11.3 million parameters in total. We train using batch size 4, learning rate 2 10 4, for 5000 steps per outer iteration and N = 12 outer iterations. We refresh the cache dataloader every 2500 steps with 1250 new samples. The training takes approximately 20 hours on a single RTX GPU. We used σ2 = 0.3 and sample using 30 diffusion steps without finetuning these parameters. For the Diffusion-fb method in Bischoff and Deck (2023), we use the released code10 without modifying any parameters. We visualize intermediate and final reconstruction samples for different algorithms in Figure 19. We see that DSBM-IPF and DSBM-IMF provide consistent samples with the low resolution source, whereas Diffusion-fb and Bridge Matching produce dissimilar samples. We also follow Bischoff and Deck (2023) for a more refined statistical analysis in Figures 20, 21, 22. DSBM-IMF achieves comparable performance as Diffusion-fb in terms of these statistical profiles, and can be comparatively more accurate e.g. in the tails of the distributions in Figure 20, and for the case kx = ky = 4 in Figure 21 for which the power spectrum of supersaturation is correctly captured by DSBM-IMF but not by other methods. Comparing this analysis with Figure 11, DSBM-IMF is also significantly more accurate in terms of conditional consistency than Diffusion-fb. On the other hand, DSBM-IPF appears less accurate in terms of these unconditional statistics than Diffusion-fb and DSBM-IMF, but achieves lower ℓ2 distances from the input sources in Figure 11. This suggests that DSBM-IPF and DSBM-IMF exhibit different empirical biases before convergence, and DSBM-IMF is more preferable when the accuracy of the samples are important. This is in line with IMF theory as the marginals π0, πT are preserved in IMF but not in IPF. J Broader Impact Our work focuses on theoretical and methodological research and is intended to bring closer the fields of generative modeling and optimal transport. It can be useful for learning transport maps between general distributions with high accuracy and high scalability, which can have useful applications 10https://github.com/Cli MA/diffusion-bridge-downscaling (code released under Apache-2.0 license) 25 20 15 10 5 0 5 10 4 kx = ky = 2 True high-res True low-res Diffusion-fb DSBM-IPF DSBM-IMF 25 20 15 10 5 0 5 10 4 100 kx = ky = 4 25 20 15 10 5 0 5 10 4 100 kx = ky = 8 25 20 15 10 5 0 5 10 4 100 kx = ky = 16 Supersaturation 20 10 0 10 20 10 4 kx = ky = 2 True high-res True low-res Diffusion-fb DSBM-IPF DSBM-IMF 20 10 0 10 20 10 4 100 kx = ky = 4 20 10 0 10 20 10 4 100 kx = ky = 8 20 10 0 10 20 10 4 100 kx = ky = 16 Figure 20: KDE estimates of values in supersaturation and vorticity fields. 100 101 102 10 6 kx = ky = 2 True high-res True low-res Diffusion-fb DSBM-IPF DSBM-IMF 100 101 102 10 6 100 kx = ky = 4 100 101 102 10 6 100 kx = ky = 8 100 101 102 10 6 100 kx = ky = 16 Supersaturation 100 101 102 10 6 kx = ky = 2 True high-res True low-res Diffusion-fb DSBM-IPF DSBM-IMF 100 101 102 10 6 100 kx = ky = 4 100 101 102 10 6 100 kx = ky = 8 100 101 102 10 6 100 kx = ky = 16 Figure 21: Spectral density estimates of supersaturation and vorticity fields. 10 8 6 4 2 0.0 kx = ky = 2 True high-res True low-res Diffusion-fb DSBM-IPF DSBM-IMF 10 8 6 4 2 0.0 kx = ky = 4 10 8 6 4 2 0.0 kx = ky = 8 10 8 6 4 2 0.0 kx = ky = 16 Supersaturation Figure 22: KDE estimates of spatial means of the supersaturation field. The shaded areas denote 99% confidence interval obtained using 10000 bootstrap samples. in machine learning, but also natural science areas such as physics, biology and geosciences in which optimal transport maps with theoretical guarantees are appealing. Our fluid flows experiment demonstrates such potentials. However, as is the case for generative models as a whole, intentional malicious use could cause detrimental societal impacts.