# stochastic_interpolants_with_datadependent_couplings__a2659a69.pdf Stochastic Interpolants with Data-Dependent Couplings Michael S. Albergo * 1 Mark Goldstein * 2 Nicholas M. Boffi 2 Rajesh Ranganath 2 3 Eric Vanden-Eijnden 2 Generative models inspired by dynamical transport of measure such as flows and diffusions construct a continuous-time map between two probability densities. Conventionally, one of these is the target density, only accessible through samples, while the other is taken as a simple base density that is data-agnostic. In this work, using the framework of stochastic interpolants, we formalize how to couple the base and the target densities, whereby samples from the base are computed conditionally given samples from the target in a way that is different from (but does not preclude) incorporating information about class labels or continuous embeddings. This enables us to construct dynamical transport maps that serve as conditional generative models. We show that these transport maps can be learned by solving a simple square loss regression problem analogous to the standard independent setting. We demonstrate the usefulness of constructing dependent couplings in practice through experiments in superresolution and in-painting. The code is available at https://github.com/interpolants/couplings. 1. Introduction Generative models such as normalizing flows and diffusions sample from a target density ρ1 by continuously transforming samples from a base density ρ0 into the target. This transport is accomplished by means of an ordinary differential equation (ODE) or stochastic differential equation (SDE), which takes as initial condition a sample from ρ0 and produces at time t = 1 an approximate sample from ρ1. Typically, the base density is taken to be something simple, analytically tractable, and easy to sample, such as *Equal contribution 1Center for Cosmology and Particle Physics, New York University 2Courant Institute of Mathematical Sciences, New York University 3Center for Data Science, New York University. Correspondence to: Michael S. Albergo , Mark Goldstein . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). Figure 1: Examples. Super-resolution and in-painting results computed with our formalism. a standard Gaussian. In some formulations, such as scorebased diffusion (Sohl-Dickstein et al., 2015; Song & Ermon, 2020; Ho et al., 2020b; Song et al., 2020; Singhal et al., 2023), a Gaussian base density is intrinsically tied to the process achieving the transport. In others, including flow matching (Lipman et al., 2022a; Chen & Lipman, 2023), rectified flow (Liu et al., 2022b; 2023b), and stochastic interpolants (Albergo & Vanden-Eijnden, 2022; Albergo et al., 2023), a Gaussian base is not required, but is often chosen for convenience. In these cases, the choice of Gaussian base represents an absence of prior knowledge about the problem structure, and existing works have yet to fully explore the strength of base densities adapted to the target. In this work, we introduce a general formulation of stochastic interpolants in which a base density is produced via a coupling, whereby samples of this base are computed conditionally given samples from the target. We construct a continuous-time stochastic process that interpolates between the coupled base and target, and we characterize the resulting transport by identification of a continuity equation obeyed by the time-dependent density. We show that the velocity field defining this transport can be estimated by solution of an efficient, simulation-free square loss regression problem analogous to standard, data-agnostic interpolant Stochastic Interpolants with Data-Dependent Couplings and flow matching algorithms. In our formulation, we also allow for dependence on an external, conditional source of information independent of ρ1, which we call ξ. This extra source of conditioning is standard, and can be used in the velocity field bt(x, ξ) to accomplish class-conditional generation, or generation conditioned on a continuous embedding such as a textual representation or problem-specific geometric information. As illustrated in Fig. 2, it is however different from the datadependent coupling that we propose. Below, we suggest some generic ways to construct coupled, conditional base and target densities, and we consider practical applications to image super-resolution and in-painting, where we find improved performance by incorporating both a data-dependent coupling and the conditioning variable. Together, our main contributions can be summarized as: 1. We define a broader way of constructing base and target pairs in generative models based on dynamical transport that adapts the base to the target. In addition, we formalize the use of conditional information both discrete and continuous in concert with this new form of data coupling in the stochastic interpolant framework. As special cases of our general formulation, we obtain several recent variants of conditional generative models that have appeared in the literature. 2. We provide a characterization of the transport that results from conditional, data-dependent generation, and analyze theoretically how these factors influence the resulting time-dependent density 3. We provide an empirical study on the effect of coupling for stochastic interpolants, which have recently been shown to be a promising, flexible class of generative models. We demonstrate the utility of data-dependent base densities and the use of conditional information in two canonical applications, image inpainting and superresolution, which highlight the performance gains that can be obtained through the application of the tools developed here . The rest of the paper is organized as follows. In Section 2, we describe some related work in conditional generative modeling. In Section 3, we introduce our theoretical framework. We characterize the transport that results from the use of data-dependent couplings, and discuss the difference between this approach and conditional generative modeling. In Section 4, we apply the framework to numerical experiments on Image Net, focusing on image inpainting and image super-resolution. We conclude with some remarks and discussion in Section 5. 2. Related Work Couplings. Several works have studied the question of how to build couplings, primarily from the viewpoint of optimal transport theory. An initial perspective in this regard comes from (Pooladian et al., 2023; Tong et al., 2023; Klein et al., 2023), who state an unbiased means for building entropically-regularized optimal couplings from minibatches of training samples. This perspective is appealing in that it may give probability flows that are straighter and hence more easily computed using simple ODE solvers. However, it relies on estimating an optimal coupling over minibatches of the entire dataset, which, for large datasets, may become uninformative as to the true coupling. In an orthogonal perspective, (Lee et al., 2023) presented an algorithm to learn a coupling between the base and the target by building dependence on the target into the base. They argue that this can reduce curvature of the underlying transport. While this perspective empirically reduces the curvature of the flow lines, it introduces a potential bias in that they still sample from an independent base, possibly not equal to the marginal of the learned conditional base. Learning a coupling can also be achieved by solving the Schr odinger bridge problem, as investigated e.g. in (De Bortoli et al., 2021; Shi et al., 2023). This leads to iterative algorithms that require solving pairs of SDEs until convergence, which is costly in practice. More closely connected to our work are the approaches proposed in (Liu et al., 2023a; Somnath et al., 2023): by considering generative modeling through the lens of diffusion bridges with known coupling, they arrive to a formulation that is operationally similar to, but less general than, ours. Our approach is simpler, and more flexible, as it differentiates between the bridging of the densities and the construction of the generative models. Table 1 summarizes these couplings along with the standard independent pairing. Generative Modeling and Dynamical Transport. Generative models built upon dynamical transport of measure go back at least to (Tabak & Vanden-Eijnden, 2010; Tabak & Turner, 2013), and were further developed in (Rezende & Mohamed, 2015; Dinh et al., 2017; Huang et al., 2016; Durkan et al., 2019) using compositions of discrete maps, while modern models are typically formulated via a continuous-time transformation. In this context, a major advance was the introduction of score-based diffusion (Song et al., 2021b;a), which relates to denoising diffusion probabilistic models (Ho et al., 2020a), and allows one to generate samples by learning to reverse a stochastic differential equation that maps the data into samples from a Gaussian base density. Methods such as flow matching (Lipman et al., 2022b), rectified flow (Liu, 2022; Liu et al., 2022a), and stochastic interpolants (Albergo & Vanden-Eijnden, 2022; Albergo et al., 2023) expand on the idea of building stochas- Stochastic Interpolants with Data-Dependent Couplings 0.0 0.2 0.4 0.6 0.8 1.0 t Data-dependent coupling 0.0 0.2 0.4 0.6 0.8 1.0 t Conditional velocity b(t, x, ξ) 0.0 0.2 0.4 0.6 0.8 1.0 t Unconditional b(t, x), Independent coupling Figure 2: Data-dependent couplings are different than conditioning. Delineating between constructing couplings versus conditioning the velocity field, and their implications for the corresponding probability flow Xt. The transport problem is flowing from a Gaussian Mixture Model (GMM) with 3 modes to another GMM with 3 modes. Left: The probability flow Xt arising from the data-dependent coupling ρ(x0, x1) = ρ1(x1)ρ0(x0|x1). All samples follow simple trajectories. No formation of auxiliary modes form in the intermediate density ρ(t), in juxtaposition to the independent case. Center: When the velocity field is conditioned bt(x, ξ) on each class (mode), it factorizes, resulting in three separate probability flows Xξ t with ξ = 1, 2, 3. Right: The probability flow Xt when taking an unconditional velocity field bt(x) and an independent coupling ρ(x0, x1) = ρ0(x0)ρ1(x1). Note the complexity of the underlying transport, which motivates us to consider finding correlated base variables directly in the data. Table 1: Couplings. Standard formulations of flows and diffusions construct generative models built upon an independent coupling (Albergo & Vanden-Eijnden, 2022; Albergo et al., 2023; Lipman et al., 2022a; Liu et al., 2022b). (Lee et al., 2023) learn qϕ(x0|x1) jointly with the velocity to define the coupling during training, but instead sample from ρ0 = N(0, Id) for generation. (Tong et al., 2023) and (Pooladian et al., 2023) build couplings by running mini-batch optimal transport algorithms (Cuturi, 2013). Here we focus on couplings enabled by our generic formalism, which bears similarities with (Liu et al., 2023a; Somnath et al., 2023), and can be individualized to each generative task. Coupling PDF ρ(x0, x1) Base PDF Description ρ1(x1)ρ0(x0) x0 N(0, Id) Independent ρ(x0|x1)ρ1(x1) x0 qϕ(x0|x1) Learned conditional mb-OT(x1, x0) x0 N(0, Id) Minibatch OT ρ1(x1)ρ0(x0|x1) x0 ρ0(x0|x1) Dependent-coupling (this work) tic processes that connect a base density to the target, but allow for bases that are more general than a Gaussian density. Typically, these constructions assume that the samples from the base and the target are uncorrelated. Conditional Diffusions and Flows for Images. (Saharia et al., 2022; Ho et al., 2022a) build diffusions for superresolution, where low-resolution images are given as inputs to a score model, which formally learns a conditional score (Ho & Salimans, 2022). In-painting can be seen as a form of conditioning where the conditioning set determines some coordinates in the target space. In-painting diffusions have been applied to video generation (Ho et al., 2022b) and protein backbone generation (Trippe et al., 2022). In the replacement method one directly inputs the clean values of the known coordinates at each step of integration (Ho et al., 2022b); (Schneuing et al., 2022) replace with draws of the diffused state of the known coordinates. (Trippe et al., 2022; Wu et al., 2023) discuss approximation error in this approach and correct with sequential Monte-Carlo. We revisit this problem framing from the velocity modeling perspective in Section 4.1. Recent work has applied flows to high-dimensional conditional modeling (Dao et al., 2023; Hu et al., 2023). A Schr odinger bridge perspective on the conditional generation problem was presented in (Shi et al., 2022). 3. Stochastic interpolants with couplings Suppose that we are given a dataset {xi 1}n i=1. The aim of a generative model is to draw new samples assuming that the data set comes from a probability density function (PDF) ρ1(x1). Following the stochastic interpolant framework (Albergo & Vanden-Eijnden, 2022; Albergo et al., 2023), we Stochastic Interpolants with Data-Dependent Couplings introduce a time-dependent stochastic process that interpolates between samples from a simple base density ρ0(x0) at time t = 0 and samples from the target ρ1(x1) at time t = 1: Definition 3.1 (Stochastic interpolant with coupling). The stochastic interpolant It is the process defined as1 It = αtx0 + βtx1 + γtz t [0, 1], (1) αt, βt, and γ2 t are differentiable functions of time such that α0 = β1 = 1, α1 = β0 = γ0 = γ1 = 0, and α2 t + β2 t + γ2 t > 0 for all t [0, 1]. The pair (x0, x1) is jointly drawn from a probability density ρ(x0, x1) with finite second moments and such that Z Rd ρ(x0, x1)dx1 = ρ0(x0), (2) Z Rd ρ(x0, x1)dx0 = ρ1(x1). (3) z N(0, Id), independent of (x0, x1). A simple instance of (1) uses αt = 1 t, βt = t, and γt = p The stochastic interpolant framework uses information about the process It to derive either an ODE or an SDE whose solutions Xt push the law of x0 onto the law of It for all times t [0, 1]. As shown in Section 3.1, the drift coefficients in these ODEs/SDEs can be estimated by quadratic regression. They can then be used as generative models, owing to the property that the process xt specified in Definition 3.1 satisfies It=0 = x0 ρ0(x0) and It=1 = x1 ρ1(x1), and hence samples the desired target density. By drawing samples x0 ρ0(x0) and using them as initial data Xt=0 = x0 in the ODEs/SDEs, we can then generate samples Xt=1 ρ1(x1) via numerical integration. In the original stochastic interpolant papers, this construction was made using the choice ρ(x0, x1) = ρ0(x0)ρ1(x1), so that x0 and x1 were drawn independently from the base and the target. Our aim here is to build generative models that are more powerful and versatile by exploring and exploiting dependent couplings between x0 and x1 via suitable definition of ρ(x0, x1). 1More generally, we may set It = I(t, x0, x1) in (1), where I satisfies some regularity properties in addition to the boundary conditions I(t = 0, x0, x1) = x0 and I(t = 1, x0, x1) = x1 (Albergo & Vanden-Eijnden, 2022; Albergo et al., 2023). For simplicity, we will stick to the linear choice I(t, x0, x1) = αtx0 + βtx1. Remark 3.1 (Incorporating conditioning). Our formalism allows (but does not require) that each data point xi 1 Rd comes with a label ξi D, such as a discrete class or a continuous embedding like that of a text caption. In this setup, our results can be straightforwardly generalized by making all the quantities (PDF, velocities, etc.) conditional on ξ. This is discussed in Appendix A and used in various forms in our numerical examples. 3.1. Transport equations and conditional generative models In this section, we show that the probability distribution of the process It defined in (1) has a time-dependent density ρt(x) that interpolates between ρ0(x) and ρ1(x). We characterize this density as the solution of a transport equation, and we show that both the corresponding velocity field and the score log ρt(x) are minimizers of simple quadratic objective functions. This result enables us to construct conditional generative models by approximating the velocity (and possibly the score) via minimization over a rich parametric class such as neural networks. We first define the functions: bt(x) = E( It|It = x), gt(x) = E(z|It = x), (4) where the dot denotes time-derivative and E( |It = x) denotes the expectation over ρ(x0, x1) conditional on It = x. We then have, Theorem 3.1 (Transport equation with coupling). The probability distribution of the stochastic interpolant It defined in (1) has a density ρt(x) that satisfies ρt=0(x) = ρ0(x) and ρt=1(x) = ρ1(x), and solves the transport equation tρt(x) + (bt(x)ρt(x)) = 0, (5) where the velocity field bt(x) is defined in (4). Moreover, for every t such that γt = 0, the following identity for the score holds log ρt(x) = γ 1 t gt(x). (6) Finally, the functions b and g are the unique minimizers of the objectives Lb(ˆb) = Z 1 0 E h |ˆbt(It)|2 2 It ˆbt(It) i dt, Lg(ˆg) = Z 1 0 E |ˆgt(It)|2 2z ˆgt(It) dt where E denotes an expectation over (x0, x1) ρ(x0, x1) and z N(0, Id) with (x0, x1) z. A more general version of this result with a conditioning variable is proven in Appendix A. The objectives (7) can Stochastic Interpolants with Data-Dependent Couplings readily be estimated in practice from samples (x0, x1) ρ(x0, x1) and z N(0, 1), which will enable us to learn approximations for use in a generative model. The transport equation (5) can be used to derive generative models, as we now show. Corollary 3.1 (Probability flow and diffusions with coupling). The solutions to the probability flow equation Xt = bt(Xt) (8) enjoy the property that Xt=1 ρ1(x1) if Xt=0 ρ0(x0) (9) Xt=0 ρ0(x0) if Xt=1 ρ1(x1) (10) In addition, for any ϵt 0, solutions to the forward SDE d XF t = bt(XF t )dt ϵtγ 1 t gt(XF t )dt + 2ϵtd Wt, (11) enjoy the property that XF t=1 ρ1(x1) if XF t=0 ρ0(x0), (12) and solutions to the backward SDE d XR t = bt(XR t )dt + ϵtγ 1 t gt(XR t )dt + 2ϵtd Wt, (13) enjoy the property that XR t=0 ρ0(x0) if XR t=1 ρ1(x1). (14) A more general version of this result with conditioning is also proven in Appendix A. Corollary 3.1 shows that the coupling can be incorporated both in deterministic and stochastic generative models derived within the stochastic interpolant framework. In what follows, for simplicity we will focus on the deterministic probability flow ODE (8). An important observation is that the transport cost of the generative model based on the probability flow ODE (8), which impacts the numerical stability of solving this ODE, is controlled by the time dynamics of the interpolant, as shown by our next result: Proposition 3.1 (Control of transport cost). Let Xt(x0) be the solution to the probability flow ODE (8) for the initial condition Xt=0(x0) = x0 ρ0. Then Ex0 ρ0 |Xt=1(x0) x0|2 Z 1 0 E[| It|2]dt < (15) The proof of this proposition is given in Appendix A. Minimizing the left hand-side of (15) would achieve optimal transport in the sense of Benamou-Brenier (Benamou & Brenier, 2000), and the minimum would give the Wasserstein-2 distance between ρ0 and ρ1. Various works seek to minimize this distance procedurally either by adapting the coupling (Pooladian et al., 2023; Tong et al., 2023) or by optimizing ρt(x) (Albergo & Vanden-Eijnden, 2022), at additional cost. Here we introduce designed couplings at no extra cost that can lower the upper bound in (15). This will allow us to show how different couplings enable stricter control of the transport cost in various applications. Let us now discuss a generic instantiation of our formalism involving a specific choice of ρ(x0, x1). 3.2. Designing data-dependent couplings One natural way to allow for a data-dependent coupling between the base and the target is to set ρ(x0, x1) = ρ1(x1)ρ0(x0|x1) with (16) Z Rd ρ0(x0|x1)ρ1(x1)dx1 = ρ0(x0). (17) There are many ways to construct the conditional ρ0(x0|x1). In the numerical experiments in Section 4.1 & Section 4.2, we consider base densities of a variable x0 of the generic form x0 = m(x1) + σζ, (18) where m(x1) Rd is some function of x1, possibly random even if conditioned on x1, σ Rd d, and ζ N(0, Id) with ζ m(x1). In this set-up, the corrupted observation m(x1) (a noisy, partial, or low-resolution image) is determined by the task at hand and available to us, but we are free to choose the design of the term σζ in (18) in ways that can be exploited differently in various applications (and is allowed to depend on any conditional info ξ). Note in particular that, given m(x1), (18) is easy to generate at sampling time. Note also that, if the corrupted observation m(x1) is deterministic given x1, the conditional probability density of (18) is the Gaussian density with mean m(x1) and covariance C = σσ : ρ0(x0|x1) = N(x0; m(x1), C), (19) We stress that, even in this case, ρ(x0, x1) = ρ1(x1)ρ0(x0|x1) and ρ0(x0) = ρ0(x0|x1) are non Gaussian densities in general. In this context, we can use the interpolant from (1) with γt = 0, which reduces to: It = αt(m(x1) + σζ) + βtx1 (20) Note that the score associated to (20) is still available because of the factor of σζ, so long as σ is invertible. 3.3. Reducing transport costs via coupling In the numerical experiments, we will highlight how the construction of a data-dependent coupling enables us to Stochastic Interpolants with Data-Dependent Couplings Algorithm 1 Training Input: Interpolant coefficients αt, βt; velocity model ˆb; batch size nb; repeat for i = 1, . . . , nb do Draw xi 1 ρ1(x1), ζi N(0, Id), ti U(0, 1). Compute xi 0 = m(xi 1) + σζi. Compute Iti = αtixi 0 + βtixi 1. end for Compute empirical loss ˆLb(ˆb) = n 1 b Pnb i=1[|ˆbti(Iti)|2 2 Iti ˆbti(Iti)]. Take gradient step on ˆLb(ˆb) to update ˆb. until converged Return: Velocity ˆb. Algorithm 2 Sampling (via forward Euler method) Input: model ˆb, corrupted sample m(x1), N N. Draw noise ζ N(0, Id) Initialize ˆX0 = m(x1) + σζ for n = 0, . . . , N 1 do ˆXi+1 = ˆXi + N 1ˆbi/N( ˆXi) end for Return: clean sample ˆXN. perform various downstream tasks. An additional appeal is that data-dependent couplings facilitate the design of more efficient transport than standard generation from a Gaussian, as we now show. The bound on the transportation cost in (15) may be more tightly controlled by the construction of data-dependent couplings and their associated interpolants. In this case, we seek couplings such that E[| It|2] is smaller with coupling than without, i.e. such that R3d | It|2ρ(x0, x1)ρz(z)dx0dx1dz R3d | It|2ρ0(x0)ρ1(x1)ρz(z)dx0dx1dz, (21) where It = αtx0 + βtx1 + γtz is a function of x0, x1 and z. A simple way to design such a coupling is to consider (19) with m(x1) = x1 and C = σ2Id for some σ > 0, which sets the base distribution to be a noisy version of the target. In the case of data-decorruption (which we explore in the numerical experiments), this interpolant directly connects the corrupted conditional density and the uncorrupted density. If we choose αt = 1 t and βt = t, and set γt = 0, then It = x1 x0, and the left hand-side of (21) reduces to E[|σz|2] = dσ2, which is less than the right hand-side given by 2E[|x1|2] + dσ2. 3.4. Learning and Sampling To learn in this setup, we can evaluate the objective functions (7) over a minibatch of nb < n data points xi 0, xi 1 by using an additional nb samples zi N(0, Id) and ti U([0, 1]). This leads to the empirical approximation ˆLb of Lb given by ˆLb(ˆb) = 1 h |ˆbti(Iti)|2 2 Iti ˆbti(Iti) i , (22) with a similar empirical variant for Lz. We approximate the functions bt(x) and gt(x) with neural networks and minimize these empirical objectives with stochastic gradient descent. This leads to an approximation of the velocity bt(x) via (4) and of the score via (6). Generating data requires sampling an Xt=0 ρ0(x0) as an initial condition to be evolved via the probability flow ODE (8) or the forward SDE (11) to respectively produce a sample Xt=1 ρ1(x1) or XF t=1 ρ1(x1). Sampling an x0 can be performed by picking data point x1 either from the data set or from some online data acquisition procedure and using it in (18), or using the assumption that one directly observes x0 ρ0(x0) at inference time (e.g. one receives a partial image). The generated samples from either the probability flow ODE or forward SDE will be different from x1, even with the choices m(x1) = x1 and C = σ2Id. The probability flow ODE necessarily produces a single sample of x1 for each x0, while the SDE produces a collection of samples whose spread can be controlled by the diffusion coefficient ϵt. Algorithms 1 and 2 depict these training and sampling procedures, respectively. 4. Numerical experiments We now explore the interpolants with data-dependent couplings on conditional image generation tasks; we find that the framework is straightforward to scale to high resolution images directly in pixel space. 4.1. In-painting We consider an in-painting task, whereby x1 RC W H denotes an image with C channels, width W, and height H. Given a pre-specified mask, the goal is to fill the pixels in the masked region with new values that are consistent with the entirety of the image. We set the conditioning variable ξ {0, 1}C W H and additionally provide the model with any potential class labels. For simplicity, the mask takes the same value for all channels in a given spatial location in the image. We define the base density by the relation x0 = ξ x1 + (1 ξ) ζ, where denotes the Hadamard (elementwise) product and ζ RC W H, ζ N(0, Id) denotes random noise used to initialize the pixels within the masked region (separate noise for each channel). During Stochastic Interpolants with Data-Dependent Couplings training, the mask is drawn randomly by tiling the image into 64 tiles; each tile is selected to enter the mask with probability p = 0.3. In our experiments, we set ρ1(x1) to correspond to Image Net (either 256 or 512). This corresponds to using ρ(x0, x1|ξ) = ρ1(x1)ρ0(x0|x1, ξ). The model sees the mask; we note that we do not need to additionally input the partial image as extra conditioning because it is present, uncorrupted, in xt for each t because the values are present in x0 and x1. In the interpolant (20), we set αt = t and βt = 1 t. In this setup, the velocity field bt(x, ξ) is such that bt(x, ξ) = 0 except in the masked regions. This follows because ξ It = ξ x1 for every t, i.e., the unmasked pixels in It are always those of x1 for which It = 0. To take this structural information into account, we can build this property into our neural network model, and mask the output of the approximate velocity field to enforce that the unmasked pixels remain fixed. We note that this method does not necessitate any inference time corrections, such as the replacement method or MCMC. Results. For implementation, we parameterize bt(x, ξ) using the basic U-Net architecture from (Ho et al., 2020b), where ξ is given to the model as appended channels of the image x. Additional specific experimental details may be found in Appendix B. Samples are shown in Figure 3, as well as Section 1. FIDs are reported in Table 2. As discussed, the missing areas of the image are defined at time zero as independent normal random variables, depicted as colorful static in the images. In each image triple, the left panel is the base distribution sample x0, the middle is the model sample of Xt=1 obtained by integrating the probability flow ODE (8), and the right panel is the ground truth. The generated textures, though different from the full sample, correspond to realistic samples from the conditional densities given the observed content. This is an advantage of probabilistic generative models such as ours over models optimized to fit a mean-square error to a ground truth image. 4.2. Super-resolution on Imagenet We now consider image super-resolution, in which we would like to produce an image with the same content as a given image but at higher resolution. To this end, we let x1 RC W H correspond to a high-resolution image, as in Sec- Table 2: FID for Inpainting Task. FID comparison between under two paradigms: a baseline, where ρ0 is a Gaussian with independent coupling to ρ1, and our datadependent coupling detailed in Section 4.1. Model FID-50k Uncoupled Interpolant (Baseline) 1.35 Dependent Coupling (Ours) 1.13 Table 3: FID-50k for Super-resolution, 64x64 to 256x256. FIDs for baselines taken from (Saharia et al., 2022; Ho et al., 2022a; Liu et al., 2023a). Model Train Valid Improved DDPM (Nichol & Dhariwal, 2021) 12.26 SR3 (Saharia et al., 2022) 11.30 5.20 ADM (Dhariwal & Nichol, 2021) 7.49 3.10 Cascaded Diffusion (Ho et al., 2022a) 4.88 4.63 I2SB (Liu et al., 2023a) 2.70 Dependent Coupling (Ours) 2.13 2.05 tion 4.1. We denote by D : RC W H RC Wlow Hlow and U : RC Wlow Hlow RC W H image downsampling and upsampling operations, where Wlow and Hlow denote the width and height of a low-resolution image. To define the base density, we then set x0 = U (D (x1)) + σζ with ζ RC W H, ζ N(0, Id), and σ > 0. Defining x0 in this way frames the transport problem such that each starting pixel is proximal to its intended target. Notice in particular that, with σ = 0, each x0 would correspond to a lowerdimensional sample embedded in a higher-dimensional space, and the corresponding distribution would be concentrated on a lower-dimensional manifold. Working with σ > 0 alleviates the associated singularities by adding a small amount of Gaussian noise to smooth the base density so it is well-defined over the entire higher-dimensional ambient space. In addition, we give the model access to the lowresolution image at all times; this problem setting then corresponds to using ρ(x0, x1|ξ) = ρ1(x1)ρ0(x0|x1, ξ) with ξ = U (D (x1)). In the experiments, we set ρ1 to correspond to Image Net (256 or 512), following prior work (Saharia et al., 2022; Ho et al., 2022a). Results. Similarly to the previous experiment, we append the upsampled low-resolution images ξ to the channel dimension of the input x of the velocity model, and likewise include the Image Net class labels. Samples are displayed in Fig. 4, as well as Section 1. Similar in layout to the previous experiment, the left panel of each triplet is the low-resolution image, the middle panel is the model sample Xt=1, and the right panel is the high-resolution image. The differences are easiest to see when zoomed-in. While the increased resolution of the model sample is very noticeable for 64 to 256, the differences even in ground truth images between 256 and 512 are more subtle. We also display FIDs for the 64x64 to 256x256 task, which has been studied in other works, in Table 3. 5. Discussion, challenges, and future work In this work, we introduced a general framework for constructing data-dependent couplings between base and target densities within the stochastic interpolant formalism. Stochastic Interpolants with Data-Dependent Couplings Figure 3: Image inpainting: Image Net-256 256 and Image Net-512 512. Top panels: Six examples of image in-filling at resolution 256 256, where the left columns display masked images, the center corresponds to in-filled model samples, and the right shows full reference images. The aims are not to recover the precise content of the reference image, but instead, to provide a conditionally valid in-filling. Bottom panels: Four examples at resolution 512 512. Figure 4: Super-resolution: Top four rows: Super-resolved images from resolution 64 64 7 256 256, where the left-most image is the lower resolution version, the middle is the model output, and the right is the ground truth. Examples for 256 256 7 512 512 are given in Fig. 6. We provide some suggestions for specific forms of datadependent coupling, such as choosing for ρ0 a Gaussian distribution with mean and covariance adapted to samples from the target, and showed how they can be used in practical problem settings such as image inpainting and super- resolution. There are many interesting generative modeling problems that stand to benefit from the incorporation of data-dependent structure. In the sciences, one potential application is in molecule generation, where we can imagine using data-dependent base distributions to fix a chemical Stochastic Interpolants with Data-Dependent Couplings backbone and vary functional groups. The dependency and conditioning structure needed to accomplish a task like this is similar to image inpainting. In machine learning, one potential application is in correcting autoencoding errors produced by an architecture such as a variational autoencoder (Kingma & Welling, 2013), where we could take the target density to be inputs to the autoencoder and the base density to be the output of the autoencoder. Acknowledgements We thank Raghav Singhal for insightful discussions. MG and RR are partly supported by the NIH/NHLBI Award R01HL148248, NSF Award 1922658 NRT-HDR: FUTURE Foundations, Translation, and Responsibility for Data Science, NSF CAREER Award 2145542, ONR N00014-23-12634, and Apple. MSA and NMB are funded by the ONR project entitled Mathematical Foundation and Scientific Applications of Machine Learning. EVE is supported by the National Science Foundation under Awards DMR-1420073, DMS-2012510, and DMS-2134216, by the Simons Collaboration on Wave Turbulence, Grant No. 617006, and by a Vannevar Bush Faculty Fellowship. Impact Statement While this paper presents work whose goal is to advance the field of machine learning, and there are many potential societal consequences of our work, we wish to highlight that generative models, as they are currently used, pose the risk of perpetuating harmful biases and stereotypes. Albergo, M. S. and Vanden-Eijnden, E. Building normalizing flows with stochastic interpolants. ar Xiv preprint ar Xiv:2209.15571, 2022. Albergo, M. S., Boffi, N. M., and Vanden-Eijnden, E. Stochastic interpolants: A unifying framework for flows and diffusions. ar Xiv preprint ar Xiv:2303.08797, 2023. Benamou, J.-D. and Brenier, Y. A computational fluid mechanics solution to the monge-kantorovich mass transfer problem. Numerische Mathematik, 84(3):375 393, 2000. doi: 10.1007/s002110050002. URL https: //doi.org/10.1007/s002110050002. Chen, R. T. and Lipman, Y. Riemannian flow matching on general geometries. ar Xiv preprint ar Xiv:2302.03660, 2023. Chen, R. T. Q. torchdiffeq, 2018. URL https:// github.com/rtqichen/torchdiffeq. Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information processing systems, 26, 2013. Dao, Q., Phung, H., Nguyen, B., and Tran, A. Flow matching in latent space. ar Xiv preprint ar Xiv:2307.08698, 2023. De Bortoli, V., Thornton, J., Heng, J., and Doucet, A. Diffusion schr odinger bridge with applications to score-based generative modeling. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 17695 17709. Curran Associates, Inc., 2021. URL https://proceedings.neurips. cc/paper_files/paper/2021/file/ 940392f5f32a7ade1cc201767cf83e31-Paper. pdf. Dhariwal, P. and Nichol, A. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780 8794, 2021. Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density Estimation Using Real NVP. In International Conference on Learning Representations, pp. 32, 2017. Durkan, C., Bekasov, A., Murray, I., and Papamakarios, G. Neural spline flows. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alch e-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings. neurips.cc/paper/2019/file/ 7ac71d433f282034e088473244df8c02-Paper. pdf. Ho, J. and Salimans, T. Classifier-free diffusion guidance. ar Xiv preprint ar Xiv:2207.12598, 2022. Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 6840 6851. Curran Associates, Inc., 2020a. URL https://proceedings. neurips.cc/paper/2020/file/ 4c5bcfec8584af0d967f1ab10179ca4b-Paper. pdf. Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840 6851, 2020b. Ho, J., Saharia, C., Chan, W., Fleet, D. J., Norouzi, M., and Salimans, T. Cascaded diffusion models for high fidelity image generation. The Journal of Machine Learning Research, 23(1):2249 2281, 2022a. Stochastic Interpolants with Data-Dependent Couplings Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., and Fleet, D. J. Video diffusion models. ar Xiv:2204.03458, 2022b. Hu, V. T., Zhang, D. W., Tang, M., Mettes, P., Zhao, D., and Snoek, C. G. Latent space editing in transformer-based flow matching. In ICML Workshop on New Frontiers in Learning, Control, and Dynamical Systems, 2023. Huang, G., Sun, Y., Liu, Z., Sedra, D., and Weinberger, K. Deep Networks with Stochastic Depth. ar Xiv:1603.09382 [cs], July 2016. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Kingma, D. P. and Welling, M. Auto-Encoding Variational Bayes. ar Xiv [Preprint], 0, 2013. URL https:// arxiv.org/1312.6114v10. Klein, L., Kr amer, A., and No e, F. Equivariant flow matching, 2023. Lee, S., Kim, B., and Ye, J. C. Minimizing trajectory curvature of ode-based generative models. ar Xiv preprint ar Xiv:2301.12003, 2023. Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling. ar Xiv preprint ar Xiv:2210.02747, 2022a. Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling, 2022b. URL https://arxiv.org/abs/2210.02747. Liu, G.-H., Vahdat, A., Huang, D.-A., Theodorou, E. A., Nie, W., and Anandkumar, A. I2sb: Image-to-image schr\ odinger bridge. ar Xiv preprint ar Xiv:2302.05872, 2023a. Liu, Q. Rectified flow: A marginal preserving approach to optimal transport, 2022. URL https://arxiv.org/ abs/2209.14577. Liu, X., Gong, C., and Liu, Q. Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022a. URL https://arxiv.org/abs/2209.03003. Liu, X., Gong, C., and Liu, Q. Flow straight and fast: Learning to generate and transfer data with rectified flow. ar Xiv preprint ar Xiv:2209.03003, 2022b. Liu, X., Zhang, X., Ma, J., Peng, J., and Liu, Q. Instaflow: One step is enough for high-quality diffusion-based textto-image generation. ar Xiv preprint ar Xiv:2309.06380, 2023b. Nichol, A. Q. and Dhariwal, P. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pp. 8162 8171. PMLR, 2021. Pooladian, A.-A., Ben-Hamu, H., Domingo-Enrich, C., Amos, B., Lipman, Y., and Chen, R. Multisample flow matching: Straightening flows with minibatch couplings. ar Xiv preprint ar Xiv:2304.14772, 2023. Rezende, D. and Mohamed, S. Variational Inference with Normalizing Flows. In International Conference on Machine Learning, pp. 1530 1538. PMLR, June 2015. Saharia, C., Ho, J., Chan, W., Salimans, T., Fleet, D. J., and Norouzi, M. Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):4713 4726, 2022. Schneuing, A., Du, Y., Harris, C., Jamasb, A., Igashov, I., Du, W., Blundell, T., Li o, P., Gomes, C., Welling, M., et al. Structure-based drug design with equivariant diffusion models. ar Xiv preprint ar Xiv:2210.13695, 2022. Shi, Y., Bortoli, V. D., Deligiannidis, G., and Doucet, A. Conditional simulation using diffusion schr odinger bridges. In The 38th Conference on Uncertainty in Artificial Intelligence, 2022. URL https://openreview. net/forum?id=H9Lu6P8sqec. Shi, Y., Bortoli, V. D., Campbell, A., and Doucet, A. Diffusion schr odinger bridge matching, 2023. Singhal, R., Goldstein, M., and Ranganath, R. Where to diffuse, how to diffuse, and how to get back: Automated learning for multivariate diffusions. In The Eleventh International Conference on Learning Representations, 2023. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp. 2256 2265. PMLR, 2015. Somnath, V. R., Pariset, M., Hsieh, Y.-P., Martinez, M. R., Krause, A., and Bunne, C. Aligned diffusion schr odinger bridges. In The 39th Conference on Uncertainty in Artificial Intelligence, 2023. URL https://openreview. net/forum?id=Bk WFJN7_b Q. Song, Y. and Ermon, S. Improved techniques for training score-based generative models. Advances in neural information processing systems, 33:12438 12448, 2020. Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. ar Xiv preprint ar Xiv:2011.13456, 2020. Song, Y., Durkan, C., Murray, I., and Ermon, S. Maximum likelihood training of score-based diffusion models. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Stochastic Interpolants with Data-Dependent Couplings Advances in Neural Information Processing Systems, volume 34, pp. 1415 1428. Curran Associates, Inc., 2021a. URL https://proceedings. neurips.cc/paper/2021/file/ 0a9fdbb17feb6ccb7ec405cfb85222c4-Paper. pdf. Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021b. Tabak, E. G. and Turner, C. V. A family of nonparametric density estimation algorithms. Communications on Pure and Applied Mathematics, 66(2): 145 164, 2013. doi: https://doi.org/10.1002/cpa. 21423. URL https://onlinelibrary.wiley. com/doi/abs/10.1002/cpa.21423. Tabak, E. G. and Vanden-Eijnden, E. Density estimation by dual ascent of the log-likelihood. Communications in Mathematical Sciences, 8(1):217 233, 2010. ISSN 15396746, 19450796. doi: 10.4310/CMS.2010.v8.n1. a11. Tong, A., Malkin, N., Huguet, G., Zhang, Y., Rector-Brooks, J., Fatras, K., Wolf, G., and Bengio, Y. Improving and generalizing flow-based generative models with minibatch optimal transport. In ICML Workshop on New Frontiers in Learning, Control, and Dynamical Systems, 2023. Trippe, B. L., Yim, J., Tischer, D., Baker, D., Broderick, T., Barzilay, R., and Jaakkola, T. Diffusion probabilistic modeling of protein backbones in 3d for the motif-scaffolding problem. ar Xiv preprint ar Xiv:2206.04119, 2022. Wu, L., Trippe, B. L., Naesseth, C. A., Blei, D. M., and Cunningham, J. P. Practical and asymptotically exact conditional sampling in diffusion models. ar Xiv preprint ar Xiv:2306.17775, 2023. Stochastic Interpolants with Data-Dependent Couplings A. Omitted proofs with conditioning variables incorporated In this Appendix we give the proofs of Theorem 3.1 and Corollary 3.1 in a more general setup in which we incorporate conditioning variables in the definition of the stochastic interpolant. To this end, suppose that each data point xi 1 Rd in the data set comes with a label ξi D, such as a discrete class or a continuous embedding like a text caption, and let us assume that this data set comes from a PDF decomposed as ρ1(x1|ξ)η(ξ), where ρ1(x1|ξ) is the density of the data x1 conditioned on their label ξ, and η(ξ) is the density of the label. In the following, we will somewhat abuse notation and use η(ξ) even when ξ is discrete (in which case, η(ξ) is a sum of Dirac measures); we will however assume that ρ1(x1|ξ) is a proper density. In this setup we can generalize Definition 3.1 as Definition A.1 (Stochastic interpolant with coupling and conditioning). The stochastic interpolant It is the stochastic process defined as It = αtx0 + βtx1 + γtz t [0, 1], (23) αt, βt, and γ2 t are differentiable functions of time such that α0 = β1 = 1, α1 = β0 = γ0 = γ1 = 0, and α2 t +β2 t +γ2 t > 0 for all t [0, 1]. The pair (x0, x1) are jointly drawn from a conditional probability density ρ(x0, x1|ξ) such that Z Rd ρ(x0, x1|ξ)dx1 = ρ0(x0|ξ), (24) Z Rd ρ(x0, x1|ξ)dx0 = ρ1(x1|ξ). (25) z N(0, Id), independent of (x0, x1, ξ). Similarly, the functions (4) become bt(x, ξ) = E( It|It = x, ξ), gt(x, ξ) = E(z|It = x, ξ) (26) where E( |It = x) denotes the expectation over ρ(x0, x1|ξ) conditional on It = x, and Theorem 3.1 becomes: Theorem A.1 (Transport equation with coupling and conditioning). The probability distribution of the stochastic interpolant It specified by Definition A.1 has a density ρt(x|ξ) that satisfies ρt=0(x|ξ) = ρ0(x|ξ) and ρt=1(x|ξ) = ρ1(x|ξ), and solves the transport equation tρt(x|ξ) + (bt(x, ξ)ρt(x|ξ)) = 0, (27) where the velocity field is given in (26). Moreover, for every t such that γt = 0, the following identity for the score holds log ρt(x|ξ) = γ 1 t gt(x, ξ). (28) The functions b and g are the unique minimizers of the objective Lb(ˆb) = Z 1 0 E h |ˆbt(It, ξ)|2 2 It ˆbt(It, ξ) i dt, Lg(ˆg) = Z 1 0 E |ˆgt(It, ξ)|2 2z ˆgt(It, ξ) dt, where E denotes an expectation over (x0, x1) ρ(x0, x1|ξ), ξ η(ξ), and z N(0, Id). Note that the objectives (29) can readily be estimated in practice from samples (x0, x1) ρ(x0, x1|ξ), z N(0, 1), and ξ η(ξ), which will enable us to learn approximations for use in a generative model. Stochastic Interpolants with Data-Dependent Couplings Proof. By definition of the stochastic interpolant given in (23), its characteristic function is given by E[eik It] = Z Rd Rd eik (αtx0+βtx1)ρ(x0, x1|ξ)dx0dx1e 1 2 γ2 t |k|2, (30) where we used z (x0, x1) and z N(0, Id). The smoothness in k of (30) guarantees that the distribution of It has a density ρt(x|ξ) > 0 globally. By definition of It, this density ρt(x|ξ) satisfies, for any suitable test function ϕ : Rd R, Z Rd ϕ(x)ρt(x|ξ)dx = Z Rd Rd Rd ϕ (It) ρ(x0, x1|ξ)(2π) d/2e 1 2 |z|2dx0dx1dz. (31) Above, It = αtx0 + βtx1 + γtz. Taking the time derivative of both sides Z Rd ϕ(x) tρt(x|ξ)dx αtx0 + βtx1 + γtz ϕ (It) ρ(x0, x1|ξ)(2π) d/2e 1 2 |z|2dx0dx1dz Rd E αtx0 + βtx1 + γtz ϕ(It) It = x ρt(x|ξ)dx Rd E αtx0 + βtx1 + γtz It = x ϕ(x)ρt(x|ξ)dx where we used the chain rule to get the first equality, the definition of the conditional expectation to get the second, and the tower property ϕ(It) = ϕ(x) conditioned on It = x to get the third. Since E αtx0 + βtx1 + γtz It = x = bt(x) (33) by the definition of b in (26), we can therefore write (32) as Z Rd ϕ(x) tρt(x|ξ)dx = Z Rd bt(x, ξ) ϕ(x)ρt(x|ξ)dx. (34) This equation is (27) written in weak form. To establish (28), note that if γt > 0, we have E zeiγtk z = γ 1 t (i k)E eiγtk z , = γ 1 t (i k)e 1 2 γ2 t |k|2, 2 γ2 t |k|2. As a result, using z (x0, x1), we have E zeik It = iγtk E eik It . (36) Using the properties of the conditional expectation, the left-hand side of this equation can be written E zeik It = Z Rd E zeik It It = x ρt(x|ξ)dx, Rd E[z|It = x]eik xρt(x, ξ)dx, Rd gt(x, ξ)eik xρt(x, ξ)dx, where we used the definition of g in (26) to get the last equality. Since the right-hand side of (36) is the Fourier transform of γt ρt(x|ξ), we deduce that gt(x, ξ)ρt(x|ξ) = γt ρt(x|ξ) = γt log ρt(x|ξ) ρt(x|ξ). (38) Stochastic Interpolants with Data-Dependent Couplings Since ρt(x|ξ) > 0, this implies (28) when γt > 0. Finally, to derive (29), notice that we can write Lb(ˆb) = Z 1 0 E h |ˆbt(It, ξ)|2 2 It ˆbt(It, ξ) i dt, Rd E h |ˆbt(It, ξ)|2 2 It ˆbt(It, ξ)|It = x i ρt(x|ξ)dxdt h |ˆbt(x, ξ)|2 2E[ It|It = x] ˆbt(x, ξ) i ρt(x|ξ)dxdt h |ˆbt(x, ξ)|2 2bt(x, ξ) ˆbt(x, ξ) i ρt(x|ξ)dxdt where we used the definition of b in (26). The unique minimizer of this objective function is ˆbt(x, ξ) = bt(x, ξ), and we can proceed similarly to show that the unique minimizers of Lg(ˆg) is ˆgt(x, ξ) = gt(x, ξ), respectively. Theorem A.1 implies the following generalization of Corollary 3.1: Corollary A.1 (Probability flow and diffusions with coupling and conditioning). The solutions to the probability flow equation Xt = bt(Xt, ξ) (40) enjoy the property that Xt=1 ρ1(x1|ξ) if Xt=0 ρ0(x0|ξ) (41) Xt=0 ρ0(x0|ξ) if Xt=1 ρ1(x1|ξ) (42) In addition, for any ϵt 0, solutions to the forward SDE d XF t = bt(XF t , ξ)dt ϵtγ 1 t gt(XF t , ξ)dt + 2ϵtd Wt, (43) enjoy the property that XF t=1 ρ1(x1|ξ) if XF t=0 ρ0(x0|ξ), (44) and solutions to the backward SDE d XR t = bt(XR t , ξ)dt + ϵtγ 1 t gt(XR t , ξ)dt + 2ϵtd Wt, (45) enjoy the property that XR t=0 ρ0(x0|ξ) if XR t=1 ρ1(x1|ξ). (46) Note that if we additionally draw ξ marginally from η(ξ) when we generate the solution to these equations, we can also generate samples from the unconditional ρ0(x0) = R D ρ0(x0|ξ)η(ξ)dξ and ρ1(x1) = R D ρ1(x1|ξ)η(ξ)dξ. Proof. The probability flow ODE is the characteristic equation of the transport equation (27), which proves the statement about its solutions Xt. To establish the statement about the solution of the forward SDE (43), use expression (28) for log ρt(x, ξ) together with the identity ρt(x, ξ) = ( log ρt(x, ξ) ρt(x, ξ)) to write (27) as the forward Fokker-Planck equation tρt(x|ξ) + (bt(x, ξ) ϵtγ 1 t gt(x, ξ))ρt(x|ξ) = ϵt ρt(x|ξ) (47) to be solved forward in time since ϵt > 0. To establish the statement about the solution of the reversed SDE (45), proceed similarly to write (27) as the backward Fokker-Planck equation tρt(x|ξ) + (bt(x, ξ) + ϵtγ 1 t gt(x, ξ))ρt(x|ξ) = ϵt ρt(x|ξ) (48) to be solved backward in time since ϵt > 0. Stochastic Interpolants with Data-Dependent Couplings The generative model arising from Corollary 3.1 has an associated transport cost which is the subject of Corollary 3.1: Proposition 3.1 (Control of transport cost). Let Xt(x0) be the solution to the probability flow ODE (8) for the initial condition Xt=0(x0) = x0 ρ0. Then Ex0 ρ0 |Xt=1(x0) x0|2 Z 1 0 E[| It|2]dt < (15) Proof. We have Ex0 ρ0 |Xt=1(x0) x0|2 = Ex0 ρ0 h Z 1 0 bt(Xt(x0))dt 2i 0 Ex0 ρ0 |bt(Xt(x0))|2 dt = E |bt(It)|2 where we used the probability flow equation (8) for Xt and the property that the law of Xt(x0) with x0 ρ0 and It coincide. Using the definition of bt(x) in (4) and Jensen s inequality we have that E |bt(It)|2 = E E[ It|It] 2 E E | It|2 It] = E[| It|2] (50) where the last line is true by the tower property of the conditional expectation. Combining (49) and (50) establishes the bound in (15). B. Further experimental details Architecture For the velocity model we use the U-net from (Ho et al., 2020b) as implemented in lucidrain s denoisingdiffusion-pytorch repository; this variant of the architecture includes embeddings to condition on class labels. We use the following hyperparameters: Dim Mults: (1,1,2,3,4) Dim (channels): 256 Resnet block groups: 8 Leanred Sinusoidal Cond: True Learned Sinusoidal Dim: 32 Attention Dim Head: 64 Attention Heads: 4 Random Fourier Features: False Image-shaped conditioning in the Unet. For image-shaped conditioning, we follow (Ho et al., 2022a) and append upsampled low-resolution images to the input xt at each time step to the velocity model. We also condition on the missingness masks for in-painting by appending them to xt. Optimization. We use Adam optimizer (Kingma & Ba, 2014), starting at learning rate 2e-4 with the Step LR scheduler which scales the learning rate by γ = .99 every N = 1000 steps. We use no weight decay. We clip gradient norms at 10, 000 (this is the norm of the entire set of parameters taken as a vector, the default type of norm clipping in Py Torch library). Integration for sampling We use the Dopri solver from the torchdiffeq library (Chen, 2018). Miscellaneous We use Pytorch library along with Lightning Fabric to handle parallelism. Below we include additional experimental illustrations in the flavor of the figures in the main text. Stochastic Interpolants with Data-Dependent Couplings Figure 5: Additional examples of in-filling on the 256 256 resolution images, with temporal slices of the probability flow. Stochastic Interpolants with Data-Dependent Couplings Figure 6: Super-resolution: Top four rows: Super-resolved images from resolution 256 256 7 512 512, where the left-most image is the lower resolution version, the middle is the model output, and the right is the ground truth.