# multimarginal_generative_modeling_with_stochastic_interpolants__3fcc55c6.pdf Published as a conference paper at ICLR 2024 MULTIMARGINAL GENERATIVE MODELING WITH STOCHASTIC INTERPOLANTS Michael S. Albergo Center for Cosmology and Particle Physics New York University New York, NY 10003, USA albergo@nyu.edu Nicholas M. Boffi Courant Institute of Mathematical Sciences New York University New York, NY 10012, USA boffi@cims.nyu.edu Michael Lindsey Department of Mathematics University of California, Berkeley Berkeley, CA 94720, USA lindsey@math.berkeley.edu Eric Vanden-Eijnden Courant Institute of Mathematical Sciences New York University New York, NY 10012, USA eve2@cims.nyu.edu Given a set of K probability densities, we consider the multimarginal generative modeling problem of learning a joint distribution that recovers these densities as marginals. The structure of this joint distribution should identify multi-way correspondences among the prescribed marginals. We formalize an approach to this task within a generalization of the stochastic interpolant framework, leading to efficient learning algorithms built upon dynamical transport of measure. Our generative models are defined by velocity and score fields that can be characterized as the minimizers of simple quadratic objectives, and they are defined on a simplex that generalizes the time variable in the usual dynamical transport framework. The resulting transport on the simplex is influenced by all marginals, and we show that multi-way correspondences can be extracted. The identification of such correspondences has applications to style transfer, algorithmic fairness, and data decorruption. In addition, the multimarginal perspective enables an efficient algorithm for reducing the dynamical transport cost in the ordinary two-marginal setting. We demonstrate these capacities with several numerical examples. 1 INTRODUCTION Generative models built upon dynamical transport of measure, in which two probability densities are connected by a learnable transformation, underlie many recent advances in unsupervised learning (Rombach et al., 2022; Dhariwal & Nichol, 2021). Contemporary methods such as normalizing flows (Rezende & Mohamed, 2015) and diffusions (Song et al., 2021b) transform samples from one density ρ0 into samples from another density ρ1 through an ordinary or stochastic differential equation (ODE/SDE). In such frameworks one must learn the velocity field defining the ODE/SDE. One effective algorithm for learning the velocity is based on the construction of a stochastic interpolant, a stochastic process that interpolates between the two probability densities at the level of the individual samples. The velocity can be characterized conveniently as the solution of a tractable square loss regression problem. In conventional generative modeling, one chooses ρ0 to be an analytically tractable reference density, such as the standard normal density, while ρ1 is some target density of interest, accessible only through a dataset of samples. In this setting, a general stochastic interpolant xt can be written as xt = α0(t)x0 + α1(t)x1, (1) where x0 ρ0 and x1 ρ1 and we allow for the possibility of dependence between x0 and x1. Meanwhile α0(t) and α1(t) are differentiable functions of t [0, 1] such that α0(0) = α1(1) = 1 and α0(1) = α1(0) = 0. These constraints guarantee that xt=0 = x0 and xt=1 = x1 by construction. It was shown in Albergo & Vanden-Eijnden (2023); Albergo et al. (2023) that if Xt is the solution of Published as a conference paper at ICLR 2024 the ordinary differential equation (ODE) Xt = b(t, Xt), (2) with velocity b defined by b(t, x) = E[ xt|xt = x] = α0(t) E[x0|xt = x] + α1(t) E[x1|xt = x] (3) and initial condition X0 drawn from ρ0, then Xt matches xt in law for all times t [0, 1]. Hence we can sample from ρ1 by generating samples from ρ0 and propagating them via the ODE (2) to time t = 1. Equivalent diffusion processes depending on the score can also be obtained by the introduction of noise into the interpolant (Albergo et al., 2023). A significant body of research on both flows and diffusions has studied how to α0(t) and α1(t) that reduce the computational difficulty of integrating the resulting velocity field b (Karras et al., 2022). In this work, we first observe that the decomposition of the velocity field of (1) into conditional expectations of samples from the marginals ρ0 and ρ1 suggests a more general definition of a process x(α), defined not with respect to the scalar t [0, 1] but with respect to an interpolation coordinate α = (α0, α1), x(α) = α0x0 + α1x1. (4) By specifying a curve α(t), we can recover the interpolant in (1) with the identification xt = x(α(t)). We use the generalized perspective in (4) to identify the minimal conditions upon α so that the density ρ(α, x) of x(α) is well-defined for all α, reduces to ρ0(x) for α = (1, 0), and to ρ1(x) for α = (0, 1). These considerations lead to several new advances, which we summarize as our main contributions: 1. We show that the introduction of an interpolation coordinate decouples the learning problem of estimating a given b(t, x) from the design problem for a path α(t) governing the transport. We use this to devise an optimization problem over curves α(t) with the Benamou-Brenier transport cost, which gives a geometric algorithm for selecting a performant α. 2. By lifting α to a higher-dimensional space, we use the stochastic interpolant framework to devise a generic paradigm for multimarginal generative modeling. To this end, we derive a generalized probability flow valid among K + 1 marginal densities ρ0, ρ1, . . . , ρK, whose corresponding velocity field b(t, x) is defined via the conditional expectations E[xk|x(α) = x] for k = 0, 1, . . . , K and a curve α(t). We characterize these conditional expectations as the minimizers of square loss regression problems, and show that one gives access to the score of the multimarginal density. 3. We show that the multimarginal framework allows us to solve K 2 marginal transport problems using only K marginal vector fields. Moreover, this framework naturally learns multi-way correspondences between the individual marginals, as detailed in Section 3. The method makes possible concepts like all-to-all image-to-image translation, where we observe an emergent style transfer amongst the images. 4. Moreover, in contrast to existing work, we consider multimarginal transport from a novel generative perspective, demonstrating how to generate joint samples (x0, . . . , x K) matching prescribed marginal densities and generalizing beyond the training data. The structure of the paper is organized as follows. In Section 1.1, we review some related works in multimarginal optimal transport and two-marginal generative modeling. In Section 2, we introduce the interpolation coordinate framework and detail its ramifications for multimarginal generative modeling. In Section 2.1, we formulate the path-length minimization problem and illustrate its application. In Appendix B, we consider a limiting case where the framework directly give one-step maps between any two marginals. In Section 3, we detail experimental results in all-to-all image translation and style transfer. We conclude and discuss future directions in Section 4. 1.1 RELATED WORKS Generative models and dynamical transport Recent years have seen an explosion of progress in generative models built upon dynamical transport of measure. These models have roots as early as (Tabak & Vanden-Eijnden, 2010; Tabak & Turner, 2013), and originally took form as a discrete series of steps (Rezende & Mohamed, 2015; Dinh et al., 2017; Huang et al., 2016; Durkan et al., 2019), while modern models are typically formulated via a continuous-time transformation. A Published as a conference paper at ICLR 2024 particularly notable example of this type is score-based diffusion (Song et al., 2021c;a; Song & Ermon, 2019), along with related methods such as denoising diffusion probabilistic models (Ho et al., 2020; Kingma et al., 2021), which generate samples by reversing a stochastic process that maps the data into samples from a Gaussian base density. Methods such as flow matching (Lipman et al., 2022; Tong et al., 2023), rectified flow (Liu, 2022; Liu et al., 2022), and stochastic interpolants (Albergo & Vanden-Eijnden, 2023; Albergo et al., 2023) refine this idea by connecting the target distribution to an arbitrary base density, rather than requiring a Gaussian base by construction of the path, and allow for paths that are more general than the one used in score-based diffusion. Importantly, methods built upon continuous-time transformations that posit a connection between densities often lead to more efficient learning problems than alternatives. Here, we continue with this formulation, but focus on generalizing the standard framework to a multimarginal setting. Multimarginal modeling and optimal transport Multimarginal problems are typically studied from an optimal transport perspective (Pass, 2014b), where practitioners are often interested in computation of the Wasserstein barycenter (Cuturi & Doucet, 2014; Agueh & Carlier, 2011; Altschuler & Boix-Adsera). The barycenter is thought to lead to useful generation of combined features from a set of datasets (Rabin et al., 2012), but its computation is expensive, leading to the need for approximate algorithms. Some recent work has begun to fuse the barycenter problem with dynamical transport, and has developd algorithms for its computation based on the diffusion Schr odinger bridge (Noble et al., 2023). A significant body of work in multimarginal optimal transport has concerned the existence of an optimal transport plan of Monge type (Gangbo & Swiech, 1998; Pass, 2014a), as well as generalized notions (Friesecke & V ogler, 2018). A Monge transport plan is a joint distribution of the restricted form (31) considered below and in particular yields a set of compatible deterministic correspondences between marginal spaces. We refer to such a set of compatible correspondences as a multi-way correspondence. Although the existence of Monge solutions in multimarginal optimal transport is open in general (Pass, 2014a), in the setting of the quadratic cost that arises in the study of Wasserstein barycenters, in fact a Monge solution does exist under mild regularity conditions (Gangbo & Swiech, 1998). We do not compute multimarginal optimal transport solutions in this work, but nonetheless we will show how to extract an intuitive multi-way correspondence from our learned transport. We will also show that if the multimarginal stochastic interpolant is defined using a coupling of Monge type, then this multi-way correspondence is uniquely determined. 2 MULTIMARGINAL STOCHASTIC INTERPOLANTS In this section, we study stochastic interpolants built upon an interpolation coordinate, as illustrated in (4). We consider the multimarginal generative modeling problem, whereby we have access to a dataset of n samples {xi k}i=1,...,n k=0,...,K from each of K + 1 densities with xi k ρk. We wish to construct a generative model that enables us to push samples from any ρj onto samples from any other ρk. By setting ρ0 to be a tractable base density such as a Gaussian, we may use this model to draw samples from any marginal; for this reason, we hereafter assume ρ0 to be a standard normal. We denote by K the K-simplex in RK+1, i.e., K = {α = (α0, . . . , αK) RK+1 : PK k=0 αk = 1 and αk 0 for k = 0, . . . K}. We begin with a useful definition of a stochastic interpolant that places α K: Definition 1 (Barycentric stochastic interpolant). Given K +1 probability density functions {ρk}K k=0 with full support on Rd, the barycentric stochastic interpolant x(α) with α = (α0, . . . , αK) K is the stochastic process defined as k=0 xkαk, (5) where (x1, . . . , x K) are jointly drawn from a probability density ρ(x1, , x K) such that k = 1, . . . , K : Z R(K 1) d ρ(x1, . . . , x K)dx1 dxk 1dxk+1 dx K = ρk(xk), (6) and we set x0 N(0, Idd) independent of (x1, . . . , x K). Published as a conference paper at ICLR 2024 The barycentric stochastic interpolant emerges as a natural extension of the two-marginal interpolant (1) with the choice α0(t) = 1 t and α1(t) = t. In this work, we primarily study the barycentric interpolant for convenience, but we note that our only requirement in the following discussion is that PK k=0 α2 k > 0. This condition ensures that x(α) always contains a contribution from some xk, and hence its density ρ(α, ) does not collapse to a Dirac measure at zero for any α. In the following, we classify ρ(α, ) as the solution to a set of continuity equations. Theorem 1 (Continuity equations). For all α K, the probability distribution of the barycentric stochastic interpolant x(α) has a density ρ(α, x) which satisfies the K + 1 equations αkρ(α, x) + x gk(α, x)ρ(α, x) = 0, k = 0, . . . , K. (7) Above, each gk(α, x) is defined as the conditional expectation gk(α, x) = E[xk|x(α) = x], k = 0, . . . , K, (8) where E[xk|x(α) = x] denotes an expectation over ρ0(x0)ρ(x1, . . . , xk) conditioned on the event x(α) = x. The score along each two-marginal path connected to ρ0 is given by α0 = 0 : x log ρ(α, x) = α 1 0 g0(α, x). (9) Moreover, each gk is the unique minimizer of the objective Lk(ˆgk) = Z K E |ˆgk(α, x(α))|2 2xk ˆgk(α, x(α)) dα, k = 0, . . . , K, (10) where the expectation E is taken over (x0, . . . , x K) ρ0(x0)ρ(x1, . . . , xk). The proof of Theorem 1 is given in Appendix A: it relies on the definition of the density ρ(α, x) that says that, for every suitable test function ϕ(x), we have Z Rd ϕ(x)ρ(α, x)dx = Z R(K+1)d ϕ(x(α))ρ0(x0)ρ(x1, . . . , x K)dx0 dx K. (11) where x(α) = PK k=0 αkxk. Taking the derivative of both sides with respect to αk using the chain rule as well as αkx(α) = xk, gives us Z Rd ϕ(x) αkρ(α, x)dx = Z R(K+1)d xk ϕ(x(α))ρ0(x0)ρ(x1, . . . , x K)dx0 dx K. (12) Integrating the right hand side by parts and using the definition of the conditional expectation implies (7). As we will see, the loss functions in (10) are amenable to empirical estimation over a dataset of samples, which enables efficient learning of the gk. The resulting approximations can be combined according to (14) to construct a multimarginal generative model. In practice, we have the option to parameterize a single, weight-shared g(α, x) = (g0(α, x), . . . g K(α, x)) as a function from (K 1) Rd RK d, or to parameterize the gk : (K 1) Rd Rd individually. In our numerical experiments, we proceed with the former for efficiency. Theorem 1 provides relations between derivatives of the density ρ(α, x) with respect to α and with respect to x that involve the conditional expectations gk(α, x). By specifying a curve α(t) that traverses the simplex, we can use this result to design generative models that transport directly from any one marginal to another, or that are influenced by multiple marginals throughout the generation process, as we now show. Corollary 2 (Transport equations). Let {ek} represent the standard basis vectors of RK+1, and let α : [0, 1] K denote a differentiable curve satisfying α(0) = ei and α(1) = ej for any i, j = 0, . . . , K. Then the barycentric stochastic interpolant x(α(t)) has probability density ρ(t, x) = ρ(α(t), x) that satisfies the transport equation t ρ(t, x) + b(t, x) ρ(t, x) = 0, ρ(t = 0, x) = ρi(x), ρ(t = 1, x) = ρj(x), (13) where we have defined the velocity field k=0 αk(t)gk(α(t), x). (14) Published as a conference paper at ICLR 2024 In addition, the probability flow associated with (13) given by Xt = b(t, Xt) (15) satisfies Xt=1 ρj for any Xt=0 ρi, and vice-versa. Corollary 2 is proved in Appendix A, where we also show how to use the information about the score in (9) to derive generative models based on stochastic dynamics. The transport equation (13) is a simple consequence of the identity t ρ(t, x) = PK k=0 αk(t) αkρ(α(t), x), in tandem with the equations in (7). It states that the barycentric interpolant framework leads to a generative model defined for any path on the simplex, which can be used to transport between any pair of marginal densities. Note that the two-marginal transports along fixed edges in (13) reduce to a set of K independent processes, because each gk takes values independent from the rest of the simplex when restricted to an edge. In fact, for any edge between ρi and ρj with i, j = k, the marginal vector field gk(α, x) = E[xk|x(α) = x] = E[xk], because the conditioning event provides no additional information about xk. In practice, however, we expect imperfect learning and the implicit bias of neural networks to lead to models with two-marginal transports that are influenced by all data used during training. We summarize a few possibilities with the multimarginal generative modeling setup in Section 2.1. 2.1 OPTIMIZING THE PATHS TO LOWER THE TRANSPORT COST Importantly, the results above highlight that the learning problem for the velocity b is decoupled from the choice of path α(t) on the simplex. We can consider the extent to which this flexibility can lower the cost of transporting a given ρi to another ρj when this transport is accomplished by solving (15) subject to (13). In particular, we are free to parameterize the αk in the expression for the velocity b(t, x) given in (14) so long as the simplex constraints on α are satisfied. We use this to state the following corollary. Corollary 3. The solution to C(ˆα) = min ˆα k=0 ˆαk(t)gk(ˆα(t), x(ˆα(t))) 2i dt (16) gives the transport with least path length in Wasserstein-2 metric over the class of velocities ˆb(t, x) = PK k=0 ˆαk(t)gk(ˆα(t), x). Here gk is given by (8), x(α) = PK k=0 αkxk, the expectation is taken over (x0, x1 . . . x K) ρ0 ρ, and the minimization is over all paths ˆα C1([0, 1]) such that ˆα(t) K for all t [0, 1]. Note the objective in (16) can be estimated efficiently without having to simulate the probability flow given by (15) by sampling the interpolant. Note also that this gives a way to reduce the transport directly without having to solve the max-min problem presented (Albergo & Vanden Eijnden, 2023). We stress however that this class of velocities is not sufficient to achieve optimal transport, as such a guarantee requires the use of nonlinear interpolants (cf. Appendix D of Albergo & Vanden-Eijnden (2023)). An example learnable parameterization is given in Appendix C.1. Algorithm 1: Learning each ˆgk Input: model ˆgk, coupling ρ(x1, . . . , xk), gradient steps Ng, loss function Lk as defined in (10) for j = 1, . . . , Ng do draw (x0, x K) ρ0(x0)ρ(x1, , x K) draw α K make x(α) = P k αkxk Take gradient step with respect to Lk end Return: ˆgk. In addition to optimization over the path, the transport equation (13) enables generative procedures that are unique to the multimarginal setting, such as transporting any given density to the barycenter of the K + 1 densities, or transport through the interior of the simplex. 3 NUMERICAL EXPERIMENTS In this section, we numerically explore the characteristics, behavior, and applications of the multimarginal generative model introduced above. Published as a conference paper at ICLR 2024 Example Example Application Simplex representation 2-marginal standard generative modeling ρ1 ρ2 3-marginal two-marginal with smoothing ρ1 ρ2 K-marginal all-to-all image translation/style transfer ρi ρj Table 1: Characterizing various generative modeling aims with respect to how their transport appears on the simplex. We highlight a sampling of various other tasks that can emerge from generic multimarginal generative modeling. Here, two-marginal with smoothing corresponds to the addition of a Gaussian marginal, as done in (Albergo et al., 2023), with a path through the simplex. Learned path Standard linear path Learned path Standard linear path Learned path Standard linear path Learned path Standard linear path Learned path Standard linear path Learned path Standard linear path Learned path Standard linear path 0.00 0.25 0.50 0.75 1.00 Time t Æ0, init Æ1, init Æ0, final Æ1, final 1 t t 0 100 200 300 Training step Path length C Figure 1: Direct optimization of α(t) over a parametric class to reduce transport cost in the 2-marginal learning problem of a Gaussian to the checkerboard density. Left: The initial and final ˆα0, ˆα1 learned over 300 optimizations steps on (16). Center: The reduction in the path length over training. Right: Time slices of the probability density ρ(t, x) corresponding to learned interpolant with learned ˆα as compared to the linear interpolant α = [1 t, t]. We start with the numerical demonstration of the path minimization in the restricted class defined in Corollary 3. Following that, we explore various pathways of transport on the simplex to illustrate various characteristics and applications of the method. We explore what happens empirically with the respect to transport on the edges of the simplex and from its barycenter. We also consider empirically if probability flows following (15) with different paths specified by α(t) but the same endpoint α(1) on give meaningfully similar but varying samples. Algorithms for learning the vector fields gk and learning a least-cost parametric path (if desired) on the simplex are given in Algorithm 1 and Algorithm 2 respectively. 3.1 OPTIMIZING THE SIMPLEX PATH α(t) Algorithm 2: Learning a path ˆ α(t) Input: model ˆgk, model path ˆα(t), coupling ρ(x1, . . . , xk), gradient steps Nsteps, loss function C(ˆα) as defined in (16) for j = 1, . . . , Ng do draw (x0, x K) ρ0(x0)ρ(x1, , x K) draw t Unif[0, 1] compute ˆα(t) make x(α(t)) = P k αk(t)xk Take gradient step with respect to C(ˆα) end Return: ˆα(t). As shown in Section 2, the multimarginal formulation reveals that transport between distributions can be defined in a path-independent manner. Moreover, Corollary 3 highlights that the path can be determined a-posteriori via an optimization procedure. Here we study to what extent this optimization can lower the transport cost, and in addition, we consider what effect it has on the resulting path at the qualitative level of the samples. We note that while we focus here on the stochastic interpolant framework considered in (5), score-based diffusion (Song et al., 2021c) fits within the original interpolant formalism as presented in (1) . This means that the frame- Published as a conference paper at ICLR 2024 work that we consider can also be used to optimize over the diffusion coefficient g(t) for diffusion models, providing an algorithmic procedure for the experiments proposed by Karras et al. (2022). A numerical realization of the effect of path optimization is provided in Fig. 1. We train an interpolant on the two-marginal problem of mapping a standard Gaussian to the two-dimensional checkerboard density seen in the right half of the figure. We parameterize the α0 and α1 in this case with a Fourier expansion with m = 20 components normalized to the simplex; the complete formula is given in Appendix C.1. A visualization of the improvements from the learned α for fixed number of function evaluations is also given in the appendix. We note that this procedure yields α0 and α1 that are qualitatively very similar to those observed by Shaul et al. (2023), who study how to minimize a kinetic cost of Gaussian paths to the checkerboard. 3.2 ALL-TO-ALL IMAGE TRANSLATION AND STYLE TRANSFER A natural application of the multimarginal framework is image-to-image translation. Given K image distributions, every image dataset is mapped to every other in a way that can be exploited, for example, for style transfer. If the distributions have some distinct characteristics, the aim is to find a map that connects samples xi and xj from ρi and ρj in a way that visually transforms aspects of one into another. We explore this in two cases. The first is a class to class example on the MNIST dataset, where every digit class is mapped to every other digit class. The second is a case using multiple common datasets as marginals: the AFHQ-2 animal faces dataset (Choi et al., 2020), the Oxford flowers dataset (Nilsback & Zisserman, 2006), and the Celeb A dataset (Zhang et al., 2019). All three are set to resolution 64 64. For each case, we provide pictorial representations that indicate which marginals map where using either simplex diagrams or Petrie polygons representing the edges of the higher-order simplices. For all image experiments, we use the U-Net architecture made popular in (Ho et al., 2020). 6gwc Ua IKmx YZVui Le Ja Yiuc/Uxj+d1hwi127jh/u Y+N8Vp0PMPeo OT/c5h UO/6Nnl Mnp Bnx Ccvy SF5R47Ji DAC5Bv5Tn4R450r LO8Qltbdcwjsnacr78BRu Q9b A=2 UGQDXYq E43wtv GDCTXu Qa YJwy DEUKehxsebe5z W5z0e9FBb9g97Debvk Mekyfk GYn IS3JI3p Fj Mia MAPl Gvp Mfna O7Lgrc Huri Xh E1k6n+g0T8Dxl4 t OMf Puv0+k0o By HU5xr DRAc Djr+o MHFGk DW2LBKN8Tbx DREl7n6m Mfz Ok OEWi3c H9zn5ti3Ov6h93B+4P2Ua/e9R1ynzwgj4h Pnp Ij8pockx Fh5BP5Rr6TH84b Rzu5U1yg21t1z D2ydpyvw HX8T+g 0 Figure 2: Left: Generated MNIST digits from the same Gaussian sample x0 ρ0, with K = 7 marginals (ρ0 and 6 digit classes). x0 is visualized in the center of the image collection at time t = 0, and the perimeter corresponds to transport to the edge of the simplex at time t = 1 with vertices color-coded. A Petrie polygon representing the 6-simplex, with arrows denoting transport from the Gaussians along edges to the color-coded marginals, clarifies the marginal endpoints. Right: Demonstrating the impact of learning with over the larger simplex. Top row: learning just on the simplex edge from 0 to 3. Middle: Learning on all the simplex edges from 0 through 5. Bottom: Learning on the entire simplex constructed from 0 through 5 and not just the edges. Qualitative ramifications of multimarginal transport We use mappings between MNIST digits as a simple testbed to begin a study of multimarginal mappings. For visual simplicity, we consider the case of a set of marginals containing the digits 0 through 5 and a Gaussian. The inclusion of the Gaussian marginal facilitates resampling to generate from any given marginal, in addition to the ability to map from any ρi to any other ρj. In the left of Fig. 2, we show the generation space across the simplex for samples pushed forward from the same Gaussian sample towards all other vertices of the simplex. Directions from the center to the colored circles indicate transport towards one of the vertices corresponding to the different digits. We note that this mapping is smooth amongst all the generated endpoint digits, allowing for a natural translation toward every marginal. Published as a conference paper at ICLR 2024 c H9zn5ti3Ov6h93B+4P2Ua/e9R1ynzwgj4h Pnp Ij8pockx Fh5BP5Rr6TH84b Rzu5U1yg21t1z D2ydpyvw Hb Uj+h 1 yvw Hesz+i 2 dk RBg B8o18Jz+c Iydxr LO8Qltbdcwjsnacr78BUQc9bw=5 ehxsebe5z W5z0e9FBb9g97Debvk Mekyfk GYn IS3JI3p Fj Mia MAPl Gvp Mfna O7Lgrc Huri Xh E1k6n+g0T8Dxl4 Figure 3: Left: An illustration of how different transport paths on the simplex can reach final samples that have similar content. Top row: A cat sample is transformed into a celebrity. Middle row: The same cat is pushed to the flower marginal. Bottom row: The new marginal flower sample is then pushed to a celebrity that maintains meaningful semantic structure from the celebrity generated from along the other path on the simplex. On the right of this figure, we illustrate how training on the simplex influences the learned generative model. The top row corresponds to learning just in the two-marginal case. The middle row corresponds to learning only on the edges of the simplex, and is equivalent to learning multiple two-marginal models in one shared representation. The third corresponds to learning on the entire simplex. We observe here (and in additional examples provided in Appendix C.2) that learning on the whole simplex empirically results in image translation that better preserves the characteristics of the original image. This can best be seen in the bottom row, where the three naturally emerges from the shape of the zero. In addition to translating along the marginals, one can try to forge a correspondence between the marginals by pushing samples through the barycenter of the simplex defined by α(t) = ( 1 K+1, . . . , 1 K+1). This is a unique feature of the multimarginal generative model, and we demonstrate its use in the right of Fig. 3 to construct other digits with a correspondence to a given sample 3. We next turn to the higher-dimensional image generation tasks depicted in Fig. 4, where we train a 4-marginal model (K = 3) to generate and translate among the celebrity, flowers, and animal face datasets mentioned above. In the left of this figure we present results generated marginally from the Gaussian vertex for a model trained on the entire simplex. Depictions of the transport along edges of the simplex are shown to the right. We show in the right half of the figure that image-to-image translation between the marginals has the capability to pick up coherent form, texture, and color from a sample from a sample from one marginal and bring it to another (e.g. the wolf maps to a flower with features of the wolf). Additional examples are available in Appendix C.2. As a final exploration, we consider the effect of taking different paths on the simplex to the same vertex. As mentioned in the related works and in Theorem 4, a coupling of the form (31) would /latexit> 2 Figure 4: Left: Marginally sampling a K = 4 multimarginal model comprised of the AFHQ, flowers, and Celeb A datasets at resolution 64 64. Shown to the right of the images is the corresponding path taken on the simplex, with α(0) = e0 starting at the Gaussian ρ0 and ending at one of ρ1, ρ2 or ρ3. Right: Demonstration of style transfer that emerges naturally when learning a multimarginal interpolant. With a single shared interpolant, we flow from the AFHQ vertex ρ2 to the flowers vertex ρ1 or to the Celeb A vertex ρ3. The learned flow connects images with stylistic similarities. Published as a conference paper at ICLR 2024 lead to sample generation independent of the interior of the path α(t) and dependent only on its endpoints. Here, we do not enforce such a constraint, and instead examine the path-dependence of an unconstrained multimarginal model. In the left of Fig. 3, we take a sample of a cat from the animals dataset and push it to a generated celebrity. In the middle row, we take the same cat image and push it toward the flower marginal. We then take the generated flower and push it towards the celebrity marginal. We note that there are similarities in the semantic structure of the two generated images, despite the different paths taken along the simplex. Nevertheless, the loop does not close, and there are differences in the generated samples, such as in their hair. This suggests that different paths through the space of measures defined on the simplex have the potential to provide meaningful variations in output samples using the same learned marginal vector fields gk. 4 OUTLOOK AND CONCLUSION In this work, we introduced a generic framework for multimarginal generative modeling. To do so, we used the formalism of stochastic interpolants, which enabled us to define a stochastic process on the simplex that interpolates between any pair of K densities at the level of the samples. We showed that this formulation decouples the problem of learning a velocity field that accomplishes a given dynamical transport from the problem of designing a path between two densities, which leads to a simple minimization problem for the path with lowest transport cost over a specific class of velocity fields. We considered this minimization problem in the two-marginal case numerically, and showed that it can lower the transport cost in practice. It would be interesting to apply the method to score-based diffusion, where significant effort has gone into heuristic tuning of the signal to noise ratio (Karras et al., 2022), and see if the optimization can recover or improve upon recommended schedules. In addition, we explored how training multimarginal generative models impacts the generated samples in comparison to two-marginal couplings, and found that the incorporation of data from multiple densities leads to novel flows that demonstrate emergent properties such as style transfer. Future work will consider the application of this method to problems in measurement decorruption from multiple sources. Martial Agueh and Guillaume Carlier. Barycenters in the Wasserstein Space. SIAM Journal on Mathematical Analysis, 43(2):904 924, January 2011. ISSN 0036-1410. doi: 10.1137/100805741. URL https://epubs.siam.org/doi/10.1137/100805741. Publisher: Society for Industrial and Applied Mathematics. Michael S. Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants, 2023. Michael S. Albergo, Nicholas M. Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions, 2023. Jason M Altschuler and Enric Boix-Adsera. Wasserstein Barycenters can be Computed in Polynomial Time in Fixed Dimension. Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. Stargan v2: Diverse image synthesis for multiple domains, 2020. Marco Cuturi and Arnaud Doucet. Fast Computation of Wasserstein Barycenters, June 2014. URL http://arxiv.org/abs/1310.4375. ar Xiv:1310.4375 [stat]. Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis, 2021. Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density Estimation Using Real NVP. In International Conference on Learning Representations, pp. 32, 2017. Conor Durkan, Artur Bekasov, Iain Murray, and George Papamakarios. Neural spline flows. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch e-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/ 7ac71d433f282034e088473244df8c02-Paper.pdf. Published as a conference paper at ICLR 2024 Gero Friesecke and Daniela V ogler. Breaking the curse of dimension in multi-marginal kantorovich optimal transport on finite state spaces. SIAM Journal on Mathematical Analysis, 50(4):3996 4019, 2018. Wilfrid Gangbo and Andrzej Swiech. Optimal maps for the multidimensional monge-kantorovich problem. Communications on Pure and Applied Mathematics, 51(1):23 45, 2023/09/29 1998. Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 6840 6851. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/ 4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf. Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Weinberger. Deep Networks with Stochastic Depth. ar Xiv:1603.09382 [cs], July 2016. Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the Design Space of Diffusion Based Generative Models, October 2022. ar Xiv:2206.00364 [cs, stat]. Diederik P Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. On density estimation with diffusion models. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum? id=2Ld Bqxc1Yv. Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling, 2022. URL https://arxiv.org/abs/2210.02747. Qiang Liu. Rectified flow: A marginal preserving approach to optimal transport, 2022. URL https://arxiv.org/abs/2209.14577. Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022. URL https://arxiv.org/abs/2209.03003. Maria-Elena Nilsback and Andrew Zisserman. A visual vocabulary for flower classification. In IEEE Conference on Computer Vision and Pattern Recognition, volume 2, pp. 1447 1454, 2006. Maxence Noble, Valentin De Bortoli, Arnaud Doucet, and Alain Durmus. Tree-Based Diffusion Schr\ odinger Bridge with Applications to Wasserstein Barycenters, May 2023. URL http: //arxiv.org/abs/2305.16557. ar Xiv:2305.16557 [cs, math, stat]. Brendan Pass. Multi-marginal optimal transport and multi-agent matching problems: Uniqueness and structure of solutions. Discrete and Continuous Dynamical Systems, 34(4):1623 1639 EP , 2014a. Brendan Pass. Multi-marginal optimal transport: theory and applications. Technical Report ar Xiv:1406.0026, ar Xiv, September 2014b. URL http://arxiv.org/abs/1406.0026. ar Xiv:1406.0026 [math] type: article. Julien Rabin, Gabriel Peyr e, Julie Delon, and Marc Bernot. Wasserstein Barycenter and Its Application to Texture Mixing. In Alfred M. Bruckstein, Bart M. ter Haar Romeny, Alexander M. Bronstein, and Michael M. Bronstein (eds.), Scale Space and Variational Methods in Computer Vision, Lecture Notes in Computer Science, pp. 435 446, Berlin, Heidelberg, 2012. Springer. ISBN 978-3-642-24785-9. doi: 10.1007/978-3-642-24785-9 37. Danilo Rezende and Shakir Mohamed. Variational Inference with Normalizing Flows. In International Conference on Machine Learning, pp. 1530 1538. PMLR, June 2015. Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj orn Ommer. Highresolution image synthesis with latent diffusion models, 2022. Neta Shaul, Ricky T. Q. Chen, Maximilian Nickel, Matt Le, and Yaron Lipman. On kinetic optimal probability paths for generative models, 2023. Published as a conference paper at ICLR 2024 Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch e-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/ 3001ef257407d5a371a96dcd947c7d93-Paper.pdf. Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of score-based diffusion models. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 1415 1428. Curran Associates, Inc., 2021a. URL https://proceedings.neurips.cc/ paper/2021/file/0a9fdbb17feb6ccb7ec405cfb85222c4-Paper.pdf. Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021b. URL https://openreview.net/forum? id=Px TIG12RRHS. Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021c. E. G. Tabak and Cristina V. Turner. A family of nonparametric density estimation algorithms. Communications on Pure and Applied Mathematics, 66(2):145 164, 2013. doi: https://doi.org/ 10.1002/cpa.21423. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/ cpa.21423. Esteban G. Tabak and Eric Vanden-Eijnden. Density estimation by dual ascent of the log-likelihood. Communications in Mathematical Sciences, 8(1):217 233, 2010. ISSN 15396746, 19450796. doi: 10.4310/CMS.2010.v8.n1.a11. Alexander Tong, Nikolay Malkin, Guillaume Huguet, Yanlei Zhang, Jarrid Rector-Brooks, Kilian Fatras, Guy Wolf, and Yoshua Bengio. Improving and generalizing flow-based generative models with minibatch optimal transport. In ICML Workshop on New Frontiers in Learning, Control, and Dynamical Systems, 2023. Honglun Zhang, Wenqing Chen, Jidong Tian, Yongkun Wang, and Yaohui Jin. Show, attend and translate: Unpaired multi-domain image-to-image translation with visual attention, 2019. Published as a conference paper at ICLR 2024 A OMITTED PROOFS Theorem 1 (Continuity equations). For all α K, the probability distribution of the barycentric stochastic interpolant x(α) has a density ρ(α, x) which satisfies the K + 1 equations αkρ(α, x) + x gk(α, x)ρ(α, x) = 0, k = 0, . . . , K. (7) Above, each gk(α, x) is defined as the conditional expectation gk(α, x) = E[xk|x(α) = x], k = 0, . . . , K, (8) where E[xk|x(α) = x] denotes an expectation over ρ0(x0)ρ(x1, . . . , xk) conditioned on the event x(α) = x. The score along each two-marginal path connected to ρ0 is given by α0 = 0 : x log ρ(α, x) = α 1 0 g0(α, x). (9) Moreover, each gk is the unique minimizer of the objective Lk(ˆgk) = Z K E |ˆgk(α, x(α))|2 2xk ˆgk(α, x(α)) dα, k = 0, . . . , K, (10) where the expectation E is taken over (x0, . . . , x K) ρ0(x0)ρ(x1, . . . , xk). Proof. By definition of the barycentric interpolant x(α) = PK k=0 αkxk, its characteristic function is given by E[eik x(α)] = Z Rd Rd eik PK k=1 xkρ(x1, . . . .x K)dx1 dx Ke 1 2 α2 0|k|2 (17) where we used x0 (x1, . . . , x K) and x0 N(0, Idd). The smoothness in k of this expression guarantees that the distribution of x(α) has a density ρ(α, x) > 0. Using the definition of x(α) again, ρ(α, x) satisfies, for any suitable test function ϕ : Rd R: Z Rd ϕ(x)ρ(t, x|ξ)dx = Z Rd Rd ϕ (x(α)) ρ(x1, . . . , x K)(2π) d/2e 1 2 |x0|2dx0 dx K. (18) Taking the derivative with respect to αk of both sides we get Z Rd ϕ(x) αkρ(α, x)dx Rd Rd xk ϕ (x(α)) ρ(x1, . . . , x K)(2π) d/2e 1 2 |x0|2dx0 dx K Rd E xk ϕ(x(α)) x(α) = x ρ(α, x)dx Rd E[xk|x(α) = x] ϕ(x)ρ(α, x)dx where we used the chain rule to get the first equality, the definition of the conditional expectation to get the second, and the fact that ϕ(x(α)) = ϕ(x) since we condition on x(α) = x to get the third. Since E[xk|x(α) = x] = gk(α, x) (20) by the definition of gk in (8), we can write (19) as Z Rd ϕ(x) αkρ(α, x)dx = Z Rd gk(α, x) ϕ(x)ρ(α, x)dx (21) This equation is (7) written in weak form. To establish (9), note that, if α0 > 0, we have E x0eiα0k x0 = α 1 0 (i k)E eiα0k x0 = α 1 0 (i k)e 1 2 α2 0|k|2 = iα0ke 1 2 α2 0|k|2. (22) As a result, using x0 (x1, . . . , x K), we have E x0eik x(α) = iα0k E eik x(α) (23) Published as a conference paper at ICLR 2024 Using the properties of the conditional expectation, the left-hand side of this equation can be written as E x0eik x(α) = Z Rd E x0eik x(α) x(α) = x ρ(α, x)dx Rd E[x0|x(α) = x]eik xρ(α, x)dx Rd g0(α, x)eik xρ(α, x)dx where we used the definition of g0 in (8) to get the last equality. Since the right-hand side of (23) is the Fourier transform of α0ρ(α, x), we deduce that g0(α, x)ρ(α, x) = α0 ρ(α, x) = α0 log ρ(α, x) ρ(α, x). (25) Since ρ(α, x) > 0, this implies (9) when α > 0. Finally, to derive (10), notice that we can write Lk(ˆgk) = Z K E |ˆgk(α, x(α))|2 2xk ˆgk(α, x(α)) dα, Rd E |ˆgk(α, x(α))|2 2xk ˆgk(α, x(α))|x(α) = x ρ(α, x)dxdα |ˆgk(α, x(α))|2 2E[xk|x(α)] gk(α, x(α)) ρ((α, x)dxdα |ˆgk(α, x(α))|2 2gk(t, x, ξ) ˆgk(α, x(α)) ρ(α, x)dxdα where we used the definition of gk in (8). The unique minimizer of this objective function is ˆgk(α, x) = gk(α, x). Corollary 2 (Transport equations). Let {ek} represent the standard basis vectors of RK+1, and let α : [0, 1] K denote a differentiable curve satisfying α(0) = ei and α(1) = ej for any i, j = 0, . . . , K. Then the barycentric stochastic interpolant x(α(t)) has probability density ρ(t, x) = ρ(α(t), x) that satisfies the transport equation t ρ(t, x) + b(t, x) ρ(t, x) = 0, ρ(t = 0, x) = ρi(x), ρ(t = 1, x) = ρj(x), (13) where we have defined the velocity field k=0 αk(t)gk(α(t), x). (14) In addition, the probability flow associated with (13) given by Xt = b(t, Xt) (15) satisfies Xt=1 ρj for any Xt=0 ρi, and vice-versa. Proof. By definition ρ(t = 0, x) = ρ(α(0), x) = ρ(ei, x) = ρi(x) and ρ(t = 1, x) = ρ(α(1), x) = ρ(ej, x) = ρj(x), so the boundary conditions in (13) are satisfied. To derive the transport equation in (13) use the chain rule as well as (7) to deduce t ρ(t, x) = k=0 αk(t) αk ρ(α, x) = k=0 αk(t) (gk(α(t), x) ρ(α, x)) . (27) This gives (13) by definition of b(t, x) in (14). (15) is the characteristic equation associated with (13) which implies the statement about the solution of this ODE. Published as a conference paper at ICLR 2024 Note that, using the expression in (13) for log ρ(α(t), x) = log ρ(t, x) as well as the identity ρ(t, x) = ( ρ(t, x) ρ(t, x)) we can, for any ϵ(t) 0, write (13) as the Fokker-Planck equation (FPE) t ρ(t, x) + b(t, x) ϵ(t)α 1 0 (t)g0(α0(t), x) ρ(α, x) = ϵ(t) ρ(t, x). (28) The SDE associated with this FPE reads d XF t = b(t, XF t ) ϵ(t)α 1 0 (t)g0(α0(t), XF t ) dt + p 2ϵ(t)d Wt. (29) As a result of the property of the solution ρ(t, x) we deduce that the solutions to the SDE (29) are such that XF t=1 ρj if XF t=0 ρi, which results in a generative model using a diffusion. Corollary 3. The solution to C(ˆα) = min ˆα k=0 ˆαk(t)gk(ˆα(t), x(ˆα(t))) 2i dt (16) gives the transport with least path length in Wasserstein-2 metric over the class of velocities ˆb(t, x) = PK k=0 ˆαk(t)gk(ˆα(t), x). Here gk is given by (8), x(α) = PK k=0 αkxk, the expectation is taken over (x0, x1 . . . x K) ρ0 ρ, and the minimization is over all paths ˆα C1([0, 1]) such that ˆα(t) K for all t [0, 1]. Proof. The statement follows from the fact that the integral at the right-hand side of (16) can be written as Z 1 k=0 ˆαk(t)gk(ˆα(t), x(ˆα(t))) 2i dt = Z 1 0 E h ˆb(t, x(ˆα(t))) 2i dt Rd |b(t, x)|2 ˆρ(t, x)dxdt where ˆb(t, x) = PK k=0 ˆαk(t)gk(ˆα(t), x) and ˆρ(t, x) solves the transport equation (13) with b replaced by ˆb. This expression give the length in Wasserstein-2 metric of the path followed by this density, implying the statement of the corollary. B DETERMINISTIC COUPLINGS AND MAP DISTILLATION We now describe an illustrative coupling ρ0(x0)ρ(x1, . . . , x K) for the multimarginal setting in which the probability flow between marginals is given in one step. Let Tk : Rd Rd for k = 1, . . . , K be invertible maps such that xk = T(x0) ρk if x0 ρ0 (i.e., each ρk is the pushforward of ρ0 by Tk). Assume that the coupling ρ is of Monge type, i.e., ρ(x1, . . . , x K) = k=1 δ(xk Tk(x0)) (31) Clearly, this distribution satisfies (6). For simplicity of notation let us denote by T0(x) = x the identity map. Let us also assume that, for all α K, the map Pk k=0 αk Tk(x) is invertible, with inverse R(α, x): α K, x Rd : R α, k=0 αk Tk(x) = k=0 αk Tk(R(α, x)) = x (32) In this case, the factors gk(α, x) evaluated at α = e0 simply recover the maps Tk. Evaluated at ei, they transport samples from ρi to samples from ρk, as shown in the following results: Theorem 4. Assume that ρ(x1, . . . , x K) is given by (31) and that (32) holds. Then, under the same conditions as in Corollary 2, the factors gk(α, x) defined in (8) are given by gk(α, x) = Tk(R(α, x)) (33) In particular, if α = ei, we have gk(ei, x) = Tk(T 1 i (x)) so that if xi ρi then gk(ei, xi) ρk. Published as a conference paper at ICLR 2024 Proof. If ρ(x1, . . . , x K) is given by (31), we have gk(α, x) = E0 h Tk(x0) k=0 αk Tk(x0) = x i = E0 h Tk(x0) x0 = R(α, x) i = Tk(R(α, x)). (34) where E0 denotes expectation over x0 ρ0. The relation gk(ei, x) = Tk(T 1 i (x)) follows from (33) since R(ei, x) = T 1 i (x) and the final statement from (31). In addition, the solution to the probability flow ODE (15) with initial data Xt=0 = Ti(x0) is given by k=0 αk(t)Tk(x0) (35) so that Xt=1 = Tj(x0) since α(1) = ej by assumption. To check (35), notice that b(t, x) = PK k=0 αk(t)Tk(R(α(t), x)). As a result, using (32), k=0 αk(t)Tk R α(t), k =0 αk (t)Tk (x0) = k=0 αk(t)Tk(x0) = Xt (36) so that (35) satisfies (15). C EXPERIMENTAL DETAILS C.1 TRANSPORT REDUCTION EXPERIMENTS Corollary 3 tells us that, because the problem of learning to transport between any two densities ρi, ρj on the simplex can separated from a specific path on the simplex, some of the transport costs associated to the probability flow (15) can be reduced by choosing the optimal α(t). α(t) can be parameterized as a learnable function conditional on this parameterization maintaining the simplex constraint P In the 2D Gaussian to checkboard example provided in the main text, we fulfilled this by taking an αi, αj of the form: αi = 1 t + N X n ai,n sin(n π 2 t) 2 (37) αj = t + N X n aj,n sin(n π 2 t) 2 (38) where ai,n, aj,n are learnable parameters. We also found the square on the sum, though necessary for enforcing ai,j 0, could be removed if N is not too large for faster optimization. For the case when the simplex contains more than two marginals (k = i, j), the additional αk which do not specify an endpoint i, j of the transport can be parameterized drop the first terms which enforce the boundary conditions so that αk =i,j = N X n ak,n sin(n π 2 t) 2 (39) The αk k are normalized such that α = h α0 P αk , α1 P αk , . . . , αK P αk As such, their time derivatives, necessary for using the ODE (15) as a generative model, is given as, for any component αk αk = αk P m =k αm αk P m =k αm k αk 2 . (42) Published as a conference paper at ICLR 2024 Learned path Standard linear path Figure 5: The output of the probability flow (15) realized by for the learned interpolant for the checkerboard problem discussed in Section 2.1 and Appendix C.1. For fixed number of function evaluations, 5 steps of the midpoint integrator, the learned α is quicker to give a more accurate solution. C.2 ARCHITECTURE FOR IMAGE EXPERIMENTS We adapt the standard U-Net architecture (Ho et al., 2020) to work with a vector α describing the time coordinate rather than a scalar t, which is the conventional coordinate label. The number of output channels from the network is given as # image channels K, where each kth slice of # image channels corresponds to the kth marginal vector field necessary for doing the probability flow. MNIST AFHQ-Flowers-Celeb A Dimension 32 32 64 64 3 # Training point 50,000 15000, 8196, 200000 Batch Size 500 180 Training Steps 3 105 1 105 Hidden dim 128 256 Attention Resolution 64 64 Learning Rate (LR) 0.0002 0.0002 LR decay (1k epochs) 0.995 0.995 U-Net dim mult [1,2,2] [1,2,3,4] Learned t sinusoidal embedding Yes Yes t0, tf when sampling with ODE [0.0, 1.0] [0.0, 1.0] EMA decay rate 0.9999 0.9999 EMA start iteration 10000 10000 # GPUs 2 8 Table 2: Hyperparameters and architecture for image datasets. C.3 ADDITIONAL IMAGE RESULTS Here we provide some additional image results, applied to the image-to-image translation. Published as a conference paper at ICLR 2024 Figure 6: Random assortment of additional image translation examples built from the simplex of Celeb A, the Oxford flowers dataset, and the animal faces dataset all at 64 64 resolution. Published as a conference paper at ICLR 2024 Figure 7: Random assortment of examples comparing the class-to-class translation of MNIST digits 0 and 3, where each row in a triplet corresponds to: top) model trained only on the edge between 0 and 3. middle) model trained on all edges between 0 and 5. bottom) model trained on the entire simplex.