# neural_optimal_transport__87b8bbd2.pdf Published as a conference paper at ICLR 2023 NEURAL OPTIMAL TRANSPORT Alexander Korotin Skolkovo Institute of Science and Technology Artificial Intelligence Research Institute Moscow, Russia a.korotin@skoltech.ru Daniil Selikhanovych Skolkovo Institute of Science and Technology Moscow, Russia selikhanovychdaniil@gmail.com Evgeny Burnaev Skolkovo Institute of Science and Technology Artificial Intelligence Research Institute Moscow, Russia e.burnaev@skoltech.ru We present a novel neural-networks-based algorithm to compute optimal transport maps and plans for strong and weak transport costs. To justify the usage of neural networks, we prove that they are universal approximators of transport plans between probability distributions. We evaluate the performance of our optimal transport algorithm on toy examples and on the unpaired image-to-image translation. (a) Celeba (female) anime, outdoor church, deterministic (one-to-one, W2). (b) Handbags shoes, stochastic (one-to-many, W2,1). Figure 1: Unpaired translation with our Neural Optimal Transport (NOT) Algorithm 1. 1 INTRODUCTION Solving optimal transport (OT) problems with neural networks has become widespread in machine learning tentatively starting with the introduction of the large-scale OT (Seguy et al., 2017) and Wasserstein GANs (Arjovsky et al., 2017). The majority of existing methods compute the OT cost and use it as the loss function to update the generator in generative models (Gulrajani et al., 2017; Liu et al., 2019; Sanjabi et al., 2018; Petzka et al., 2017). Recently, (Rout et al., 2022; Daniels et al., 2021) have demonstrated that the OT plan itself can be used as a generative model providing comparable performance in practical tasks. In this paper, we focus on the methods which compute the OT plan. Most recent methods (Korotin et al., 2021b; Rout et al., 2022) consider OT for the quadratic transport cost (the Wasserstein-2 distance, W2) and recover a nonstochastic OT plan, i.e., a deterministic OT map. In general, it may Published as a conference paper at ICLR 2023 not exist. (Daniels et al., 2021) recover the entropy-regularized stochastic plan, but the procedures for learning the plan and sampling from it are extremely time-consuming due to using score-based models and the Langevin dynamics (Daniels et al., 2021, M6). Contributions. We propose a novel algorithm to compute deterministic and stochastic OT plans with deep neural networks (M4.1, M4.2). Our algorithm is designed for weak and strong optimal transport costs (M2) and generalizes previously known scalable approaches (M3, M4.3). To reinforce the usage of neural nets, we prove that they are universal approximators of transport plans (M4.4). We show that our algorithm can be applied to large-scale computer vision tasks (M5). Notations. We use X, Y, Z to denote Polish spaces and P(X), P(Y), P(Z) to denote the respective sets of probability distributions on them. We denote the set of probability distributions on X Y with marginals P and Q by Π(P, Q). For a measurable map T : X Z Y (or T : X Y), we denote the associated push-forward operator by T#. 2 PRELIMINARIES In this section, we provide key concepts of the OT theory (Villani, 2008; Santambrogio, 2015; Gozlan et al., 2017; Backhoff-Veraguas et al., 2019) that we use in our paper. Figure 2: Monge s OT formulation. Strong OT formulation. For P P(X), Q P(Y) and a cost function c : X Y R, Monge s primal formulation of the OT cost is Cost(P, Q) def = inf T#P=Q X c x, T(x) d P(x), (1) where the minimum is taken over measurable functions (transport maps) T : X Y that map P to Q (Figure 2). The optimal T is called the OT map. Note that (1) is not symmetric and does not allow mass splitting, i.e., for some P, Q P(X), P(Y), there may be no T satisfying T#P = Q. Thus, (Kantorovitch, 1958) proposed the following relaxation: Cost(P, Q) def = inf π Π(P,Q) X Y c(x, y)dπ(x, y), (2) where the minimum is taken over all transport plans π (Figure 3a), i.e., distributions on X Y whose marginals are P and Q. The optimal π Π(P, Q) is called the optimal transport plan. If π is of the form [id, T ]#P Π(P, Q) for some T , then T minimizes (1). In this case, the plan is called deterministic. Otherwise, it is called stochastic (nondeterministic). (a) Strong OT formulation (2). (b) Weak OT formulation (3). Figure 3: Strong (Kantorovich s) and weak (Gozlan et al., 2017) optimal transport fomulations. An example of OT cost for X = Y = RD is the (p-th power of) Wasserstein-p distance Wp, i.e., formulation (2) with c(x, y) = x y p. Two its most popular cases are p = 1, 2 (W1, W2 2). Weak OT formulation (Gozlan et al., 2017). Let C : X P(Y) R be a weak cost, i.e., a function which takes a point x X and a distribution of y Y as input. The weak OT cost between P, Q is Cost(P, Q) def = inf π Π(P,Q) X C x, π( |x) dπ(x), (3) where π( |x) denotes the conditional distribution (Figure 3b). Note that (3) is a generalization of (2). Indeed, for cost C x, µ = R Y c(x, y)dµ(y), the weak formulation (3) becomes strong (2). An Published as a conference paper at ICLR 2023 example of a weak OT cost for X = Y = RD is the γ-weak (γ 0) Wasserstein-2 (W2,γ): 1 2 x y 2dµ(y) γ 2 Var(µ) (4) Existence and duality. Throughout the paper, we consider weak costs C(x, µ) which are lower bounded, convex in µ and jointly lower semicontinuous in an appropriate sense. Under these assumptions, (Backhoff-Veraguas et al., 2019) prove that the minimizer π of (3) always exists.1 With mild assumptions on c, strong costs satisfy these assumptions. In particular, they are linear w.r.t. µ, and, consequently, convex. The γ-weak quadratic cost (4) is lower-bounded (for γ 1) and is also convex since the functional Var(µ) is concave in µ. For the costs in view, the dual form of (3) is Cost(P, Q) = sup f X f C(x)d P(x) + Z Y f(y)d Q(y), (5) where f are the upper-bounded continuous functions with not very rapid growth (Backhoff-Veraguas et al., 2019, Equation 1.2) and f C is the weak C-transform of f, i.e. f C(x) def = inf µ P(Y) Y f(y)dµ(y) . (6) Note that for strong costs C, the infimum is attained at any µ P(Y) supported on the arg infy Y{c(x, y) f(y)} set. Therefore, it suffices to use the strong c-transform: f C(x) = f c(x) def = inf y Y {c(x, y) f(y)} . (7) For strong costs (2), formula (5) with (7) is the well known Kantorovich duality (Villani, 2008, M5). Nonuniqueness. In general, an OT plan π is not unique, e.g., see (Peyré et al., 2019, Remark 2.3). 3 RELATED WORK In large-scale machine learning, OT costs are primarily used as the loss to learn generative models. Wasserstein GANs introduced by (Arjovsky et al., 2017; Gulrajani et al., 2017) are the most popular examples of this approach. We refer to (Korotin et al., 2022b; 2021b) for recent surveys of principles of WGANs. However, these models are out of scope of our paper since they only compute the OT cost but not OT plans or maps (M4.3). To compute OT plans (or maps) is a more challenging problem, and only a limited number of scalable methods to solve it have been developed. We overview methods to compute OT plans (or maps) below. We emphasize that existing methods are designed only for strong OT formulation (2). Most of them search for a deterministic solution (1), i.e., for a map T rather than a stochastic plan π , although T might not always exist. To compute the OT plan (map), (Lu et al., 2020; Xie et al., 2019) approach the primal formulation (1) or (2). Their methods imply using generative models and yield complex optimization objectives with several adversarial regularizers, e.g., they are used to enforce the boundary condition (T#P = Q). As a result, the methods are hard to setup since they require careful selection of hyperparameters. In contrast, methods based on the dual formulation (5) have simpler optimization procedures. Most of such methods are designed for OT with the quadratic cost, i.e., the Wasserstein-2 distance (W2 2). An evaluation of these methods is provided in (Korotin et al., 2021b). Below we mention their issues. Methods by (Taghvaei & Jalali, 2019; Makkuva et al., 2020; Korotin et al., 2021a;c) based on inputconvex neural networks (ICNNs, see (Amos et al., 2017)) have solid theoretical justification, but do not provide sufficient performance in practical large-scale problems. Methods based on entropy regularized OT (Genevay et al., 2016; Seguy et al., 2017; Daniels et al., 2021) recover regularized OT plan that is biased from the true one, it is hard to sample from it or compute its density. According to (Korotin et al., 2021b), the best performing approach is MM:R , which is based on the maximin reformulation of (5). It recovers OT maps fairly well and has a good generative performance. The follow-up papers (Rout et al., 2022; Fan et al., 2022) test extensions of this approach for more general strong transport costs c( , ) and apply it to compute W2 barycenters (Korotin et al., 2022a). Their key limitation is that it aims to recover a deterministic OT map T which might not exist. 1Backhoff-Veraguas et al. (2019) work with the subset Pp(Y) P(Y) whose p-th moment is finite. Henceforth, we also work in Pp(Y) equipped with the Wasserstein-p topology. Since this detail is not principal for our subsequent analysis, to keep the exposition simple, we still write P(Y) but actually mean Pp(Y). Published as a conference paper at ICLR 2023 4 ALGORITHM FOR LEARNING OT PLANS In this section, we develop a novel neural algorithm to recover a solution π of OT problem (3). The following lemma will play an important role in our derivations. Lemma 1 (Existence of transport maps.). Let µ and ν be probability distributions on RM and RN. Assume that µ is atomless. Then there exists a measurable t:RM RN satisfying t#µ = ν. Proof. (Santambrogio, 2015, Cor. 1.29) proves the fact for M =N. The proof works for M =N. Throughout the paper we assume that P, Q are supported on subsets X RP , Y RQ, respectively. 4.1 REFORMULATION OF THE DUAL PROBLEM First, we reformulate the optimization in C-transform (6). For this, we introduce a subset Z RS with an atomless distribution S on it, e.g., S = Uniform [0, 1] or N(0, 1). Lemma 2 (Reformulation of the C-transform). The following equality holds: f C(x)=inf t C(x, t#S) Z Z f t(z) d S(z) , (8) where the infimum is taken over all measurable t : Z Y. Proof. For all x X and t : Z Y we have f C(x) C(x, t#S) R Z f t(z) d S(z). The inequality is straightforward: we substitute µ = t#S to (6) to upper bound f C(x) and use the change of variables. Taking the infimum over t, we obtain f C(x) inf t C(x, t#S) Z Z f t(z) d S(z) . (9) Now let us turn (9) to an equality. We need to show that ϵ > 0 there exists tϵ : Z Y satisfying f C(x)+ϵ C(x, tϵ #S) Z Z f tϵ(z) d S(z). (10) By (6) and the definition of inf, µϵ P(Y) such that f C(x) + ϵ C(x, µϵ) R Y f(y)dµϵ(y). Thanks to Lemma 1, there exists tϵ : Z Y such that µϵ = tϵ #S, i.e., (10) holds true. Now we use Lemma 2 to get an analogous reformulation of the integral of f C in the dual form (5). Lemma 3 (Reformulation of the integrated C-transform). The following equality holds: Z X f C(x)d P(x) = inf T C x, T(x, )#S Z Z f T(x, z) d S(z) d P(x), (11) where the inner minimization is performed over all measurable functions T : X Z Y. Proof. The lemma follows from the interchange between the infimum and integral provided by the Rockafellar s interchange theorem (Rockafellar, 1976, Theorem 3A). The theorem states that for a function F : A B R and a distribution ν on A, Z A inf b B F(a, b)dν(a)= inf H:A B A F(a, H(a))dν(a) (12) We apply (12), use A = X, ν = P, and put B to be the space of measurable functions Z Y, and F(a, b) = C(a, b#S) R Y f y d b#S (y). Consequently, we obtain that R X f C(x)d P(x) equals C x, H(x)#S Z Y f(y)d H(x)#S](y) d P(x) (13) Finally, we note that the optimization over functions H : X {t : Z Y} equals the optimization over functions T : X Z Y. We put T(x, z) = [H(x)](z), use the change of variables for y = T(x, z) and derive (11) from (13). Lemma 3 provides the way to represent the dual form (5) as a saddle point optimization problem. Published as a conference paper at ICLR 2023 Corollary 1 (Maximin reformulation of the dual problem). The following holds: Cost(P, Q) = sup f inf T L(f, T), (14) where the functional L is defined by L(f, T) def = Z Y f(y)d Q(y) + Z C x, T(x, )#S Z Z f T(x, z) d S(z) d P(x). (15) Proof. It suffices to substitute (11) into (5). We say that functions T : X Z Y are stochastic maps. If a map T is independent of z, i.e., for all (x, z) X Z we have T(x, z) T(x), we say the map is deterministic. Figure 4: Stochastic function T(x, z) representing a transport plan. The function s input is x X and z S. The idea behind the introduced notation is the following. An optimal transport plan π might be nondeterministic, i.e., there might not exist a deterministic function T : X Y which satisfies π = [id X , T]#P. However, each transport plan π Π(P, Q) can be represented implicitly through a stochastic function T : X Z Y. This fact is known as noise outsourcing (Kallenberg, 1997, Theorem 5.10) for Z = [0, 1] R1 and S = Uniform([0, 1]). Combined with Lemma 1, the noise outsourcing also holds for a general Z RS and atomless S P(Z). We visualize the idea in Figure 4. For a plan π, there might exist multiple maps T which represent it. For a pair of probability distributions P, Q, we say that T is a stochastic optimal transport map if it realizes some optimal transport plan π . Such maps solve the inner problem in (14) for optimal f . Lemma 4 (Optimal maps solve the maximin problem). For any maximizer f of (5) and for any stochastic map T which realizes some optimal transport plan π , it holds that T arg inf T L(f , T). (16) Proof. Let π be the OT plan realized by T . We derive Z Z f T (x, z) d S(z)d P(x) = Z Y f (y)dπ (y|x)dπ (x) = Z X Y f (y)dπ (x, y) = Z Y f (y)d Q(y), (17) where we change the variables for y = T (x, z) and use the property dπ (x) = d P(x). Now assume that T / arg inf T L(f , T). In this case, from the definition (15) we conclude that L(f , T ) > Cost(P, Q). However, we derive substituting (17) into (15), we see that L(f , T ) = R X C x, T (x, )#S | {z } π (y|x) d P(x) | {z } dπ (x) =Cost(P, Q), which is a contradiction. Thus, (16) holds true. For the γ-weak quadratic cost (4) which we use in the experiments (M5), a maximizer f of (5) indeed exists, see (Alibert et al., 2019, M5.22) or (Gozlan & Juillet, 2020). Thanks to our Lemma 4, one may solve the saddle point problem (14) and extract an optimal stochastic transport map T from its solution (f , T ). In general, the arg inf set for f may contain not only the optimal stochastic transport maps but other stochastic functions as well. In Appendix F, we show that for strictly convex (in µ) costs C(x, µ), all the solutions of (14) provide stochastic OT maps. 4.2 PRACTICAL OPTIMIZATION PROCEDURE To approach the problem (14) in practice, we use neural networks Tθ : RP RS RQ and fω : RQ R to parameterize T and f, respectively. We train their parameters with the stochastic gradient ascent-descent (SGAD) by using random batches from P, Q, S, see Algorithm 1. Published as a conference paper at ICLR 2023 Algorithm 1: Neural optimal transport (NOT) Input :distributions P, Q, S accessible by samples; mapping network Tθ : RP RS RQ; potential network fω : RQ R; number of inner iterations KT ; (weak) cost C : X P(Y) R; empirical estimator b C x, T(x, Z) for the cost; Output :learned stochastic OT map Tθ representing an OT plan between distributions P, Q; repeat Sample batches Y Q, X P; for each x X sample batch Zx S; Lf 1 |X| P x X 1 |Zx| P z Zx fω Tθ(x, z) 1 Update ω by using Lf ω ; for k T = 1, 2, . . . , KT do Sample batch X P; for each x X sample batch Zx S; LT 1 |X| P x X b C x, Tθ(x, Zx) 1 |Zx| P z Zx fω Tθ(x, z) ; Update θ by using LT until not converged; Our Algorithm 1 requires an empirical estimator b C for C x, T(x, )#S . If the cost is strong, it is straightforward to use the following unbiased Monte-Carlo estimator from a random batch Z S: C x, T(x, )#S = Z Z c(x, T(x, z))d S(z) 1 |Z| z Z c x, T(x, z) def = b C x, T(x, Z) . (18) For general costs C, providing an estimator might be nontrivial. For the γ-weak quadratic cost (4), such an unbiased Monte-Carlo estimator is straightforward to derive: b C x, T(x, Z) def = 1 2|Z| z Z x T(x, z) 2 γ 2 ˆσ2, (19) where ˆσ2 is the (corrected) batch variance ˆσ2 = 1 |Z| 1 P z Z T(x, z) 1 |Z| P z Z T(x, z) 2. To estimate strong costs (18), it is enough to sample a single noise vector (|Z| = 1). To estimate the γ-weak quadratic cost (19), one needs |Z| 2 since the estimation of the variance ˆσ2 is needed. 4.3 RELATION TO PRIOR WORKS Generative adversarial learning. Our algorithm 1 is a novel approach to learn stochastic OT plans; it is not a GAN or WGAN-based solution endowed with additional losses such as the OT cost. WGANs (Arjovsky et al., 2017) do not learn an OT plan but use the (strong) OT cost as the loss to learn the generator network. Their problem is inf T supf V(T, f). The generator T solves the outer inf T problem and is the first coordinate of an optimal saddle point (T , f ). In our algorithm 1, problem (15) is supf inf T L(f, T), the generator (transport map) T solves of the inner inf T problem and is the second coordinate of an optimal saddle point (f , T ). Intuitively, in our case the generator T is adversarial to potential f (discriminator), not vise-versa as in GANs. Theoretically, the problem is also significantly different swapping inf T and supf, in general, yields a different problem with different solutions, e.g., 1 = infx supy sin(x+y) = supy infx sin(x+y) = 1. Practically, we do KT > 1 updates of T per one step of f, which again differs from common GAN practices, where multiple updates of f are done per a step of T. Finally, in contrast to WGANs, we do not need to enforce any constraints on f, e.g., the 1-Lipschitz continuity. Stochastic generator parameterization. We add an additional noise input z to transport map T(x, z) to make it stochastic. This approach is a common technical instrument to parameterize one-to-many mappings in generative modeling, see (Almahairi et al., 2018, M3.1) or (Zhu et al., 2017b, M3). In the context of OT, (Yang & Uhler, 2019) employ a stochastic generator to learn a transport plan π in the unbalanced OT problem (Chizat, 2017). Due to this, their optimization objective slightly resembles ours (15). However, this similarity is deceptive, see Appendix G. Dual OT solvers. Our algorithm 1 recovers stochastic plans for weak costs (3). It subsumes previously known approaches which learn deterministic OT maps for strong costs (2). When the cost is strong (3) and transport map T is restricted to be deterministic T(x, z) T(x), our Algorithm 1 yields maximin method MM:R , which was discussed in (Korotin et al., 2021b, M2) for the quadratic cost 1 2 x y 2 and further developed by (Rout et al., 2022) for the Q-embedded cost Q(x), y and by (Fan et al., 2022) for other strong costs c(x, y). These works are the most related to our study. Published as a conference paper at ICLR 2023 4.4 UNIVERSAL APPROXIMATION WITH NEURAL NETWORKS In this section, we show that it is possible to approximate transport maps with neural nets. Theorem 1 (Neural networks are universal approximators of stochastic transport maps). Assume that X, Z are compact and Q has finite second moment. Let T be a stochastic map from P to Q (not necessarily optimal). Then for any nonaffine continuous activation function which is continuously differentiable at at least one point (with nonzero derivative at that point) and for any ϵ > 0, there exists a neural network Tθ : RP RS RQ satisfying Tθ T 2 L2 ϵ and W2 2 (Tθ)#(P S), Q ϵ, (20) where L2 = L2(P S, X Z RQ) is the space of quadratically integrable w.r.t. P S functions X Z RQ. That is, the network Tθ generates a distribution which is ϵ-close to Q in W2 2. Proof. The squared norm T 2 L2 is equal to the second moment of Q since T pushes P S to Q. The distribution Q has finite second moment, and, consequently, T L2. Thanks to (Folland, 1999, Proposition 7.9), the continuous functions C0(X Z RQ) are dense2 in L2. According to (Kidger & Lyons, 2020, Theorem 3.2), the neural networks RP RS RQ with the abovementioned activations are dense in C0(X Z RQ) w.r.t. L norm and, consequently, w.r.t. L2 norm. Combining these results yields that neural nets are dense in L2, and for every ϵ > 0 there necessarily exists network Tθ satisfying the left inequality in (20). For Tθ, the right inequality follows from (Korotin et al., 2021a, Lemma A.2). Our Theorem 1 states that neural nets can approximate stochastic maps in L2 norm. It should be taken into account that such continuous nets Tθ may be highly irregular and hard to learn in practice. 5 EVALUATION We perform comparison with the weak discrete OT (considered as the ground truth) on toy 2D, 1D distributions in Appendices B, C, respectively. In this section, we test our algorithm on an unpaired image-to-image translation task. We perform comparison with popular existing translation methods in Appendix D. The code is written in Py Torch framework and is publicly available at https://github.com/iamalexkorotin/Neural Optimal Transport Image datasets. We use the following publicly available datasets as P, Q: aligned anime faces3, celebrity faces (Liu et al., 2015), shoes (Yu & Grauman, 2014), Amazon handbags, churches from LSUN dataset (Yu et al., 2015), outdoor images from the MIT places database (Zhou et al., 2014). The size of datasets varies from 50K to 500K images. Train-test split. We pick 90% of each dataset for unpaired training. The rest 10% are considered as the test set. All the results presented here are exclusively for test images, i.e., unseen data. Transport costs. We experiment with the strong (γ = 0) and γ-weak (γ > 0) quadratic costs. Testing other costs, e.g., perceptual (Johnson et al., 2016) or semantic (Cherian & Sullivan, 2019), might be interesting practically, but these two quadratic costs already provide promising performance. The other training details are given in Appendix E. 5.1 PRELIMINARY EVALUATION In the preliminary experiments with strong cost (γ = 0), we noted that T(x, z) becomes independent of z. For a fixed potential f and a point x, the map T(x, ) learns to be the map pushing distribution S to some arg inf distribution µ of (6). For strong costs, there are suitable degenerate distributions µ, see the discussion around (7). Thus, for T it becomes unnecessary to keep any dependence on z; it simply learns a deterministic map T(x, z) = T(x). We call this behavior a conditional collapse. Importantly, for the γ-weak cost (γ > 0), we noted a different behavior: the stochastic map T(x, z) did not collapse conditionally. To explain this, we substitute (4) into (3) to obtain W2 2,γ(P, Q) = inf π Π(P,Q) 1 2 x y 2dπ(x, y) γ Z 1 2Var π(y|x) dπ(x) | {z } d P(x) The first term is analogous to the strong cost (W2 = W2,0), while the additional second term stimulates the OT plan to be stochastic, i.e., to have high conditional variance. 2The proposition considers scalar-valued functions (Q = 1), but is analogous for vector-valued functions. 3kaggle.com/reitanaka/alignedanimefaces Published as a conference paper at ICLR 2023 (a) Handbags shoes, 128 128. (b) Shoes handbags, 128 128. (c) Celeba (female) anime, 64 64. (d) Anime celeba (female), 64 64. (e) Celeba (male) celeba (female), 64 64. (f) Anime shoes, 64 64. Figure 5: Unpaired translation with deterministic OT maps (W2). Taking into account our preliminary findings, we perform two types of experiments. In 5.2, we learn deterministic (one-to-one) translation maps T(x) for the strong cost (γ = 0), i.e., do not add z-channel. In 5.3, we learn stochastic (one-to-many) maps T(x, z) for the γ-weak cost (γ > 0). For completeness, in Appendix A, we study how varying γ affects the diversity of samples. 5.2 ONE-TO-ONE TRANSLATION WITH OPTIMAL MAPS We learn deterministic OT maps between various pairs of datasets. We provide the results in Figures 1a and 5. Extra results for all the dataset pairs that we consider are given in Appendix H. Being optimal, our translation map b T(x) tries to minimally change the image content x in the L2 pixel space. This results in preserving certain features during translation. In shoes handbags (Figures 5b, 5a), the image color and texture of the pushforward samples reflects those of input samples. In celeba (female) anime (Figures 1a, 5c, 5d), head forms, hairstyles are mostly similar for input and output images. The hair in anime is usually bigger than that in celeba. Thus, when translating celeba (female) anime, the anime hair inherits the color from the celebrity image background. In outdoor churches (Figure 1a), the ground and the sky are preserved, in celeba (male) celeba (female) (Figure 5e) the face does not change. We also provide results for translation in the case when the input and output domains are significantly different, see anime shoes (Figure 5f). Related work. Existing unpaired translation models, e.g., Cycle GAN (Zhu et al., 2017a) or UNIT (Liu et al., 2017), typically have complex adversarial optimization objectives endowed with additional losses. These models require simultaneous optimization of several neural networks. Importantly, vanilla Cycle GAN searches for a random translation map and is not capable of preserving certain attributes, e.g., the color, see (Lu et al., 2019, Figure 5b). To handle this issue, imposing extra losses is required (Benaim & Wolf, 2017; Kim et al., 2017), which further complicates the hyperparameter selection. In contrast, our approach has a straightforward objective (14); we use only 2 networks (potential f, map T), see Table 2 for the comparison of hyperparameters. While the majority of existing unpaired translation models are based on GANs, recent work (Su et al., 2023) proposes a diffusion model (DDIBs) and relates it to Schrödinger Bridge (Léonard, 2014), i.e., entropic OT. 5.3 ONE-TO-MANY TRANSLATION WITH OPTIMAL PLANS We learn stochastic OT maps between various pairs of datasets for the γ-weak quadratic cost. The parameter γ equals 2 3 or 1 in the experiments. We provide the results in Figures 1b and 6. In all the Published as a conference paper at ICLR 2023 (a) Celeba (female) anime, 128 128 (W2, 2 (b) Outdoor church, 128 128 (W2, 2 (c) Anime celeba (f), 64 64 (W2, 2 (d) Shoes handbags, 64 64 (W2,1). (e) Anime shoes, 64 64 (W2,1). Figure 6: Unpaired translation with stochastic OT maps (W2,γ). cases, the random noise inputs z S are not synchronized for different inputs x. The examples with the synchronized noise inputs z are given in Appendix I. Extended results and examples of interpolation in the conditional latent space are given in Appendix H. The stochastic map b T(x, z) preserves the attributes of the input image and produces multiple outputs. Related work. Transforming a one-to-one learning pipeline to one-to-many is nontrivial. Simply adding additional noise input leads to conditional collapse (Zhang, 2018). This is resolved by Aug Cycle GAN (Almahairi et al., 2018) and M-UNIT (Huang et al., 2018), but their optimization objectives are much more complicated then vanilla versions. Our method optimizes only 2 nets f, T in straightforward objective (14). It offers a single parameter γ to control the amount of variability in the learned maps. We refer to Table 2 for the comparison of hyperparameters of the methods. 6 DISCUSSION Potential impact. Our method is a novel generic tool to align probability distributions with deterministic and stochastic transport maps. Beside unpaired translation, we expect our approach to be applied to other one-to-one and one-to-many unpaired learning tasks as well (image restoration, domain adaptation, etc.) and improve existing models in those fields. Compared to the popular models based on GANs (Goodfellow et al., 2014) or diffusion models (Ho et al., 2020), our method provides better interpretability of the learned map and allows to control the amount of diversity in generated samples (Appendix A). It should be taken into account that OT maps we learn might be suitable not for all unpaired tasks. We mark designing task-specific transport costs as a promising research direction. Limitations. Our method searches for a solution (f , T ) of a saddle point problem (14) and extracts the stochastic OT map T from it. We highlight after Lemma 4 and in M5.1 that not all T are optimal stochastic OT maps. For strong costs, the issue leads to the conditional collapse. Studying saddle points of (14) and arg inf sets (16) is an important challenge to address in the further research. Published as a conference paper at ICLR 2023 Potential societal impact. Our developed method is at the junction of optimal transport and generative learning. In practice, generative models and optimal transport are widely used in entertainment (image-manipulation applications like adding masks to images, hair coloring, etc.), design, computer graphics, rendering, etc. Our method is potentially applicable to many problems appearing in mentioned industries. While the mentioned applications allow making image processing methods publicly available, a potential negative is that they might transform some jobs in the graphics industry. Reproducibility. We provide the source code for all experiments and release the checkpoints for all models of M5. The details are given in README.MD in the official repository. ACKNOWLEDGEMENTS. The work was supported by the Analytical center under the RF Government (subsidy agreement 000000D730321P5Q0002, Grant No. 70-2021-00145 02.11.2021). J-J Alibert, Guy Bouchitté, and Thierry Champion. A new class of costs for optimal transport planning. European Journal of Applied Mathematics, 30(6):1229 1263, 2019. Amjad Almahairi, Sai Rajeshwar, Alessandro Sordoni, Philip Bachman, and Aaron Courville. Augmented cyclegan: Learning many-to-many mappings from unpaired data. In International Conference on Machine Learning, pp. 195 204. PMLR, 2018. Brandon Amos, Lei Xu, and J Zico Kolter. Input convex neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 146 155. JMLR. org, 2017. Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pp. 214 223. PMLR, 2017. Julio Backhoff-Veraguas, Mathias Beiglböck, and Gudmun Pammer. Existence, duality, and cyclical monotonicity for weak transport costs. Calculus of Variations and Partial Differential Equations, 58(6):1 28, 2019. Sagie Benaim and Lior Wolf. One-sided unsupervised domain mapping. Advances in Neural Information Processing Systems, 30, 2017. Anoop Cherian and Alan Sullivan. Sem-gan: semantically-consistent image-to-image translation. In 2019 ieee winter conference on applications of computer vision (wacv), pp. 1797 1806. IEEE, 2019. Lenaic Chizat. Unbalanced optimal transport: Models, numerical methods, applications. Ph D thesis, Université Paris sciences et lettres, 2017. Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. Stargan v2: Diverse image synthesis for multiple domains. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8188 8197, 2020. Grady Daniels, Tyler Maunu, and Paul Hand. Score-based generative neural networks for large-scale optimal transport. Advances in Neural Information Processing Systems, 34, 2021. Jiaojiao Fan, Shu Liu, Shaojun Ma, Yongxin Chen, and Hao-Min Zhou. Scalable computation of monge maps with general costs. In ICLR Workshop on Deep Generative Models for Highly Structured Data, 2022. URL https://openreview.net/forum?id=r En GR3Vd DW5. Gerald B Folland. Real analysis: modern techniques and their applications, volume 40. John Wiley & Sons, 1999. Aude Genevay, Marco Cuturi, Gabriel Peyré, and Francis Bach. Stochastic optimization for largescale optimal transport. In Advances in neural information processing systems, pp. 3440 3448, 2016. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672 2680, 2014. Published as a conference paper at ICLR 2023 Nathael Gozlan and Nicolas Juillet. On a mixture of brenier and strassen theorems. Proceedings of the London Mathematical Society, 120(3):434 463, 2020. Nathael Gozlan, Cyril Roberto, Paul-Marie Samson, and Prasad Tetali. Kantorovich duality for general transport costs and applications. Journal of Functional Analysis, 273(11):3327 3405, 2017. Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of Wasserstein GANs. In Advances in Neural Information Processing Systems, pp. 5767 5777, 2017. Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pp. 6626 6637, 2017. Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840 6851, 2020. Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. Multimodal unsupervised image-to-image translation. In Proceedings of the European conference on computer vision (ECCV), pp. 172 189, 2018. Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, pp. 694 711. Springer, 2016. Olav Kallenberg. Foundations of modern probability, volume 2. Springer, 1997. Leonid Kantorovitch. On the translocation of masses. Management Science, 5(1):1 4, 1958. Patrick Kidger and Terry Lyons. Universal approximation with deep narrow networks. In Conference on learning theory, pp. 2306 2327. PMLR, 2020. Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, and Jiwon Kim. Learning to discover cross-domain relations with generative adversarial networks. In International Conference on Machine Learning, pp. 1857 1865. PMLR, 2017. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Alexander Korotin, Vage Egiazarian, Arip Asadulaev, Alexander Safin, and Evgeny Burnaev. Wasserstein-2 generative networks. In International Conference on Learning Representations, 2021a. URL https://openreview.net/forum?id=b Eoxz W_EXsa. Alexander Korotin, Lingxiao Li, Aude Genevay, Justin M Solomon, Alexander Filippov, and Evgeny Burnaev. Do neural optimal transport solvers work? a continuous wasserstein-2 benchmark. Advances in Neural Information Processing Systems, 34, 2021b. Alexander Korotin, Lingxiao Li, Justin Solomon, and Evgeny Burnaev. Continuous wasserstein-2 barycenter estimation without minimax optimization. In International Conference on Learning Representations, 2021c. URL https://openreview.net/forum?id=3t FAs5E-Pe. Alexander Korotin, Vage Egiazarian, Lingxiao Li, and Evgeny Burnaev. Wasserstein iterative networks for barycenter estimation. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022a. URL https: //openreview.net/forum?id=Gi Enzx Tna MN. Alexander Korotin, Alexander Kolesov, and Evgeny Burnaev. Kantorovich strikes back! wasserstein GANs are not optimal transport? In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022b. URL https://openreview.net/forum? id=Vt EEpi-d Glt. Christian Léonard. A survey of the schrödinger problem and some of its connections with optimal transport. Discrete & Continuous Dynamical Systems, 34(4):1533, 2014. Published as a conference paper at ICLR 2023 Huidong Liu, Xianfeng Gu, and Dimitris Samaras. Wasserstein GAN with quadratic transport cost. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4832 4841, 2019. Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised image-to-image translation networks. In Advances in neural information processing systems, pp. 700 708, 2017. Yahui Liu, Marco De Nadai, Jian Yao, Nicu Sebe, Bruno Lepri, and Xavier Alameda-Pineda. Gmm-unit: Unsupervised multi-domain and multi-modal image-to-image translation via attribute gaussian mixture modeling. ar Xiv preprint ar Xiv:2003.06788, 2020. Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015. Guansong Lu, Zhiming Zhou, Yuxuan Song, Kan Ren, and Yong Yu. Guiding the one-to-one mapping in cyclegan via optimal transport. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 4432 4439, 2019. Guansong Lu, Zhiming Zhou, Jian Shen, Cheng Chen, Weinan Zhang, and Yong Yu. Large-scale optimal transport via adversarial training with cycle-consistency. ar Xiv preprint ar Xiv:2003.06635, 2020. Ashok Makkuva, Amirhossein Taghvaei, Sewoong Oh, and Jason Lee. Optimal transport mapping via input convex neural networks. In International Conference on Machine Learning, pp. 6672 6681. PMLR, 2020. Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-GAN: Training generative neural samplers using variational divergence minimization. In Advances in neural information processing systems, pp. 271 279, 2016. Henning Petzka, Asja Fischer, and Denis Lukovnicov. On the regularization of wasserstein gans. ar Xiv preprint ar Xiv:1709.08894, 2017. Gabriel Peyré, Marco Cuturi, et al. Computational optimal transport. Foundations and Trends in Machine Learning, 11(5-6):355 607, 2019. R Tyrrell Rockafellar. Integral functionals, normal integrands and measurable selections. In Nonlinear operators and the calculus of variations, pp. 157 207. Springer, 1976. Ralph Rockafellar. Characterization of the subdifferentials of convex functions. Pacific Journal of Mathematics, 17(3):497 510, 1966. Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computerassisted intervention, pp. 234 241. Springer, 2015. Litu Rout, Alexander Korotin, and Evgeny Burnaev. Generative modeling with optimal transport maps. In International Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=5Jd LZg346Lw. Kuniaki Saito, Kate Saenko, and Ming-Yu Liu. Coco-funit: Few-shot unsupervised image translation with a content conditioned style encoder. In European Conference on Computer Vision, pp. 382 398. Springer, 2020. Maziar Sanjabi, Jimmy Ba, Meisam Razaviyayn, and Jason D Lee. On the convergence and robustness of training gans with regularized optimal transport. Advances in Neural Information Processing Systems, 31, 2018. Filippo Santambrogio. Optimal transport for applied mathematicians. Birkäuser, NY, 55(58-63):94, 2015. Vivien Seguy, Bharath Bhushan Damodaran, Remi Flamary, Nicolas Courty, Antoine Rolet, and Mathieu Blondel. Large scale optimal transport and mapping estimation. In International Conference on Learning Representations, 2017. Published as a conference paper at ICLR 2023 Xuan Su, Jiaming Song, Chenlin Meng, and Stefano Ermon. Dual diffusion implicit bridges for image-to-image translation. In International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=5HLo Tv VGDe. Amirhossein Taghvaei and Amin Jalali. 2-Wasserstein approximation via restricted convex potentials with application to improved training for GANs. ar Xiv preprint ar Xiv:1902.07197, 2019. Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008. Yujia Xie, Minshuo Chen, Haoming Jiang, Tuo Zhao, and Hongyuan Zha. On scalable and efficient computation of large scale optimal transport. volume 97 of Proceedings of Machine Learning Research, pp. 6882 6892, Long Beach, California, USA, 09 15 Jun 2019. PMLR. URL http: //proceedings.mlr.press/v97/xie19a.html. Karren D. Yang and Caroline Uhler. Scalable unbalanced optimal transport using generative adversarial networks. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Hyex Ai A5Fm. Aron Yu and Kristen Grauman. Fine-grained visual comparisons with local learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 192 199, 2014. Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. ar Xiv preprint ar Xiv:1506.03365, 2015. Yongqi Zhang. Xogan: One-to-many unsupervised image-to-image translation. ar Xiv preprint ar Xiv:1805.07277, 2018. Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. Learning deep features for scene recognition using places database. 2014. Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223 2232, 2017a. Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A Efros, Oliver Wang, and Eli Shechtman. Toward multimodal image-to-image translation. Advances in neural information processing systems, 30, 2017b. Published as a conference paper at ICLR 2023 A VARIANCE-SIMILARITY TRADE-OFF In this section, we study the effect of the parameter γ on the structure of the learned stochastic map for the γ-weak quadratic cost. We consider handbags shoes translation (64 64) and test γ {0, 1 3, 1}. The results are shown in Figure 7. Figure 7: Stochastic Handbags shoes translation with the γ-weak quadratic cost for various γ. Discussion. For γ = 0 there is no variety in produced samples (Figure 7a), i.e., the conditional collapse happens. With the increase of γ (Figures 7b, 7c), the variety of samples increases and the style of the input images is mostly preserved. For γ = 1 (Figure 7d), the variety of samples is very high but many of them do not preserve the style of the input image. The parameter γ can be viewed as the trade-off parameter balancing the variance of samples and their similarity to the input. B TOY 2D EXPERIMENTS In this section, we test our Algorithm 1 on toy 2D distributions P, Q, i.e., P = Q = 2. Strong quadratic cost (γ = 0). As we noted in M5.1 and Appendix A, for the strong quadratic cost, our method tends to learn deterministic maps T(x, z) = T(x) which are independent of the noise input z. For deterministic maps T(x), our method yields MM:R method which has been evaluated in the recent Wasserstein-2 benchmark by (Korotin et al., 2021b). The authors show that the method recovers OT maps well on synthetic high-dimensional pairs P, Q with known ground truth OT maps. Thus, for brevity, we do not include toy experiments with our method for the strong quadratic cost. Weak quadratic cost (γ > 0). To our knowledge, our method is the first to solve weak OT, i.e., there are no approaches to compare with. The analysis of computed transport plans for weak costs is challenging due to the lack of nontrivial pairs P, Q with known ground truth OT plan π . The Published as a conference paper at ICLR 2023 (a) Input distribution P. (b) Target distribution Q. (c) Fitted b T#(P S) Q. (d) Map T(x) = R Z b T(x, z)d S(z). (e) Learned stochastic map b T(x, z). (f) Map x7 R Y ydπ (y|x). Figure 8: Gaussian Mixture of 8 Gaussians, learned stochastic map for the 1-weak quadratic cost. situation is even worsened by the nonuniqueness of π . To cope with this issue, we consider the weak quadratic cost with γ = 1. For this cost, one may derive 1 2 x y 2dµ(y) 1 2Var(µ) = 1 Y y dµ(y) 2. (21) For cost (21) and a pair P, Q, (Gozlan & Juillet, 2020, Theorem 1.2) states that there exists a P-unique (up to a constant) convex ψ : RP R such that every OT plan π satisfies ψ(x) = R Y y dπ (y|x). Besides, ψ : RP RP is 1-Lipschitz. Let b T(x, z) be the stochastic map recovered by our Algorithm 1, and let bπ be the corresponding plan. Let T(x) def = Z Y y dbπ(y|x) = Z Z b T(x, z)d S(z). (22) Due to the above mentioned characterization of OT plans, T(x) should look like a gradient ψ(x) of some convex function ψ(x) and should nearly be a contraction. Since here we work in the 2D space, we are able to get sufficiently many samples from P and Q and obtain a fine approximation of an OT plan π and ψ by a discrete weak OT solver. We may sample random batches from X P and Y Q of size 210 and use ot.weak from POT library4 to get some optimal π and ψ = R Y y dπ (y|x). We are going to compare our recovered average map T with ψ. Datasets. We test 2 pairs P, Q: Gaussian Mixture of 8 Gaussians; Gaussian Swiss roll. Neural Networks. We use multi-layer perceptrons as fω, Tθ with 3 hidden layers of 100 neurons and Re LU nonlinearity. The input of the stochastic map Tθ(x, z) is 2 + 2 = 4 dimensional. The two first dimensions represent the input x R2 while the other dimensions represent the noise z S. We employ a Gaussian noise with σ = 0.1 Discussion. We provide qualitative results in Figures 8 and 9. In both cases, the pushforward distribution b T#(P S) matches the desired target distribution Q (Figures 8c and 9c). Figures 8e 4https://pythonot.github.io/ Published as a conference paper at ICLR 2023 (a) Input distribution P. (b) Target distribution Q. (c) Fitted b T#(P S) Q. (d) Map T(x) = R Z b T(x, z)d S(z). (e) Learned stochastic map b T(x, z). (f) Map x7 R Y ydπ (y|x). Figure 9: Gaussian Swiss Roll, learned stochastic OT map for the 1-weak quadratic cost. and 9e show how the mass of points x P is split by the stochastic map. The average maps T(x) (Figures 8d, 9d) indeed nearly match the ground truth ψ (Figures 8f, 9f) obtained by POT. To quantify them, we compute L2-UVP (T) = 100% T ψ 2 P/Var( ψ#P) metric (Korotin et al., 2021a, M5.1). Here we obtain small values < 1% and 3% for the Swiss Roll and 8 Gaussians examples which further indicates the similarity of the learned T and the ground truth ψ(x). Note that T indeed roughly equals a gradient of a convex function. The gradients of convex functions are cycle monotone (Rockafellar, 1966). Cycle monotonicity yields that for x1 = x2 the segments [x1, ψ(x1)] and [x2, ψ(x2)] do not intersect in the inner points (Villani, 2008, M8).5 Visually, we see that in Figures 8d and 9d the segments [x, T(x)] do not intersect for different x, which is good. C TOY 1D EXPERIMENTS In this section, we additionally test our Algorithm 1 on toy 1D distributions P, Q, i.e., P = Q = 1. In this case, transport plans are 2D distributions and can be conveniently visualized. We experiment with the 1-weak quadratic cost (21). Following the discussion in the previous section, we recall that an OT plan π may be not unique. However, all OT plans satisfy ψ(x) = R Y y dπ (y|x) for some 1-smooth convex function ψ : R R. This simply means that x 7 ψ(x) = R Y y dπ (y|x) is a monotone increasing 1-Lipschitz function ψ : R R. Below we check that this necessary condition holds for T (22), where ˆT is our learned stochastic map. Datasets. We test 2 pairs P, Q: Gaussian Mix of 2 Gaussians; Gaussian Mix of 3 Gaussians. Neural Networks. We use the same networks as in Appendix B. This time, the input of the stochastic map Tθ(x, z) is 1 + 1 = 2 dimensional, the input to fω 1-dimensional. Discussion. We provide qualitative results in Figures 10 and 11. For each case, we plot the results of 3 random restarts of our method (ˆπ denotes our learned OT plan). Similarly to Appendix B, we plot 5For the sake of clarity, we slightly reformulated the property of the cycle monotone maps (Villani, 2008). Published as a conference paper at ICLR 2023 (a) Input and output distributions. (b) Learned plan ˆπ and marginal ˆT#(P S), test 1. (c) Learned plan ˆπ and marginal ˆT#(P S), test 2. (d) Learned plan ˆπ and marginal ˆT#(P S), test 3. (e) Various optimal plans π learned by discrete OT (considered here as the ground truth). Figure 10: Stochastic plans between toy 1D distributions (Figure 10a) learned by NOT (Figures 10b, 10c, 10d) and discrete OT (Figure 10e) with the 1-weak quadratic cost. The figures with the 2D transport plans also demonstrate the average map x 7 R Y ydˆπ(y|x) (conditional expectation). the results obtained by a discrete weak OT solver (ot.weak from POT library). Namely, in Figures 10e, 11e we show its results obtained for 4 restarts with differing seeds. Note that the average maps T computed by our algorithm in both cases nearly match those computed by the discrete weak OT. This indicates that the transport cost of our computed plan ˆπ is since [Cost of ˆπ] = Z 1 2 x T(x) |{z} ψ(x) 1 2 x ψ(x) 2d P(x) = Cost(P, Q), i.e., it nearly equals the optimal cost. Here we use T(x) ψ(x) observed from the experiments. To conclude, wee see that the recovered plans are close to the DOT considered as the ground truth. D COMPARISON WITH PRINCIPAL UNPAIRED TRANSLATION METHODS We compare our Algorithm 1 with popular models for unpaired translation. We consider handbags shoes (64 64), celeba male female (64 64), outdoor church (128 128) translation. For quantitative comparison, we compute Frechet Inception Distance6 (Heusel et al., 2017, FID) of the 6github.com/mseitzer/pytorch-fid Published as a conference paper at ICLR 2023 (a) Input and output distributions. (b) Learned plan ˆπ and marginal ˆT#(P S), test 1. (c) Learned plan ˆπ and marginal ˆT#(P S), test 2. (d) Learned plan ˆπ and marginal ˆT#(P S), test 3. (e) Various optimal plans π learned by discrete OT (considered here as the ground truth). Figure 11: Stochastic plans between toy 1D distributions (Figure 11a) learned by NOT (Figures 11b, 11c, 11d) and discrete OT (Figure 11e) with the 1-weak quadratic cost. The figures with the 2D transport plans also demonstrate the average map x 7 R Y ydˆπ(y|x) (conditional expectation). mapped test handbags subset w.r.t. the test shoes subset. The scores of our method and alternatives are given in Table 1. The translated images are shown in Figures 12, 13, 14. Methods. We compare our method with one-to-one Cycle GAN 7(Zhu et al., 2017a), Disco GAN8 (Kim et al., 2017) and with one-to-many Aug Cycle GAN9 (Almahairi et al., 2018) and MUNIT10 (Huang et al., 2018). We use the official or community implementations with the hyperparameters from the respective papers. We choose the above-mentioned methods for comparison because they are principal methods for one-to-one and one-to-many translation. Recent methods (GMM-UNIT (Liu et al., 2020), COCO-FUNIT (Saito et al., 2020), Star GAN (Choi et al., 2020)) are based on them and focus on specific details/setups such as style/content separation, few-shot learning, disentanglement, multi-domain transfer, which are out of scope of our paper. Discussion. Existing one-to-one methods visually preserve the style during translation comparably to our method. Alternative one-to-many methods do not preserve the style at all. When the input 7github.com/eriklindernoren/Py Torch-GAN/tree/master/implementations/ cyclegan 8github.com/eriklindernoren/Py Torch-GAN/tree/master/implementations/ discogan 9github.com/aalmah/augmented_cyclegan 10github.com/NVlabs/MUNIT Published as a conference paper at ICLR 2023 and output domains are similar (handbags shoes, celeba male female), the FID scores of all the models are comparable. However, most models are outperformed by NOT when the domains are distant (outdoor church), see Figure 14 and the last row in Table 1. For completeness, in Table 2 we compare the number of hyperparameters of the translation methods in view. Note that in contrast to the other methods, we optimize only 2 neural networks transport map and potential. Type One-to-one One-to-many Method Disco GAN Aug Cycle GAN MUNIT NOT (ours) Handbags shoes (64 64) 22.42 16.00 13.77 18.84 0.11 Celeba male female (64 64) 35.64 17.74 13.23 12.94 0.08 Outdoor church (128 128) 75.36 46.39 25.5 51.42 0.12 Table 1: Test FID of the considered unpaired translation methods. Type One-to-one One-to-many Method Disco GAN Aug Cycle GAN MUNIT NOT (ours) Hyperparameters of optimization objectives Weights of cycle and identity losses λcyc, λid Weights of cycle losses γ1, γ2 Weights of reconstruction losses λx, λc, λs Diversity control parameter γ Total number of hyperparameters 0 2 0 2 3 1 2 generators, 2 29.2M 2 discriminators 2 0.7M 2 generators 2 11.4M 2 discriminators 2 2.8M 1 transport 9.7M, 1 potential 22.9M [32.4M ] 2 generators 2 1.1M, 2 discriminators 2 2.8M, 2 encoders 2 1.4M 2 generators 2 15.0M, 2 discriminators 2 8.3M 1 transport map 9.7M, 1 potential 22.9M [32.4M ] Total number of networks and parameters 4 networks 59.8M 4 networks 28.2M 2 networks 32.6M [42.1M ] 6 networks 7.0M 4 networks 46.6M 2 networks 32.6M [42.1M ] Table 2: Comparison of the number of hyperparameters of the optimization objectives, the number of networks and their parameters for the considered unpaired translation methods for 64 64 images. For 128 128 images. Published as a conference paper at ICLR 2023 (a) NOT (ours, W2), one-to-one. (b) Disco GAN, one-to-one. (c) Cycle GAN, one-to-one. (d) NOT (ours, W2, 2 3 ), one-to-many. (e) MUNIT, one-to-many. (f) Aug Cycle GAN Figure 12: Handbags shoes translation (64 64) by the methods in view. Published as a conference paper at ICLR 2023 (a) NOT (ours, W2), one-to-one. (b) Disco GAN, one-to-one. (c) Cycle GAN, one-to-one. (d) NOT (ours, W2, 2 3 ), one-to-many. (e) MUNIT, one-to-many. (f) Aug Cycle GAN Figure 13: Celeba (male) Celeba (female) translation (64 64) by the methods in view. Published as a conference paper at ICLR 2023 (a) Disco GAN, one-to-one. (b) Cycle GAN, one-to-one. (c) NOT (ours, W2), one-to-one. (d) Aug Cycle GAN, one-to-many. (e) MUNIT, one-to-many. (f) NOT (ours, W2, 2 3 ), one-to-many. Figure 14: Outdoor church (128 128) translation with various methods. Published as a conference paper at ICLR 2023 E EXPERIMENTAL DETAILS Pre-processing. We beforehand rescale anime face images to 512 512, and do 256 256 crop with the center located 14 pixels above the image center to get the face. Next, for all these datasets, we rescale RGB channels to [ 1, 1] and resize images to the required size (64 64 or 128 128). We do not apply any augmentations to data. Neural networks. We use WGAN-QC discriminator s Res Net architecture (Liu et al., 2019) for potential f. We use UNet (Ronneberger et al., 2015) as the stochastic transport map T(x, z). The noise z is simply an additional 4th input channel (RGBZ), i.e., the dimension of the noise equals the image size (64 64 or 128 128). We use high-dimensional Gaussian noise with axis-wise σ = 0.1. Optimization. We use the Adam optimizer (Kingma & Ba, 2014) with the default betas for both Tθ and fω. The learning rate is lr = 1 10 4. The batch size is |X| = 64. The number of inner iterations is k T = 10. When training with the weak cost (4), we sample |Zx| = 4 noise samples per each image x in batch. In toy experiments, we do 10K total iterations of fω update. In the experiments with unpaired translation, our Algorithm 1 converges in 40K iterations for most datasets. Dynamic weak cost. In M5.3, we train the algorithm with the gradually changing γ. Starting from γ = 0, we linearly increase it to the desired value ( 2 3 or 1) during 25K first iterations of fω. Stability of training. In several cases, we noted that the optimization fluctuates around the saddle points or diverges. An analogous behavior of saddle point methods for OT has been observed in (Korotin et al., 2021b). For the γ-weak quadratic cost (γ > 0), we sometimes experienced instabilities when the input P is notably less disperse than Q or when the parameter γ is high. Studying this behaviour and improving stability/convergence of the optimization is a promising research direction. Computational complexity. The time and memory complexity of training deterministic OT maps T(x) is comparable to that of training usual generative models for unpaired translation. Our networks converge in 1-3 days on a Tesla V100 GPU (16 GB); wall-clock times depend on the datasets and the image sizes. Training stochastic T(x, z) is harder since we sample multiple random z per x (we use |Z| = 4). Thus, we learn stochastic maps on 4 Tesla V100 GPUs. F OPTIMALITY OF SOLUTIONS FOR STRICTLY CONVEX COSTS Our Lemma 4 proves that optimal maps T are contained in the arg inf T sets of optimal potentials f but leaves the question what else may be contained in these arg inf T sets open. Our following result shows that for strictly convex costs, nothing else beside OT maps is contained there. Lemma 5 (Solutions of the maximin problem are OT maps). Let C(x, µ) be a weak cost which is strictly convex in µ. Assume that there exists at least one potential f which maximizes dual form (5). Consider any such optimal potential f arg supf inf T L(f, T). It holds that ˆT arg inf T L(f , T) ˆT is a stochastic OT map. Proof of Lemma (5). By the definition of f , we have L(f , ˆT) = sup f inf T L(f, T) = Cost(P, Q), i.e., ˆT attains the optimal cost. It remains to check that it satisfies ˆT (P S) = Q, i.e., ˆT generates Q from P. Let T be any true stochastic OT map. We denote µ x = T (x, ) S and ˆµx = ˆT(x, ) S for all x X and define µ1 x = 1 2(µ x + ˆµx). Let T 1 : X Z Y be any stochastic map which satisfies T 1(x, ) S = µ1 x for all x X (M4.1). By using the change of variables, we derive Cost(P, Q) L(f , T 1) = Z X C(x, µ1 x)d P(x) Z Y f (y)dµ1 x(y) d P(x) + Z Y f (y)Q(y). github.com/milesial/Pytorch-UNet Published as a conference paper at ICLR 2023 Since C is convex in the second argument, we have C(x, µx 1) = C x, 1 2(µ x + ˆµx) 1 2C(x, µ x) + 1 2C(x, ˆµx). (24) Since C is strictly convex, the equality in (24) is possible only when µ x = ˆµx. We also note that Z Y f (y)dµ1 x(y) = Z Y f (y)d(µ x + ˆµx) Y f (y)dµ x(y) + 1 Y f (y)dˆµx(y). We substitute these findings to L(f , T 1) and get Cost(P, Q) L(f , T 1) 1 2L(f , T )+1 2L(f , ˆT) = 1 2Cost(P, Q)+1 2Cost(P, Q) = Cost(P, Q). Thus, (23) is an equality P-almost surely for all x X and µ x = ˆµx holds P-almost surely. This means that T and ˆT generate the same distribution from P S, i.e., ˆT is a stochastic OT map. Our generic framework allows learning stochastic transport maps (Lemma 4). For strictly convex costs, all the solutions of our objective (14) are guaranteed to be stochastic OT maps (Lemma 5). In the experiments, we focus on strong and weak quadratic costs, which are not strictly convex but still provide promising performance in the downstream task of unpaired image-to-image translation (M5). Developing strictly convex costs is a promising research avenue for the future work. G RELATION TO PRIOR WORKS IN UNBALANCED OPTIMAL TRANSPORT In the context of OT, (Yang & Uhler, 2019) employ a stochastic generator to learn a transport plan π in the unbalanced OT problem (Chizat, 2017). Due to this, their optimization objective slightly resembles our objective (15). However, this similarity is deceptive. Unlike strong (2) or weak (3) OT, the unbalanced OT is an unconstrained problem, i.e., there is no need to satisfy π Π(P, Q). This makes unbalanced OT easier to handle: to optimize it one just has to parametrize the plan π and backprop through the loss. The challenging part with which the authors deal is the estimation of the ϕ-divergence terms in the unbalanced OT objective. These terms can be interpreted as a soft relaxation of the constraints π Π(P, Q), i.e., penalization for disobeying the constraints. The authors compute these terms by employing the variational (dual) formula from f-GAN (Nowozin et al., 2016). This yields a GAN-style optimization problem min T maxf which is similar to other problems in the generative adversarial framework. The problem we tackle is strong (2) and weak (3) OT which requires enforcing of the constraint π Π(P, Q). We reformulate the dual (weak) OT problem (5) into maximin problem (15) which can be used to recover the OT plan (via the stochastic map T). Our approach can be viewed as a hard enforcement of the constraints. Our maxf min T saddle point problem (15) is atypical for the traditional generative adversarial framework. Published as a conference paper at ICLR 2023 H ADDITIONAL EXPERIMENTAL RESULTS (a) Celeba (female) anime translation, 128 128. (b) Outdoor church, 128 128. (c) Handbags shoes, 128 128. (d) Shoes handbags, 128 128. Figure 15: Unpaired translation with OT maps (W2). Additional examples (part 1). Published as a conference paper at ICLR 2023 (a) Celeba (female) anime translation, 64 64. (b) Anime celeba (female) translation, 64 64. (c) Celeba (male) celeba (female) translation, 64 64. (d) Anime shoes translation, 64 64. Figure 16: Unpaired translation with OT maps (W2). Additional examples (part 2). Published as a conference paper at ICLR 2023 (a) Input images x and random translated examples T(x, z). (b) Interpolation in the conditional latent space, z = (1 α)z1 + αz2. Figure 17: Celeba (female) anime, 128 128, stochastic. Additional examples. Published as a conference paper at ICLR 2023 (a) Input images x and random translated examples T(x, z). (b) Interpolation in the conditional latent space, z = (1 α)z1 + αz2. Figure 18: Outdoor church, 128 128, stochastic. Additional examples. Published as a conference paper at ICLR 2023 (a) Input images x and random translated examples T(x, z). (b) Interpolation in the conditional latent space, z = (1 α)z1 + αz2. Figure 19: Handbags shoes translation, 128 128, stochastic. Additional examples. Published as a conference paper at ICLR 2023 (a) Input images x and random translated examples T(x, z). (b) Interpolation in the conditional latent space, z = (1 α)z1 + αz2. Figure 20: Anime celeba (female) translation, 64 64, stochastic. Additional examples. Published as a conference paper at ICLR 2023 (a) Input images x and random translated examples T(x, z). (b) Interpolation in the conditional latent space, z = (1 α)z1 + αz2. Figure 21: Anime shoes translation, 64 64, stochastic. Additional examples. Published as a conference paper at ICLR 2023 (a) Input images x and random translated examples T(x, z). (b) Interpolation in the conditional latent space, z = (1 α)z1 + αz2. Figure 22: Shoes handbags, 64 64, stochastic. Additional examples. Published as a conference paper at ICLR 2023 I EXAMPLES WITH THE SYNCHRONIZED NOISE In this section, for handbags shoes (64 64) and outdoor church (128 128) datasets, we pick a batch of input data x1, . . . , x N P and noise z1, . . . , z K S to plot the N K matrix of generated images Tθ(xn, zk). Our goal is to assess whether using the same zk for different xn leads to some shared effects such as the same form a generated shoe or church. The images results are given in Figures 23 and 24. Qualitatively, we do not find any close relation between images produced with the same noise vectors for different input images xn. Figure 23: Input images x and random translated examples T(x, z). In this example, noise inputs z are synchronized between different input images x. Published as a conference paper at ICLR 2023 Figure 24: Input images x and random translated examples T(x, z). In this example, noise inputs z are synchronized between different input images x.