# neural_monge_map_estimation_and_its_applications__754c1390.pdf Published in Transactions on Machine Learning Research (07/2023) Neural Monge map estimation and its applications Jiaojiao Fan jiaojiaofan@gatech.edu Georgia Institute of Technology Shu Liu shuliu@math.ucla.edu University of California, Los Angeles Shaojun Ma shaojunma@gatech.edu Georgia Institute of Technology Hao-Min Zhou hmzhou@math.gatech.edu Georgia Institute of Technology Yongxin Chen yongchen@gatech.edu Georgia Institute of Technology Reviewed on Open Review: https: // openreview. net/ forum? id= 2m ZSl Qscj3 Monge map refers to the optimal transport map between two probability distributions and provides a principled approach to transform one distribution to another. Neural networkbased optimal transport map solver has gained great attention in recent years. Along this line, we present a scalable algorithm for computing the neural Monge map between two probability distributions. Our algorithm is based on a weak form of the optimal transport problem, thus it only requires samples from the marginals instead of their analytic expressions, and can be applied in large-scale settings. Furthermore, using the duality gap we prove rigorously a posteriori error analysis for the method. Our algorithm is suitable for general cost functions, compared with other existing methods for estimating Monge maps using samples, which are usually for quadratic costs. The performance of our algorithms is demonstrated through a series of experiments with both synthetic and realistic data, including text-to-image generation, class-preserving map, and image inpainting tasks. 1 Introduction The past decade has witnessed great success of optimal transport (OT) (Villani, 2008) based applications in machine learning community (Arjovsky et al., 2017; Krishnan & Martínez, 2018; Li et al., 2019; Inoue et al., 2020; Ma et al., 2020; Fan et al., 2020; Haasler et al., 2020; Alvarez-Melis & Fusi, 2020; Alvarez-Melis et al., 2021; Bunne et al., 2021; Mokrov et al., 2021; Bunne et al., 2022; Yang & Uhler, 2018; Fan et al., 2022a; Fan & Alvarez-Melis, 2022; Nguyen & Ho, 2022; Zhu et al., 2021). The Wasserstein distance induced by OT is widely used to evaluate the discrepancy between distributions thanks to its weak continuity and robustness. In this work, given any two probability distributions ρa and ρb defined on Rn and Rm, we consider the Monge problem CMonge(ρa, ρb) min T :Rn Rm, T ρa=ρb Rn c(x, T(x))ρa(x) dx. (1) Here c(x, y) denotes the cost of transporting from x to y and T is the transport map. We define the pushforward of distribution ρa by T as T ρa(E) = ρa(T 1(E)) for any measurable set E Rm. The Monge Equal contribution. Published in Transactions on Machine Learning Research (07/2023) problem seeks the cost-minimizing transport plan T from ρa to ρb. The optimal T is also known as the Monge map of (1). In this work, we propose a scalable algorithm for estimating the Wasserstein distance as well as the optimal map in continuous spaces without introducing any regularization terms. Particularly, we apply the Lagrangian multiplier directly to the Monge problem and obtain a sup-inf problem. Our contribution is summarized as follows: 1) We develop a neural network based algorithm to compute the optimal transport map associated with general transport costs between two distributions; 2) Our method is capable of computing OT problems between distributions with different dimensions. 3) We provide a rigorous a posteriori error analysis of the algorithm based on duality gaps; 4) We demonstrate its performance and its scaling properties in truly high dimensional settings through various experiments. Just like other computational OT methods, our method does not require paired data for learning the transport map. In real-life applications, the acquirement and maintenance of large-scale paired data are laborious. We will show that our proposed algorithm can be applied to multiple cutting-edge tasks, where the dominant methods still require paired data. In particular, our examples include text-to-image generation and image inpainting, which confirm that our algorithm can achieve competitive results even without paired data. Figure 1: Random image samples with size 224 224 on Laion art testing prompts. Except for the last row of real images, we show the outputs of the same decoder with different conditions. The condition of decoder includes a text encoding, which we keep unchanged across rows, and an image embedding. In the first row, we feed the decoder the text embedding, and the generated images are unrealistic because the image and the text embeddings are not interchangeable. In the second row, we pass the embedding mapped by our transport map, which is trained on unpaired data. As a baseline method, we pass an image embedding generated by the diffusion prior in DALL E2-Laion, which is trained on paired data. To explicitly show the decoder s recovery ability, we pass the real image embedding in the fourth row. Published in Transactions on Machine Learning Research (07/2023) Figure 2: Random image samples on Conceptual Captions 3M prompts. The meaning of each row is the same as in Figure 1. 2 A brief background of OT problems Consider X as a closed subset of Rn and Y as an arbitrary Polish space1. Suppose the cost function c is defined on Rn Y. We denote ρa, ρb as Borel probability distributions defined on X, Y respectively. Then the general OT problem, which can be treated as a relaxation of Monge problem (1), can be formulated as C(ρa, ρb) := inf π Π(ρa,ρb) X Y c(x, y)dπ(x, y), (2) where we define Π(ρa, ρb) as the set of joint Borel probability distributions on X Y with marginals equal to ρa and ρb. Once c is continuous and satisfies (15) and ρa is atomless, then optimal values of Monge problem and Kantorovich problem are equal to each other (Pratelli, 2007), i.e., CMonge(ρa, ρb) = C(ρa, ρb). The Kantorovich dual formulation (Villani, 2008, Ch. 5) of the primal OT problem (2) is Y ϕ(y)ρb(y) dy Z X ψ(x)ρa(x) dx , (3) where the maximization is over all ψ Cb(ρa), ϕ Cb(ρb) that satisfy ϕ(y) ψ(x) c(x, y) for any x X, y Y. For further discussions on the dual problem (3), as well as its equivalence to the primal OT problem (2) , we refer the reader to Appendix A. In general, the optimal solution π of (2) can be treated as a stochastic transport plan, i.e., we are allowed to break a single particle into pieces and then transport each 1A topological space is Polish space if it is a separable, complete, metric space. Euclidean space Rm and its closed subsets are Polish spaces. Published in Transactions on Machine Learning Research (07/2023) piece to certain positions according to the plan π . However, in this study, we mainly focus on computing the deterministic optimal map T for the Monge problem. The following theorem states the existence and uniqueness of the optimal solution to the Monge problem. It also reveals the relation between the optimal map T and the optimal transport plan π . This theorem is taken from Chapter 10 of (Villani, 2008). A more complete statement can be found in Appendix A. Theorem 1. (Informal, existence and uniqueness of Monge map) Assume the boundary X of X has dimension no larger than n 1. Suppose the cost c satisfies (15), (16), (17), (18), and assume that ρa and ρb are compactly supported and ρa is absolute continuous with respect to the Lebesgue measure on Rn. Then there exists a unique-in-law2 transport map T solving Monge problem (1). Furthermore, (Id, T ) ρa3 is an optimal solution to the OT problem (2). 3 Proposed method In order to formulate a tractable algorithm for the general Monge problem (1), we first notice that (1) is a constrained optimization problem. Thus, we introduce a Lagrange multiplier f for the constraint T ρa = ρb and reformulate (1) as a saddle point problem sup f Cb(Y) inf T M(X,Y) L(T, f). (4) Here we denote Cb(Y) as the space of bounded continuous functions on Y; and M(X, Y) as the space of measurable map T : X Y. In the following discussion, if not specified, we always treat the optimization space of f and T as Cb(Y), M(X, Y). Furthermore, we define L(T, f) as Z X c(x, T(x))ρadx + Z Y f(y)(ρb T ρa)dy = Z X [c(x, T(x)) f(T(x))] ρadx + Z Y f(y)ρbdy. (5) It is worth mentioning that if we optimize the inner problem of (4), our proposed problem (4) is also closely tied to the Kantorovich dual problem (20) of primal OT problem. Such observation will be useful on establishing several theoretical results for our method. The following theorem reveals the equivalence between the existence of solutions to primal problem (1) as well as dual problem (3) and the existence of saddle point (see Definition 3 of Appendix B). Theorem 2 (Equivalence & Consistency between optima of (1) and saddle point). Suppose X = Rn, Y = Rm; ρa, ρb are Borel probability distributions on X, Y; ρa is atomless4. Assume the OT distance C(ρa, ρb) < + and the cost c is continuous and bounded from below by a finite number c, i.e. c c. Then, (Equivalence) The following two statements are equivalent: 1. There exists at least one saddle point of L(T, f). 2. Both the primal problem (1) and the dual problem (20) admit at least one solution. (Consistency) For any optimal solution T to (1), optimal solution ϕ to (20), and saddle point ( ˆT, ˆf) of L(T, f), we have (T , ϕ ) is a saddle point of L(T, f), ˆT is an optimal solution to (1), and ˆf is an optimal solution to (20). Furthermore, L( ˆT, ˆf) = CMonge(ρa, ρb). A direct corollary of Theorem 2 guarantees the existence of the saddle point. Corollary 1 (Existence of saddle point). Suppose X = Rn, Y = Rm. Assume that ρa, ρb are compactly supported Borel probability distributions on X, Y, and ρa is absolute continuous with respect to the Lebesgue measure on Rn. Assume c satisfies (16), (17), (18), (23), and there exists a finite constant c such that c c. Then the Monge map T and the optimal solution ϕ to (20) both exist and (T , ϕ ) is a saddle point of L(T, f). 2This means if T is another Monge map solving the problem (1), then T = T on Spt(ρa) \ E0, where Spt(ρa) is the support of ρa and E0 is a zero measure set. 3This is the joint distribution on X Y obtained by pushing forward ρa via the map (Id, T ) : X X Y. 4This indicates that no single point in X carries a positive mass of ρa. Published in Transactions on Machine Learning Research (07/2023) The existence of the saddle point of L(T, f) is a rather strong assumption and may not always hold (Zhang et al., 2022). We next provide some further assertions only focusing on the existence of sup-inf solution to (4) instead of the existence of saddle point. Theorem 3 (Consistency of sup-inf solution). Assume the cost function is continuous and satisfies the condition (15). Suppose the sup-inf problem (4) admits at least one solution ( T, f). Under the condition T ρa = ρb, f is an optimal solution to Kantorovich dual (20), then T is optimal solution to Monge problem (1); and furthermore, L( T, f) = CMonge(ρa, ρb). It is hard to theoretically guarantee the condition T ρa = ρb for general cost. Empirically, we verify our method can achieve T ρa ρb qualitatively (Figure 4, 5(b), 6, 7) and quantitatively (Figure 4(b), Table 1, 2). More results on consistency without T ρa = ρb are presented in Appendix B.3. In the implementation, we parametrize both the map T and the dual variable f by the neural networks Tθ, fη, with θ, η being the parameters. Consequently, our goal is to solve the following saddle point problem max η min θ L(Tθ, fη) := 1 k=1 c(Xk, Tθ(Xk)) fη(Tθ(Xk)) + fη(Yk) (6) where N is the size of the datasets and {Xk}, {Yk} are samples drawn by ρa and ρb separately. The algorithm is summarized in Algorithm 1. The computational complexity of our algorithm is similar with GAN-type methods. In particular, Algorithm 1 requires O(K(K1 + K2)B) operations in total. Algorithm 1 Computing optimal Monge map from ρa to ρb 1: Input: Marginal distributions ρa and ρb, Batch size B, Cost function c(x, y). 2: Initialize Tθ, fη. 3: for K steps do 4: Sample {Xk}B k=1 ρa. Sample {Yk}B k=1 ρb. 5: Update θ to decrease (6) for K1 steps. 6: Update η to increase (6) for K2 steps. 7: end for 8: Output: The transport map Tθ. Mapping between unequal dimensions Our method is still applicable when c(x, y) is defined on Rn Rm and n = m. A typical example is c(x, y) = c(Q(x), y) (Pass, 2010; Mc Cann & Pass, 2020), where c is defined on Rm Rm and Q(x) is an embedding function. To name a few, Q(x) can be the upsampling function (Rout et al., 2022), or padding zero function (see Figure 8) as n < m, and a linear projection (Section 6.1) as n > m. Comparison with GAN It is worth pointing out that our method and the WGAN (Arjovsky et al., 2017) are similar in the sense that they are both carrying out minimization over the generator/transport map and maximization over the discriminator/dual potential. However, there are two main distinctions, which are summarized in the following. More detailed discussions are postponed to Appendix C. (Purpose: arbitrary map vs optimal map) The purpose of WGAN is to find an arbitrary map T such that T ρa is close to the target distribution ρb; On the other hand, the purpose of our method is two-fold: we not only compute for the map T that pushforwards ρa to ρb, but also guarantee the optimality of T in the sense of minimizing the total transportation cost EX ρac(X, T(X)). (Designing logic: minimizing distance vs computing distance itself) In WGAN, one aims at minimizing the OT distance as the discrepancy between T ρa and the target distribution ρb, the optimal value of the corresponding loss function thus equals to 0; In contrast, our proposed method computes OT distance (as well as the transport map) between source distribution ρa and the target ρb. In this case, the optimal value of the associated loss function equals CMonge(ρa, ρb). 4 Error Estimation via Duality Gaps Published in Transactions on Machine Learning Research (07/2023) In this section, we assume X = Y = Rd, i.e. we consider Monge problem on Rd. Suppose we solve (4) to a certain stage and obtain the pair (T, f), inspired by Hütter & Rigollet (2020) and Makkuva et al. (2020), we provide an a posteriori estimate to a weighted L2 error between our computed map T and the optimal Monge map T . Before we present our result, we need the following assumptions. Assumption 1 (on cost c( , )). We assume c C2(Rd Rd) is bounded from below. Furthermore, for any x, y Rd, we assume xc(x, y) is injective w.r.t. y; 2 xyc(x, y), as a d d matrix, is invertible; and 2 yyc(x, y) is independent of x. Assumption 2 (on marginals ρa, ρb). ρa, ρb are compactly supported on Rd, and ρa has density. Assumption 3 (on dual variable f). Assume the dual variable f C2(Rd) is always taken from c-concave functions, i.e., there exists certain φ C2(Rd) such that f( ) = infx{φ(x) + c(x, )} (see Definition 5.7 of Villani (2008)). Furthermore, we assume that there exists at least one minimizer xy argminx{φ(x) + c(x, y)} for any y Rd. And the Hessian of φ( ) + c( , y) at xy is positive definite. Assumption 4 (Infimum over T achieves its minimizer). Assume that there exists measurable Tf M(Rd, Rd) such that Tf argmin T {L( T, f)}. i.e., Tf(x) argminy{f(y) c(x, y)} for almost every y Rd. For the sake of conciseness, let us introduce two notations. We denote σ(x, y) = σmin( 2 xyc(x, y)) > 0 as the minimum singular value of 2 xyc(x, y); and denote λ(y) = λmax( 2 xx(φ(x) + c(x, y))|x=xy) > 0 (7) as the maximum eigenvalue of the Hessian of φ( ) + c( , y) at xy. We denote the duality gaps as E1(T, f) = L(T, f) inf e T L( e T, f), E2(f) = sup e f inf e T L( e T, ef) inf e T L( e T, f). We can verify that the conditions mentioned in Theorem 1 are satisfied under Assumption 1 and Assumption 2. Thus the Monge map T to the Monge problem (1) exists and is unique. We now state the main theorem on error estimation: Theorem 4 (A posteriori Error Estimation via Duality Gaps). Suppose Assumption 1, 2, 3 and 4 hold. Let us further assume the sup-inf problem (4) admits a solution ( f, T) that is consistent with the Monge problem, i.e. T equals T , ρa almost surely. Then there exists a strict positive weight function β(x) > min y Rm such that the weighted L2 error between computed map T and the Monge map T is upper bounded by T T L2(βρa) p 2(E1(T, f) + E2(f)). (8) The exact formulation of β is provided in (43) in the appendix B.4. Remark 1. The cost c(x, y) = 1 2 x y 2 or c(x, y) = x y satisfy the conditions mentioned above. More specifically, when c(x, y) = x y, we have σ(x, y) = 1, and λ(y) = 1 λmin( 2f(y)), which recovers Theorem 3.6 in Makkuva et al. (2020). Indeed, in this case, our f, λ correspond to f, 1 α in Makkuva et al. (2020). Theorem 4 provides an a posteriori error bound, i.e., for any computed (T, f), we provide an upper bound (8) on the accuracy of T. The weight function β( ) relies on f, and (T, f) could be arbitrary computed pair, thus it is almost impossible to provide a uniform lower bound for β( ), which is also dicussed in Makkuva et al. (2020, Remark 3.7). Further analysis on λ( ) and β( ) can be found in Appendix B.5. We clarify some limitations of our Theorem 4 below. (Hardness on evaluating E1, E2) It could be intractable to calculate E2 exactly due to the term supe f inf e T L( e T, ef), but one can approximate them by empirical values using samples after convergence. (Restricted form for analytic cost) Suppose that the cost c satisfies Assumption 1 and is also analytic, then c is restricted to the specific form A(x)+F(x) y +B(y), where A, B are analytical functions on Rd, and F : Rd Rd is analytic with invertible Jacobian DF(x) at any x Rd. Some further discussion on enforcing the c-concavity of the dual variable f is presented in Appendix B.4. Published in Transactions on Machine Learning Research (07/2023) 5 Related work Discrete OT methods (Courty et al., 2016; Feydy et al., 2020; Pooladian & Niles-Weed, 2021; Meng et al., 2019) solve the Kantorovich formulation with EMD (Nash, 2000) or Sinkhorn algorithm (Cuturi, 2013). It normally computes an optimal coupling of two empirical distributions. Such type of treatment has been widely accepted since it is friendly to high dimensional cases (Altschuler et al., 2017; Genevay et al., 2018; Li et al., 2019; Xie et al., 2020), but the algorithm does not scale well to a large number of samples and is not suitable to handle continuous probability measures. In addition, they cannot map the unseen samples. With an additional parameterization of a mapping function, such as linear or kernel function space (Perrot et al., 2016), they can map the unseen data. However, this map is not the solution to Monge problem but only an approximation of the optimal coupling (Courty et al., 2017, Sec. 2.1). Perrot et al. (2016) requires solving the matrix inverse when updating the transformation map, thus introducing a risk of instability. Their function space of the map is also less expressive than neural networks, which makes it difficult to solve the OT problem with complex transport costs and high dimension data. And they still can not handle large scale datasets. With the rise of neural networks, neural OT has become popular with an advantage of dealing with continuous measure. Seguy et al. (2017); Genevay et al. (2016) solve the regularized OT problem, and as such introduce the bias. Another branch of work (Makkuva et al., 2020; Korotin et al., 2021b; Fan et al., 2020; Korotin et al., 2021c) comes with parameterizing Brenier potential by Input Convex Neural Network (ICNN) (Amos et al., 2017). The notable work (Makkuva et al., 2020) is special case of our formula. In particular, their dual formula reads suph infg R [ x, g(x) h( g(x)) ]ρadx R h(y)ρbdy, where g, h are ICNN. After a change of variable g = T, h = f + 2/2, it becomes the same as our (5) equipped with the cost c(x, y) = x y 2/2. Later, the ICNN parameterization was shown to be less suitable (Korotin et al., 2021a; Fan et al., 2022b; Korotin et al., 2022a) for large scale problems because ICNN is not expressive enough. Recently, several works have illustrated the scalability of our dual formula with different realizations of the transportation costs. When c(x, y) = x y 2 2/2, our method directly boils down to reversed maximum solver (Nhan Dam et al., 2019), which appears to be the best neural OT solvers among multiple baselines (Korotin et al., 2021a). Another variant (Rout et al., 2022) of our work obtains comparable performance in image generative models, which asserts the efficacy in unequal dimension tasks. Gazdieva et al. (2022) utilizes the same dual formula with more diverse costs in image super-resolution task. Based on the similar dual formula, several works propose to add random noise as an additional input to make the map stochastic, i.e. one-to-many mapping, so it can approximate the Kantorovich OT plan. To make the learning of stochastic map valid, Korotin et al. (2022b) extends the transport cost c(x, y) to be weak cost that depends on the mapped distribution. Typically, they involve the variance in the weak cost to enforce diversity. The stochastic map is, however, not suitable for our problem (1) because of the conditional collapse behavior (Korotin et al., 2022b, Sec 5.1). Later, Asadulaev et al. (2022) extends the dual formula to more general cost functional. We note that the term general cost in Asadulaev et al. (2022) is different from general cost in our paper. They propose to solve the General OT problem infπ Π(ρa,ρb) F(π) through the dual formula sup f inf π Π(ρa){F(π) Z Y f(y)dπ(y) + Z Y f(y)dρb(y)}. If F(π) = R c(x, y)dπ(x, y) and assuming optimal π is of the form π = (Id, T ) ρa, then their formula boils down to our (5). However, the assumption π = (Id, T ) ρa may not always hold. 6 Experiments In this section, we specialize (5) with different general costs c(x, y) to fit in various applications. We will not focus on the quadratic cost c(x, y) = x y 2 2 since our formula in this special case has been extensively studied in multiple scenarios, e.g. generative model (Rout et al., 2022), super-resolution (Gazdieva et al., 2022), and style transfer (Korotin et al., 2022b). Instead, we focus on data with specific structure, which will be revealed in the cost function. Due to a wide span of tasks, it is unrealistic to find universal baselines in this section. We will select the proper baselines according to the scale of the examples. Published in Transactions on Machine Learning Research (07/2023) 6.1 Unpaired text to image generation We consider the task of generating images given text prompts. Existing successful text-to-image generation algorithms (Ramesh et al., 2021; 2022; Saharia et al., 2022) are supervised learning methods, and as such rely on paired data. In real world, it can be exhausting to maintain large-scale paired datasets given that they are mostly web crawling data, and the validity period of web data is limited. We intend to learn a map between the text and image embedding space of the CLIP model without paired data. Our framework is shown in Figure 3. We use the same unpaired data generation scheme as Rout et al. (2022). (a) Pipeline motivated by DALL E2. (b) Unpaired data generation process Figure 3: (a) During the training process (above the dotted line), our map learns to generate image embeddings that maximize the expected similarity with input texts. During the evaluation (below the dotted line), given a text encoding from the CLIP model, our map outputs an image embedding, which conditions a pre-trained diffusion decoder to generate an image. The CLIP encoder and diffusion decoder are both pre-trained and frozen. (b) We split a paired text-image dataset training into two parts: S and T , and take the texts from S and images from T as our training data. Finally, we use the paired test data to compare with the real images. In our algorithm, the sample from source distribution ρa is the text encoding x R77 768 (x = 0) from CLIP (Radford et al., 2021) model, and the sample from target distribution ρb is the image embedding y R768, which is normalized to have y 2 = 1. The CLIP model is pre-trained on 400M (text, image) pairs of web data with contrastive loss, such that the paired texts and images share a large cosine similarity, and the non-paired ones present a small similarity. As a result, we choose the transport cost as the negative cosine similarity between Rx and y: c(x, y) = Rx, y Rx 2 y 2 = Rx, y The frozen matrix R : R77 768 R768 is extracted from a linear layer of CLIP model and it projects the text encoding x to the same dimension as image embedding y. The normalized projected vector Rx/ Rx is called text embedding. Similar to DALL E2 (Ramesh et al., 2021; 2022), we use text encoding instead of text embedding as the input data because the encoding contains more information. The cost c(x, y) would enforce our map to generate an image embedding relevant to the input text. We evaluate our algorithm on two datasets: Laion art and Conceptual captions 3M (CC-3M). Laion art dataset is filtered from a 5 billion dataset to have the high aesthetic level, which is suitable for the learning of image generation, while CC-3M is not curated. We use the Open AI s CLIP (Vi T-L/14) model and a DALL E2 diffusion decoder. The training of their diffusion decoder requires paired data and its train dataset includes Laion art but not CC-3M. Therefore, our results on CC-3M are heavily based on the unpaired data because no part in Figure 3 has seen paired data of CC-3M, while on Laion art, the diffusion decoder has some paired data information gained from pre-training. We show qualitative samples in Figure 1 and 2, where we compare with DALL E2-Laion, a DALL E2 model pre-trained on the paired Laion aesthetic dataset, which includes Laion art dataset. Figure 1 confirms that our transport map can generate image embeddings with comparable quality even without paired data. Since CC-3M is a zero-shot dataset for DALL E2-Laion, the performance of DALL E2-Laion clearly drops in Published in Transactions on Machine Learning Research (07/2023) Figure 2, while our model still generates reasonable images. We visually verify our map achieves T ρa ρb by comparing the generated embeddings with the real image embeddings (on a test dataset). In Figure 4 (a), we present the results of our method, along with the results of a non-linear kernel map given by Perrot et al. (2016). While the latter method struggles to generalize to unseen data, our method is able to generate a distribution that closely matches the ground truth. Figure 4: (a) Comparison between the Laion art target image embeddings and fitted measure of generated embeddings by our method and Perrot et al. (2016). Samples are projected onto 2D plane by PCA. (b) Averaged cosine similarity between the generated image embeddings and (Left) the ground truth image embeddings or (Right) the unrelated text embeddings. The similarity on the left depicts how well the model recovers the real image embedding, and the similarity on the right quantifies the overfitting behavior (the lower the better). We also quantitatively compare with the baseline in terms of cosine similarity in Figure 4 (b). Note that we also normalize T(x), so T(x) = y cos-similarity(T(x), y) = 1. Our method achieves higher cosine similarity w.r.t. the real images on Laion art dataset. In practice, we observe that the relevance of each (text, image) pair in CC-3M is more noisy than Laion art, which makes the model more difficult to learn the real image embedding. This explains why our cosine similarity w.r.t real image embedding on CC-3M can not converge to the same level as Laion art. In the meantime, our overfitting level is low on both datasets. 6.2 Class-preserving map We consider the class-preserving map between two labelled datasets. The input distribution ρa = PJ j=1 ajρj a is a mixture of J distinct distributions {ρj a}J j=1. Similarly, the target distribution is ρb = PJ j=1 bjρj b, where {aj} and {bj} satisfy P j aj = 1 and P j bj = 1. Each distribution ρj a (or ρj b) is associated with a known label/class j. We further assume that the support of {ρj a}j (or {ρj b}j) are disjoint with respect to j. We seek a map that solves the problem min T ρj a=ρj b Rn x T(x) 2ρa(x) dx, j (10) where the constraint asks the map to preserve the original class. To approximate this map, we replace the constraint by a contrastive penalty. This trick transfers (10) to the original Monge problem with a cost c({x, y}, {x , y }) = x x 2 + 1(y = y ), (11) where y and y are the labels corresponding to x and x , 1 is the indicator function. To better guide the mapping, we involve the label as an input to the mapping T and the potential f. Denote ℓ( ) : Rm RJ as a pre-trained classifier on the target domain ρb. Given a feature from the target domain, ℓ( ) will output a probability vector. Then, our full formula reads sup f inf T Z x x 2 y y f( x; y)) dρa + Z f(x ; y )dρb, where x = T(x; y), y = ℓ(T(x; y)). In practice, y RJ is a one-hot label and it is known because both the source and target datasets are labeled. Published in Transactions on Machine Learning Research (07/2023) Figure 5: (a) Our cost is composed of the ordinary cost x x 2 and the contrastive label cost y y. (b) Class-preserving mapping by our conditional map. The top block is the source images, and the bottom block is the pushforward images. Each column represents a class. Left: F M, Right: M K. Table 1: Class-preserving map result: accuracy of the maps and FID of generated samples. Our map is deterministic. The map by Asadulaev et al. (2022) is non-conditional on the label. M, K, F represent MNIST, KMNIST, FMNIST correspondingly. Our accuracy result is significantly better than Asadulaev et al. (2022) thanks to the guidance of contrastive cost (11). Asadulaev et al. (2022) deterministic map Asadulaev et al. (2022) stochastic map Ours with conditional map Ours with non-conditional map FID M K 17.26 9.69 18.07 17.45 F M 7.15 5.26 5.78 6.86 Accuracy M K 79.20 61.91 99.59 93.02 F M 82.97 83.22 99.73 98.93 Recent two works (Asadulaev et al., 2022; Bunne et al., 2022) are along this line of class-guided OT maps. The former one learns an unconditional map T(x), i.e. no need of label information during the inference, and the latter one learns a conditional map T(x; y). Our method is different from both of them because their costs do not explicitly involve the label, while we include the label in the contrastive cost (11). Our method is designed to learn a conditional map, but we can also fit an unconditional map to the learned conditional map for the convenience of inference. We compare our algorithm with Asadulaev et al. (2022) on NIST (Le Cun & Cortes, 2005; Xiao et al., 2017; Clanuwat et al., 2018) datasets, where we use default labels in Py Torch. The visualization of our mapping is presented in Figure 5 (b). We also calculate FID (Heusel et al., 2017) between the mapped source test dataset and the target test dataset. To quantify how well the map preserves the original class, we use a pre-trained Spinal Net classifier (Kabir et al., 2022) on the target domain and evaluate the accuracy of the predicted label. If the predicted label of T(x; y) equals to the original label y, then the mapping is correct, otherwise wrong. We show the quantitative result in Table 1, where the results of Asadulaev et al. (2022) are from their paper. Since Asadulaev et al. (2022) makes inference using an unconditional map, to make a fair comparison, we present the results with a fitted unconditional map as well in Table 1. The stochastic map in Asadulaev et al. (2022) leads to a lower FID because the loss function for the stochastic map incorporates the "variance" of the pushforward distribution, which encourages the generation of more diverse data. However, because the stochastic map requires random noise as input, it requires a more sophisticated network architecture and is more computationally expensive to train. 6.3 Unpaired image inpainting In this section, we show the effectiveness of our method on the inpainting task with random rectangle masks. We take the distribution of occluded images to be ρa and the distribution of the full images to be ρb. In many inpainting works, it s assumed that an unlimited amount of paired training data is accessible (Zeng Published in Transactions on Machine Learning Research (07/2023) et al., 2021). However, some real-world applications do not involve the paired datasets. Accordingly, we consider the unpaired inpainting task. The training and test data are generated in the same way as Figure 3 (b). We choose cost function to be mean squared error (MSE) in the unmasked area c(x, y) = α x M(x) y M(x) 2 2 n , (12) where M(x), depending on x, is a binary mask with the same size as the image. M takes the value 1 in the unoccluded region, and 0 in the unknown/missing region. represents the point-wise multiplication, α is a tunable coefficient, and n is dimension of x. Different from Rout et al. (2022), where c(x, y) = x y 2/n, we include the mask in the cost to avoid the penalty in the masked region. Otherwise, it penalizes the generation of non-zero pixels because the pixel values are zeros in the masked region of source image. The map satisfying constraints T ρa = ρb and T(x) M(T(x)) = x M(x) is the optimal map. (a) Masked images (b) Real images (c) Perrot et al. (2016) (d) Our T(x) Figure 6: Unpaired image inpainting on test dataset of Celeb A 64 64. We conduct the experiments on Celeb A 64 64 and 128 128 datasets (Liu et al., 2015). The input images (ρa) are occluded by randomly positioned square masks. Each of the source ρa and target ρb distributions contains 80k images. We present the empirical results of inpainting in Figure 6 and 15. In Figure 6, we compare with the discrete OT method Perrot et al. (2016) since it also accepts general costs. It fits a map that approximates the barycentric projection so it can also provide out-of-sample mapping, although their map is not the solution to our Monge problem. We use 1000 unpaired samples for Perrot et al. (2016) to learn the optimal coupling. The comparison in Figure 6 confirms that our recovered images have much better quality than the discrete OT method. We also evaluate FID of the generated composite images w.r.t. the original images on the test dataset. We use 40k images and compute the score with the implementation provided by Obukhov et al. (2020). We choose transport map with L2 cost and WGAN-GP (Gulrajani et al., 2017) as the baselines and show the comparison results in Table 2. Our essential difference with transport map with L2 cost is that the latter cost does not involve masks, which jeopardizes the performance when the position of masks are changeable. Table 2: Quantitative evaluation results on Celeb A 64 64 test dataset. Cost (12) Cost α x y 2/n WGAN-GP α = 10 α = 104 FID 9.2857 3.7109 8.9398 6.7479 Theoretically, the change of α should not affect the OT map because f should be able to scale arbitrarily. However, in practice, due to the self-regularization of neural networks (Martin & Mahoney, 2021), f can not scale arbitrarily with α. As a result, modifying α will change the loss landscape. During the training, we observe that the magnitude ratio r = |E[c(x, T(x))]|/E[f(T(x))]| decreases from 18 to 0.2 when α drops from 104 to 10. We also observe the map learned with a larger α can generate more realistic images with natural transition in the mask border and exhibit more details on the face (see Figure 14 in the appendix). In our experiments, we observe that r > 10 yields reasonable results. Published in Transactions on Machine Learning Research (07/2023) 6.4 Population transportation on the sphere Motivated by the recent spherical transportation works (Cohen et al., 2021; Amos et al., 2022), we consider the following population transport problem on the sphere for testing our proposed method. It is wellknown that the population on earth is not distributed uniformly over the land due to various factors such as landscape, climate, temperature, economy, etc. Suppose we ignore all these factors, and we would like to design an optimal transport plan under which the current population travels along the earth s surface to form a rather uniform (in spherical coordinate) distribution over the earth s landmass. We treat the earth as an ideal sphere with radius 1, then we are able to formulate such problem as a Monge problem defined on D = [0, 2π) [0, π] with certain cost function c. To be more specific, we consider the spherical coordinate (θ, ϕ) on D, which corresponds to the point (cos θ sin ϕ, sin θ sin ϕ, cos ϕ) on the sphere. For any (θ1, ϕ1), (θ2, ϕ2) D, we set the distance c( , ) function as the geodesic distance on sphere, which is formulated as c((θ1, ϕ1), (θ2, ϕ2)) = arccos(sin ϕ1 sin ϕ2 cos(θ1 θ2) + cos ϕ1 cos ϕ2). (13) Then assume the current population distribution over the sphere is denoted as ρa (in Cartesian coordinate), which introduces the corresponding distribution ρSph a (in sperical coordinate) on D. Denote Dland D as the region of the land in the spherical coordinate system. We set the target distribution ρSph b as the uniform distribution supported on Dland. We aim at solving the following Monge problem min T :T ρSph a =ρSph b D c((θ, ϕ), T(θ, ϕ))ρSph a dθdϕ . The samples used in our experiment are generated from the licensed dataset from Doxsey-Whitfield et al. (2015). In our exact implementation, in order to avoid the explosion of the gradient of arccos( ) near 1, we replace the cost c with its linearization (In our computation, the constant π 2 can be omitted.) bc((θ1, ϕ1), (θ2, ϕ2)) = π 2 (sin ϕ1 sin ϕ2 cos(θ1 θ2) + cos ϕ1 cos ϕ2). (14) Furthermore, to guarantee that each sample point is transported onto landmass, we composite the trained map T with a map τ : D D that remains samples on land unchanged but maps samples on sea back to the closest location on land among randomly selected sites. In Figure 7, we compare our result with the linear transformation method introduced in Perrot et al. (2016). For more general kernel map, such as Gaussian kernel, their algorithm is not very stable and it is very difficult to obtain valid results. Although the ground truth for this example is unknown to us, we refer the reader to Appendix D for more synthetic examples with ground truth such as distributions with unequal dimensions, cost of decreasing functions, and Monge map on sphere. Figure 7: We show our result on the left and Perrot et al. (2016) on the right. Each figure plots samples from the source distribution ρSph a (blue) and samples from the pushforward distribution Tθ ρSph a (green). We also demonstrate the computed transport map (orange) for the first 500 randomly generated points from source ρSph a to target ρSph b . Published in Transactions on Machine Learning Research (07/2023) 7 Conclusion and Discussion In this paper, we present a novel method to compute the Monge map between two given distributions with flexible transport cost functions. In particular, we consider applying Lagrange multipliers on the Monge problem, which leads to a sup-inf saddle point problem. By further introducing neural networks into our optimization, we obtain a scalable algorithm that can handle most general costs. Broader impact Our method will become a useful tool for machine learning applications such as domain adaption, and image restoration that requires transforming data distributions. Compared to other deep learning methods, our scheme reduces the need for paired data; compared to other OT methods, our scheme can be applied to high dimensional large-scale datasets, and we do not involve regularization which could hurt the quality of transformed data. In the future, it is promising to specialize our method to structured data, such as time series, graphs, point clouds, and SPD manifolds (Ju & Guan, 2022). Limitation Firstly, We observe that when the target distribution is a discrete distribution with fixed support, our method tends to be more unstable. We conjecture that this is because the network parameterization of the discriminator is too complex for this type of uncomplicated target distribution. Secondly, there could be a gap between the theory and some of our experiments. We make a summary table in Section F, which demonstrates that for some experiments, the existence and uniqueness of the Monge map are not guaranteed. It is either because the space is not regular or the assumptions in Theorem 1 are not fully satisfied. We leave the theoretical study for those examples as the future direction. SN Afriat. Theory of maxima and the method of lagrange. SIAM Journal on Applied Mathematics, 20(3): 343 357, 1971. (Cited on page 25.) J. Altschuler, J. Niles-Weed, and P. Rigollet. Near-linear time approximation algorithms for optimal transport via sinkhorn iteration. Advances in neural information processing systems, pp. 1964 1974, 2017. (Cited on page 7.) David Alvarez-Melis and Nicolo Fusi. Geometric dataset distances via optimal transport. Advances in Neural Information Processing Systems, 33:21428 21439, 2020. (Cited on page 1.) David Alvarez-Melis, Yair Schiff, and Youssef Mroueh. Optimizing functionals on the space of probabilities with input convex neural networks. ar Xiv preprint ar Xiv:2106.00774, 2021. (Cited on page 1.) B. Amos, L. Xu, and J.Z. Kolter. Input convex neural networks. In International Conference on Machine Learning, pp. 146 155, 2017. (Cited on page 7.) Brandon Amos, Samuel Cohen, Giulia Luise, and Ievgen Redko. Meta optimal transport, 2022. URL https://arxiv.org/abs/2206.05262. (Cited on page 12.) Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. In ar Xiv preprint ar Xiv:1701.07875, 2017. (Cited on pages 1, 5, and 29.) Arip Asadulaev, Alexander Korotin, Vage Egiazarian, and Evgeny Burnaev. Neural optimal transport with general cost functionals. ar Xiv preprint ar Xiv:2205.15403, 2022. (Cited on pages 7 and 10.) Charlotte Bunne, Laetitia Meng-Papaxanthos, Andreas Krause, and Marco Cuturi. Jkonet: Proximal optimal transport modeling of population dynamics. ar Xiv preprint ar Xiv:2106.06345, 2021. (Cited on page 1.) Charlotte Bunne, Andreas Krause, and Marco Cuturi. Supervised training of conditional monge maps. ar Xiv preprint ar Xiv:2206.14262, 2022. (Cited on pages 1 and 10.) Tarin Clanuwat, Mikel Bober-Irizar, Asanobu Kitamoto, Alex Lamb, Kazuaki Yamamoto, and David Ha. Deep learning for classical japanese literature. ar Xiv preprint ar Xiv:1812.01718, 2018. (Cited on page 10.) Published in Transactions on Machine Learning Research (07/2023) Samuel Cohen, Brandon Amos, and Yaron Lipman. Riemannian convex potential maps. In International Conference on Machine Learning, pp. 2028 2038. PMLR, 2021. (Cited on page 12.) N. Courty, R. Flamary, D. Tuia, and A. Rakotomamonjy. Optimal transport for domain adaptation. IEEE transactions on pattern analysis and machine intelligence, 39(9):1853 1865, 2016. (Cited on page 7.) Nicolas Courty, Rémi Flamary, Amaury Habrard, and Alain Rakotomamonjy. Joint distribution optimal transportation for domain adaptation. Advances in Neural Information Processing Systems, 30, 2017. (Cited on page 7.) M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Neural Information Processing Systems, 2013. (Cited on page 7.) Erin Doxsey-Whitfield, Kytt Mac Manus, Susana B. Adamo, Linda Pistolesi, John Squires, Olena Borkovska, and Sandra R. Baptista. Taking advantage of the improved availability of census data: A first look at the gridded population of the world, version 4. Papers in Applied Geography, 1(3):226 234, 2015. doi: 10.1080/23754931.2015.1014272. URL https://doi.org/10.1080/23754931.2015.1014272. (Cited on pages 12 and 32.) J. Fan, A. Taghvaei, and Y. Chen. Scalable computations of wasserstein barycenter via input convex neural networks. ar Xiv preprint ar Xiv:2007.04462, 2020. (Cited on pages 1 and 7.) Jiaojiao Fan and David Alvarez-Melis. Generating synthetic datasets by interpolating along generalized geodesics. In Neur IPS 2022 Workshop on Synthetic Data for Empowering ML Research, 2022. (Cited on page 1.) Jiaojiao Fan, Isabel Haasler, Johan Karlsson, and Yongxin Chen. On the complexity of the optimal transport problem with graph-structured cost. In International Conference on Artificial Intelligence and Statistics, pp. 9147 9165. PMLR, 2022a. (Cited on page 1.) Jiaojiao Fan, Qinsheng Zhang, Amirhossein Taghvaei, and Yongxin Chen. Variational Wasserstein gradient flow. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 6185 6215. PMLR, 17 23 Jul 2022b. URL https://proceedings. mlr.press/v162/fan22d.html. (Cited on pages 7 and 27.) Jean Feydy, Alexis Glaunès, Benjamin Charlier, and Michael Bronstein. Fast geometric learning with symbolic matrices. Advances in Neural Information Processing Systems, 33:14448 14462, 2020. (Cited on page 7.) Rémi Flamary, Nicolas Courty, Alexandre Gramfort, Mokhtar Z. Alaya, Aurélie Boisbunon, Stanislas Chambon, Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo Fournier, Léo Gautheron, Nathalie T.H. Gayraud, Hicham Janati, Alain Rakotomamonjy, Ievgen Redko, Antoine Rolet, Antony Schutz, Vivien Seguy, Danica J. Sutherland, Romain Tavenard, Alexander Tong, and Titouan Vayer. Pot: Python optimal transport. Journal of Machine Learning Research, 22(78):1 8, 2021. URL http://jmlr.org/papers/ v22/20-451.html. (Cited on pages 33 and 36.) Milena Gazdieva, Litu Rout, Alexander Korotin, Alexander Filippov, and Evgeny Burnaev. Unpaired image super-resolution with optimal transport maps. ar Xiv preprint ar Xiv:2202.01116, 2022. (Cited on pages 7, 21, 22, and 29.) A. Genevay, M. Cuturi, G. Peyré, and F. Bach. Stochastic optimization for large-scale optimal transport. In Advances in neural information processing systems, pp. 3440 3448, 2016. (Cited on page 7.) A. Genevay, G. Peyré, and M. Cuturi. Learning generative models with sinkhorn divergences. International Conference on Artificial Intelligence and Statistics, pp. 1608 1617, 2018. (Cited on page 7.) Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. Advances in neural information processing systems, 30, 2017. (Cited on page 11.) Published in Transactions on Machine Learning Research (07/2023) I. Haasler, R. Singh, Q. Zhang, J. Karlsson, and Y. Chen. Multi-marginal optimal transport and probabilistic graphical models. In ar Xiv preprint ar Xiv:2006.14113, 2020. (Cited on page 1.) Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017. (Cited on page 10.) Jan-Christian Hütter and Philippe Rigollet. Minimax estimation of smooth optimal transport maps, 2020. (Cited on pages 6 and 20.) D. Inoue, Y. Ito, and H. Yoshida. Optimal transport-based coverage control for swarm robot systems: Generalization of the voronoi tessellation-based method. IEEE Control Systems Letters, 5(4):1483 1488, 2020. (Cited on page 1.) Ce Ju and Cuntai Guan. Deep optimal transport on spd manifolds for domain adaptation. ar Xiv preprint ar Xiv:2201.05745, 2022. (Cited on page 13.) HM Dipu Kabir, Moloud Abdar, Abbas Khosravi, Seyed Mohammad Jafar Jalali, Amir F Atiya, Saeid Nahavandi, and Dipti Srinivasan. Spinalnet: Deep neural network with gradual input. IEEE Transactions on Artificial Intelligence, 2022. (Cited on page 10.) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2014. (Cited on pages 33 and 36.) A. Korotin, L. Li, A. Genevay, J. Solomon, A. Filippov, and E. Burnaev. Do neural optimal transport solvers work? a continuous wasserstein-2 benchmark. In ar Xiv preprint ar Xiv:2106.01954, 2021a. (Cited on page 7.) Alexander Korotin, Vage Egiazarian, Arip Asadulaev, Alexander Safin, and Evgeny Burnaev. Wasserstein2 generative networks. In International Conference on Learning Representations, 2021b. URL https: //openreview.net/forum?id=b Eoxz W_EXsa. (Cited on page 7.) Alexander Korotin, Lingxiao Li, Justin Solomon, and Evgeny Burnaev. Continuous wasserstein-2 barycenter estimation without minimax optimization. ar Xiv preprint ar Xiv:2102.01752, 2021c. (Cited on page 7.) Alexander Korotin, Vage Egiazarian, Lingxiao Li, and Evgeny Burnaev. Wasserstein iterative networks for barycenter estimation. ar Xiv preprint ar Xiv:2201.12245, 2022a. (Cited on pages 7 and 27.) Alexander Korotin, Daniil Selikhanovych, and Evgeny Burnaev. Neural optimal transport. ar Xiv preprint ar Xiv:2201.12220, 2022b. (Cited on pages 7 and 21.) V. Krishnan and S. Martínez. Distributed optimal transport for the deployment of swarms. IEEE Conference on Decision and Control (CDC), pp. 4583 4588, 2018. (Cited on page 1.) Yann Le Cun and Corinna Cortes. The mnist database of handwritten digits. 2005. (Cited on page 10.) R. Li, X. Ye, H. Zhou, and H. Zha. Learning to match via inverse optimal transport. In J. Mach. Learn. Res., pp. 20, pp.80 1, 2019. (Cited on pages 1 and 7.) Huidong Liu, Xianfeng Gu, and Dimitris Samaras. Wasserstein gan with quadratic transport cost. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 4832 4841, 2019. (Cited on page 36.) Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pp. 3730 3738, 2015. (Cited on page 11.) S. Ma, S. Liu, H. Zha, and H. Zhou. Learning stochastic behaviour of aggregate data. In ar Xiv preprint ar Xiv:2002.03513, 2020. (Cited on page 1.) Published in Transactions on Machine Learning Research (07/2023) A. Makkuva, A.and Taghvaei, S. Oh, and J. Lee. Optimal transport mapping via input convex neural networks. In International Conference on Machine Learning, pp. 6672 6681, 2020. (Cited on pages 6 and 7.) Tudor Manole, Sivaraman Balakrishnan, Jonathan Niles-Weed, and Larry Wasserman. Plugin estimation of smooth optimal transport maps, 2022. (Cited on page 20.) Charles H Martin and Michael W Mahoney. Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning. J. Mach. Learn. Res., 22(165):1 73, 2021. (Cited on page 11.) Robert J Mc Cann and Brendan Pass. Optimal transportation between unequal dimensions. Archive for Rational Mechanics and Analysis, 238(3):1475 1520, 2020. (Cited on page 5.) Cheng Meng, Yuan Ke, Jingyi Zhang, Mengrui Zhang, Wenxuan Zhong, and Ping Ma. Large-scale optimal transport map estimation using projection pursuit. Advances in Neural Information Processing Systems, 32, 2019. (Cited on page 7.) Takeru Miyato and Masanori Koyama. cgans with projection discriminator. ar Xiv preprint ar Xiv:1802.05637, 2018. (Cited on page 36.) Petr Mokrov, Alexander Korotin, Lingxiao Li, Aude Genevay, Justin Solomon, and Evgeny Burnaev. Largescale wasserstein gradient flows. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=nl Lj Iu Hs MHp. (Cited on page 1.) John C Nash. The (dantzig) simplex method for linear programming. Computing in Science & Engineering, 2(1):29 31, 2000. (Cited on page 7.) Khai Nguyen and Nhat Ho. Amortized projection optimization for sliced wasserstein generative models. ar Xiv preprint ar Xiv:2203.13417, 2022. (Cited on page 1.) Quan Hoang Nhan Dam, Trung Le, Tu Dinh Nguyen, Hung Bui, and Dinh Phung. Threeplayer wasserstein gan via amortised duality. In Proc. of the 28th Int. Joint Conf. on Artificial Intelligence (IJCAI), pp. 1 11, 2019. (Cited on page 7.) Anton Obukhov, Maximilian Seitzer, Po-Wei Wu, Semen Zhydenko, Jonathan Kyl, and Elvis Yu-Jing Lin. High-fidelity performance metrics for generative models in pytorch, 2020. URL https://github.com/ toshas/torch-fidelity. Version: 0.3.0, DOI: 10.5281/zenodo.4957738. (Cited on page 11.) Brendan Pass. Regularity of optimal transportation between spaces with different dimensions. ar Xiv preprint ar Xiv:1008.1544, 2010. (Cited on page 5.) Michaël Perrot, Nicolas Courty, Rémi Flamary, and Amaury Habrard. Mapping estimation for discrete optimal transport. Advances in Neural Information Processing Systems, 29, 2016. (Cited on pages 7, 9, 11, 12, and 36.) Aram-Alexandre Pooladian and Jonathan Niles-Weed. Entropic estimation of optimal transport maps. ar Xiv preprint ar Xiv:2109.12004, 2021. (Cited on page 7.) Aldo Pratelli. On the equality between monge s infimum and kantorovich s minimum in optimal mass transportation. Annales de l Institut Henri Poincare (B) Probability and Statistics, 43(1):1 13, 2007. ISSN 0246-0203. doi: https://doi.org/10.1016/j.anihpb.2005.12.001. URL https://www.sciencedirect. com/science/article/pii/S0246020306000148. (Cited on pages 3 and 19.) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp. 8748 8763. PMLR, 2021. (Cited on page 8.) Published in Transactions on Machine Learning Research (07/2023) Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Conference on Machine Learning, pp. 8821 8831. PMLR, 2021. (Cited on page 8.) Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. ar Xiv preprint ar Xiv:2204.06125, 2022. (Cited on page 8.) R. Tyrrell Rockafellar. Integral functionals, normal integrands and measurable selections. In Jean Pierre Gossez, Enrique José Lami Dozo, Jean Mawhin, and Lucien Waelbroeck (eds.), Nonlinear Operators and the Calculus of Variations, pp. 157 207, Berlin, Heidelberg, 1976. Springer Berlin Heidelberg. ISBN 978-3-540-38075-7. (Cited on page 21.) R Tyrrell Rockafellar and Roger J-B Wets. Variational analysis, volume 317. Springer Science & Business Media, 2009. (Cited on page 21.) Litu Rout, Alexander Korotin, and Evgeny Burnaev. Generative modeling with optimal transport maps. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id= 5Jd LZg346Lw. (Cited on pages 5, 7, 8, 11, and 36.) Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. Photorealistic text-toimage diffusion models with deep language understanding. ar Xiv preprint ar Xiv:2205.11487, 2022. (Cited on page 8.) Filippo Santambrogio. Optimal transport for applied mathematicians. Birkäuser, NY, 55(58-63):94, 2015. (Cited on page 37.) V. Seguy, B.B. Damodaran, R. Flamary, Rolet Courty, N., A., and M. Blondel. Large-scale optimal transport and mapping estimation. ar Xiv preprint ar Xiv:1711.02283, 2017. (Cited on page 7.) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(56): 1929 1958, 2014. URL http://jmlr.org/papers/v15/srivastava14a.html. (Cited on page 33.) C. Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008. (Cited on pages 1, 3, 4, 6, 19, and 20.) Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017. (Cited on page 10.) Y. Xie, X. Wang, R. Wang, and H. Zha. A fast proximal point method for computing exact wasserstein distance. In Uncertainty in Artificial Intelligence, pp. 433 453, 2020. (Cited on page 7.) Karren D Yang and Caroline Uhler. Scalable unbalanced optimal transport using generative adversarial networks. ar Xiv preprint ar Xiv:1810.11447, 2018. (Cited on page 1.) Yu Zeng, Zhe Lin, Huchuan Lu, and Vishal M Patel. Cr-fill: Generative image inpainting with auxiliary contextual reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14164 14173, 2021. (Cited on page 10.) Peiyuan Zhang, Jingzhao Zhang, and Suvrit Sra. Minimax in geodesic metric spaces: Sion s theorem and algorithms. ar Xiv preprint ar Xiv:2202.06950, 2022. (Cited on page 5.) Jiacheng Zhu, Aritra Guha, Dat Do, Mengdi Xu, Xuan Long Nguyen, and Ding Zhao. Functional optimal transport: map estimation and domain adaptation for functional data. ar Xiv preprint ar Xiv:2102.03895, 2021. (Cited on page 1.) Published in Transactions on Machine Learning Research (07/2023) 1 Introduction 1 2 A brief background of OT problems 3 3 Proposed method 4 4 Error Estimation via Duality Gaps 5 5 Related work 7 6 Experiments 7 6.1 Unpaired text to image generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 6.2 Class-preserving map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 6.3 Unpaired image inpainting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 6.4 Population transportation on the sphere . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 7 Conclusion and Discussion 13 A Backgrounds of optimal transport problem with general cost function 19 B Proof of theoretical results 21 B.1 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 B.2 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 B.3 Consistency in optimal values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 B.4 Proof of Theorem 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 B.5 Further analysis on λ( ) and β( ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 C Relation between our method and generative adversarial networks 29 D Additional results 29 E Implementation details and hyper-parameters 33 E.1 Synthetic datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 E.2 Text to image generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 E.3 Unpaired inpainting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 E.4 Class preserving mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 F Theoretical justification of all the examples considered in this article 37 G Notations used in the paper 38 Published in Transactions on Machine Learning Research (07/2023) A Backgrounds of optimal transport problem with general cost function We present some fundamental results of OT problem with general cost in this section. Most of the results are collected and summarized from Villani (2008). Suppose M is a smooth, complete connected Riemmannian manifold. In most of our discussions, we treat M as the Euclidean space Rn for the sake of convenience. Let us denote X as a closed subset of M. Thus X is a Polish space, meaning X is a separable5, complete, metric space. We assume Y is a general Polish space. Suppose the cost function c is defined on X Y. Suppose X, Y are equipped with their Borel σ-algebra. Denote P(X), P(Y) as the space of Borel probability measures on X, Y. We mainly focus on the Monge problem (1) from ρa P(X) to ρb P(Y) with cost function c. Before the detailed discussion, we list some useful conditions that may be used later. A) c is bounded below, i.e., there exists a L1(X), b L1(Y), s.t. c(x, y) a(x) + b(y) on X Y; (15) B) c is locally Lipschitz as a function of x on int(X), locally in y; (16) C) c is everywhere superdifferentiable as a function of x int(X), for all y; (17) D) For arbitrary x int(X), xc(x, ) : Y Rn, is injective, (18) i.e., suppose for x, y, y , xc(x, y) = xc(x, y ), then y = y . Here we denote int(X) as the interior of the set X, i.e., int(X) is the union of all open sets in M contained in X. And we define the local Lipschitz property and superdifferentiablity as follows. Definition 1 (Locally Lipschitz). Let U Rn be open and let f : Rn R be given. Then (1) f is Lipschitz if there exists L < such that x, z Rn, |f(z) f(x)| L|x z|. (2) f is said to be locally Lipschitz if for any x0 Rn, there is a neighbourhood of x0 in which f is Lipschitz. Definition 2 (Superdifferentiablity). f : Rn R is superdifferentiable at x, with supergradient p, if f(z) f(x) + p, z x + o(|z x|). Here the little-o notation indicates that limz x o(|z x|) The relation between optimal values of Monge problem (1) and its relaxation, namely, the general OT problem (2) is studied in Pratelli (2007) (c.f. Bibliographical notes of Chap 4 in Villani (2008)). We summarize the result below. Theorem 5. Suppose X, Y are two Polish spaces; ρa, ρb are Borel probability distributions on X, Y. Suppose ρa is atomless (i.e. no single point carries a positive mass). Assume further c is continuous and satisfies (15). Then CMonge(ρa, ρb) = C(ρa, ρb). Recall that the Kantorovich dual problem of the primal OT problem (2) is formulated as K(ρa, ρb) := sup (ψ, ϕ) Cb(X) Cb(Y) ϕ(y) ψ(x) c(x, y) x X, y Y Y ϕ(y)ρb(y) dy Z X ψ(x)ρa(x) dx . We denote the optimal value of this dual problem as K(ρa, ρb). It is not hard to tell that the above dual problem is equivalent to any of the following two problems sup ψ Cb(X) Y ψc,+(y)ρb(y) dy Z X ψ(x)ρa(x) dx , (19) 5A topological space is separable if it contains a countable dense set. Since Rn is separable, its subset X equipped with the subspace topology is also separable. Published in Transactions on Machine Learning Research (07/2023) sup ϕ Cb(Y) Y ϕ(y)ρb(y) dy Z X ϕc, (x)ρa(x) dx . (20) Here the c-transform ψc,+, ϕc, are defined via infimum/supremum convolution as: ψc,+(y) = inf x (ψ(x) + c(x, y)); (21) ϕc, (x) = sup y (ϕ(y) c(x, y)). (22) If we denote ψ , ϕ as the optimal solutions to (19), (20) respectively. Then both (ψ , ψc,+ ) and (ϕc, , ϕ ) are the optimal solutions to (3); ψc,+ , ϕc, are also solutions to (20), (19) respectively. Remark 2 (c.f. Theorem 5.10 of Villani (2008)). It is worth mentioning that one can relax the Cb(X), Cb(Y) spaces in (3), (19) and (20) to L1(X; ρa) and L1(Y; ρb). The following theorem states the equivalent relationship between the primal OT problem (2) and its Kantorovich dual (3). The proof of a more general version can be found in Theorem 5.10 of Villani (2008). Theorem 6 (Kantorovich Duality). Suppose X, Y are two Polish spaces. Suppose c is a cost function defined on X Y, c is continuous and satisfies (15). Recall that C(ρa, ρb) denotes the infimum value of (2) and K(ρa, ρb) denotes the maximum value of (3) (or equivalently, (19), (20)). Then C(ρa, ρb) = K(ρa, ρb). Furthermore, if we assume that c is bounded from above by c(x, y) u(x) + v(y), with u L1(X; ρa), v L1(Y; ρb). (23) Then the optimal transport plan π for (2) and optimal potential ϕ for (20) both exist. The following result states the existence and uniqueness of the optimal solution to the Monge problem. It also reveals the relation between the optimal map T of (1) and the optimal transport plan π of (2). It is a simplified version of Theorem 10.28 combined with Theorem 10.26 taken from Villani (2008). Theorem 1 (Existence, uniqueness and characterization of the optimal Monge map). Assume X, Y, ρa, ρb are defined as in the beginning of section 2. Assume the dimension of boundary X is no larger than n 1. Suppose the cost c is continuous and satisfies (15), (16), (17), (18), we further assume that ρa and ρb are compactly supported and ρa is absolutely continuous with respect to the Lebesgue measure on Rn. Then there exists a unique-in-law transport map T solving the Monge problem (1). And the joint distribution (Id, T ) ρa on Rn m is the optimal transport plan of the general OT problem (2). Furthermore, there exists a c-convex6 ψ , which can be proved to be differentiable ρa almost surely7, and ψ (x) + xc(x, T (x)) = 0, ρa almost surely. Remark 3. The differentiable property of c-convex ψ is a direct result of Theorem 10.26 from Villani (2008). Remark 4. The condition C(ρa, ρb) < + in Theorem 10.28 of Villani (2008) is automatically satisfied since we assume that ρa, ρb are compactly supported and the cost c is continuous. It is also worth mentioning that the conditions in Theorem 1 can be modified and we can still obtain the existence and uniqueness of the Monge map. We refer the readers to Remark 10.33 of Villani (2008) for more details. Remark 5 (Discussion on regularity of T , ψ ). From the above theorem, one can tell that the regularity of the Monge map T is closely related to the regularity of ψ , which is a very important topic in OT and the Monge-Ampère equation associated to the OT problem. We refer the reader to Theorem 12.50, 12.51 and 12.52 of Villani (2008) for details. Furthermore, in recent research such as Hütter & Rigollet (2020), Manole et al. (2022), stronger Ck,α regularity on ψ were used to establish bounds on minimax rates for certain types of estimators under the statistical setting. In our paper, we did not consider the aforementioned statistical problem, so we omit the detailed discussion on the regularities of both T and ψ . 6C.f. the definition in Assumption 3. 7ψ is differentiable at x for all x Spt(ρa) \ E0, where E0 is a zero measure set. Published in Transactions on Machine Learning Research (07/2023) B Proof of theoretical results We first define the saddle point of L(T, f) to the sup-inf problem (4) as following. Definition 3 (Saddle point). Suppose X, Y are two Polish spaces. We say ( ˆT, ˆf) is a saddle point of the functional L(T, f) to the sup-inf problem (4) defined in (5) if ( ˆT, ˆf) satisfy ˆT argmin T M(X,Y) {L(T, ˆf)}; ˆf argmax f Cb(Y) {L( ˆT, f)}. Before we introduce the main results, we present a useful lemma that enables one to switch the optimization over function parameters and the integral. Such result is a corollary of Theorem 14.60 in Rockafellar & Wets (2009) (also c.f. Theorem 3A in Rockafellar (1976)) and it is also used in (Korotin et al., 2022b; Gazdieva et al., 2022). Lemma 1. Let f : Rn Rm R is continuous function. Let µ be a Borel measure defined on Rn. Denote M(Rn, Rm) as the space of all measurable functions T : Rn Rm. Then we have inf T M(Rn,Rm) Rn f(x, T(x))dµ(x) = Z Rn inf ξ Rm f(x, ξ) dµ(x). B.1 Proof of Theorem 2 Proof. (1 2) First denote T as a solution to (1) and ϕ as a solution to (20). We prove (T , ϕ ) is the saddle point of L(T, f). We first prove T argmin T M(X,Y) {L(T, ϕ )}. By using lemma 1, one gets inf T {L(T, ϕ )} = Z X sup ξ {ϕ (ξ) c(x, ξ)}ρa(x)dx+ Z Y ϕ (y)ρb(y)dy = Z Y ϕ (y)ρb(y)dy Z X ϕc, (x)ρa(x)dx. Since ϕ solves (20), the right-hand side of the above equals K(ρa, ρb). Now by Theorem 5, 6, one can obtain CMonge(ρa, ρb) = C(ρa, ρb) = K(ρa, ρb), which leads to the strong duality of such Monge problem. Thus we have inf T {L(T, ϕ )} = CMonge(ρa, ρb). Since T solves (1), we have inf T {L(T, ϕ )} = Z X c(x, T (x))ρa(x)dx = L(T , ϕ ). The second equality above is due to the constraint T ρa = ρb. This proves T argmin T M(X,Y) {L(T, ϕ )}. The proof of ϕ argmax ϕ Cb(Y) {L(T , ϕ)} is straightforward, since L(T , ϕ) = R X c(x, T (x))ρa(x)dx is indepen- Thus we have proved that (T , ϕ ) is a saddle of L(T, f). (2 1) Let us first prove that ˆT is an optimal solution to Monge problem (1). We first claim that ˆT ρa = ρb. Suppose this is not true, then one can find a function φ Cb(Y) such that R Y φ( ˆT(x))ρa(x)dx = R Y φ(y)ρb(y)dy8. Without loss of generality, assume R Y φ(y)ρb(y)dy R X φ( ˆT(x))ρa(x)dx > 0. We pick the test function νφ Cb(Y) with ν R as an arbitrary real number, then we consider L( ˆT, νφ) = Z X c(x, ˆT(x))ρa(x)dx + ν( Z Y φ(y)ρb(y)dy Z X φ( ˆT(x))ρa(x)dx | {z } denote as δ>0 ) c + µδ. (24) Recall ˆT argmin T {L(T, ˆf)}, use Lemma 1, then we have L( ˆT, ˆf) = inf T {L(T, ˆf)} = Z Y ˆf(y)ρb(y)dy Z X ˆf c, (x)ρa(x)dx K(ρa, ρb) = C(ρa, ρb) < + . (25) 8This is due to the fact that when Y is a metric space, and µa, µb P(Y) (Borel measure), if R Y f dµa = R Y f dµb for all f Cb(Y); then µa = µb. Published in Transactions on Machine Learning Research (07/2023) Combine (24) and (25), one can obtain L( ˆT, ˆf) L( ˆT, νφ) C(ρa, ρb) c νδ. We can increase ν large enough so that L( ˆT, ˆf) < L( ˆT, νφ). This contradiction to the fact that ˆf argmax f Cb(Y) {L( ˆT, f)}. Thus we have proved that ˆT ρa = ρb. As a result , Z X c(x, ˆT(x))ρa(x)dx = L( ˆT, ˆf) = inf T {L(T, ˆf)} inf T ρa=ρb{L(T, ˆf)} = inf T ρa=ρb X c(x, T(x))ρa(x)dx. This indicates that ˆT solves the Monge problem (1). The above calculation also leads to L( ˆT, ˆf) = CMonge(ρa, ρb). At last, we prove that ˆf is the optimal solution to Kantorovich dual problem (20). By applying Lemma 1, just notice that for any f Cb(Y), one can verify L( ˆT, ˆf) L( ˆT, f) inf T {L(T, f)} = Z Y f(y)ρb(y)dy Z X f c, (x)ρb(y)dy. On the other hand, we have L( ˆT, ˆf) = inf T {L(T, ˆf)} = Z Y ˆf(y)ρb(y)dy Z X ˆf c, (x)ρb(y)dy. This proves that ˆf is the optimal solution to (20). A simialr proof of T argmin T M(X,Y) {L(T, ϕ )} as part of our proof to Theorem 2 is also established in section 6.1 of Gazdieva et al. (2022). To prove the Corollary 1, it suffices to check the conditions in this corollary imply the conditions in Theorem 1, 2, 6. B.2 Proof of Theorem 3 Proof . As mentioned before in the proof of Theorem 2, by Lemma 1, one can verify inf T L(T, f) = Z Y f(y)ρb(y)dy Z X f c, (x)ρa(x)dx, (26) Then the sup-inf problem supf inf T L(T, f) can be formulated as Y f(y)ρb(y)dy Z X f c, (x)ρa(x)dx . This is exactly the Kantorovich dual problem (20). Since ( T, f) is the solution to sup-inf problem, f is an optimal solution to (20). This verifies the first assertion of the theorem. On the other hand, at ( T, f), we have T(x) argmaxξ Y{ f(ξ) c(x, ξ)}, ρa almost surely. This leads to f c, (x) = f( T(x)) c(x, T(x)), ρa almost surely. Then we have Z X c(x, T(x))ρa(x) dx = Z X f( T(x))ρa(x) dx Z X f c, (x)ρa(x) dx Y f(y)ρb(y) dy Z X f c, (x)ρa(x) dx X Y [ f(y) f c, (x)]dπ(x, y) ZZ X Y c(x, y)dπ(x, y) Published in Transactions on Machine Learning Research (07/2023) for any π Π(ρa, ρb). Here the second equality is due to the assumption T ρa = ρb. The last inequality is due to the definition of f c, (x) = supy{ f(y) c(x, y)}. We now take the infimum value of RR X Y cdπ and we obtain X c(x, T(x))ρa(x) dx C(ρa, ρb), (27) Now, for any transport map T satisfying T ρa = ρb, by denoting π = (Id, T) ρa, we have X c(x, T(x))ρa(x) dx = ZZ X Y c(x, y)dπ(x, y) C(ρa, ρb). (28) Combining (27) and (28), we obtain X c(x, T(x))ρa(x)dx Z X c(x, T(x))ρa(x)dx, for any T, T ρa = ρb. This indicates the existence of the Monge map, and T is the Monge map. At last, we have L( T, f) = Z X c(x, T(x))ρa(x)dx + Z Y f(y)(T ρa ρb)dy = Z X c(x, T(x))ρa(x) dx = CMonge(ρa, ρb). B.3 Consistency in optimal values As discussed in Theorem 3, we require the condition T ρa = ρb to establish the consistency between the sup-inf solution and the optimal transport solution. However, even without the assumption T ρa = ρb, one can directly prove the consistency in optimal value, i.e., supf inf T L(T, f) = C(ρa, ρb). This is demonstrated in the following theorem. Theorem 7 (Consistency in optimal value). Assume the cost function is continuous and satisfies the condition (15). Suppose the sup-inf problem (4) admits at least one solution ( T, f), then L( T, f) = C(ρa, ρb). Proof. The proof of the first part of this theorem is the same as Theorem 3. Thus it is not hard to verify sup f inf T L(T, f) = sup f Rm f(y)ρb(y)dy Z Rn f c, (x)ρa(x)dx = K(ρa, ρb). Since c(x, y) satisfies (15), recall Theorem 6, the optimal value of Kantorovich dual problem equals general OT distance C(ρa, ρb). This proves our assertion. Theorem 7 indicates that although for some general cases in which our proposed method (4) fails to compute for a valid transport map T, (4) is still able to recover the exact optimal transport distance C(ρa, ρb). In the following example, although the Monge map does not exist, the proposed method can still capture the exact OT distance. Example 1. Consider the OT problem on R with c(x, y) = |x y|2 and ρa = δ0 (point distribution at 0), ρb = N(0, 1) (normal distribution). Our method yields sup f inf T |0 T(0)|2 + Z R f(y)ρb(y)dy f(T(0)) . Published in Transactions on Machine Learning Research (07/2023) By setting Ψ0(y) = f(y) |y|2, our sup-inf problem becomes sup f inf T R [(f(y) |y|2) | {z } Ψ0(y) +|y|2]ρb(y)dy (f(T(0)) |0 T(0)|2) | {z } Ψ0(T (0)) = sup Ψ0 inf T { Z R (Ψ0(y) + |y|2)ρb(y)dy Ψ0(T(0))} = Z R |y|2ρb(y)dy | {z } =1 R [Ψ0(y) sup Ψ0]ρb(y)dy The supreme over Ψ0 is obtained when Ψ0 = Const, i.e., when f(y) = |y|2 + Const. Thus we have supf inf T L(T, f) = 1 = C(ρa, ρb). B.4 Proof of Theorem 4 We can firstly verify the following lemma. Lemma 2 (Existence and uniqueness of Monge map under Assumption 1, 2). Suppose Assumption 1 and 2 hold, then the Monge map T exists and is unique in law. Lemma 2 is a corollary of Theorem 1. Once assumption 1 and 2 hold, the conditions on c( , ) and ρa, ρb mentioned in Theorem 1 are all satisfied. To prove Theorem 4, we further need the following two lemmas. Lemma 3. Suppose n n matrix A is invertible with minimum singular value σmin(A) > 0. Also assume n n matrix H is self-adjoint and satisfies λIn H On9. Then A H 1A σmin(A)2 Proof of Lemma 3 . One can first verify that H 1 1 λIn by digonalizing H 1. To prove this lemma, we only need to verify that for arbitrary v Rn, v A H 1Av = (Av) H 1Av |Av|2 Thus A H 1A σmin(A)2 λ In is positive-semidefinite. The following lemma is crucial since it analyzes the concavity of the target function f( ) c( , y) when f is c-concave. Lemma 4 (Concavity of f( ) c(x, ) if f c-concave). Suppose the cost function c(x, y) and f satisfy the conditions mentioned in Theorem 4. Denote the function Ψx(y) = f(y) c(x, y), then we have 2Ψx(y) σ(x, y)2 Proof of Lemma 4. First, we notice that f is c-convex, thus, there exists φ such that f(y) = infx{φ(x) + c(x, y)}. Let us also denote Φ(x, y) = φ(x) + c(x, y). Now for a fixed y Rn, We pick one xy argminx {φ(x) + c(x, y)} Since we assumed that φ C2(Rn) and c C2(Rn Rn), we have xΦ(xy, y) = φ(xy) + xc(xy, y) = 0 (29) Now recall Assumption 3, . 2 xxΦ(xy, y) is positive definite, thus is also invertible. We can now apply the implicit function theorem to show that the equation xΦ(x, y) = 0 determines an implicit function x( ), 9Here matrix M1 M2 iff M2 M1 is a positive-definite matrix, and M1 M2 iff M1 M2 is a positive-semidefinite matrix. Published in Transactions on Machine Learning Research (07/2023) which satisfies x(y) = xy in a small neighborhood U Rn containing y. Furthermore, one can show that x( ) is continuously differentiable at y. We will denote xy as x(y) in our following discussion. Now differentiating (29) with respect to y yields 2 xxΦ(x(y), y) x(y) + 2 xyc(x(y), y) = 0. (30) On one hand, (30) tells us x(y) = 2 xxΦ(x(y), y) 1 2 xyc(x(y), y). (31) Now we directly compute 2Ψx(y) = 2f(y) 2 yyc(x, y). (32) In order to compute 2f(y), we first compute f(y) f(y) = (φ(x(y)) + c(x(y), y)) = yc(x(y), y). (33) the second equality is due to the envelope theorem Afriat (1971). Then 2f(y) can be computed as 2f(y) = 2 yxc(x(y), y) x(y) + 2 yyc(x(y), y). (34) Plugging (31) into (34), recall (32), this yields 2Ψx(y) = 2 yxc(x(y), y) 2 xxΦ(x(y), y) 1 2 xyc(x(y), y) + 2 yyc(x(y), y) 2 yyc(x, y). Recall the Assumption 1, notice that c C2(Rn Rn), by symmetry of second derivatives, one can verify 2 xyc = 2 yxc; Since 2 yyc(x, y) is independent of x, one has 2 yyc(x(y), y) 2 yyc(x, y) = 0. Thus we obtain 2Ψx(y) = 2 xyc(x(y), y) 2 xxΦ(x(y), y) 1 2 xyc(x(y), y). (35) By the positive definite assumption made in Assumption 3 as well as (7), we have λ(y)In 2 xxΦ(x(y), y) On. Recall that σmin( 2 xyc(x, y)) = σ(x, y). Now we apply lemma 3 to (35), this yields 2Ψx(y) σ(x, y)2 Now we prove the main result of Theorem 4. Proof of Theorem 4. In this proof, we denote R as R Rd for simplicity. We first recall L(T, f) = Z c(x, T(x))ρa(x) dx + Z f(y)ρb(y) dy Z f(T(x))ρa(x) dx = Z f(y)ρb(y) dy Z (f(T(x)) c(x, T(x)))ρa(x)dx, then we write E1(T, f) = L(T, f) inf e T L( e T, f) = Z [f(T(x)) c(x, T(x))]ρa dx + sup e T Z [f( e T(x)) c(x, e T(x))]ρa dx Recall Tf defined in Assumption 4, we have, for almost every x Rd, Tf(x) = argmaxy{f(y) c(x, y)} = argmaxy{Ψx(y)}, (36) Published in Transactions on Machine Learning Research (07/2023) recall that we denote Ψx(y) = f(y) c(x, y), then, for almost every x Rd, we have Ψx(Tf(x)) = 0. (37) One can also write: E1(T, f) = Z [(f(Tf(x)) c(x, Tf(x))) (f(T(x)) c(x, T(x)))] = Z [Ψx(Tf(x)) Ψx(T(x))]ρa(x) dx For any x Rd, since Ψx( ) C2(Rn), and according to the previous Lemma 4, we have Ψx(T(x)) Ψx(Tf(x)) = Ψx(Tf(x))(T(x) Tf(x)) + 1 2(T(x) Tf(x)) 2Ψx(ω(x))(T(x) Tf(x)) ω(x) = (1 θx)T(x) + θx Tf(x) (38) for certain θx [0, 1]. By (37) and Lemma 4, we have Ψx(T(x)) Ψx(Tf(x)) σ(x, ω(x))2 2λ(ω(x)) |T(x) Tf(x)|2. Thus we have: E1(T, f) = Z [Ψx(Tf(x)) Ψx(T(x))]ρa(x) dx Z σ(x, ω(x))2 2λ(ω(x)) |T(x) Tf(x)|2ρa(x) dx (39) On the other hand, recall a solution of our sup-inf problem (4) is denoted as ( T, f). Then we have sup f inf T L(T, f) = Z c(x, T(x))ρa dx Z f(y)( T ρa ρb)dy = Z c(x, T(x))ρa dx, the second equality is due to the assumption on the consistency between T and Monge map T , thus T ρa = ρb. Thus we have E2(f) = Z c(x, T(x))ρa dx inf e T Z c(x, e T(x))ρa(x)dx + Z f(y)ρb(y) dy Z f( e T(x))ρa(x)dx = Z f( T(x)) c(x, T(x))ρa dx + sup e T Z (f( e T(x)) c(x, e T(x)))ρa(x)dx. The second equality is due to T ρa = ρb. Similar to the previous treatment, we have E2(f) = Z [Ψx(Tf(x)) Ψx( T(x))]ρa(x) dx Apply similar analysis as before, we obtain E2(f) Z σ(x, ξ(x))2 2λ(ξ(x)) | T(x) Tf(x)|2ρa(x) dx (40) ξ(x) = (1 τx) T(x) + τx Tf(x) (41) Published in Transactions on Machine Learning Research (07/2023) for certain τx [0, 1]. Since T = T , ρa almost surely, (40) leads to E2(f) Z σ(x, ξ(x))2 2λ(ξ(x)) |T (x) Tf(x)|2ρa(x) dx (42) β(x) = min σ(x, ω(x))2 2λ(ω(x)) , σ(x, ξ(x))2 where ω and ξ are defined in (38) and (41). Combining (39) and (42), we obtain E1(T, f) + E2(f) Z β(x)(|T(x) Tf(x)|2 + |T (x) Tf(x)|2)ρa dx 2 |T(x) T (x)|2ρa dx This leads to T T L2(βρa) p 2(E1(T, f) + E2(f)). Discussion on enforcing the c-concave property of the dual variable Theorem 4 is established on the assumption of c-concavity of the dual variable f. In practice, the c-concave property can be approximated via soft-min convolution, i.e., one can set fθ(y) = 1 K log(EX U exp( K(φθ(X) + c(X, y)))) with K > 0 large enough, φθ as an arbitrary neural network and U as a uniform distribution over a sufficiently large ball. However, according to the existing work (Fan et al., 2022b; Korotin et al., 2022a) on applying ICNN to solve OT, restricting f to the c-concave functions may undermine the performance of our computation. The related implementation and comparison may serve as good directions for future research. B.5 Further analysis on λ( ) and β( ) We can carry out some further calculations so that we may gain a better understanding of the value λ(y) evaluated at y = Tf(x). We can then tell that the quantity β defined in (43) will be impossible to be bounded from below as the maps T, Tf are very close to the sup-inf solution T. Before our calculation, we need the following lemma. Lemma 5. We assume the assumptions mentioned in Theorem 4 all hold. We also assume that for the cost function c, the map yc( , y) : x 7 yc(x, y) is injective. We further strengthen Assumption 4 such that Tf(x) minimizes f(y) c(x, y) for all x Rd. Similarly, we denote map Tφ as Tφ(y) = x(y) arg minx{φ(x) + c(x, y)} for all y Rd. Then we have Tφ(Tf(y)) = y for all y. Proof. To prove this, we notice that y(f(y) c(x, y))|y=Tf (x) = 0. This yields f(Tf(x)) yc(x, Tf(x)) = 0, for arbitrary x Rd. (44) By a similar argument carried out in the proof of Lemma 4, we are able to show that f(y) = yc(Tφ(y), y), (45) for all y Rd. Then, for any x, we can set y = Tf(x) in (45) and plug this into (44) to obtain yc(Tφ(Tf(x)), Tf(x)) yc(x, Tf(x)) = 0 (46) Now due to the fact that xyc(x, y) is invertible in Assumption 1, one can tell that the map yc( , y) : x 7 xc(x, y) is injective. Thus (46) yields Tφ(Tf(x)) = x for all x Rd. Remark 6 (Tf, Tφ as inverse map to each other). If we further assume that φ is c-convex, i.e., φ(x) = supy{f(y) c(x, y)}. Then by similar argument, we can also prove that Tf(Tφ(y)) = y for all y Rd. Thus, Tf and Tφ are inverse map to each other. Published in Transactions on Machine Learning Research (07/2023) Theorem 8 (Estimation on λ( ) at Tf(x)). Suppose we assume all the assumptions as in Lemma 5. Let Tf as defined in Assumption 4, and λ( ) defined in (7). Denote Γ = 2 xyc(x, Tf(x)). Then we have λ(Tf(x)) = ΓDTf(x)Γ 1Γ 2. Furthermore, λ(Tf(x)) σ(Γ) DTf(x) 2. Proof. Let us recall that we denote Tf(x) = arg maxy{f(y) c(x, y)} = arg maxy{Ψx(y)}. Using the optimality condition, we obtain Ψx(Tf(x)) = 0. yc(x, Tf(x)) f(Tf(x)) = 0. (47) Let us now differentiate (47) w.r.t. x. 2 xyc(x, Tf(x)) + 2 yyc(x, Tf(x))DTf(x) 2f(Tf(x))DTf(x) = 0, which can also be formulated as 2 xyc(x, Tf(x)) + 2 yyΨx(Tf(x))DTf(x) = 0. (48) Let us now recall equation (35) and the definition of Tφ, we substitute (35) into (48) to obtain 2 xyc(x, Tf(x)) 2 xyc(Tφ(Tf(x)), Tf(x)) 2 xxΦ(Tφ(Tf(x)), Tf(x)) 1 2 xyc(Tφ(Tf(x)), Tf(x))DTf(x) = 0. (49) Recall that the previous Lemma 5 yields Tφ(Tf(x)) = x for all x Rd. Let us denote Γ = 2 xyc(x, Tf(x)) for simplicity. Then equation (49) then leads to 2 xxΦ(x(Tf(x)), Tf(x)) = ΓDTf(x)Γ 1Γ (50) Since 2 xxΦ is self-adjoint, then λmax( 2 xxΦ(x(Tf(x)), Tf(x))) = 2 xxΦ(x(Tf(x)), Tf(x)) 2 = ΓDTf(x)Γ 1Γ 2. We can bound this matrix 2-norm as ΓDTf(x)Γ 1Γ 2 ΓDTf(x)Γ 1 2 Γ 2 = σmin(Γ) DTf(x) 2. Thus, when y equals to the minimizer Tf(x) arg miny{f(y) c(x, y)}, the quantity λ(y) = λ(Tf(x)) σmin(Γ) DTf(x) 2 Now suppose that method is about to converge to the sup-inf solution, i.e., the current minimizer Tf of L(f, T) is very close to the sup-inf solution T and our computed T is very close to Tf. Then the ω(x), ξ(x) mentioned in the proof of Theorem 4 will both be very close to Tf(x), thus, β(x) approximately equals to σ(x, Tf(x)) Recall the inequality λ(Tf(x)) σ(Γ) DTf(x) 2 by Theorem 8 and σmin(Γ) = σmin( 2 xyc(x, Tf(x))) = σ(x, Tf(x)) by definition, we obtain that β(x) can approximately bounded from above by 1 DTf (x) 2 . Notice that we can easily construct ρa with very small support and ρb with large support so that the norm of the gradient of the approximated OT map Tf(x) 2 is as large as possible for all x supp(ρa). Then β(x) can be an arbitrarily small value. Example 2. Consider quadratic cost c(x, y) = |x y|2 and two uniform distributions on R, ρa = U([ ϵ, ϵ]) and ρb = U([ L, L]) with ϵ, L > 0. Then the optimal transport map T (x) = L ϵ x. Since we assume the consistency, T(x) = L ϵ x, thus D T = L In Example 2, suppose T, Tf are very close to T, it is likely that DTf(x) also possesses a large norm for most of the x [ ϵ, ϵ], which leads to a very small value of β(x) ϵ Published in Transactions on Machine Learning Research (07/2023) C Relation between our method and generative adversarial networks It is worth pointing out that our scheme and Wasserstein Generative Adversarial Networks (WGAN) (Arjovsky et al., 2017) are similar in the sense that they are both doing minimization over the generator/map and maximization over the discriminator/dual potential. However, there are two main distinctions between them. Such differences are not reflected from the superficial aspects such as the choice of reference distributions ρa, but come from the fundamental logic hidden behind the algorithms. We want to first emphasize that the mechanisms of the two algorithms are different: Typical Wasserstein GANs (WGAN) are usually formulated as min G max D Lip 1 Z D(y)ρb(y)dy Z D(G(x))ρa(x)dx | {z } 1 Wasserstein distance W1(G ρa,ρb) and ours reads max f min T Z f(y)ρb(y)dy Z f(T(x))ρa(x)dx + Z c(X, T(x))ρa(x)dx | {z } general Wasserstein distance CMonge(ρa,ρb) The inner maximization of (51) computes W1 distance via Kantorovich duality and the outer loop minimize the W1 gap between desired ρb and G ρa; However, the logic behind our scheme (52) is different: the inner optimization computes for the c transform of f, i.e. f c, (x) = supξ(f(ξ) c(x, ξ)); And the outer maximization computes for the Kantorovich dual problem C(ρa, ρb) = supf R f(y)ρb(y)dy R f c, (x)ρa(x)dx . Even under W1 circumstance, one can verify the intrinsic difference between two proposed methods: when setting the cost c(x, y) = x y , and ρa = G ρa in (52), the entire "sup-inf" optimization of (52) (underbraced part) is equivalent to the inner maximization problem of (51) (underbraced part), but not for the entire saddle point scheme. It is also important to note that WGAN aims to minimize the distance between the generated distribution and the target distribution and the ideal value for (51) is 0. On the other hand, one of our goals is to estimate the optimal transport distance between the initial distribution ρa and the target distribution ρb. Thus the ideal value for (52) should be C(ρa, ρb), which is not 0 in most of the cases. We then argue about the optimality of the computed map G and T: In (51), one is trying to obtain a map G by minimizing W1(ρb, G ρa) w.r.t. G, and hopefully, G ρa can approximate ρb well. However, there isn t any restriction exerted on G, thus one can not expect the computed G to be the optimal transport map between ρa and ρb; On the other hand, in (52), we not only compute T such that T ρa approximates ρb , but also compute for the optimal T that minimizes the transport cost Eρa[c(X, T(X))]. In (52), the computation of T is naturally incorporated in the sup-inf scheme, and there exists the theoretical result (recall Theorem 3 in the paper) that guarantees T to be the optimal transport map. In summary, even though the formulation of both algorithms are similar, the designing logic (minimizing distance vs computing distance itself) and the purposes (computing arbitrary pushforward map vs computing the optimal map) of the two methods are distinct. Thus the theoretical and empirical study of GANs cannot be trivially translated to the proposed method. In addition to the above discussions, we should also refer the readers to Gazdieva et al. (2022), in which a comparison between a similar saddle point method and the regularized GANs are made in section 6.2 and summarized in Table 1. D Additional results OT between 1D manifolds We consider the source distribution to be an 1D uniform distribution, and the target as an incomplete ellipse. In this case, the source dimension n = 1 and target dimension m = 2. Published in Transactions on Machine Learning Research (07/2023) As discussed in Section 3, we choose the embedding function to be padding-zero function Q(x) = [x, 0], so the cost reads c(x, y) = c(Q(x), y) = [x, 0] y 2 2. As shown in Figure 8, our algorithm is able to learn a symmetric map from an 1D uniform distribution towards the incomplete ellipse. This example also shows that our method can deal with the case that both ρa and ρb are not absolute continuous w.r.t. Lebesgue measure. (d) T( ) map Figure 8: Qualitative results for learning unequal dimension maps. Decreasing function as the cost We consider the cost function c(x, y) = ϕ( x y ) with ϕ as a monotonic decreasing function. We test our algorithm for a specific example ϕ(s) = 1 s2 . In this example, we compute for the optimal Monge map from ρa to ρb with ρa = h U([4, 6] [0, 2π)) and ρb = h U([1, 2] [0, 2π)). Here we define h : R+ [0, 2π) R2 as the transform from polar coordinates to Cartesian coordinates, i.e. h(r, θ) = (r cos θ, r sin θ). We denote U(E) as the uniform distribution over the measurable set E R2. Let us denote the outer annulus as Ωa, which is the image of [4, 6] [0, 2π) under h; similarly, the inner annulus is denoted as Ωb. One can verify that the density function ρa(x) = 1 4π x χΩa and ρb(x) = 1 2π x χΩb. Here χE is the indicator function of measurable set E, i.e. χE(x) = 1 if x E and χE(x) = 0 if x / E. Under polar coordinate, we define the map τ that maps each point (r, θ) on [4, 6] [0, 2π) to (4 r 2, (θ + π) mod (2π)) [1, 2] [0, 2π). Then the ground truth Monge map is T = h τ h 1 : Ωa Ωb. Now we denote OT distance using the above reciprocal cost as W 2. Then it is not hard to compute W 2(ρa, ρb) = RR Ωa 1 T (x) x 2 ρa(x) dx = R 2π 0 R 6 4 1 (4+ r 2) 2 1 4πrrdrdθ = 1 2 R 6 4 1 (4+ r 2) 2 dr = 1 42. We also compute the same problem for L2 cost, i.e. c(x, y) = x y 2. Under polar coordinate, the corresponding Monge map is T = h τ h 1 where τ of the problem should map each point (r, θ) on [4, 6] [0, 2π) to ( r 2 1, θ). Similar to the above discussion, W2(ρa, ρb) = R 2π 0 R 6 4 1 + r 2 2 1 4πrrdrdθ 1 Figure 9 shows the transported samples as well as the differences between the two cost functions. Figure 9: Panel (a) shows the samples from computed T ρa. When c(x, y) = 1/ x y 2 (panels (b) (c)), we show T(x) maps the circle with radius 6 to the circle with radius 1, and maps the circle with radius 4 to the circle with radius 2. If c(x, y) = x y 2, which is the case of the panels (d) (e), it is exactly opposite. Published in Transactions on Machine Learning Research (07/2023) Uniform distribution on sphere We use the geodesic distance on sphere (13) as the cost. We set ρa = U([0, 2π) [0, π 4 ]) and ρb = U([0, 2π) [ 3π 4 , π]), and consider solving min T :T ρa=ρb Z c((θ, ϕ), T(θ, ϕ))ρa dθdϕ. (53) To mitigate the gradient blow-up of arccos near 1, we slightly modify the original cost function as cλ((θ1, ϕ1), (θ2, ϕ2)) = arccos(λ(sin ϕ1 sin ϕ2 cos(θ1 θ2) + cos ϕ1 cos ϕ2)) (54) with λ = 0.999. We apply our algorithm to solve (53) with cost cλ. We also translate our computed map back to the 3D sphere S via spherical-Cartesian coordinate transform (θ, ϕ, r) 7 (r sin ϕ cos θ, r sin ϕ sin θ, r cos ϕ) where we fix the radius r = 3. We demonstrate the numerical results in Figure 10. In this example, the ground truth Monge map T of (53) maps each (θ, ϕ) on [0, 2π) [0, π 4 ] to (θ, ϕ + 3 4π). Furthermore, it is not hard to verify that the OT distance between ρa, ρb is Wgeodesic(ρa, ρb) = 9π Figure 10: Monge map from ρa to ρb on the sphere: (a) blue samples from ρa and orange samples from ρb; (b) blue samples from ρa, orange samples are obtained from T ρb, grey curves are geodesics connecting each transporting pairs; (c) our computed Monge map maps blue ring (ϕ = π 8 ) to the orange curve (ground truth is ϕ = 7 8π); (d) our computed Monge map maps blue ring (ϕ = π 4 ) to the orange curve (ground truth is the south pole) Our computational method also yields to an estimation on the OT distance. To be more specific, for any computed Tθ, we can use Monte Carlo method to estimate the OT distance as c W(ρa, ρb) = 1 N PN k=1 c(xk, Tθ(xk)), where the N samples {xk} are i.i.d. samples drawn from ρa. In exact computation, we choose N = 50000. We summarize the comparison between our estimated OT distances and exact OT distances of previous examples in the following Table 3. W 2 (annulus example) W2 (annulus example) Wgeodesic (sphere example) Figure 9 (b,c) Figure 9 (d,e) Figure 10 real distance W 1/42 0.0238 37/3 3.5119 9π/4 7.0686 our estimation c W 0.0241 3.4019 7.1041 relative error |W b W | |W | 1.3149% 3.1325% 0.5018% Table 3: Comparison between the estimated OT distances and the exact OT distances. Published in Transactions on Machine Learning Research (07/2023) Figure 11: In this figure, we plot N = 40000 samples {θk, ϕk}N k=1 (blue) randomly drawn from ρSph a , and their mapped samples {Tθ(θk, ϕk)}N k=1 (green). Population transportation We have described this example in Section 6.4. Here we make further explanation on the map-to-land transform τ, it is defined as follows. (θ, ϕ), if (θ, ϕ) Land; argmin ( θ, ϕ) P ( θ, ϕ) (θ, ϕ) 2 , if (θ, ϕ) Sea. Here we choose P as a finite set consists of 2000 samples randomly selected from ρSph b . We further plot in Figure 11 with sufficiently large amount of source samples and mapped samples. Among the 40000 mapped samples, 7718 are located on the sea, and we apply τ to map these samples back to a rather close location on land. It is worth mentioning that the map used as the background in our figures has several small regions removed from the actual land. This is due to the lack of data points in the dataset provided in Doxsey-Whitfield et al. (2015). We also tested our algorithm on this population transport example with different costs. One of the costs we consider is the cλ cost defined before in (54), with λ = 0.99. We can treat cλ as a reasonable approximation of the exact geodesic distance on sphere. The other cost we consider is the L2 cost c((θ1, ϕ1), (θ2, ϕ2)) = (θ2 θ1)2 + (ϕ2 ϕ1)2. We present the numerical results for both costs in Figure 12. One can tell that there are more transport trajectories passing through the Northern Arctic Circle with cλ cost than the trajectories with L2 cost. This makes sense since the cλ distance of trajectory passing through the North Arctic Circle won t be necessarily large when comparing with L2 distance. As a result, the transport plan with cλ cost is available to accommodate more transport plans going across the Arctic Circle. Text to image generation We show additional generation results in Figure 13. We also compared the cos-sim cost (9) and the ordinary inner product cost c(x, y) = Rx, y . The latter can converge to the same level as cos-sim cost, measured both qualitatively via generated images and quantitatively via cosine similarity w.r.t. real images (see Figure 4 (b).) The only difference is the method with inner product cost will converge slightly slower. After investigating the statistics of Rx , we find this result is reasonable because Rx concentrates around 12 with a large probability. Meanwhile, we emphasize that this text-to-image generation task has not been considered in previous neural OT methods, which is itself a contribution, even though using cos-sim cost does not distinguish from inner product cost significantly. Published in Transactions on Machine Learning Research (07/2023) (a) Population transport with cost cλ (b) Population transport with L2 cost Figure 12: Comparison of transport plans using different cost functions. E Implementation details and hyper-parameters E.1 Synthetic datasets We follow Algorithm 1 and apply Adam method Kingma & Ba (2014) to train both Tθ and fη for all the following examples in this section. Recall the notations B, K, K1, K2 mentioned in Algorithm 1. OT between 1D manifolds The networks Tθ and fη each has 5 layers with 10 hidden neurons. The batch size B = 100. K1 = 6, K2 = 1. The learning rate is 10 3. The number of iterations K = 12000. Decreasing cost function In this example, we set Tθ(x) = x + Fθ(x) and optimize over θ. For either 1 |x y|2 or |x y|2 case we set both Fθ and the Lagrange multiplier fη as six layers multilayer perceptron (MLP), with PRe LU and Tanh activation functions respectively, each layer has 36 nodes. The training batch size B = 2000. We set K = 8000, K1 = 8, K2 = 6. The learning rate for both θ and η equals 10 4. Uniform distribution on sphere In this example, we set Tθ(x) = x + Fθ(x) and optimize over θ. We set both Fθ and fη as six layers MLP, with PRe LU activation functions, each layer has 8 nodes The training batch size is B = 4000. We set K = 10000, K1 = 8, K2 = 4. The learning rate for both θ and η equals 0.9 10 4. Population transportation In this example, we set Tθ(x) = x + Fθ(x) and optimize over θ, we choose both Fθ, fη as Res Nets with depth equals 4 and hidden dimensions equals 32; we choose the activation function as PRe LU. We apply dropout technique Srivastava et al. (2014) with p = 0.24 to each layer of our networks. The learning rate for both θ and η was set as 5 10 5. Tθ is computed after K = 200000 steps of optimization with K1 = 8, K2 = 1 and training batchsize B = 400. For the comparison experiment, we tested our example by slightly modifying the codes presented in OT mapping estimation for domain adaptation of POT library Flamary et al. (2021). We train the linear transformation on 2000 samples from both ρSph a and ρSph b , we choose the hyper parameter µ = 1, ϵ = 10 4. We plot the second Figure 7 by applying the map-to-land transform τ with 2000 newly selected samples. We also tried the same example with Gaussian kernel provided in Flamary et al. (2021), but the results are always unstable for various hyper parameters. Generally speaking, it is very hard to obtain valid results with Gaussian kernel. E.2 Text to image generation Dataset details Laion aesthetic dataset is filtered from a Laion 5B dataset to have the high aesthetic level. Laion art is the subset of Laion aesthetic and contains 8M most aethetic samples. We download the metadata of Laion art according to the instructions of Laion. We then filter only English prompts, and download the images with img2dataset. To speed up the training, we run the CLIP retrieval to convert Published in Transactions on Machine Learning Research (07/2023) (a) Random image samples on Laion art prompts (b) Random image samples on Conceptual Captions 3M prompts Figure 13: Additional text to image generation results images to embeddings and use embedding-dataset-reordering to reorder the embeddings into the expected format. By doing these two steps, we save the time of calculating embeddings on the fly. After filtering English prompts, we get the Laion art dataset with 2.2M data. Published in Transactions on Machine Learning Research (07/2023) (a) Degraded and original images (b) Composite images G(x) (c) Pushforward images T(x) Figure 14: Unpaired image inpainting on test dataset of Celeb A 64 64. We take the composite image G(x) = T(x) M C + x M as the output image. Additionally, we provide the pushforward images T(x) to illustrate the regularization effect of transportation costs. In panels (b) and (c), we show the results with α = 10 in the first row and α = 10000 in the second row. A small transportation cost would result that the pushforward map neglects the connection to the unmasked area, which is illustrated by a more clear mask border in pushforward images. Figure 15: Unpaired image inpainting on test dataset of Celeb A 128 128. We take the composite image G(x) = T(x) M C + x M as the output image. We download CC-3M following this instruction. We then remove all the images with watermark, which include images downloaded from shutterstock, alamy, gettyimages, and dailymail.co.uk websites. After removing watermark images, the CC-3M only has 0.8M (text, image) pairs left. We then use CLIP retrieval and embedding-dataset-reordering packages to convert raw images/texts to embeddings. For each dataset, we let the source and the target distribution contain 0.3M data respectively, and take the rest of dataset as the test data. Network structure and hyper-parameters We use the DALLE 2 prior network to represent the map T. The potential f is using the same network with an additional pooling layer. We use EMA to stabilize the training of the map. The batch size is 225. The numbers of loop iterations are K1 = 10, K2 = 1. We Published in Transactions on Machine Learning Research (07/2023) use the learning rates 10 4, Adam (Kingma & Ba, 2014) optimizer with weight decay coefficient 0.0602. We train the networks for 110 epochs. On NVIDIA RTX A6000 (48GB), the training time of each experiment is 21 hours. E.3 Unpaired inpainting Denote M C as the complement of M, i.e. M C = 1 in the occluded area and 0 otherwise. The composite image is G(x) = T(x) M C + x M. The loss function we actually use is slightly different with the (5). We modify the f(T(x)) to be f(G(x)) to strengthen the training of f sup f inf T Rn [c(x, T(x)) f(G(x))] ρa(x) dx + Z Rm f(y)ρb(y) dy. In the unpaired inpainting experiments, the images are first cropped at the center with size 140 and then resized to 64 64 or 128 128. We choose learning rate to be 1 10 3, Adam optimizer with default beta parameters, K2 = 1. The batch size is 64 for Celeb A64 and 16 for Celeb A128. The number of inner loop iteration K1 = 5 for Celeb A64 and K1 = 10 for Celeb A128. We use exactly the same UNet for the map T and convolutional neural network for f as Rout et al. (2022, Table 9) for Celeb A64 and add one additional convolutional block in f network for Celeb A128. On NVIDIA RTX A6000 (48GB), the training time of Celeb A64 experiment is 10 hours and the time of Celeb A128 is 45 hours. We use the POT implementation (Flamary et al., 2021) of Perrot et al. (2016) as the template and implement the masked MSE loss by ourselves. Perrot et al. (2016) provides two options as the transformation map, one is linear and another is the kernel function. We find the kernel function on this example would generate mode-collapse results, so we adopt the linear transformation map. We choose L2 regularization coefficient to be 10 3, and other parameters are the same as default. E.4 Class preserving mapping The NIST images are rescaled to size 32 32 and repeated to 3 channels. We use the conditional UNet to represent map T. We add a projection module (Miyato & Koyama, 2018) on WGAN-QC s Res Net (Liu et al., 2019) to represent potential f. We choose learning rate to be 1 10 4, Adam optimizer with betas (0.5, 0.999). K1 = 10, K2 = 1. We use EMA to stablize the training of the map. The batch size is 64. In practice, we introduce a coefficient λ in the cost c({x, y}, {x , y }) = x x 2+λ1(y = y ) and set λ = 0.5. Published in Transactions on Machine Learning Research (07/2023) F Theoretical justification of all the examples considered in this article Table 4: We make a summary table on all the problems considered in this work with their cost function c and distributions ρa, ρb and whether the existence and uniqueness of Monge map for each problem are guaranteed. We conclude the last column by checking the conditions in Theorem 1. Note that the theory of the existence of Monge map relies on the connection to a deterministic OT plan (Santambrogio, 2015, Section 1.1-1.4), and it is quite common that the OT plan is non-deterministic in deep learning examples. Sd is the unit ball in Rd, and d 1 is the d 1 dimension probability simplex in Rd. Problem X Y Cost Guarantee of existence and uniqueness Text to image R77 768 \ {0} S768 c(x, y) = Rx,y [No] xc(x, ) = R Rx (I uu ) may not be injective. Here we denote u = Rx Rx . Class-preserving map [ 1, 1]32 32 3 J 1 Y = X c({x, y}, { x, y}) = [No] Consider M = R32 32 3+J, ρa is not absolutely continuous as a distribution on R32 32 3+J Image inpainting [ 1, 1]32 32 3 or [ 1, 1]64 64 3 Y = X c(x, y) x M y M 2 2 Existence guaranteed uniqueness not guaranteed Population transport [0, 2π) [0, π] Y = X c({θ1, ϕ1}, {θ2, ϕ2}) = sin ϕ1 sin ϕ2 cos(θ1 θ2) cos ϕ1 cos ϕ2 [No] xc(x, ) may not be injective, c.f. Remark 7 Line to ellipse [ L, L] (0, π) (x a cos y)2 + b2 sin2 y The parametrization of the line and the ellipse are (x, 0); (a cos y, b sin y). [Yes] The conditions in Theorem 1 are satisfied. (reciprocal cost) c(x, y) = 1 x y 2 [Yes] One can restrict c on X Y, i.e., set the manifold M = Ω1 as stated in Appendix A, and choose X = M. The conditions in Theorem 1 are all satisfied. c(x, y) = x y 2 [Yes] One can verify that the conditions in Thm 1 are satisfied. OT on sphere [0, 2π) [0, π] Y = X cλ({θ1, ϕ1}, {θ2, ϕ2}) = arccos(λ(cos ϕ1 cos ϕ2 + sin ϕ1 sin ϕ2 cos(θ1 θ2))) [No] xc(x, ) may not be injective. c.f. Remark 7 Remark 7. We discuss the condition (18) for geodesic cost. Consider the cost c is the linearized geodesic distance (14), one can compute θ1,ϕ1c({θ1, ϕ1}, {θ2, ϕ2}) = sin ϕ1 sin ϕ2 sin(θ1 θ2) cos ϕ1 sin ϕ2 cos(θ1 θ2) + sin ϕ1 cos ϕ2 Fix θ2 = θ1. When ϕ1 = π 2 , we can always choose two different ϕ , ϕ [0, π] such that sin(ϕ1 ϕ ) = sin(ϕ1 ϕ ), so θ1,ϕ1c({θ1, ϕ1}, {θ2, ϕ }) = 0 sin(ϕ1 ϕ ) = 0 sin(ϕ1 ϕ ) = ϕ1c({θ1, ϕ1}, {θ2, ϕ }), Published in Transactions on Machine Learning Research (07/2023) thus xc(x, ) is not injective on [0, 2π) [0, π]. One can also use a similar way to show that for the cost cλ (cf. (54)) used in OT on sphere, xcλ(x, ) is not injective. This is also true for the original arc length distance function (13). The cost function c(x, y) may satisfy (18) if it is the square of the geodesic distance, i.e., c(x, y) = d2(x, y) where d(x, y) denotes the length of the geodesic joining x, y. The computation w.r.t. the square cost may serve as one of our future research directions. G Notations used in the paper Notation Meaning c(x, y) cost function X, Y Polish spaces ρa, ρb Borel probability measures defined on X, Y T Measurable map from X to Y T µ Pushforward of distribution µ by the map T M(X, Y) Space of measurable maps from X to Y Cb(X) Space of bounded continuous functions defined on X P(X) Space of Borel probability distributions defined on X C(ρa, ρb) OT distance from ρa to ρb, i.e., optimal value to the general OT problem 2 CMonge(ρa, ρb) Optimal value to the Monge problem 1 K(ρa, ρb) Optimal value to the Kantorovich dual problem (3), (19), or (20) L(T, f) Lagrangian function of the Monge problem, defined in (5) ψc,+ c transform defined in (21) ϕc, c transform defined in (22) T , ϕ Optimal solution (Monge map) to (1) and optimal solution to (20) ( ˆT, ˆf) Saddle point of L(T, f) to (4) ( T, f) sup-inf solution to (4) B, K, K1, K2 Training batch size, total iterations, iterations for Tθ, iterations for fη R Frozen matrix Point-wise multiplication M, M C Binary mask with the same size as the image, complement of M U(E) Uniform probability distribution on measurable set E