# generative_modeling_with_optimal_transport_maps__454ca055.pdf Published as a conference paper at ICLR 2022 GENERATIVE MODELING WITH OPTIMAL TRANSPORT MAPS Litu Rout Space Applications Centre Indian Space Research Organisation lr@sac.isro.gov.in Alexander Korotin Skolkovo Institute of Science and Technology Artificial Intelligence Research Institute (AIRI) a.korotin@skoltech.ru Evgeny Burnaev Skolkovo Institute of Science and Technology Artificial Intelligence Research Institute (AIRI) e.burnaev@skoltech.ru With the discovery of Wasserstein GANs, Optimal Transport (OT) has become a powerful tool for large-scale generative modeling tasks. In these tasks, OT cost is typically used as the loss for training GANs. In contrast to this approach, we show that the OT map itself can be used as a generative model, providing comparable performance. Previous analogous approaches consider OT maps as generative models only in the latent spaces due to their poor performance in the original high-dimensional ambient space. In contrast, we apply OT maps directly in the ambient space, e.g., a space of high-dimensional images. First, we derive a minmax optimization algorithm to efficiently compute OT maps for the quadratic cost (Wasserstein-2 distance). Next, we extend the approach to the case when the input and output distributions are located in the spaces of different dimensions and derive error bounds for the computed OT map. We evaluate the algorithm on image generation and unpaired image restoration tasks. In particular, we consider denoising, colorization, and inpainting, where the optimality of the restoration map is a desired attribute, since the output (restored) image is expected to be close to the input (degraded) one. 1 INTRODUCTION Since the discovery of Generative Adversarial Networks (GANs, Goodfellow et al. (2014)), there has been a surge in generative modeling (Radford et al., 2016; Arjovsky et al., 2017; Brock et al., 2019; Karras et al., 2019). In the past few years, Optimal Transport (OT, Villani (2008)) theory has been pivotal in addressing important issues of generative models. In particular, the usage of Wasserstein distance has improved diversity (Arjovsky et al., 2017; Gulrajani et al., 2017), convergence (Sanjabi et al., 2018), and stability (Miyato et al., 2018; Kim et al., 2021) of GANs. Generative models based on OT can be split into two classes depending on what OT is used for. First, the optimal transport cost serves as the loss for generative models, see Figure 1a. This is the most prevalent class of methods which includes WGAN (Arjovsky et al., 2017) and its modifications: WGAN-GP (Gulrajani et al., 2017), WGAN-LP (Petzka et al., 2018), and WGAN-QC (Liu et al., 2019). Second, the optimal transport map is used as a generative model itself, see Figure 1b. Such approaches include LSOT (Seguy et al., 2018), AE-OT (An et al., 2020a), ICNN-OT (Makkuva et al., 2020), W2GN (Korotin et al., 2021a). Models of the first class have been wellstudied, but limited attention has been paid to the second class. Existing approaches of the second class primarily consider OT maps in latent spaces of pre-trained autoencoders (AE), see Figure 3. The performance of such generative models depends on the underlying AEs, in which decoding transformations are often not accurate; as a result this deficiency limits practical applications in high-dimensional ambient spaces. For this reason, using OT in the latent space does not necessarily guarantee superior performance in generative modeling. Published as a conference paper at ICLR 2022 (a) OT cost as the loss for the generative model. (b) OT map as the generative model. Figure 1: Two existing approaches to use optimal transport in generative models. The focus of our paper is the second class of OT-based models using OT map as the generative map. Finding an optimal mapping is motivated by its ability to preserve specific attributes of the input samples, a desired property in unpaired learning. For example, in unpaired image-to-image translation, the learner has to fit a map between two data distributions which preserves the image content. Cycle GAN-based models (Zhu et al., 2017) are widely used for this purpose. However, they typically have complex optimization objectives consisting of several losses (Amodio & Krishnaswamy, 2019; Lu et al., 2019) in order to make the fitted map preserve the required attributes. The main contributions of this paper are as follows: 1. We propose an end-to-end algorithm (M4.3) to fit OT maps for the quadratic cost (Wasserstein-2 distance) between distributions located on the spaces of equal dimensions (M4.1) and extend the method to unequal dimensions as well (M4.2). We prove error bounds for the method (M4.4). 2. We demonstrate large-scale applications of OT maps in popular computer vision tasks. We consider image generation (M5.1) and unpaired image restoration (M5.2) tasks. Our strict OT-based framework allows the theoretical analysis of the recovered transport map. The OT map obtained by our method can be directly used in large-scale computer vision problems which is in high contrast to previous related methods relying on autoencoders and OT maps in the latent space. Importantly, the performance and computational complexity of our method is comparable to OT-based generative models using OT cost as the loss. Notations. In what follows, X and Y are two complete metric spaces, µ(x) and ν(y) are probability distributions on X and Y, respectively. For a measurable map T : X Y, T#µ denotes the pushforward distribution of µ, i.e., the distribution for which any measurable set E Y satisfies T#µ(E) = µ(T 1(E)). For a vector x, x denotes its Euclidean norm. We use x, y to denote the inner product of vectors x and y. We use Π(µ, ν) to denote the set of joint probability distributions on X Y whose marginals are µ and ν, respectively (couplings). For a function f : RD R { } its Legendre Fenchel transform (the convex conjugate) is f(y) = supx RD{ x, y f (x)}. It is convex, even if f is not. 2 BACKGROUND ON OPTIMAL TRANSPORT Consider a cost of transportation, c : X Y R defined over the product space of X and Y. Figure 2: Monge s OT. Monge s Formulation. The optimal transport cost between µ and ν for ground cost c( , ) is Cost(µ, ν) def = inf T#µ=ν X c (x, T(x)) dµ(x), (1) where the infimum is taken over all measurable maps T : X Y pushing µ to ν, see Figure 2. The map T on which the infimum in (1) is attained is called the optimal transport Published as a conference paper at ICLR 2022 map. Monge s formulation does not allow splitting. For example, when µ is a Dirac distribution and ν is a non-Dirac distribution, the feasible set of equation (1) is empty. Kantorovich s Relaxation. Instead of asking to which particular point y Y should all the probability mass of x be moved, Kantorovich (1948) asks how the mass of x should be distributed among all y Y. Formally, a transport coupling replaces a transport map; the OT cost is given by: Cost(µ, ν) def = inf π Π(µ,ν) X Y c (x, y) dπ(x, y), (2) where the infimum is taken over all couplings π Π(µ, ν) of µ and ν. The coupling π attaining the infimum of (2) is called the optimal transport plan. Unlike the formulation of (1), the formulation of (2) is well-posed, and with mild assumptions on spaces X, Y and ground cost c( , ), the minimizer π of (2) always exists (Villani, 2008, Theorem 4.1). In particular, if π is deterministic, i.e., π = [id X , T ]#µ for some T : X Y, then T minimizes (1). Duality. The dual form of (2) is given by (Kantorovich, 1948): Cost(µ, ν) = sup (u,v) X u(x)dµ(x) + Z Y v(y)dν(y): u(x) + v(y) c(x, y) , (3) with u L1(µ), v L1(ν) called Kantorovich potentials. For u : X R and v : Y R define their c-transforms by uc(y) = infx X {c (x, y) u (x)} and vc(x) = infy Y{c (x, y) v (y)} respectively. Using c-transform, (3) is reformulated as (Villani, 2008, M5) Cost(µ, ν)=sup v X vc(x)dµ(x)+ Z Y v(y)dν(y) =sup u X u(x)dµ(x)+ Z Y uc(y)dν(y) . (4) Primal-dual relationship. For certain ground costs c( , ), the primal solution T of (1) can be recovered from the dual solution u of (3). For example, if X = Y = RD, c(x, y) = h(x y) with strictly convex h : RD R and µ is absolutely continuous supported on the compact set, then T (x) = x ( h) 1 u (x) , (5) see (Santambrogio, 2015, Theorem 1.17). For general costs, see (Villani, 2008, Theorem 10.28). 3 OPTIMAL TRANSPORT IN GENERATIVE MODELS OPTIMAL TRANSPORT COST AS THE LOSS (Figure 1a). Starting with the works of Arjovsky & Bottou (2017); Arjovsky et al. (2017), the usage of OT cost as the loss has become a major way to apply OT for generative modeling. In this setting, given data distribution ν and fake distribution µθ, the goal is to minimize Cost(µθ, ν) w.r.t. the parameters θ. Typically, µθ is a pushforward distribution of some given distribution, e.g., N(0, I), via generator network Gθ. The Wasserstein-1 distance (W1), i.e., the transport cost for ground cost c(x, y) = x y , is the most practically prevalent example of such a loss. Models based on this loss are known as Wasserstein GANs (WGANs). They estimate W1(µθ, ν) based on the dual form as given by (4). For W1, the optimal potentials u , v of (4) satisfy u = v where u is a 1-Lipschitz function (Villani, 2008, Case 5.16). As a result, to compute W1, one needs to optimize the following simplified form: W1(µθ, ν) = sup u L 1 X u(x)dµθ(x) Z Y u(y)dν(y) . (6) In WGANs, the potential u is called the discriminator. Optimization of (6) reduces constrained optimization of (4) with two potentials u, v to optimization of only one discriminator u. In practice, enforcing the Lipschitz constraint on u is challenging. Most methods to do this are regularizationbased, e.g., they use gradient penalty (Gulrajani et al., 2017, WGAN-GP) and Lipschitz penalty (Petzka et al., 2018, WGAN-LP). Other methods enforce Lipschitz property via incorporating certain hard restrictions on the discriminator s architecture (Anil et al., 2019; Tanielian & Biau, 2021). General transport costs (other than W1) can also be used as the loss for generative models. They are less popular since they do not have a dual form reducing to a single potential function similar to (6) for W1. Consequently, the challenging estimation of the c-transform uc is needed. To Published as a conference paper at ICLR 2022 avoid this, Sanjabi et al. (2018) consider the dual form of (3) with two potentials u, v instead form (4) with one u and softly enforce the condition u(x) + v(y) c(x, y) via entropy or quadratic regularization. Nhan Dam et al. (2019) use the dual form of (4) and amortized optimization to compute uc via an additional neural network. Both methods work for general c( , ), though the authors test them for c(x, y) = x y only, i.e., W1 distance. Mallasto et al. (2019) propose a fast way to approximate the c-transform and test the approach (WGAN-(q, p)) with several costs, in particular, the Wasserstein-2 distance (W2), i.e., the transport cost for the quadratic ground cost c(x, y) = 1 2 x y 2. Specifically for W2, Liu et al. (2019) approximate the c-transform via a linear program (WGAN-QC). A fruitful branch of OT-based losses for generative models comes from modified versions of OT cost, such as Sinkhorn (Genevay et al., 2018), sliced (Deshpande et al., 2018) and minibatch (Fatras et al., 2019) OT distances. They typically have lower sample complexity than usual OT and can be accurately estimated from random mini-batches without using dual forms such as (3). In practice, these approaches usually learn the ground OT cost c( , ). The aforementioned methods use OT cost in the ambient space to train GANs. There also exist approaches using OT cost in the latent space. For example, Tolstikhin et al. (2017); Patrini et al. (2020) use OT cost between encoded data and a given distribution as an additional term to reconstruction loss for training an AE. As the result, AE s latent distribution becomes close to the given one. OPTIMAL TRANSPORT MAP AS THE GENERATIVE MAP (Figure 1b). Methods to compute the OT map (plan) are less common in comparison to those computing the cost. Recovering the map from the primal form (1) or (2) usually yields complex optimization objectives containing several adversarial terms (Xie et al., 2019; Liu et al., 2021; Lu et al., 2020). Such procedures require careful hyperparameter choice. This needs to be addressed before using these methods in practice. Primal-dual relationship (M2) makes it possible to recover the OT map via solving the dual form (3). Dual-form based methods primarily consider W2 cost due to its nice theoretical properties and relation to convex functions (Brenier, 1991). In the semi-discrete case (µ is continuous, ν is discrete), An et al. (2020a) and Lei et al. (2019) compute the dual potential and the OT map by using the Alexandrov theory and convex geometry. For the continuous case, Seguy et al. (2018) use the entropy (quadratic) regularization to recover the dual potentials and extract OT map from them via the barycenteric projection. Taghvaei & Jalali (2019), Makkuva et al. (2020), Korotin et al. (2021a) employ input-convex neural networks (ICNNs, see Amos et al. (2017)) to parametrize potentials in the dual problem and recover OT maps by using their gradients. Figure 3: The existing most prevalent approach to use OT maps in generative models. The aforementioned dual form methods compute OT maps in LATENT SPACES for problems such as domain adaptation and latent space mass transport, see Figure 3. OT maps in high-dimensional ambient spaces, e.g., natural images, are usually not considered. Recent evaluation of continuous OT methods for W2 (Korotin et al., 2021b) reveals their crucial limitations, which negatively affect their scalability, such as poor expressiveness of ICNN architectures or bias due to regularization. 4 END-TO-END SOLUTION TO LEARN OPTIMAL MAPS 4.1 EQUAL DIMENSIONS OF INPUT AND OUTPUT DISTRIBUTIONS In this section, we use X = Y = RD and consider the Wasserstein-2 distance (W2), i.e., the optimal transport for the quadratic ground cost c(x, y) = 1 2 x y 2. We use the dual form (4) to derive a saddle point problem the solution of which yields the OT map T . We consider distributions µ, ν with finite second moments. We assume that for distributions µ, ν in view there exists a unique OT plan π minimizing (3) and it is deterministic, i.e., π = [id RD, T ]#µ. Here T is an OT map Published as a conference paper at ICLR 2022 which minimizes (1). Previous related works (Makkuva et al., 2020; Korotin et al., 2021a) assumed the absolute continuity of µ, which implied the existence and uniqueness of T (Brenier, 1991). Let ψ(y) def = 1 2 y 2 v(y), where v is the potential of (4). Note that vc(x) = inf y RD 2 x y 2 v(y) = 1 2 x 2 sup y RD { x, y ψ(y)} = 1 2 x 2 ψ(x). (7) Therefore, (4) is equivalent to W2 2(µ, ν) = Z 2 dµ(x) + Z 2 dν(x) + sup ψ ψ(x)dµ(x) Z Y ψ(y)dν(y) = (8) Constant(µ, ν) inf ψ ψ(x)dµ(x) + Z Y ψ(y)dν(y) = (9) Constant(µ, ν) inf ψ X sup y RD { x, y ψ(y)} dµ(x) + Z Y ψ(y)dν(y) Constant(µ, ν) inf ψ x, T(x) ψ T(x) dµ(x) + Z Y ψ(y)dν(y) where between lines (10) and (11) we replace the optimization over y RD with the equivalent optimization over functions T : RD RD. The equivalence follows from the interchange between the integral and the supremum (Rockafellar, 1976, Theorem 3A). We also provide an independent proof of equivalence specializing Rockafellar s interchange theorem in Appendix A.1. Thanks to the following lemma, we may solve saddle point problem (11) and obtain the OT map T from its solution (ψ , T ). Lemma 4.1. Let T be the OT map from µ to ν. Then, for every optimal potential ψ , T arg sup T x, T(x) ψ T(x) dµ(x). (12) We prove Lemma 4.1 in Appendix A.2. For general µ, ν the arg sup T set for optimal ψ might contain not only OT map T , but other functions as well. Working with real-world data in experiments (M5.2), we observe that despite this issue, optimization (11) still recovers T . Relation to previous works. The use of the function T to approximate the c-transform was proposed by Nhan Dam et al. (2019) to estimate the Wasserstein loss in WGANs. For W2, the fact that T is an OT map was used by Makkuva et al. (2020); Korotin et al. (2021a) who primarily assumed continuous µ, ν and reduced (11) to convex ψ and T = ϕ for convex ϕ. Issues with nonuniqueness of solution of (12) were softened, but using ICNNs to parametrize ψ became necessary. Korotin et al. (2021b) demonstrated that ICNNs negatively affect practical performance of OT and tested an unconstrained formulation similar to (11). As per the evaluation, it provided the best empirical performance (Korotin et al., 2021b, M4.5). The method MM:R they consider parametrizes 1 2 2 ψ( ) by a neural network, while we directly parametrize ψ( ) by a neural network (M4.3). Recent work by Fan et al. (2021) exploits formulation similar to (11) for general costs c( , ). While their formulation leads to a max-min scheme with general costs (Fan et al., 2021, Theorem 3), our approach gives rise to a min-max method for quadratic cost. In particular, we extend the formulation to learn OT maps between distributions in spaces with unequal dimensions, see the next subsection. 4.2 UNEQUAL DIMENSIONS OF INPUT AND OUTPUT DISTRIBUTIONS Consider the case when X = RH and Y = RD have different dimensions, i.e., H = D. In order to map the probability distribution µ to ν, a straightforward solution is to embed X to Y via some Q : X Y and then to fit the OT map between Q#µ and ν for the quadratic cost on Y = RD. In this case, the optimization objective becomes inf ψ sup T Q(x), T Q(x) ψ T Q(x) dµ(x) + Z Y ψ(y)dν(y) (13) Published as a conference paper at ICLR 2022 with the optimal T recovering the OT map from Q#µ to ν. For equal dimensions H = D and the identity embedding Q(x) x, expression (13) reduces to optimization (11) up to a constant. Instead of optimizing (13) over functions T : Q(X) Y, we propose to consider optimization directly over generative mappings G : X Y: L(ψ, G) def = inf ψ sup G Q(x), G(x) ψ G(x) dµ(x) + Z Y ψ(y)dν(y) (14) Our following lemma establishes connections between (14) and OT with unequal dimensions: Lemma 4.2. Assume that exists a unique OT plan between Q#µ and ν and it is deterministic, i.e., [id RD, T ]#(Q#µ). Then G (x) = T Q(x) is the OT map between µ and ν for the Q-embedded quadratic cost c(x, y) = 1 2 Q(x) y 2. Moreover, for every optimal potential ψ of problem (14), G arg sup G Q(x), G(x) ψ G(x) dµ(x). (15) Figure 4: The scheme of our approach for learning OT maps between unequal dimensions. In the figure, the setup of M5.1 is shown: µ is a noise, Q is the bicubic upscaling, ν is a distribution of images. We prove Lemma 4.2 in Appendix A.3 and schematically present its idea in Figure 4. Analogously to Lemma 4.1, it provides a way to compute the OT map G for the Qembedded quadratic cost between distributions µ and ν by solving the saddle point problem (14). Note the situation with nonuniqueness of arg sup G is similar to M4.1. Relation to previous works. In practice, learning OT maps directly between spaces of unequal dimensions was considered in the work by (Fan et al., 2021, M5.2) but only on toy examples. We demonstrate that our method works well in large-scale generative modeling tasks (M5.1). Theoretical properties of OT maps for embedded costs are studied, e.g., in (Pass, 2010; Mc Cann & Pass, 2020). 4.3 PRACTICAL ASPECTS AND OPTIMIZATION PROCEDURE To optimize functional (14), we approximate G : RH RD and ψ : RD R with neural networks Gθ, ψω and optimize their parameters via stochastic gradient descent-ascent (SGDA) by using minibatches from µ, ν. The practical optimization procedure is given in Algorithm 1 below. Following the usual practice in GANs, we add a small penalty (MB.3) on potential ψω for better stability. The penalty is not included in Algorithm 1 to keep it simple. Relation to previous works. WGAN by Arjovsky & Bottou (2017) uses W1 as the loss to update the generator while we solve a diferent task we fit the generator G to be the OT map for Q-embedded quadratic cost. Despite this, our Algorithm 1 has similarities with WGAN s training. The update of ψ (line 4) coincides with discriminator s update in WGAN. The update of generator G (line 8) differs from WGAN s update by the term Q( ), Gθ( ) . Besides, in WGAN the optimization is inf G sup D. We have infψ sup G, i.e., the generator in our case is the solution of the inner problem. 4.4 ERROR ANALYSIS Given a pair ( ˆψ, ˆG) approximately solving (14), a natural question to ask is how good is the recovered OT map ˆG. In this subsection, we provide a bound on the difference between G and ˆG based on the duality gaps for solving outer and inner optimization problems. In (8), and, as the result, in (10), (11), (13), (14), it is enough to consider optimization over convex functions ψ, see (Villani, 2008, Case 5.17). Our theorem below assumes the convexity of ˆψ although it might not hold in practice since in practice ˆψ is a neural network. Published as a conference paper at ICLR 2022 Algorithm 1: Learning the optimal transport map between unequal dimensions. Input : Input distribution µ on X = RH; output distribution ν on Y = RD; generator network Gθ : RH RD; potential network ψω : RD R; number of iterations per network: KG, Kψ; embedding Q : X Y; Output: Trained generator Gθ representing OT map from µ to ν; 2 for kψ = 1 to Kψ do 3 Draw batch X µ and Y ν; 4 Lψ 1 |Y | P y Y ψω(y) 1 |X| P x X ψω Gθ(x) ; 5 Update ω by using Lψ ω to minimize Lψ; 6 for k G = 1 to KG do 7 Draw batch X µ; 8 LG 1 |X| P x X ψ G(x) Q(x), Gθ(x) ; 9 Update θ by using LG θ to minimize LG; 10 until not converged; 11 return Gθ Theorem 4.3. Assume that there exists a unique deterministic OT plan for Q-embedded quadratic cost between µ and ν, i.e., π = [id RH, G ]#µ for G : RH RD. Assume that ˆψ is β-strongly convex (β > 0) and ˆG : RH RD. Define ϵ1 = sup G L( ˆψ, G) L( ˆψ, ˆG) and ϵ2 = sup G L( ˆψ, G) inf ψ sup G L(ψ, G) Then the following bound holds true for the OT map G from µ to ν: FID( ˆG#µ, ν) L2 2 W2 2( ˆG#µ, ν) Z X ˆG(x) G (x) 2dµ(x) 2 β ( ϵ1 + ϵ2)2, (16) where FID is the Fr echet inception distance (Heusel et al., 2017) and L is the Lipschitz constant of the feature extractor of the pre-trained Inception V3 neural network (Szegedy et al., 2016). We prove Theorem 4.3 in Appendix A.4. The duality gaps upper bound L2(µ) norm between computed ˆG and true G maps, and the W2 2 between true ν and generated (fake) distribution ˆG#µ. Consequently, they upper bound FID between data ν and fake (generated) ˆG#µ distributions. Relation to previous works. Makkuva et al. (2020); Korotin et al. (2021a) prove related bounds for W2 with µ, ν located on the spaces of the same dimension. Our result holds for different dimensions. 5 EXPERIMENTS We evaluate our algorithm in generative modeling of the data distribution from a noise (M5.1) and unpaired image restoration task (M5.2). Technical details are given in Appendix B. Additionally, in Appendix B.4 we test our method on toy 2D datasets and evaluate it on the Wasserstein-2 benchmark (Korotin et al., 2021b) in Appendix B.2. The code is in the supplementary material. 5.1 MODELING DATA DISTRIBUTION FROM NOISE DISTRIBUTION In this subsection, µ is a 192-dimensional normal noise and ν the high-dimensional data distribution. Let the images from ν be of size w h with c channels. As the embedding Q : X Y we use a naive upscaling of a noise. For x R192 we represent it as 3-channel 8 8 image and bicubically upscale it to the size w h of data images from ν. For grayscale images drawn from ν, we stack c copies over channel dimension. We test our method on MNIST 32 32 (Le Cun et al., 1998), CIFAR10 32 32 (Krizhevsky et al., 2009), and Celeb A 64 64 (Liu et al., 2015) image datasets. In Figure 5, we show random samples Published as a conference paper at ICLR 2022 (a) MNIST, 32 32, grayscale (b) CIFAR10, 32 32, RGB (c) Celeb A, 64 64, RGB Figure 5: Randomly generated MNIST, CIFAR10, and Celeb A samples by our method (OTM). Table 1: Results on CIFAR10 dataset. Model Related Work Inception FID NVAE Vahdat & Kautz (2020) - 51.71 Pixel IQN Ostrovski et al. (2018) 5.29 49.46 EBM Du & Mordatch (2019) 6.02 40.58 DCGAN Radford et al. (2016) 6.64 0.14 37.70 NCSN Song & Ermon (2019) 8.87 0.12 25.32 NCP-VAE Aneja et al. (2021) - 24.08 WGAN Arjovsky et al. (2017) - 55.2 WGAN-GP Gulrajani et al. (2017) 6.49 0.09 39.40 3P-WGAN Nhan Dam et al. (2019) 7.38 0.08 28.8 AE-OT An et al. (2020a) - 28.5 AE-OT-GAN An et al. (2020b) - 17.1 OTM Ours 7.42 0.06 21.78 Table 2: Results on Celeb A dataset. Model Related Work FID DCGAN Radford et al. (2016) 52.0 DRAGAN Kodali et al. (2017) 42.3 BEGAN Berthelot et al. (2017) 38.9 NVAE Vahdat & Kautz (2020) 13.4 NCP-VAE Aneja et al. (2021) 5.2 WGAN Arjovsky et al. (2017) 41.3 WGAN-GP Gulrajani et al. (2017) 30.0 WGAN-QC Liu et al. (2019) 12.9 AE-OT An et al. (2020a) 28.6 AE-OT-GAN An et al. (2020b) 7.8 OTM Ours 6.5 generated by our approach, namely Optimal Transport Modeling (OTM). To quantify the results, in Tables 1 and 2 we give the inception (Salimans et al., 2016) and FID (Heusel et al., 2017) scores of generated samples. Similar to (Song & Ermon, 2019, Appendix B.2), we compute them on 50K real and generated samples. Additionally, in Appendix B.4, we test our method on 128 128 Celeb A faces. We provide qualitative results (images of generated faces) in Figure 11. For comparison, we include the scores of existing generative models of three types: (1) OT map as the generative model; (2) OT cost as the loss; (3) not OT-based. Note that models of the first type compute OT in the latent space of an autoencoder in contrast to our approach. According to our evaluation, the performance of our method is better or comparable to existing alternatives. 5.2 UNPAIRED IMAGE RESTORATION In this subsection, we consider unpaired image restoration tasks on Celeb A faces dataset. In this case, the input distribution µ consists of degraded images, while ν are clean images. In all the cases, embedding Q is a straightforward identity embedding Q(x) x. In image restoration, optimality of the restoration map is desired since the output (restored) image is expected to be close to the input (degraded) one minimizing the transport cost. Note that GANs do not seek for an optimal mapping. However, in practice, due to implicit inductive biases such as convolutional architectures, GANs still tend to fit low transport cost maps (B ezenac et al., 2021). The experimental setup is shown in Figure 6. We split the dataset in 3 parts A, B, C containing 90K, 90K, 22K samples respectively. To each image we apply the degradation transform (decolorization, noising or occlusion) and obtain the degraded dataset containing of 3 respective parts A, B, C. For unpaired training we use part A of degraded and part B of clean images. For testing, we use parts C. Figure 6: The training/testing scheme that we use for unpaired restoration tasks. To quantify the results we compute FID of restored images w.r.t. clean images of part C. The scores for denoising, inpainting and colorization are given in Table 3, details of each experiment and qualitative results are given below. As a baseline, we include WGAN-GP. For a fair comparison, we fit it using exactly the same hyperparameters as in our method OTM-GP. This is possible due to the similarities between our method and WGAN-GP s training procedure, see discussion in M4.3. In OTM, there is no GP (MB.3). Published as a conference paper at ICLR 2022 Model Denoising Colorization Inpainting Input 166.59 32.12 47.65 WGAN-GP 25.49 7.75 16.51 OTM-GP (ours) 10.95 5.66 9.96 OTM (ours) 5.92 5.65 8.13 Table 3: FID on test part C in image restoration experiments. Denoising. To create noisy images, we add white normal noise with σ = 0.3 to each pixel. Figure 7 illustrates image denoising using our OTM approach on the test part of the dataset. We show additional qualitative results for varying σ in Figure 15 of M(B.4). (a) Noisy samples. (b) Pushforward samples. (c) Original samples. Figure 7: OTM for image denoising on test C part of Celeb A, 64 64. Colorization. To create grayscale images, we average the RGB values of each pixel. Figure 8 illustrates image colorization using OTM on the test part of the dataset. (a) Grayscale samples. (b) Pushforward samples. (c) Original samples. Figure 8: OTM for image colorization on test C part of Celeb A, 64 64. Inpainting. To create incomplete images, we replace the right half of each clean image with zeros. Figure 9 illustrates image inpainting using OTM on the test part of the dataset. (a) Occluded samples. (b) Pushforward samples. (c) Original samples. Figure 9: OTM for image inpainting on test C part of Celeb A, 64 64. 6 CONCLUSION Our method fits OT maps for the embedded quadratic transport cost between probability distributions. Unlike predecessors, it scales well to high dimensions producing applications of OT maps directly in ambient spaces, such as spaces of images. The performance is comparable to other existing generative models while the complexity of training is similar to that of popular WGANs. Limitations. For distributions µ, ν we assume the existence of the OT map between them. In practice, this might not hold for all real-world µ, ν. Working with equal dimensions, we focus on the quadratic ground cost 1 2 x y 2. Nevertheless, our approach extends to other costs c( , ), see Fan et al. (2021). When the dimensions are unequal, we restrict our analysis to embedded quadratic cost 1 2 Q(x) y 2 where Q equalizes dimensions. Choosing the embedding Q might not be straightforward in some practical problems, but our evaluation (M5.1) shows that even naive choices of Q work well. Potential impact and ethics. Real-world image restoration problems often do not have paired datasets limiting the application of supervised techniques. In these practical unpaired learning problems, we expect our optimal transport approach to improve the performance of the existing models. However, biases in data might lead to biases in the pushforward samples. This should be taken into account when using our method in practical problems. Published as a conference paper at ICLR 2022 Reproducibility. The Py Torch source code is provided at https://github.com/Litu Rout/Optimal Transport Modeling The instructions to use the code are included in the README.md file. 7 ACKNOWLEDGMENT This research was supported by the computational resources provided by Space Applications Centre (SAC), ISRO. The first author acknowledges the funding by HRD Grant No. 0303T50FM703/SAC/ISRO. Skoltech RAIC center was supported by the RF Government (subsidy agreement 000000D730321P5Q0002, Grant No. 70-2021-00145 02.11.2021). Matthew Amodio and Smita Krishnaswamy. Travelgan: Image-to-image translation by transformation vector learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8983 8992, 2019. Brandon Amos, Lei Xu, and J Zico Kolter. Input convex neural networks. In International Conference on Machine Learning, pp. 146 155. PMLR, 2017. Dongsheng An, Yang Guo, Na Lei, Zhongxuan Luo, Shing-Tung Yau, and Xianfeng Gu. Ae-ot: A new generative model based on extended semi-discrete optimal transport. In International Conference on Learning Representations, 2020a. URL https://openreview.net/ forum?id=Hkldy TNYw H. Dongsheng An, Yang Guo, Min Zhang, Xin Qi, Na Lei, and Xianfang Gu. Ae-ot-gan: Training gans from data specific latent distribution. In European Conference on Computer Vision, pp. 548 564. Springer, 2020b. Jyoti Aneja, Alexander Schwing, Jan Kautz, and Arash Vahdat. Ncp-vae: Variational autoencoders with noise contrastive priors. In Advances in Neural Information Processing Systems Conference, 2021. Cem Anil, James Lucas, and Roger Grosse. Sorting out lipschitz function approximation. In International Conference on Machine Learning, pp. 291 301. PMLR, 2019. Martin Arjovsky and L eon Bottou. Towards principled methods for training generative adversarial networks. ar Xiv preprint ar Xiv:1701.04862, 2017. Martin Arjovsky, Soumith Chintala, and L eon Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pp. 214 223. PMLR, 2017. David Berthelot, Thomas Schumm, and Luke Metz. Began: Boundary equilibrium generative adversarial networks. ar Xiv preprint ar Xiv:1703.10717, 2017. Emmanuel de B ezenac, Ibrahim Ayed, and Patrick Gallinari. Cyclegan through the lens of (dynamical) optimal transport. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 132 147. Springer, 2021. Yann Brenier. Polar factorization and monotone rearrangement of vector-valued functions. Communications on pure and applied mathematics, 44(4):375 417, 1991. Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=B1xsqj09Fm. Biwei Dai and Uros Seljak. Sliced iterative normalizing flows. In ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models, 2021. Published as a conference paper at ICLR 2022 Ishan Deshpande, Ziyu Zhang, and Alexander G Schwing. Generative modeling using the sliced wasserstein distance. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3483 3491, 2018. DC Dowson and BV Landau. The fr echet distance between multivariate normal distributions. Journal of multivariate analysis, 12(3):450 455, 1982. Yilun Du and Igor Mordatch. Implicit generation and generalization in energy-based models. ar Xiv preprint ar Xiv:1903.08689, 2019. Jiaojiao Fan, Shu Liu, Shaojun Ma, Yongxin Chen, and Haomin Zhou. Scalable computation of monge maps with general costs. ar Xiv preprint ar Xiv:2106.03812, 2021. Kilian Fatras, Younes Zine, R emi Flamary, R emi Gribonval, and Nicolas Courty. Learning with minibatch wasserstein: asymptotic and gradient properties. ar Xiv preprint ar Xiv:1910.04091, 2019. Aude Genevay, Gabriel Peyr e, and Marco Cuturi. Learning generative models with sinkhorn divergences. In International Conference on Artificial Intelligence and Statistics, pp. 1608 1617. PMLR, 2018. Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems Conference, pp. 2674 2680, 2014. Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Improved training of wasserstein gans. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 5769 5779, 2017. Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017. Leygonie Jacob, Jennifer She, Amjad Almahairi, Sai Rajeswar, and Aaron Courville. W2gan: Recovering an optimal transport map with a gan. 2018. Leonid Vitalevich Kantorovich. On a problem of monge. Uspekhi Mat. Nauk, pp. 225 226, 1948. Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401 4410, 2019. Cheolhyeong Kim, Seungtae Park, and Hyung Ju Hwang. Local stability of wasserstein gans with abstract gradient penalty. IEEE Transactions on Neural Networks and Learning Systems, 2021. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. Naveen Kodali, Jacob Abernethy, James Hays, and Zsolt Kira. On convergence and stability of gans. ar Xiv preprint ar Xiv:1705.07215, 2017. Alexander Korotin, Vage Egiazarian, Arip Asadulaev, Alexander Safin, and Evgeny Burnaev. Wasserstein-2 generative networks. In International Conference on Learning Representations, 2021a. URL https://openreview.net/forum?id=b Eoxz W_EXsa. Alexander Korotin, Lingxiao Li, Aude Genevay, Justin Solomon, Alexander Filippov, and Evgeny Burnaev. Do neural optimal transport solvers work? a continuous wasserstein-2 benchmark. ar Xiv preprint ar Xiv:2106.01954, 2021b. Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. Published as a conference paper at ICLR 2022 Yann Le Cun, L eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. Na Lei, Kehua Su, Li Cui, Shing-Tung Yau, and Xianfeng David Gu. A geometric view of optimal transportation and generative model. Computer Aided Geometric Design, 68:1 21, 2019. Huidong Liu, Xianfeng Gu, and Dimitris Samaras. Wasserstein gan with quadratic transport cost. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019. Shu Liu, Shaojun Ma, Yongxin Chen, Hongyuan Zha, and Haomin Zhou. Learning high dimensional wasserstein geodesics. ar Xiv preprint ar Xiv:2102.02992, 2021. Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pp. 3730 3738, 2015. Guansong Lu, Zhiming Zhou, Yuxuan Song, Kan Ren, and Yong Yu. Guiding the one-to-one mapping in cyclegan via optimal transport. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 4432 4439, 2019. Guansong Lu, Zhiming Zhou, Jian Shen, Cheng Chen, Weinan Zhang, and Yong Yu. Large-scale optimal transport via adversarial training with cycle-consistency. ar Xiv preprint ar Xiv:2003.06635, 2020. Ashok Makkuva, Amirhossein Taghvaei, Sewoong Oh, and Jason Lee. Optimal transport mapping via input convex neural networks. In International Conference on Machine Learning, pp. 6672 6681. PMLR, 2020. Anton Mallasto, Jes Frellsen, Wouter Boomsma, and Aasa Feragen. (q, p)-wasserstein gans: Comparing ground metrics for wasserstein gans. ar Xiv preprint ar Xiv:1902.03642, 2019. Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2794 2802, 2017. Robert J Mc Cann and Brendan Pass. Optimal transportation between unequal dimensions. Archive for Rational Mechanics and Analysis, 238(3):1475 1520, 2020. Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018. Quan Hoang Nhan Dam, Trung Le, Tu Dinh Nguyen, Hung Bui, and Dinh Phung. Threeplayer wasserstein gan via amortised duality. In Proc. of the 28th Int. Joint Conf. on Artificial Intelligence (IJCAI), 2019. Georg Ostrovski, Will Dabney, and R emi Munos. Autoregressive quantile networks for generative modeling. In International Conference on Machine Learning, pp. 3936 3945. PMLR, 2018. Brendan Pass. Regularity of optimal transportation between spaces with different dimensions. ar Xiv preprint ar Xiv:1008.1544, 2010. Giorgio Patrini, Rianne van den Berg, Patrick Forre, Marcello Carioni, Samarth Bhargav, Max Welling, Tim Genewein, and Frank Nielsen. Sinkhorn autoencoders. In Uncertainty in Artificial Intelligence, pp. 733 743. PMLR, 2020. Henning Petzka, Asja Fischer, and Denis Lukovnikov. On the regularization of wasserstein gans. In International Conference on Learning Representations, 2018. Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. URL http://arxiv.org/abs/1511.06434. R Tyrrell Rockafellar. Integral functionals, normal integrands and measurable selections. In Nonlinear operators and the calculus of variations, pp. 157 207. Springer, 1976. Published as a conference paper at ICLR 2022 Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advances in neural information processing systems, 29: 2234 2242, 2016. Maziar Sanjabi, Meisam Razaviyayn, Jimmy Ba, and Jason D Lee. On the convergence and robustness of training gans with regularized optimal transport. Advances in Neural Information Processing Systems, 2018:7091 7101, 2018. Filippo Santambrogio. Optimal transport for applied mathematicians. Birk auser, NY, 55(58-63):94, 2015. Vivien Seguy, Bharath Bhushan Damodaran, Remi Flamary, Nicolas Courty, Antoine Rolet, and Mathieu Blondel. Large scale optimal transport and mapping estimation. In International Conference on Learning Representations, 2018. Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In Proceedings of the 33rd Annual Conference on Neural Information Processing Systems, 2019. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818 2826, 2016. Amirhossein Taghvaei and Amin Jalali. 2-wasserstein approximation via restricted convex potentials with application to improved training for gans. ar Xiv preprint ar Xiv:1902.07197, 2019. Ugo Tanielian and Gerard Biau. Approximating lipschitz continuous functions with groupsort neural networks. In International Conference on Artificial Intelligence and Statistics, pp. 442 450. PMLR, 2021. Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein autoencoders. ar Xiv preprint ar Xiv:1711.01558, 2017. Arash Vahdat and Jan Kautz. Nvae: A deep hierarchical variational autoencoder. In Advances in Neural Information Processing Systems Conference, 2020. C edric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008. Yujia Xie, Minshuo Chen, Haoming Jiang, Tuo Zhao, and Hongyuan Zha. On scalable and efficient computation of large scale optimal transport. In International Conference on Machine Learning, pp. 6882 6892. PMLR, 2019. Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Computer Vision (ICCV), 2017 IEEE International Conference on, 2017. Published as a conference paper at ICLR 2022 A.1 PROOF OF EQUIVALENCE: EQUATION (10) AND (11) Proof. Pick any T : X Y. For every point x X by the definition of the supremum we have x, T(x) ψ (T(x)) sup y Y { x, y ψ(y)} . Integrating the expression w.r.t. x µ yields Z X { x, T(x) ψ (T(x))}dµ(x) Z X sup y Y { x, y ψ(y)} dµ(x) = L1. Since the inequality holds for all T : X Y, we conclude that L2 = sup T :X Y X { x, T(x) ψ (T(x))}dµ(x) Z X sup y Y { x, y ψ(y)} dµ(x) = L1, i.e. L2 L1. Now let us prove that the sup on the left side actually equals L1. To do this, we need to show that for every ϵ > 0 there exists T ϵ : X Y satisfying Z X { x, T ϵ(x) ψ (T ϵ(x))}dµ(x) L1 ϵ. First note that for every x X by the definition of the supremum there exists yϵ = yϵ(x) which provides x, yϵ(x) ψ (yϵ(x)) sup y Y { x, y ψ(y)} ϵ. We take T ϵ(x) = yϵ(x) for all x X and integrate the previous inequality w.r.t. x µ. We obtain Z X { x, T ϵ(x) ψ (T ϵ(x))}dµ(x) Z X sup y Y { x, y ψ(y)} dµ(x) ϵ = L1 ϵ, which is the desired inequality. A.2 PROOF OF LEMMA 4.1 Proof. It is enough to prove that ψ (x) = T (x), x ψ T(x) holds µ-almost everywhere, i.e., T (x) arg sup y RD { x, y ψ (y)}. Since ν = T #µ, we use (9) with ψ ψ to derive W2 2(µ, ν) Z 1 2 x 2dµ(x) Z 1 2 y 2dν(y) = ψ (x)dµ(x) Z Y ψ (y)dν(y) = Z ψ (x)dµ(x) Z Y ψ T (x) dµ(x) = ψ (x) + ψ T (x) | {z } T (x),x X T (x), x dµ(x) = (17) 1 2 x T (x) 2dµ(x) Z 1 2 x 2dµ(x) Z 1 2 T (x) 2dµ(x) = W2 2(µ, ν) Z 1 2 x 2dµ(x) Z 1 2 y 2dν(y). As a result, inequality (17) becomes the equality, in particular, ψ (x) + ψ T (x) = T (x), x holds µ-almost everywhere. Published as a conference paper at ICLR 2022 A.3 PROOF OF LEMMA 4.2 Proof. Let QW2 2 denote the Q-embedded quadratic cost. We use the change of variables formula to derive QW2 2(µ, ν) = inf π Π(µ,ν) 1 2 Q(x) y 2dπ(x, y) = inf π Π(Q#µ,ν) 1 2 x y 2dπ (x, y) = W2 2(Q#µ, ν), (18) i.e., computing the OT plan for QW2 2(µ, ν) boils down to computing the OT plan for W2 2(Q#µ, ν). It follows that [id RH, T Q(x) ]#µ = [id RH, G ]#µ is an OT plan for QW2 2(µ, ν), and G is the OT map. Inclusion (15) now follows from Lemma 4.1. A.4 PROOF OF THEOREM 4.3 Proof. Pick any G arg sup G L( ˆψ, G) = arg sup G R X n Q(x), G(x) ˆψ G(x) o dµ(x) or, equivalently, for all x RH, G (x) arg supy n Q(x), y ˆψ(y) o . Consequently, for all y RD Q(x), G (x) ˆψ G (x) Q(x), y ˆψ(y), which after regrouping the terms yields ˆψ(y) ˆψ G (x) + Q(x), y G (x) . This means that Q(x) is contained in the subgradient ˆψ at G (x) for a convex ˆψ. Since ˆψ is β-strongly convex, for points G(x), G (x) RD and Q(x) ˆψ G (x) we derive ˆψ G(x) ˆψ G (x) + Q(x), G(x) G (x) + β 2 G (x) G(x) 2. Regrouping the terms, this gives Q(x), G (x) ˆψ G (x) Q(x), G(x) ˆψ G(x) β 2 G (x) G(x) 2. Integrating w.r.t. x µ yields ϵ1 = L( ˆψ, G ) L( ˆψ, G) β Z 1 2 G (x) G(x) 2dµ(x) = β 2 G G 2 L2(µ). (19) Let G be the OT map from µ to ν. We use G #µ = ν to derive L( ˆψ, G ) = Z n Q(x), G (x) ˆψ G (x) o dµ(x) + Z Y ˆψ(y)dν(y) = Z n Q(x), G (x) ˆψ G (x) o dµ(x) + Z X ˆψ G (x) dµ(x) = Q(x), G (x) ˆψ G (x) + ˆψ G (x) | {z } Q(x),G (x) +β 1 X Q(x), G (x) dµ(x) + β Z 1 2 G G 2dµ(x). (20) Let ψ be an optimal potential in (14). Thanks to Lemma 4.2, we have inf ψ sup G L(ψ, G) = L(ψ , G ) = Z Q(x), G (x) ψ G (x) dµ(x) + Z Y ψ (y)dν(y) = Published as a conference paper at ICLR 2022 Q(x), G (x) ψ G (x) dµ(x) + Z X ψ G (x) dµ(x) = Z X Q(x), G (x) dµ(x) (21) By combining (20) with (21), we obtain ϵ2 = L( ˆψ, G ) L(ψ , G ) β Z 1 2 G G 2dµ(x) = β 2 G G 2 L2(µ) (22) The right-hand inequality of (16) follows from the triangle inequality combined with (19) and (22). The middle inequality of (16) follows from (Korotin et al., 2021a, Lemma A.2) and G #µ = ν. Now we prove the left-hand inequality of (16). Let I be the feature extractor of the pre-trained Inception V3 neural networks. FID score between generated (fake) ˆG#µ and data distribution ν is FID( ˆG#µ, ν) = FD(I# ˆG#µ, I#ν) 2 W2 2(I# ˆG#µ, I#ν), (23) where FD( , ) is the Fr echet distance which lower bounds 2 W2 2, see (Dowson & Landau, 1982). Finally, from (Korotin et al., 2021a, Lemma A.1) it follows that W2 2(I# ˆG#µ, I#ν) L2 W2 2( ˆG#µ, ν). (24) Here L is the Lipschitz constant of I. We combine (23) and (24) to get the left-hand inequality in (16). B EXPERIMENTAL DETAILS We use the Py Torch framework. All the experiments are conducted on 2 V100 GPUs. We compute inception and FID scores with the official implementation from Open AI1 and TTUR2. The compared results are taken from the respective papers or publicly available source codes. B.1 GENERAL TRAINING DETAILS MNIST (Le Cun et al., 1998). On MNIST, we use x R192 and y R32 32. The batch size is 64, learning rate 2 10 4, optimizer Adam (Kingma & Ba, 2014) with betas (0, 0.9), gradient optimality coefficient λ = 10, and the number of training epochs T = 30. We observe stable training while updating ψ once in multiple G updates, i.e., k G = 2 and kψ = 1. CIFAR10 (Krizhevsky et al., 2009). We use all 50000 samples while training. The latent vector x R192 and y R32 32 3, batch size 64, λ = 10, k G = 1, kψ = 1, T = 1000, Adam optimizer with betas (0, 0.9), and learning rate 2 10 4 for G and 1 10 3 for ψ. Celeb A (Liu et al., 2015). We use x R192 and y R64 64 3. The images are first cropped at the center with size 140 and then resized to 64 64. We consider all 202599 samples. We use Adam with betas (0, 0.9), T = 200, KG = 2, Kψ = 1 and learning rate 2 10 4. Image restoration. In the unpaired image restoration experiments, we use Adam optimizer with betas (0, 0.9), KG = 5, Kψ = 1, λ = 0, learning rate 1 10 4 and train for T = 300 epochs. Celeb A128x128 (Liu et al., 2015). On this dataset, we resize the cropped images as in Celeb A to 128 128, i.e. y R128 128 3. Here, KG = 5, Kψ = 1, λ = 0.01, learning rate 1 10 4 and betas=(0.5, 0.999). The batch size is reduced to 16 so as to fit in the GPU memory. Anime128x1283. This dataset consists of 500000 high resolution images. We resize the cropped images as in Celeb A to 128 128, i.e. y R128 128 3. Here, KG = 5, Kψ = 1, λ = 0.01, learning rate 2 10 4, batch size 16, and betas=(0, 0.9). Toy datasets. The dimension is D = H = 2, total number of samples is 10000. We use the batch size 400, λ = 0.1, Kψ = 1, KG = 16, and 1IS: https://github.com/openai/improved-gan/tree/master/inception_score 2FID: https://github.com/bioinf-jku/TTUR 3Anime: https://www.kaggle.com/reitanaka/alignedanimefaces Published as a conference paper at ICLR 2022 T = 100. The optimizer is Adam with betas (0.5, 0.99) and learning rate 1 10 3. We use the following datasets: Gaussian to mixture of Gaussians4, two moons (sklearn.datasets.make_moons), circles (sklearn.datasets.make_circles), gaussian to S-curve (sklearn.datasets.make_s_curve), and gaussian to swiss roll (sklearn.datasets.make_swiss_roll). Wasserstein-2 benchmark (Appendix B.2). The dimension is D = H = 64 64 3. We use batch size 64, λ = 0, Kψ = 1, KG = 5, learning rate 10 4, and Adam optimizer with default betas. B.2 EVALUATION ON THE CONTINUOUS WASSERSTEIN-2 BENCHMARK To empirically show that the method recovers the optimal transport maps well on equal dimensions, we evaluate it on the recent continuous Wasserstein-2 benchmark by Korotin et al. (2021b). The benchmark provides a number of artificial test pairs (µ, ν) of continuous probability distributions with analytically known OT map T between them. Method L2-UVP MM:R 1.4% OTM (ours) 1.32% Table 4: L2-UVP metric of the recovered transport map on the Early images benchmark pair. For evaluation, we use the Early images benchmark pair (D = 12288), see (Korotin et al., 2021b, M4.1) for details. We adopt the L2-unexplained percentage metric (Korotin et al., 2021a, M5.1) to quantify the recovered OT map ˆT: L2-UVP( ˆT) = 100 ˆT T 2/Var(ν)%. For our method the L2-UVP metric is only 1%, see Table 4. This is comparable to the best MM:R method which the authors evaluate on their benchmark. The qualitative results are given in Figure 10. Figure 10: Qualitative results of OTM on the Early images benchmark pair (µ, ν) by Korotin et al. (2021b). The 1st line shows samples x µ, the 2nd line shows fitted OT map ˆT(x), and the 3rd line shows the corresponding optimal map T (x) ν. B.3 FURTHER DISCUSSION AND EVALUATION Generative modeling. In the experiments, we use the gradient penalty on ψ for better stability of optimization. The penalty is intended to make the gradient norm of the optimal WGAN critic equal to 1 (Gulrajani et al., 2017, Corollary 1). This condition does not necessarily hold for optimal ψ in our case and consequently might introduce bias to optimization. To address this issue, we additionally tested an alternative regularization which we call the gradient optimality. For every optimal potential ψ and map G of problem (14), we get from Lemma 4.2: Ex µ [ Q(x), G (x) ψ (G (x)))] = Ex µ [Q(x)] Ex µ [ ψ (G (x)))] = 0. (25) Since µ is normal noise distribution and Q(x) is naive upscaling (M5.1), the above expression simplifies to Ex µ ψ (G (x))) = 0. Based on this property, we establish the following regularizer λ Ex µ ψ (G(x))) for λ > 0 and add this to Lψ in our Algorithm 1. While gradient penalty considers expectation of norm, gradient optimality considers norm of expectation. The gradient optimality is always non-negative and vanishes at the optimal point. 4https://github.com/Amir Tag/OT-ICNN Published as a conference paper at ICLR 2022 λ FID 0.001 16.91 0.01 16.22 0.1 16.70 1.0 10.01 10 6.50 Table 5: Ablation study of gradient optimality in OTM. We conduct additional experiments with the gradient optimality and compare FID scores for different λ in Table 5. It leads to an improvement of FID score from the earlier 7.7 with the gradient penalty to the current 6.5 with the gradient optimality on Celeb A (Table 2). Unpaired restoration. In the unpaired restoration experiments (M5.2), we test OTM with the gradient penalty to make a fair comparison with the baseline WGAN-GP. We find OTM without regularization, i.e., λ = 0 works better than OTM-GP (Table 3). In practice, more G updates for a single ψ update works fairly well (MB.1). B.4 ADDITIONAL QUALITATIVE RESULTS OTM works with both the grayscale and color embeddings of noise in the ambient space. Celeb A128x128. Figure 11 shows the grayscale embedding Q(x), the recovered transport map ˆG(x), and independently drawn real samples y ν. Figure 11: OTM between 128-dimensional noise and Celeb A, 128 128. The 1st line shows the grayscale embedding Q (repeating bicubic upscaling of a noise, 16 8), the 2nd line shows corresponding generated samples, and the 3rd line shows random samples from the dataset. Anime128x128. Figure 12 shows the color embedding Q(x), the recovered transport map ˆG(x), and independently drawn real samples y ν. Figure 12: OTM between 192-dimensional noise and Anime, 128 128. The 1st line shows the color embedding Q (bicubic upscaling of a noise, 3 8 8), the 2nd line shows corresponding generated samples, and the 3rd line shows random samples from the dataset. The extended qualitative results with color embedding on MNIST, CIFAR10, and Celeb A are shown in Figure 13a, Figure 13b, and Figure 13c respectively. Table 6 shows quantiative results on MNIST. The color embedding Q is bicubic upscaling of a noise in R3 8 8. The samples are Published as a conference paper at ICLR 2022 Table 6: Results on MNIST dataset. Model Related Work FID VAE Kingma & Welling (2013) 23.8 0.6 LSGAN Mao et al. (2017) 7.8 0.6 BEGAN Berthelot et al. (2017) 13.1 1.0 WGAN Arjovsky et al. (2017) 6.7 0.4 SIG Dai & Seljak (2021) 4.5 AE-OT An et al. (2020a) 6.2 0.2 AE-OT-GAN An et al. (2020b) 3.2 OTM Ours 2.4 generated randomly (uncurated) by fitted optimal transport maps between noise and ambient space, e.g., spaces of high-dimensional images. Figure 14 illustrates latent space interpolation between the generated samples. Figure 15 shows denoising of images with varying levels of σ = 0.1, 0.2, 0.3, 0.4 by the model trained with σ = 0.3. (a) MNIST, 32 32, grayscale (b) CIFAR10, 32 32, RGB (c) Celeb A, 64 64, RGB Figure 13: Randomly generated MNIST, CIFAR10, and Celeb A samples by our method (OTM). Figure 14: OTM for latent space interpolation on Celeb A, 64 64. Extended samples. Toy datasets. Figure 16 shows the results of our method and related approaches (M3) on a toy 2D dataset. Figure 17 shows the results of our method applied to other toy datasets. B.5 NEURAL NETWORK ARCHITECTURES This section contains architectures on CIFAR10 (Table 7), Celeb A (Table 8), Celeb A 128 128 (Figure 11) generation, image restoration tasks (Table 9), evaluation on the toy 2D datasets and Wasserstein-2 images benchmark (Table 4). In the unpaired restoration tasks (M5.2), we use UNet architecture for transport map G and convolutional architecture for potential ψ. Similarly to (Song & Ermon, 2019), we use Batch Normaliation (BN) and Instance Normalization+ (INorm+) layers. In the Res Net architectures, we use the Residual Block of NCSN (Song & Ermon, 2019). In the toy 2D examples, we use a simple multi-layer perceptron with 3 hidden layers consisting of 128 neurons each and Leaky Re LU activation. The final layer is linear without any activation. Published as a conference paper at ICLR 2022 σ = 0.1 σ = 0.2 σ = 0.3 σ = 0.4 (a) Noisy samples. σ = 0.1 σ = 0.2 σ = 0.3 σ = 0.4 (b) Pushforward samples. Figure 15: OTM for image denoising for varying levels of noise on test C part of Celeb A, 64 64. 10 5 0 5 10 (a) LSOT (Seguy et al., 2018) 10 5 0 5 10 (b) W2GAN (Jacob et al., 2018) 10 5 0 5 10 (c) WGAN-LP (Petzka et al., 2018) 10 5 0 5 10 (d) ICNN-OT(Makkuva et al., 2020) 10 5 0 5 10 (e) W2GN (Korotin et al., 2021a) 10 5 0 5 10 (f) OTM (ours) Figure 16: Mapping between a Gaussian and a Mixture of 8 Gaussians in 2D by various methods. The colors green, blue, and peru represent input, pushforward, and output samples respectively. The transport map G and potential ψ architectures on MNIST 32 32, Celeb A 128 128, and Anime 128 128 are the generator and discriminator architectures of WGAN-QC Liu et al. (2019) respectively. In the evaluation on the Wasserstein-2 benchmark, we use publicly available Unet5 architecture for transport map T and WGAN-QC discriminator s architecture for ψ (Liu et al., 2019). These neural network architectures are the same as the authors of the benchmark use. 5https://github.com/milesial/Pytorch-UNet Published as a conference paper at ICLR 2022 (a) Two moons 1.0 0.5 0.0 0.5 1.0 (b) Circles 10 5 0 5 10 (c) S Curve 5.0 2.5 0.0 2.5 5.0 7.5 (d) Swiss Roll Figure 17: OTM on toy datasets, D = 2. Here, the colors green, blue, and peru represent input, pushforward, and output samples respectively. Table 7: Architectures for generation task on CIFAR10, 32 32. G(.) Noise: x R128 Linear, Reshape, output shape: [128 4 4] Residual Block Up, output shape: [128 8 8] Residual Block Up, output shape: [128 16 16] Residual Block Up, output shape: [128 32 32] Conv, Tanh, output shape: [3 32 32] ψ(.) Target: y R3 32 32 Residual Block Down, output shape: [128 16 16] Residual Block Down, output shape: [128 8 8] Residual Block, output shape: [128 8 8] Residual Block, output shape: [128 8 8] Re LU, Global sum pooling, output shape: [128 1 1] Linear, output shape: [1] Published as a conference paper at ICLR 2022 Table 8: Architectures for generation task on Celeba, 64 64. G(.) Noise: x R128 Conv Transpose, BN, Leaky Re LU, output shape: [256 1 1] Conv Transpose, BN, Leaky Re LU, output shape: [512 4 4] Conv, Pixel Shuffle, BN, Leaky Re LU, output shape: [512 8 8] Conv, Pixel Shuffle, BN, Leaky Re LU, output shape: [512 16 16] Conv, Pixel Shuffle, BN, Leaky Re LU, output shape: [512 32 32] Conv Transpose, Tanh, output shape: [3 64 64] ψ(.) Target: y R3 64 64 Conv, output shape: [128 64 64] Residual Block Down, output shape: [256 32 32] Residual Block Down, output shape: [256 16 16] Residual Block Down, output shape: [256 8 8] Residual Block Down, output shape: [128 4 4] Conv, output shape: [1] Table 9: Architectures for restoration tasks on Celeb A, 64 64. G(.) Input: x R3 64 64 Conv, BN, Leaky Re LU, output shape: [256 64 64] Conv, Leaky Re LU, Avg Pool, output shape: [256 32 32], x1 Conv, Leaky Re LU, Avg Pool, output shape: [256 16 16], x2 Conv, Leaky Re LU, Avg Pool, output shape: [256 8 8], x3 Conv, Leaky Re LU, Avg Pool, output shape: [256 4 4], x4 Nearest Neighbour Upsample, Conv, BN, Re LU, output shape: [256 8 8], y3 Add (y3, x3), output shape: [256 8 8], y3 Nearest Neighbour Upsample, Conv, BN, Re LU, output shape: [256 16 16], y2 Add (y2, x2), output shape: [256 16 16], y2 Nearest Neighbour Upsample, Conv, BN, Re LU, output shape: [256 32 32], y1 Add (y1, x1), output shape: [256 32 32], y1 Nearest Neighbour Upsample, Conv, BN, Re LU, output shape: [256 64 64], y Add (y, x), output shape: [256 64 64], y Conv Transpose, Tanh, output shape: [3 64 64] ψ(.) Target: y R3 64 64 Conv, Leaky Re LU, Avg Pool, output shape: [256 32 32] Conv, Leaky Re LU, Avg Pool, output shape: [256 16 16] Conv, Leaky Re LU, Avg Pool, output shape: [256 8 8] Conv, Leaky Re LU, Avg Pool, output shape: [256 4 4] Linear, output shape: [1]