# wasserstein2_generative_networks__175980c4.pdf Published as a conference paper at ICLR 2021 WASSERSTEIN-2 GENERATIVE NETWORKS Alexander Korotin Skolkovo Institute of Science and Technology Moscow, Russia a.korotin@skoltech.ru Vage Egiazarian Skolkovo Institute of Science and Technology Moscow, Russia vage.egiazarian@skoltech.ru Arip Asadulaev ITMO University Saint Petersburg, Russia aripasadulaev@itmo.ru Aleksandr Safin Skolkovo Institute of Science and Technology Moscow, Russia aleksandr.safin@skoltech.ru Evgeny Burnaev Skolkovo Institute of Science and Technology Moscow, Russia e.burnaev@skoltech.ru We propose a novel end-to-end non-minimax algorithm for training optimal transport mappings for the quadratic cost (Wasserstein-2 distance). The algorithm uses input convex neural networks and a cycle-consistency regularization to approximate Wasserstein-2 distance. In contrast to popular entropic and quadratic regularizers, cycle-consistency does not introduce bias and scales well to high dimensions. From the theoretical side, we estimate the properties of the generative mapping fitted by our algorithm. From the practical side, we evaluate our algorithm on a wide range of tasks: image-to-image color transfer, latent space optimal transport, image-to-image style transfer, and domain adaptation. 1 INTRODUCTION Generative learning framework has become widespread over the last couple of years tentatively starting with the introduction of generative adversarial networks (GANs) by Goodfellow et al. (2014). The framework aims to define a stochastic procedure to sample from a given complex probability distribution Q on a space Y RD, e.g. a space of images. The usual generative pipeline includes sampling from tractable distribution P on space X and applying a generative mapping g : X Y that transforms P into the desired Q. (a) An Arbitrary Mapping. (b) The Monotone Mapping. Figure 1: Two possible generative mappings that transform distribution P to distribution Q. In many cases for probability distributions P, Q, there may exist several different generative mappings. For example, the mapping in Figure 1b seems to be better than the one in Figure 1a and should be preferred: the mapping in Figure 1b is straightforward, wellstructured and invertible. Existing generative learning approaches mainly do not focus on the structural properties of the generative mapping. For example, GAN-based approaches, such as f-GAN by Nowozin et al. (2016); Yadav et al. (2017), WGAN by Arjovsky et al. (2017) and others Li et al. (2017); Mroueh & Sercu (2017), approximate generative mapping by a neural network with a problem-specific architecture. Published as a conference paper at ICLR 2021 The reasonable question is how to find a generative mapping g P = Q that is well-structured. Typically, the better the structure of the mapping is, the easier it is to find such a mapping. There are many ways to define what the well-structured mapping is. But usually, such a mapping is expected to be continuous and, if possible, invertible. One may note that when P and Q are both one-dimensional (X, Y R1), the only class of mappings g : X Y satisfying these properties are monotone mappings1, i.e. x, x X (x = x ) satisfying g(x) g(x ) x x > 0. The intuition of 1-dimensional spaces can be easily extended to X, Y RD. We can require the similar condition to hold true: x, x X (x = x ) g(x) g(x ), x x > 0. (1) The condition (1) is called monotonicity, and every surjective function satisfying this condition is invertible. In one-dimensional case, for any pair of continuous P, Q with non-zero density there exists a unique monotone generative map given by g(x) = F 1 Q FP(x) Mc Cann et al. (1995), where F( ) is the cumulative distribution function of P or Q. However, for D > 1 there might exist more than one generative monotone mapping. For example, when P = Q are standard 2-dimensional Gaussian distributions, all rotations by angles π 2 are monotone and preserve the distribution. One may impose uniqueness by considering only maximal Peyré (2018) monotone mappings g : X Y satisfying N = 2, 3 . . . and N distinct points x1, . . . , x N X (N + 1 1): N X n=1 g(xn), xn xn+1 > 0. (2) The condition (2) is called cycle monotonicity and also implies "usual" monotonicity (1). Importantly, for almost every two continuous probability distributions P, Q on X = Y = RD there exists a unique cycle monotone mapping g : X Y satisfying g P = Q, see Mc Cann et al. (1995). Thus, instead of searching for arbitrary generative mapping, one may significantly reduce the considered approximating class of mappings by using only cycle monotone ones. According to Rockafellar (1966), every cycle monotone mapping g is contained in a sub-gradient of some convex function ψ : X R. Thus, every convex class of functions may produce cycle monotone mappings (by considering sub-gradients of these functions). In practice, deep input convex neural networks (ICNNs, see Amos et al. (2017)) can be used as a class of convex functions. Formally, to fit a cycle monotone generative mapping, one may apply any existing approach, such as GANs Goodfellow et al. (2014), with the set of generators restricted to gradients of ICNN. However, GANs typically require solving a minimax optimization problem. It turns out that the cycle monotone generators are strongly related to Wasserstein-2 distance (W2). The approaches by Taghvaei & Jalali (2019); Makkuva et al. (2019) use dual form of W2 to find the optimal generative mapping which is cycle monotone. The predecessor of both approaches is the gradient-descent algorithm for computing W2 distance by Chartrand et al. (2009). The drawback of all these methods is similar to the one of GANs their optimization objectives are minimax. Cyclically monotone generators require that both spaces X and Y have the same dimension, which poses no practical limitation. Indeed, it is possible to combine a generative mapping with a decoder of a pre-trained autoencoder, i.e. train a generative mapping into a latent space. It should be also noted that the cases with equal dimensions of X and Y are common in computer vision. The typical example is image-to-image style transfer when both the input and the output images have the same size and number of channels. Other examples include image-to-image color transfer, domain adaptation, etc. In this paper, we develop the concept of cyclically monotone generative learning. The main contributions of the paper are as follows: 1. Developing an end-to-end non-minimax algorithm for training cyclically monotone generative maps, i.e. optimal maps for quadratic transport cost (Wasserstein-2 distance). 2. Proving theoretical bound on the approximation properties of the transport mapping fitted by the developed approach. 3. Developing a class of Input Convex Neural Networks whose gradients are used to approximate cyclically monotone mappings. 1We consider only monotone increasing mappings. Decreasing mappings have analogous properties. Published as a conference paper at ICLR 2021 4. Demonstrating the performance of the method in practical problems of image-to-image color transfer, mass transport in latent spaces, image-to-image style translation and domain adaptation. Our algorithm extends the approach of Makkuva et al. (2019), eliminates minimax optimization imposing cyclic regularization and solves non-minimax optimization problem. At the result, the algorithm scales well to high dimensions and converges up to 10x times faster than its predecessors. The paper is structured as follows. Section 2 is devoted to Related Work. In Section 3, we give the necessary mathematical tools on Wasserstein-2 optimal transport. In Section 4, we derive our algorithm and state our main theoretical results. In Section 5, we provide the results of computational experiments. In Appendix A, we prove our theoretical results. In Appendix B, we describe the particular architectures of ICNN that we use for experiments. In Appendix C, additional experiments and training details are provided. 2 RELATED WORK Modern generative learning is mainly associated with Generative Adversarial Networks (GANs) Goodfellow et al. (2014); Arjovsky et al. (2017). Basic GAN model consists of two competing networks: generator g and discriminator d. Generator g takes as input samples x from given distribution P and tries to produce realistic samples from real data distribution Q. Discriminator d attempts to distinguish between generated and real distributions g P and Q respectively. Formally, it approximates a dissimilarity measure between g P and Q (e.g. f-divergence Nowozin et al. (2016) or Wasserstein-1 distance Arjovsky et al. (2017)). Although superior performance is reported for many applications of GANs Karras et al. (2017); Mirza & Osindero (2014), training such models is always hard due to the minimax nature of the optimization objective. Another important branch of generative learning is related to the theory of Optimal Transport (OT) Villani (2008); Peyré et al. (2019). OT methods seek generative mapping2 g : X Y, optimal in the sense of the given transport cost c : X Y R: Cost(P, Q) = min g P=Q X c x, g(x) d P(x). (3) Equation (3) is also known as Monge s formulation of optimal transportation Villani (2008). The principal OT generative method Seguy et al. (2017) is based on optimizing the regularized dual form of the transport cost (3). It fits two potentials ψ, ψ (primal and conjugate) and then uses the barycentric projection to establish the desired (third) generative network g. Although the method uses non-minimax optimization objective, it is not end-to-end (consists of two sequential steps). In the case of quadratic transport cost c(x, y) = x y 2 2 , the value (3) is known as the square of Wasserstein-2 distance: W2 2(P, Q) = min g P=Q 2 d P(x). (4) It has been well studied in literature Brenier (1991); Mc Cann et al. (1995); Villani (2003; 2008) and has many useful properties which we discuss in Section 3 in more detail. The optimal mapping for the quadratic cost is cyclically monotone. Several algorithms exist Lei et al. (2019); Taghvaei & Jalali (2019); Makkuva et al. (2019) for finding this mapping. The recent approach by Taghvaei & Jalali (2019) uses the gradient-descent-based algorithm by Chartrand et al. (2009) for computing W2. The key idea is to approximate the optimal potential ψ by an ICNN Amos et al. (2017), and extract the optimal generator g from its gradient ψ . The method is impractical due to high computational complexity: during the main optimization cycle, it solves an additional optimization sub-problem. The inner problem is convex but computationally costly. This was noted in the original paper and de-facto confirmed by the lack of experiments with complex distributions. A refinement of this approach is proposed by Makkuva et al. (2019). The inner optimization sub-problem is removed, and a network is used to approximate its solution. This speeds up the computation, but the problem is still minimax. 2Commonly, in OT it is assumed that dim X = dim Y. Published as a conference paper at ICLR 2021 3 PRELIMINARIES In the section, we recall the properties of W2 distance (4) and its relation to cycle monotone mappings. Throughout the paper, we assume that P and Q are continuous distributions on X = Y = RD with finite second moments.3 This condition guarantees that (3) is well-defined in the sense that the optimal mapping g always exists. It follows from (Villani, 2003, Brenier s Theorem 2.12) that its restriction to the support of P is unique (up to the values on the small sets) and invertible. The symmetric characteristics apply to its inverse (g ) 1, which induces symmetry to definition (4) for quadratic cost. According to Villani (2003), the dual form of (4) is given by W2 2(P, Q) = Z 2 d P(x) + Z 2 d Q(y) | {z } Const(P,Q) min ψ Convex X ψ(x)d P(x) + Z | {z } Corr(P,Q) where the minimum is taken over all the convex functions (potentials) ψ : X R { }, and ψ(y) = maxx X x, y ψ(x) is the convex conjugate Fenchel (1949) to ψ, which is also a convex function, ψ : Y R { }. We call the value of the minimum in (5) cyclically monotone correlations and denote it by Corr(P, Q). By equating (5) with (4), one may derive the formula Corr(P, Q) = max g P=Q X x, g(x) d P(x). (6) Note that Corr(P, Q) can be viewed as an optimal transport cost for bilinear cost function c(x, y) = x, y , see Mc Cann et al. (1995). Thus, searching for optimal transport map g for W2 is equivalent to finding the mapping which maximizes correlations (6). It is known for W2 distance that the gradient g = ψ of optimal potential ψ readily gives the minimizer of (4), see Villani (2003). Being a gradient of a convex function, it is necessarily cycle monotone. In particular, the inverse mapping can be obtained by taking the gradient w.r.t. input of the conjugate of optimal potential ψ (y) Mc Cann et al. (1995). Thus, we have (g ) 1(y) = ψ 1(y) = ψ (y). (7) In fact, one may approximate the primal potential ψ by a parametric class Θ of input convex functions ψθ and optimize correlations min θ Θ Corr(P, Q | ψθ) = min θ Θ X ψθ(x)d P(x) + Z ψθ(y)d Q(y) in order to extract the approximate optimal generator gθ : X Y from the approximate potential ψθ . Note that in general it is not true that gθ P will be equal to Q. However, we prove that if Corr(P, Q | ψθ ) is close to Corr(P, Q), then gθ P Q, see our Theorem A.3 in Appendix A.2. The optimization of (8) can be performed via stochastic gradient descent. It is possible to get rid of conjugate ψθ and extract an analytic formula for the gradient of (8) w.r.t. parameters θ by using ψθ only, see the derivations in Taghvaei & Jalali (2019); Chartrand et al. (2009): Corr(P, Q | ψθ) θ in the second integral is computed at ˆx = ( ψθ) 1(y), i.e. inverse value of y for ψθ. In practice, both integrals are replaced by their Monte Carlo estimates over random mini-batches from P and Q. Yet to compute the second integral, one needs to recover the inverse values of the current mapping ψθ for all y Q in the mini batch. To do this, the following optimization sub-problem has to be solved ˆx = ( ψθ) 1(y) ˆx = arg max x X 3In practice, the continuity condition can be assumed to hold true. Indeed, widely used heuristics, such as adding small Gaussian noise to data Sønderby et al. (2016), make considered distributions to be continuous. Published as a conference paper at ICLR 2021 for each y Q in the mini batch. The optimization problem (9) is convex but complex because it requires computing the gradient of ψθ multiple times. It is computationally costly since ψθ is in general a large neural network. Besides, during iterations over θ, each time a new independent batch of samples arrives. This makes it hard to use the information on the solution of (9) from the previous gradient descent step over θ in (8). 4 AN END-TO-END NON-MINIMAX ALGORITHM In Subsection 4.1, we describe our novel end-to-end algorithm with non-minimax optimization objective for fitting cyclically monotone generative mappings. In Subsection 4.2, we state our main theoretical results on approximation properties of the proposed algorithm. 4.1 ALGORITHM To simplify the inner optimization procedure for inverting the values of current ψθ, one may consider the following variational approximation of the main objective: min ψ Convex Corr(P, Q|ψ) = min ψ Convex X ψ(x)d P(x) + Z =ψ(y) z }| { max x X x, y ψ(x) d Q(y) = min ψ Convex X ψ(x)d P(x) + max T YX T(y), y ψ(T(y)) d Q(y) , (10) where by considering arbitrary measurable functions T, we obtain a variational lower bound which matches the entire value for T = ψ 1(y) = ψ(y). Thus, a possible approach is to approximate both primal and dual potentials by two different networks ψθ and ψω and solve the optimization problem w.r.t. parameters θ, ω, e.g. by stochastic gradient descent/ascent Makkuva et al. (2019). Yet such a problem is still minimax. Thus, it suffers from typical problems such as convergence to local saddle points, instabilities during training and usually requires non-trivial hyperparameters choice. We propose a method to get rid of the minimax objective by imposing additional regularization. Our key idea is to add regularization term RY(θ, ω) which stimulates cycle consistency Zhu et al. (2017), i.e. optimized generative mappings gθ = ψθ and g 1 ω = ψω should be mutually inverse: RY(θ, ω) = Z Y gθ g 1 ω (y) y 2d Q(y) = Z Y ψθ ψω(y) y 2d Q(y). (11) From the previous discussion and equation (7), we see that cycle consistency is a quite natural condition for W2 distance. More precisely, if ψθ and ψω are exactly inverse to each other (assuming ψθ is injective), then ψω is a convex conjugate to ψθ up to a constant. In contrast to regularization used in Seguy et al. (2017), the proposed penalties use not the values of the potentials ψθ, ψω itself but the values of their gradients (generators). This helps to stabilize the value of the regularization term which in the case of Seguy et al. (2017) may take extremely high values due to the fact that convex potentials grow fast in absolute value.4 Our proposed regularization leads to the following non-minimax optimization objective (λ > 0): X ψθ(x)d P(x) + Z ψω(y), y ψθ( ψω(y)) d Q(y) + λ | {z } Corr(P,Q|ψθ,ψω;λ) The practical optimization procedure is given in Algorithm 1. We replace all the integrals by Monte Carlo estimates over random mini-batches from P and Q. To perform optimization, we use the stochastic gradient descent over parameters θ, ω of primal ψθ and dual ψω potentials. We use the automatic differentiation to evaluate ψω, ψθ and the gradients of (12) w.r.t. parameters θ, ω. The time required to compute the gradient of (12) w.r.t. θ, ω is comparable by a constant factor 4For example, in the case of identity map g(x) = ψ(x) = x, we have quadratic growth: ψ = x 2 Published as a conference paper at ICLR 2021 Algorithm 1: Numerical Procedure for Optimizing Regularized Correlations (12) Input :Distributions P, Q with sample access; cycle-consistency regularizer coefficient λ > 0; a pair of input-convex neural networks ψθ and ψω; batch size K > 0; for t = 1, 2, . . . do 1. Sample batches X P and Y Q; 2. Compute the Monte-Carlo estimate of the correlations: x X ψθ(x) + X ψω(y), y ψθ ψω(y) ; 3. Compute the Monte-Carlo estimate of the cycle-consistency regularizer: LCycle := 1 y Y ψθ ψω(y) y 2 2; 4. Compute the total loss LTotal := LCorr + λ 2 LCycle; 5. Perform a gradient step over {θ, ω} by using LTotal to the time required to compute the value of ψθ(x). We empirically measured that this factor roughly equals 8-12, depending on the particular architecture of ICNN ψθ(x). We discuss the time complexity of a gradient step of our method in a more detail in Appendix C.2. In Subsection 5.1, we show that our non-minimax approach converges up to 10x times faster than minimax alternatives by Makkuva et al. (2019) and Taghvaei & Jalali (2019). 4.2 APPROXIMATION PROPERTIES Our gradient-descent-based approach described in Subsection 4.1 computes Corr(P, Q) by approximating it with a restricted sets of convex potentials. Let (ψ , ψ ) be a pair of potentials obtained by the optimization of correlations. Formally, the fitted generators g = ψ and (g ) 1 = ψ are byproducts of optimization (12). We provide guarantees that the generated distribution g P is indeed close to Q as well as the inverse mapping (g ) 1 pushes Q close to P. Theorem 4.1 (Generative Property for Approximators of Regularized Correlations). Let P, Q be two continuous probability distributions on X = Y = RD with finite second moments. Let ψ : X R be the optimal convex potential: ψ = arg min ψ Convex Corr(P, Q|ψ) = arg min ψ Convex X ψ(x)d P(x) + Z ψ(y)d Q(y) . (13) Let two differentiable convex functions ψ : X R and ψ : Y R satisfy for some ϵ R: Corr P, Q | ψ , ψ ; λ Z X ψ (x)d P(x) + Z ψ (y)d Q(y) + ϵ = Corr(P, Q) | {z } Equals (6) Assume that ψ is β -strongly convex (β > 1 λ > 0) and B -smooth (B β ). Assume that ψ has bijective gradient ψ . Then the following inequalities hold true: 1. Correlation Upper Bound (regularized correlations dominate over the true ones) Corr P, Q | ψ , ψ ; λ Corr P, Q (i.e. ϵ 0); 2. Forward Generative Property (mapping g = ψ pushes P to be O(ϵ)-close to Q) W2 2(g P, Q) = W2 2( ψ P, Q) (B )2 ϵ 3. Inverse Generative Property (mapping (g ) 1 = ψ pushes Q to be O(ϵ)-close to P) W2 2 (g ) 1 Q, P = W2 2( ψ Q, P) ϵ β 1 Published as a conference paper at ICLR 2021 Informally, Theorem 4.1 states that the better we approximate correlations between P and Q by potentials ψ , ψ , the closer we expect generated distributions g P and (g ) 1 Q to be to Q and P respectively in the W2 sense. We prove Theorem 4.1 and provide extra discussion on smoothness and strong convexity in Section A.2. Additionally, we derive Theorem A.3 which states analogous generative properties for the mapping obtained by the base method (8) with single potential and no regularization. Due to the Forward Generative property of Theorem 4.1, one may view the optimization of regularized correlations (12) as a process of minimizing W2 2 between the forward generated g P and true Q distributions (same applies to the inverse property). Wasserstein-2 distance prevents mode dropping for distant modes due to the quadratic cost. See the experiment in Figure 9 in Appendix C.3. The following Theorem demonstrates that we actually can approximate correlations as well as required if the approximating classes of functions for potentials are large enough. Theorem 4.2 (Approximability of Correlations). Let P, Q be two continuous probability distributions on X = Y = RD with finite second moments. Let ψ : Y R be the optimal convex potential. Let ΨX , ΨY be classes of differentiable convex functions X R and Y R respectively and 1. ψX ΨX with ϵX -close gradient to the forward mapping ψ in L2(X RD, P) sense: ψX ψ 2 P def = Z X ψX (y) ψ (y) 2d P(y) ϵX , and ψX is BX -smooth; 2. ψY ΨY with ϵY-close gradient to the inverse mapping ψ in L2(Y RD, Q) sense: ψY ψ 2 Q def = Z Y ψY(y) ψ (y) 2d Q(y) ϵY. Let (ψ , ψ ) be the minimizers of the regularized correlations within ΨX ΨY: (ψ , ψ ) = arg min ψ ΨX ,ψ ΨY Corr P, Q | ψ, ψ ; λ . (15) Then the regularized correlations for (ψ , ψ ) satisfy the following inequality: Corr P, Q | ψ , ψ ; λ Corr(P, Q) + λ 2 (BX ϵY + ϵX )2 + (BX ϵY + ϵX ) ( ϵY) + BX 2 ϵY , (16) i.e. regularized correlations do not exceed true correlations plus O(ϵX + ϵY) term. By combining Theorem 4.2 with Theorem 4.1, we conclude that solutions ψ , ψ of (15) push P and Q to be O(ϵX + ϵY)-close to Q and P respectively. In practice, it is reasonable to use inputconvex neural networks as classes of functions ΨX , ΨY. Fully-connected ICNNs satisfy universal approximation property Chen et al. (2018). In Appendix A.3, we prove that our method can be applied to the latent space scenario. Theorem A.4 states that the distance between the target and generated (latent space generative map combined with the decoder) distributions can upper bounded by the quality of the latent fit and the reconstruction loss of the auto-encoder. In Appendix A.4, we prove Theorem A.5, which demonstrates how our method can be applied to non-continuous distributions P and Q. 5 EXPERIMENTS In this section, we experimentally evaluate the proposed model.5 In Subsection 5.1, we apply our method to estimate optimal transport maps in the Gaussian setting. In Subsection 5.2, we consider 5The code is written on Py Torch framework and is publicly available at https://github.com/iamalexkorotin/Wasserstein2Generative Networks. Published as a conference paper at ICLR 2021 latent space mass transport. In Subsection 5.3, we experiment with image-to-image style translation. In Appendix C, we provide training details and additional experiments on color transfer, domain adaptation and toy examples. The architectures of input convex networks that we use (Dense/Conv ICNNs) are described in Appendix B. The provided results are not intended to represent the state-ofthe-art for any particular task the goal is to show the feasibility of our approach and architectures. 5.1 OPTIMAL TRANSPORT BETWEEN GAUSSIANS We test our method in the Gaussian setting P, Q = N(µP, ΣP), N(µQ, ΣQ). It is one of the few setups with the ground truth OT mapping existing in a closed form. We compare our method [W2GN] with quadratic regularization approach by Seguy et al. (2017) [LSOT] and minimax approaches by Taghvaei & Jalali (2019) [MM-1] and by Makkuva et al. (2019) [MM-2]. To assess the quality of the recovered transport map ψ , we consider unexplained variance percentage: L2-UVP( ψ ) = 100 ψ ψ 2 P/Var(Q) %. Here ψ is the optimal transport map. For values 0%, ψ is a good approximation of OT map. For values 100%, map ψ is nearly useless. Indeed, a trivial benchmark ψ0(x) = EQ[y] provides L2-UVP( ψ0) = 100%. As it is seen from Table 1, LSOT leads to high error which grows drastically with the dimension. W2GN, MM-1 and MM-2 approaches perform nearly equally in metrics. It is expected since they both optimize analogous objectives. These methods compute optimal transport maps with low error (L2-UVP<3% even in R4096). However, as it is seen from the convergence plot in Figure 2, our approach converges several times faster: it naturally follows from the fact that MM approaches contain an inner optimization cycle. We discuss the experiment in more detail in Appendix C.4. Dim 2 4 8 16 32 64 128 256 512 1024 2048 4096 LSOT <1 3.7 7.5 14.3 23 34.7 46.9 >50 MM-1 <1 <1 <1 <1 <1 1.2 1.4 1.3 1.5 1.6 1.8 2.7 MM-2 <1 <1 <1 <1 <1 <1 1 1.1 1.2 1.3 1.5 2.1 W2GN <1 <1 <1 <1 <1 <1 1 1.1 1.3 1.3 1.8 1.5 Table 1: Comparison of L2-UVP (%) for LSOT, MM-1, MM-2 and W2GN (ours) methods in D = 2, 4, . . . , 212. Figure 2: Comparison of convergence of W2GN (ours), MM-1, MM-2 approaches. 5.2 LATENT SPACE OPTIMAL TRANSPORT Method FID AE: Dec(Enc(X)) 7.5 AE Raw Decode: Dec(Z) 31.81 W2GN+AE: Dec(g (Z)) 17.21 WGAN-QC : Gen(Z) 14.41 Table 2: FID scores for 64 64 generated images. We test our algorithm on Celeb A image generation (64 64). First, we construct the latent space distribution by using nonvariational convolutional auto-encoder to encode the images to 128-dimensional latent vectors. Next, we use a pair of Dense ICNNs to fit a cyclically monotone mapping to transform standard normal noise into the latent space distribution. In Figure 3, we present images generated directly by sampling from standard normal noise before (1st row) and after (2nd row) applying out transport map. While our generative mapping does not perform significant changes, its effect is seen visually as well as confirmed by improvement of Frechet Inception Distance (FID, see Heusel et al. (2017)), see Table 2. For comparison, we also provide the score of a recent Wasserstein GAN by Liu et al. (2019). In Appendix C.5, we provide additional examples and the visualization of the latent space. Figure 3: Images decoded from standard Gaussian latent noise (1st row) and decoded from the same noise transferred by our cycle monotone map (2nd row). Published as a conference paper at ICLR 2021 5.3 IMAGE-TO-IMAGE STYLE TRANSLATION In the problem of unpaired style transfer, the learner gets two image datasets, each with its own attributes, e.g. each dataset consists of landscapes related to a particular season. The goal is to fit a mapping capable of transferring attributes of one dataset to the other one, e.g. changing a winter landscape to a corresponding summer landscape. The principal model for solving this problem is Cycle GAN by Zhu et al. (2017). It uses 4 networks and optimizes a minimax objective to train the model. Our method uses 2 networks, and has non-minimax objective. We experiment with Conv ICNN potentials on publicly availaible Winter2Summer and Photo2Cezanne datasets containing 256 256 pixel images. We train our model on mini-batches of randomly cropped 128 128 pixel RGB image parts. The results on Winter2Summer and Photo2Cezanne datasets applied to random 128 128 crops are shown in Figure 4. (a) Winter2Summer dataset results. (b) Photo2Cezanne dataset results. Figure 4: Results of image-to-image style transfer by Conv ICNN, 128 128 pixel images. Our generative model fits a cycle monotone mapping. However, the desired style transfer may not be cycle monotone. Thus, our model may transfer only some of the required attributes. For example, for winter-to-summer transfer our model learned to colorize trees to green. Yet the model experiences problems with replacing snow masses with grass. As Seguy et al. (2017) noted, OT is permutation invariant. It does not take into account the relation between dimensions, e.g. neighbouring pixels or channels of one pixel. Thus, OT may struggle to fit the optimal generative mapping via convolutional architectures (designed to preserve the local structure of the image). Figures 4a and 4b demonstrate highlights of our model. Yet we provide examples when the model does not perform well in Appendix C.8. To fix the above mentioned issue, one may consider OT for the quadratic cost defined on the Gaussian pyramid of an image Burt & Adelson (1983) or, similar to perceptual losses used for super-resolution Johnson et al. (2016), consider perceptual quadratic cost. This statement serves as the challenge for our further research. 6 CONCLUSION In this paper, we developed an end-to-end algorithm with a non-minimax objective for training cyclically monotone generative mappings, i.e. optimal transport mappings for the quadratic cost. Additionally, we established theoretical justification for our method from the approximation point of view. The results of computational experiments confirm the potential of the algorithm in various practical problems: latent space mass transport, image-to-image color/style transfer, domain adaptation. ACKNOWLEDGEMENTS The work was partially supported by the Russian Foundation for Basic Research grant 21-51-12005 NNIO_a. Published as a conference paper at ICLR 2021 Pedro C Álvarez-Esteban, E Del Barrio, JA Cuesta-Albertos, and C Matrán. A fixed-point approach to barycenters in wasserstein space. Journal of Mathematical Analysis and Applications, 441(2): 744 762, 2016. Brandon Amos, Lei Xu, and J Zico Kolter. Input convex neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 146 155. JMLR. org, 2017. Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. ar Xiv preprint ar Xiv:1701.07875, 2017. Bharath Bhushan Damodaran, Benjamin Kellenberger, Rémi Flamary, Devis Tuia, and Nicolas Courty. Deepjdot: Deep joint distribution optimal transport for unsupervised domain adaptation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 447 463, 2018. Yann Brenier. Polar factorization and monotone rearrangement of vector-valued functions. Communications on pure and applied mathematics, 44(4):375 417, 1991. Peter Burt and Edward Adelson. The laplacian pyramid as a compact image code. IEEE Transactions on communications, 31(4):532 540, 1983. Rick Chartrand, Brendt Wohlberg, Kevin Vixie, and Erik Bollt. A gradient descent solution to the monge-kantorovich problem. Applied Mathematical Sciences, 3(22):1071 1080, 2009. Yize Chen, Yuanyuan Shi, and Baosen Zhang. Optimal control via neural networks: A convex approach. ar Xiv preprint ar Xiv:1805.11835, 2018. Nicolas Courty, Rémi Flamary, Devis Tuia, and Alain Rakotomamonjy. Optimal transport for domain adaptation. IEEE transactions on pattern analysis and machine intelligence, 39(9):1853 1865, 2016. Nicolas Courty, Rémi Flamary, Amaury Habrard, and Alain Rakotomamonjy. Joint distribution optimal transportation for domain adaptation. In Advances in Neural Information Processing Systems, pp. 3730 3739, 2017. Werner Fenchel. On conjugate convex functions. Canadian Journal of Mathematics, 1(1):73 77, 1949. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672 2680, 2014. Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pp. 5767 5777, 2017. Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pp. 6626 6637, 2017. Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, pp. 694 711. Springer, 2016. Sham Kakade, Shai Shalev-Shwartz, and Ambuj Tewari. On the duality of strong convexity and strong smoothness: Learning applications and matrix regularization. Unpublished Manuscript, http://ttic. uchicago. edu/shai/papers/Kakade Shalev Tewari09. pdf, 2(1), 2009. Leonid Kantorovitch. On the translocation of masses. Management Science, 5(1):1 4, 1958. Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. ar Xiv preprint ar Xiv:1710.10196, 2017. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Published as a conference paper at ICLR 2021 Na Lei, Kehua Su, Li Cui, Shing-Tung Yau, and Xianfeng David Gu. A geometric view of optimal transportation and generative model. Computer Aided Geometric Design, 68:1 21, 2019. Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnabás Póczos. Mmd gan: Towards deeper understanding of moment matching network. In Advances in Neural Information Processing Systems, pp. 2203 2213, 2017. Huidong Liu, Xianfeng Gu, and Dimitris Samaras. Wasserstein gan with quadratic transport cost. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4832 4841, 2019. Ashok Vardhan Makkuva, Amirhossein Taghvaei, Sewoong Oh, and Jason D Lee. Optimal transport mapping via input convex neural networks. ar Xiv preprint ar Xiv:1908.10962, 2019. Robert J Mc Cann et al. Existence and uniqueness of monotone measure-preserving maps. Duke Mathematical Journal, 80(2):309 324, 1995. Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. ar Xiv preprint ar Xiv:1411.1784, 2014. Youssef Mroueh and Tom Sercu. Fisher gan. In Advances in Neural Information Processing Systems, pp. 2513 2523, 2017. Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in neural information processing systems, pp. 271 279, 2016. François-Pierre Paty, Alexandre d Aspremont, and Marco Cuturi. Regularity as regularization: Smooth and strongly convex brenier potentials in optimal transport. ar Xiv preprint ar Xiv:1905.10812, 2019. Gabriel Peyré. Mathematical foundations of data sciences. def, 1(2π):2π, 2018. Gabriel Peyré, Marco Cuturi, et al. Computational optimal transport. Foundations and Trends R in Machine Learning, 11(5-6):355 607, 2019. Julien Rabin, Sira Ferradans, and Nicolas Papadakis. Adaptive color transfer with relaxed optimal transport. In 2014 IEEE International Conference on Image Processing (ICIP), pp. 4852 4856. IEEE, 2014. Ievgen Redko, Nicolas Courty, Rémi Flamary, and Devis Tuia. Optimal transport for multi-source domain adaptation under target shift. ar Xiv preprint ar Xiv:1803.04899, 2018. Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. ar Xiv preprint ar Xiv:1505.05770, 2015. Ralph Rockafellar. Characterization of the subdifferentials of convex functions. Pacific Journal of Mathematics, 17(3):497 510, 1966. Vivien Seguy, Bharath Bhushan Damodaran, Rémi Flamary, Nicolas Courty, Antoine Rolet, and Mathieu Blondel. Large-scale optimal transport and mapping estimation. ar Xiv preprint ar Xiv:1711.02283, 2017. Casper Kaae Sønderby, Jose Caballero, Lucas Theis, Wenzhe Shi, and Ferenc Huszár. Amortised map inference for image super-resolution. ar Xiv preprint ar Xiv:1610.04490, 2016. Amirhossein Taghvaei and Amin Jalali. 2-wasserstein approximation via restricted convex potentials with application to improved training for gans. ar Xiv preprint ar Xiv:1902.07197, 2019. Cédric Villani. Topics in optimal transportation. Number 58. American Mathematical Soc., 2003. Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008. Abhay Yadav, Sohil Shah, Zheng Xu, David Jacobs, and Tom Goldstein. Stabilizing adversarial nets with prediction methods. ar Xiv preprint ar Xiv:1705.07364, 2017. Published as a conference paper at ICLR 2021 Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223 2232, 2017. In Subsection A.1, we provide important additional properties of Wasserstein-2 distance and related L2-spaces required to prove our main theoretical results. In Subsection A.2, we prove our main Theorems 4.1 and 4.2. Additionally, we show how our proofs can be translated to the simpler case (Theorem A.3), i.e. optimizing correlations with a single potential (8) by using the basic approach of Taghvaei & Jalali (2019). In Subsection A.3, we justify the pipeline of Figure 13. To do this, we prove a theorem that makes it possible to estimate the quality of the latent space generative mapping combined with the decoding part of the encoder. In Subsection A.4, we prove a useful fact which makes it possible to apply our method to distributions which do not have a density. A.1 PROPERTIES OF WASSERSTEIN-2 METRIC AND RELATION TO L2 SPACES To prove our results, we need to introduce Kantorovich s formulation of Optimal Transport Kantorovitch (1958) which extends Monge s formulation (3). For a given transport cost c : X Y R and probability distributions P and Q on X and Y respectively, we define Cost(P, Q) = min µ Π(P,Q) X Y c(x, y)dµ(x, y), (17) where Π(P, Q) is the set of all probability measures on X Y whose marginals are P and Q respectively (transport plans). If the optimal transport solution exists in the form of mapping g : X Y minimizing (3), then the optimal transport plan in (17) is given by µ = [id, g ] P. Otherwise, formulation (17) can be viewed as a relaxation of (3). For quadratic cost c(x, y) = 1 2 x y 2, the root of (17) defines Wasserstein-2 distance (W2), a metric on the space of probability distributions. In particular, it satisfies the triangle inequality, i.e. for every triplets of probability distributions P1, P2, P3 on X RD we have W2(P1, P3) W2(P1, P2) + W2(P2, P3). (18) We will also need the following lemma. Lemma A.1 (Lipschitz property of Wasserstein-2 distance). Let P, P be two probability distributions with finite second moments on X1 RD1. Let T : X1 X2 RD2 be a measurable mapping with Lipschitz constant bounded by L. Then the following inequality holds true: W2(T P, T P ) L W2(P, P ), (19) i.e. the distribution distance between P, P mapped by T does not exceed the initial distribution distance multiplied by Lipschitz constant L. Proof. Let µ Π(P1, P2) be the optimal transport plan between P1 and P2. Consider the distribution on X2 X2 given by µ = T µ , where mapping T is applied component-wise. The left and the right marginals of µ are equal to T P1 and T P2 respectively. Thus, µ Π(T P1, T P2) is a transport plan between T P1 and T P2. The cost of µ is not smaller than the optimal cost, i.e. W2 2(T P1, T P2) Z 2 dµ(x, x ). (20) Next, we use the Lipschitz property of T and derive Z 2 dµ(x, x ) = µ = T µ = Z T(x) T(x ) 2 2 dµ (x, x ) Published as a conference paper at ICLR 2021 2 dµ (x, x ) = L2 Z 2 dµ (x, x ) = L2 W2 2(P1, P2). (21) To finish the proof, we combine (20) with (21) and obtain the desired inequality (19). Throughout the rest of the paper, we use L2(X RD, P) to denote the Hilbert space of functions f : X RD with integrable square w.r.t. probability measure P. The corresponding inner product for f1, f2 L2(X RD, P) is denoted by f1, f2 P def = Z X f1(x), f2(x) d P(x). We use P = p , P to denote the corresponding norm induced by the inner product. Lemma A.2 (L2 inequality for Wasserstein-2 distance). Let P be a probability distribution on X1 RD1. Let T1, T2 L2(X1 RD2, P). Then the following inequality holds true: 1 2 T1(x) T2(x) 2 P W2 2(T1 P, T2 P). Proof. We define the transport plan µ = [T1, T2] P between T1 P and T2 P and, similar to the previous Lemma A.1, use the fact that its cost is not smaller than the optimal cost. A.2 PROOFS OF THE MAIN THEORETICAL RESULTS First, we prove our main Theorem 4.1. Then we formulate and prove its analogue (Theorem A.3) for the basic correlation optimization method (8) with single convex potential. Next, we prove our main Theorem 4.2. At the end of the subsection, we discuss the constants appearing in theorems: strong convexity and smoothness parameters. Proof of Theorem 4.1. We split the proof into three subsequent parts. Part 1. Upper Bound on Correlations. First, we establish a lower bound for regularized correlations Corr P, Q | ψ , ψ ; λ omitting the regularization term. Z X ψ (x)d P(x) + Z y, ψ (y) ψ ψ (y) d Q(y) = Z Y ψ ψ (y) d Q(y) + Z y, ψ (y) ψ ψ (y) d Q(y) = (22) Z ψ ψ (y) ψ ψ (y) d Q(y) + Z y, ψ (y) d Q(y) ψ ψ (y), ψ (y) ψ (y) + β 2 ψ (y) ψ (y) 2]d Q(y) + (23) Z Y y, ψ (y) d Q(y) + Z Y y, ψ (y) d Q(y) | {z } Corr(P,Q) Y y, ψ (x) d Q(y) | {z } Corr(P,Q) ψ ψ , ψ ψ Q + β 2 ψ ψ 2 Q id Y, ψ ψ Q + Corr(P, Q) = (25) ψ ψ id Y, ψ ψ Q + β 2 ψ ψ 2 Q + Corr(P, Q) = 1 2β ψ ψ id Y Q + ψ ψ id Y, ψ ψ Q + β 2 ψ ψ 2 Q + Corr(Q, P) 1 2β ψ ψ id Y 2 Q = Published as a conference paper at ICLR 2021 Corr(Q, P) + β ψ ψ id Y + p Q 1 2β ψ ψ id Y 2 Q. (26) In transition to line (22), we use change of variables formula P = ψ Q. In line (23), we use β -strong convexity of function ψ and then add zero term in line (24). Next, for simplicity, we replace integral notation with L2(Y RD, Q) notation starting from line (25). We add the omitted regularization term back to (26) and obtain the following bound: Corr P, Q | ψ , ψ ; λ Corr(Q, P) + 1 β ) ψ ψ id Y 2 Q + β ψ ψ id Y + p Since λ > 1 β , the obtained inequality proves that the true correlations Corr(P, Q) are upper bounded by the regularized correlations Corr P, Q | ψ , ψ ; λ . Note that if the optimal map ψ is β strongly convex, the bound (27) is tight. Indeed, it turns into equality when we substitute ψ = ( ψ ) 1 = ψ . Part 2. Inverse Generative Property. We continue the derivations of part 1. Let u = ψ ψ id Y and v = ψ ψ . By matching (27) with (14), we obtain β ) u 2 Q + 1 Now we derive an upper bound for v 2 Q. For a fixed u we have 1 p Next, we apply the triangle inequality: β ) u 2 Q + 1 p β u Q. (29) The expression of the right-hand side of (29) attains its maximal value r 2ϵ 1 1 λβ at u Q = q We conclude that ψ ψ 2 Q = v 2 Q 2ϵ β 1 Finally, we apply L2-inequality of Lemma A.2 to distribution Q, mappings ψ and ψ , and obtain W2 2( ψ P, Q) ϵ β 1 λ , i.e. the desired upper bound on the distance between the generated and target distribution. Part 3. Forward Generative Property. We recall the bound (28). Since all the summands are positive, we derive u 2 Q 2ϵ λ 1 β . (30) We will use (30) to obtain an upper bound on ψ ψ P. Published as a conference paper at ICLR 2021 To begin with, we note that since ψ is β -strongly convex, its convex conjugate ψ is 1 β -smooth. Thus, gradient ψ is 1 β -Lipschitz. We conclude that for all x, x X: ψ (x) ψ (x ) 1 β x x . (31) We raise both parts of (31) into the square, substitute x = ψ ψ (y) and x = ψ ψ (y) and integrate over Y w.r.t Q. The obtained inequality is as follows: Z Y ψ (y) ψ (y) 2d Q(y) 1 (β )2 Y ψ ψ (y) ψ ψ (y) 2d Q(y) (32) Next, we derive ψ ψ 2 Q = Z Y ψ (y) ψ (y) 2d Q(y) =y z }| { ψ ψ (y) ψ ψ (y) | {z } = u(y) 2d Q(y) = u 2 Q (β )2 . (33) In transition to line (33), we use the previously obtained inequality (32). Next, we use the triangle inequality for Q to bound ψ ψ Q ψ ψ Q + ψ ψ | {z } =v Q 2ϵ λ 1 β ( 1 λ β ) = r 2ϵ λβ 1 ( 1 p Next, we derive a lower bound for the left-hand side of (34) by using B -smoothness of ψ . For all x, x X we have ψ (x) ψ (x ) B x x . (35) We raise both parts of (35) to the square, substitute x = ψ (y) and x = ψ (y), and integrate over Y w.r.t. Q: Z Y ψ ψ (y) ψ ψ (y) 2d Q(y) (B )2 Z Y ψ (y) ψ (y) 2d Q(y) (36) Next, we use (36) to derive ψ ψ 2 Q = Z Y ψ (y) ψ (y) 2d Q(y) Z 1 (B )2 ψ ψ (y) | {z } =y ψ ψ (y) 2d Q(y) = Y ψ (x) ψ (x) 2d P(x) 2 (B )2 W2 2( ψ P, Q) (37) In line (37), we use the L2 property of Wasserstein-2 distance (Lemma A.2). We conclude that W2 2( ψ P, Q) (B )2 ϵ and finish the proof. It is quite straightforward to formulate analogous result for the basic optimization method with single potential (8). We summarise the statement in the following Published as a conference paper at ICLR 2021 Theorem A.3 (Generative Property for Approximators of Correlations). Let P, Q be two continuous probability distributions on Y = X = RD with finite second moments. Let ψ : Y R be the convex minimizer of Corr(P, Q|ψ). Let differentiable ψ is β -strongly convex and B -smooth (B β > 0) function ψ : X R satisfy Corr P, Q | ψ Z X ψ (x)d P(x) + Z ψ (y)d Q(y) + ϵ = Corr(P, Q) + ϵ. (38) Then the following inequalities hold true: 1. Forward Generative Property (map g = ψ pushes P to be O(ϵ)-close to Q) W2 2(g P, Q) = W2 2( ψ P, Q) B ϵ; 2. Inverse Generative Property (map (g ) 1 = ψ = ( ψ ) 1 pushes Q to be O(ϵ)-close to P) W2 2 (g ) 1 Q, P ϵ Proof. First, we note that ϵ 0 by the definition of ψ . Next, we repeat the first part of the proof of Theorem (4.1) by substituting ψ := ψ . Thus, by using ψ ψ = id Y we obtain the following simple analogue of formula (27): Corr(P, Q | ψ ) Corr(P, Q) + β 2 ψ ψ Q, (39) 2 ψ ψ Q. Thus, by using Lemma (A.2), we immediately derive W2 2( ψ Q, P) ϵ β , i.e. inverse generative property. To derive forward generative property, we note that B -smoothness of ψ means that ψ is 1 B -strongly convex. Thus, due to symmetry of the objective, we can repeat all the derivations w.r.t. ψ instead of ψ in order to prove W2 2( ψ P, Q) B ϵ, i.e. forward generative property. Proof of Theorem 4.2. We repeat the first part of the proof of Theorem 4.1, but instead of exploiting strong convexity of ψX to obtain an upper bound on the regularized correlations, we use BX - smoothness to obtain a lower bound. The resulting analogue (40) to (27) is as follows: Corr P, Q | ψX , ψY, λ Corr(P, Q) 1 2(λ 1 BX ) ψX ψY id Y 2 Q + BX ψX ψY id Y + BX ) ψX ψY id Y 2 Q + BX ψX ψY id Y Q + λ 2 ψX ψY id Y 2 Q + ψX ψY id Y Q ψ ψY Q + BX 2 ψ ψY 2 Q (42) Here in transition from line (40) to (41), we apply the triangle inequality. For every y Y we have ψX ψY(y) ψX ψ (y) BX ψY(y) ψ (y) . We raise both parts of this inequality to the power 2 and integrate over Y w.r.t. Q. We obtain ψX ψY ψX ψ 2 Q (BX )2 ψY ψ 2 Q (BX )2 ϵY. (43) Published as a conference paper at ICLR 2021 Now we recall that since ψ Q = P, we have ψX ψ ψ ψ | {z } id Y 2 Q = ψX ψ 2 P ϵX . (44) Next, we combine (43) and (44) apply the triangle inequality to bound ψX ψY id Y Q ψX ψY(y) ψX ψ (y) Q + ψX ψ ψ ψ | {z } id Y BX ϵY + ϵX . (45) We substitute all the bounds to (40): Corr P, Q | ψX , ψY, λ Corr(P, Q) λ 2 (BX ϵY + ϵX )2 + (BX ϵY + ϵX ) ( ϵY) + BX and finish the proof by using Corr P, Q | ψ , ψ , λ Corr P, Q | ψX , ψY, λ which follows from the definition of ψ , ψ . One may formulate and prove analogous result for the basic optimization method with a single potential (8). However, we do not include this in the paper since a similar result exists Taghvaei & Jalali (2019). All our theoretical results require smoothness or strong convexity properties of potentials. We note that the assumption of smoothness and strong convexity also appears in other papers on Wasserstein-2 optimal transport, see e.g. Paty et al. (2019). The property of B-smoothness of a convex function ψ means that its gradient ψ has Lipshitz constant bounded by B. In our case, constant B serves as a reasonable measure of complexity of the fitted mapping ψ: it estimates how much the mapping can warp the space. Strong convexity is dual to smooothness in the sense that a convex conjugate ψ to β-strongly convex function ψ is 1 β -smooth (and vise-versa) Kakade et al. (2009). In our case, β-strongly convex potential means that its inverse gradient mapping ( ψ) 1 = ψ can not significantly warp the space, i.e. has Lipshitz constant bounded by 1 Recall the setting of our Theorem 4.2. Assume that the optimal transport map ψ between P and Q is a gradient of β-strongly convex (β > 0) and B-smooth (B < ) function. In this case, by considering classes ΨX = ΨY equal to all min(β, 1 B)-strongly convex and max(B, 1 β )-smooth functions, by using our method (for any λ > 1 β ) we will exactly compute correlations and find the optimal ψ . A.3 FROM LATENT SPACE TO DATA SPACE In the setting of the latent space mass transport, we fit a generative mapping to the latent space of an autoencoder and combine it with the decoder to obtain a generative model. The natural question is how close decoded distribution is to the real data distribution S used to train an encoder. The following Theorem states that the distribution distance of the combined model can be naturally divided into two parts: the quality of the latent fit and the reconstruction loss of the auto-encoder. Theorem A.4 (Decoding Theorem). Let S be the real data distribution on S RK. Let u : S Y = RD be the encoder and v : Y RK be L-Lipschitz decoder. Assume that a latent space generative model has fitted a map g : X Y that pushes some latent distribution P on X = RD to be ϵ close to Q = u S in W2 2-sense, i.e. W2 2(g P, Q) ϵ. Published as a conference paper at ICLR 2021 Then the following inequality holds true: W2( v g P | {z } Generated data distribuon , S) L ϵ + 1 2 ES s v u(s) 2 2 | {z } Autoencoder s reconstrution loss where v g is the combined generative model. Proof. We apply the triangle inequality and obtain W2(v g P, S) W2(v g P, v Q) + W2(v Q, S). (48) Let P = g P be the fitted latent distribution (W2 2(P , Q) ϵ). We use Lipschitz Wasserstein-2 property of Lemma A.1 and obtain L ϵ L W2(P , Q) W2(v P , v Q) = W2(v g P, v Q). (49) Next, we apply L2-property (Lemma A.2) to mappings id S, v u and distribution S, and derive 1 2ES s v u(s) 2 2 W2 2(S, v Q). (50) The desired inequality (47) immediately results from combining (48) with (49), (50). A.4 EXTENSION TO THE NON-EXISTENT DENSITY CASE Our main theoretical results require distributions P, Q to have finite second moments and density on X = Y = RD. While the existence of second moments is a reasonable condition, in the majority of practical use-cases the density might not exist. Moreover, it is typically assumed that the supports of distributions are manifold of dimension lower than D or even discrete sets. One may artificially smooth distributions P, Q by convolving them with a random white Gaussian noise6 Λ = N(0, σ2ID) and find a generative mapping g : X Y between smoothed P Λ and Q Λ. For Wasserstein-2 distances, it is natural that the generative properties of g as a mapping between P Λ and Q Λ will transfer to generative properties of g as a mapping between P, Q, but with some bias depending on the statistics of Λ. Theorem A.5 (De-smoothing Wasserstein-2 Property). Let P, Q be two probability distributions on X = Y = RD with finite second moments. Let Λ = N(0, σ2ID) be a Gaussian white noise . Let P Λ and Q Λ be versions of P and Q smoothed by Λ. Let T : X Y be a L-Lipschitz measurable map satisfying W2(T [P Λ], [Q Λ]) ϵ. Then the following inequality holds true: W2(T P, Q) (L + 1)σ Proof. We apply the triangle inequality twice and obtain W2(T P, Q) W2(T P, T [P Λ]) + W2(T [P Λ], [Q Λ]) + W2([Q Λ], Q). Consider a transport plan µ(y, y ) Π([Q Λ], Q) satisfying µ(y | y) = Λ(y y). The cost of µ is given by Z 2 dΛ(y y )d Q(y) = Z 2 d Q(y) = Dσ2 Since the plan is not necessarily optimal, we conclude that W2 2([Q Λ], Q) Dσ2 2 . Analogously, we conclude that W2 2(P, [P Λ]) Dσ2 2 . Next, we apply Lipschitz property of Wasserstein-2 (Lemma A.1) and obtain: W2(T P, T [P Λ]) L W2(P, [P Λ]) = Lσ 6From the practical point of view, smoothing is equal to adding random noise distributed according to Λ to samples from P, Q respectively. Published as a conference paper at ICLR 2021 Finally, we combine all the obtained bounds and derive W2(T P, Q) (L + 1)σ B NEURAL NETWORK ARCHITECTURES In Subsection B.1, we describe the general architecture of the input convex networks. In this section, we describe particular realisations of the general architecture that we use in experiments: Dense ICNN in Subsection B.2 and Conv ICNN in Subsection B.3. B.1 GENERAL INPUT-CONVEX ARCHITECTURE We approximate convex potentials by Input Convex Neural Networks Amos et al. (2017). The overall architecture is schematically presented in Figure 5. Figure 5: General architecture of an Input Convex Neural Network. Input convex network consists of two principal blocks: 1. Linear (L) block consists of linear layers. Activation functions and pooling operators in the block are also linear, e.g. identity activation or average pooling. 2. Convexity preserving (CP) block consists of linear layers with non-negative weights (excluding biases). Activations and pooling operators in this block are convex and monotone. Within blocks it is possible to use arbitrary skip connections obeying the stated rules. Neurons of L Block can be arbitrarily connected to those of CP block by applying a convex activation7 and adding the result with a positive weight. It comes from the convex function arithmetic that every neuron (including the output one) in the architecture of Figure 5 is a convex function of the input.8 In our case, we expect the network to be able to easily fit the identity generative mapping g(x) = ψ(x) = x, i.e. ψ(x) = 1 2 x 2 + c is a quadratic function. Thus, we mainly insert quadratic activations between L and CP blocks, which differs from Amos et al. (2017) where no activation was used. Gradients of input quadratic functions correspond to linear warps of the input and are intuitively highly useful as building blocks (in particular, for fitting identity mapping). We use specific architectures which fit to the general scheme shown in Figure 5. Conv ICNN is used for image-processing tasks, and Dense ICNN is used otherwise. The exact architectures are described in the subsequent subsections. We use CELU function as a convex and monotone activation (within CP block) in all the networks. We have also tried Soft Plus among some other continuous and differentiable functions, yet this 7Unlike activations within convexity preserving block, convex activation between L and CP block may not be monotone, e.g. σ(x) = x2 can be used as an activation. 8It is possible into insert batch norm and dropout to L and CP blocks as well as between them. These layers do not affect convexity since they can be considered (during inference) as linear layers with non-negative weights. Published as a conference paper at ICLR 2021 negatively impacted the performance. The usage of Re LU function is also possible, but the gradient of the potential in this case will be discontinuous. Thus, it will not be Lipschitz, and the insights of our Theorems 4.1 and 4.2 may not work. As a convex and monotone pooling (within CP block), it is possible to use Average and Log Sum Exp pooling (smoothed max pooling). Pure Max pooling should be avoided for the same reason as Re LU activation. However, in Conv ICNN architecture we use convolutions with stride instead of pooling, see Subsection B.3. In order to use insights of Theorem 4.1, we impose strong convexity and smoothness on the potentials. As we noted in Appendix A.2, B-smoothness of a convex function is equal to 1 B strong convexity of its conjugate function (and vise versa). Thus, we make both networks ψθ, ψω to be β := 1 B strongly convex, and cycle regularization keeps 1 β = B smoothness for ψθ (ψω) 1 and ψω (ψθ) 1. In practice, we achieve strong convexity by adding extra value β 2 x 2 to the output of the final neuron of a network. In all our experiments, we set β 1 = 1000000.9 In addition to smoothing, strong convexity guarantees that ψθ and ψω are bijections, which is used in Theorems 4.1, 4.2. B.2 DENSE INPUT CONVEX NEURAL NETWORK For Dense ICNN, we implement Convex Quadratic layer each output neuron of which is a convex quadratic function of input. More precisely, for each input x RNin it outputs (cq1(x), . . . , cq Nout(x)) RNout, with cqn(x) = x, Anx + bn, x + cn for positive semi-definite quadratic form A RNin Nin, vector b RNin and constant c R. Note that for large Nin, the size of such layer grows fast, i.e. O(N 2 in Nout). To fix this issue, we represent each quadratic matrix as a product An = F T n Fn, where F Rr Nin is the matrix of rank at most r. This helps to limit optimization to only positive quadratic forms (and, in particular, symmetric), and reduce the number of weights stored for quadratic part to O(r Nin Nout). Actually, the resulting quadratic forms An will have rank at most r. The architecture is shown in Figure 6. We use Convex Quadratic Layers in Dense ICNN for connecting input directly to layers of a fully connected network. Note that such layers (even at full rank) do not blow the size of the network when the input dimension is low, e.g. in the problem of color transfer. Figure 6: Dense Input Convex Neural Network. The hyperparameters of Dense ICNN are widths of the layers and ranks of the convex input-quadratic layers. For simplicity, we use the same rank r for all the layers. We denote the width of the first convex quadratic layer by h0 and the width of k + 1-th Convex Quadratic and k-th Linear layers by hk. The complete hyperparameter set of the network is given by [r; h0; h1, . . . , h K]. 9Imposing smoothness & strong convexity can be viewed as a regularization of the mapping: it does not perform too large/small warps of the input. See e.g. Paty et al. (2019). Published as a conference paper at ICLR 2021 B.3 CONVOLUTIONAL INPUT CONVEX NEURAL NETWORK We apply convolutional networks to the problem of unpaired image-to-image style transfer. The architecture of Conv ICNN is shown in Figure 7. The network takes an input image (128 128 with 3 RGB channels) and outputs a single value. The gradient of the network w.r.t. the input serves as a generator in our algorithm. Figure 7: Convolutional Input Convex Neural Network. All convolutional layers have 128 channels. Linear and Convexity preserving blocks are successive, and no skip connections are used. Block L consists of two separate parts with stacked convolutions without intermediate activation. The square of the second part is added to the first part and is used as an input for the CP block. All convolutional layers of the network have 128 channels (zero-padding with offset = 1 is used). C EXPERIMENTAL DETAILS AND EXTRA RESULTS In the first subsection, we describe general training details. In the second subsection, we discuss the computational complexity of a single gradient step of our method. Each subsequent subsection corresponds to a particular problem and provides additional experimental results and training details: toy experiments in Subsection C.3 and comparison with minimax approach in Subsection C.4, latent space optimal transport in Subsection C.5, image-to-image color transfer in C.6, domain adaptation in Subsection C.7, image-to-image style transfer in Subsection C.8. C.1 GENERAL TRAINING DETAILS The code is written on Py Torch framework. The networks are trained on a single GTX 1080Ti. The numerical optimization procedure is provided in Algorithm 1 of Section 4.1. In each experiment, both primal ψθ and conjugate ψω potentials have the same network architecture. The minimization of (12) is done via mini batch stochastic gradient descent with weight clipping (excluding biases) in CP block to the [0, + ).10 We use Adam Kingma & Ba (2014) optimizer. For every particular task we pretrain the potential network ψθ by minimizing mean squared error to satisfy ψθ(x) x and copy the weights to ψω. This provides a good initialization for the main training, i.e. ψθ and ψω are mutually inverse. Doing tests, we noted that our method converges faster if we disable back-propagation through term ψω which appears twice in the second line of (12). In this case, the derivative w.r.t. ω is computed by using the regularization terms only.11 This heuristic allows to save additional memory and computational time because a smaller computational graph is built. We used the heuristic in all the experiments. In the experiments with high dimensional data (latent space optimal transport, domain adaptation and style transfer), we add the following extra regularization term to the main objective (12): RX (θ, ω) = Z X g 1 ω gθ(x) x 2d P(x) = Z Y ψω ψθ(x) x 2d P(x). (51) 10We also tried to use softplus, exponent on weights and regularization instead of clipping, but none of these worked well. 11The term ψω(y), y becomes redundant for the optimization. Yet it remains useful for monitoring the convergence. Published as a conference paper at ICLR 2021 Term (51) is analogous to the term RY(θ, ω) given by (11). It also keeps forward gθ and inverse g 1 ω generative mappings being approximately inverse. From the theoretical point of view, it is straightforward to obtain approximation guarantees similar to those of Theorems 4.1 and 4.2 for the optimization with two terms: RX and RY. However, we do not to include RX in the proofs in order to keep them simple. C.2 COMPUTATIONAL COMPLEXITY The time required to evaluate the value of (12) and its gradient w.r.t. θ, ω is comparable up to a constant factor to that of a single evaluation of ψθ(x). This claim follows from the well-known fact that gradient evaluation θhθ(x) of hθ : RD R, when parameterized as a neural network, requires time proportional to the size of the computational graph. Hence gradient computation requires computational time proportional to the time for evaluating the function hθ(x) itself. The same holds true when computing the derivative with respect to x. Thus, the number of operations required to compute different terms in (12), e.g. ψω(y), ψθ ψω(y) and ψθ ψω(y), is also linear w.r.t. the computation time of ψθ(x) or, equivalently, ψω(x). As a consequence, the time required for the forward pass of (12) is larger than the forward pass for ψθ(x) only up to a constant factor. Thus, the backward pass for (12) with respect to parameters of ICNNs θ and ω is also linear in the computation time of ψθ(x). We empirically measured that for our Dense ICNN potentials, the computation of gradient of (12) w.r.t. parameters θ, ω requires roughly 8-12x more time than the computation of ψθ(x). Evaluating ψθ, ψω takes roughly 3-4x more time than evaluating ψθ(x). C.3 TOY EXPERIMENTS In this subsection, we test our algorithm on 2D toy distributions from Gulrajani et al. (2017); Seguy et al. (2017). In all the experiments, distribution P is the standard Gaussian noise and Q is a Gaussian mixture or a Swiss roll. Both primal and conjugate potentials ψθ and ψω have Dense ICNN [2; 128; 128, 64] architecture. Each network has roughly 25000 trainable parameters. Some of them vanish during the training because of the weight clipping. For each particular problem the networks are trained for 30000 iterations with 1024 samples in a mini batch. Adam optimizer Kingma & Ba (2014) with lr = 10 3 is used. We put λ = 1 in our cycle regularization and impose additional 10 10 L1 regularization on the weights. For the case when Q is a mixture of 8 Gaussians, the intermediate learned distributions are shown in Figure 8. The overall structure of the forward mapping has already been learned on iteration 200, while the inverse mapping gets learned only on iteration 2000. This can be explained by the smoothness of the desired optimal mappings ψ and ψ . The inverse mapping ψ has large Lipschitz constant because it has to unsqueeze dense masses of 8 Gaussians. In contrast to the inverse mapping, the forward mapping ψ has to squeeze the distribution. Thus, it is expected to have lower Lipschitz constant (everywhere except the neighbourhood of a central point which is a fixed point of ψ due to symmetry). Additional examples (Gaussian Mixtures & Swiss roll) are shown in Figures 10a, 10b and 9. When Q is a mixture of 100 gaussians (Figure 9), our model learns all of the modes and does not suffer from mode dropping. We do not state that the fit is perfect but emphasize that mode collapse also does not happen. Note that all our theoretical results require distributions P, Q to be smooth. Our method fits continuous optimal transport map via differentiable w.r.t. input ICNNs with CELU activation. At the same time, Makkuva et al. (2019) also uses ICNNs with discontinuous gradient w.r.t. the input, e.g. by using Re LU activations to fit discontinuous generative mappings. We do not know whether our theoretical results can be directly generalized to the discontinuous mappings case (without using smoothing as we suggested in Subsection A.4). However, we note that the usage of ICNNs with discontinuous gradient naturally leads to torn generated distributions, see an example in Figure 11. While the fitted mapping is indeed close to the true Swiss Roll in W2 sense, it clearly suffers from Published as a conference paper at ICLR 2021 (a) Initial distributions P and Q. (b) Generated distributions after 200 iterations. (c) Generated distributions after 2000 iterations. (d) Generated distributions after 30000 iterations. Figure 8: Convergence stages of our algorithm applied to fitting cycle monotone mappings (forward and inverse) between distributions P (Gaussian) and Q (Mixture of 8 Gaussians). Figure 9: Mixture of 100 Gaussians Q and distribution ψθ P fitted by our algorithm. torn effect. From the practical point of view, this effect seems to be similar to mode collapse, a well-known disease of GANs. C.4 GAUSSIAN OPTIMAL TRANSPORT DETAILS AND DISCUSSION We consider the Gaussian setting P, Q = N(0, ΣP), N(0, ΣQ) for which the ground truth OT solution has a closed form, see (Álvarez-Esteban et al., 2016, Theorem 2.3). Considering non-centered P, Q is unnecessary since W2 2(P, Q) = µP µQ 2 + W2 2(P0, Q0), where P0, Q0 are centered copies of P, Q. In D-dimensional space, ΣP ( p ΣQ - analogously) is initialized as ST P ΛSP, where SP OD is a random rotation, Λ is diagonal with eigenvalues [ 1 2, . . . , 1 2bk, . . . , 2], b = D 1 4. The ground Published as a conference paper at ICLR 2021 (a) Mixture of 49 Gaussians Q and distribution ψθ P Q fitted by our algorithm. (b) Swiss Roll distribution Q and distribution ψθ P Q fitted by our algorithm. Figure 10: Toy distributions fitted by our algorithm. Figure 11: An example of a "torn" generative mapping to a Swiss Roll by a gradient of ICNN with Re LU activations. truth optimal transport map from P to Q is linear and given by ψ (x) = Σ 1 We compare our approach with the method by Seguy et al. (2017) [LSOT] and the minimax approaches by Taghvaei & Jalali (2019) [MM-1], Makkuva et al. (2019) [MM-2]. We recall the details of LSOT s regularized optimization of OT maps and distances. The objective is given by X ψθ(x)d P(x)+ Z ψω(y)d Q(y)+ 1 x, y ψθ(x) ψω(y) 2 +d P Q (x, y) , where we defined ψθ = x 2 2 uθ(x) and ψω = y 2 2 vω(y) to make LSOT notation (uθ, vω) match our notation (ψθ, ψω). We use L2 regularizer (but not entropy) since it empirically works better (as noted in LSOT paper). Potentials ψθ, ψω are NOT restricted to be convex. The transport plan ˆT P Q is recovered via the barycentric projection, see (Seguy et al., 2017, Section 4). We set λ = min(D, 50) for our method and ϵ = 0.01 for LSOT (chosen empirically). In all the methods we use Dense ICNN[1; D, D, D 2 ] of Subsection B.2. In LSOT, we do not convexify nets (do not clamp weights), i.e. they are "usual" unrestricted neural networks (empirically selected as the best option) as originally implied in LSOT paper. It follows from the Table 1 in Section 5.1 that LSOT leads to high bias error which grows drastically with the dimension. While theoretically ϵ 0 should solve the bias problem, practically small ϵ leads to optimization instabilities. LSOT s drawback is that estimation of the regularizer requires sampling from joint measure P Q on R2D. For the majority of pairs (x, y) the L2 regularizer vanishes (the effect worsens with D ). In contrast to LSOT, our cycle regularizer uses samples only from marginal measures (Q or P) on RD. The biasing effect is studied theoretically (our Theorems 4.1, 4.2), and the bias is inexistent when optimal potentials ψ , ψ are contained in the approximating function classes. Published as a conference paper at ICLR 2021 Figure 12: Comparison of convergence speed of W2GN, MM-1 and MM-2 approaches in dimensions D = 64, 256, 1024, 4096. W2GN (ours), MM-1 and MM-2 methods are capable of computing maps and distances with low error (L2-UVP<3% even in R4096). However, as it is seen from the convergence plots in Figure 12, our approach converges several times faster: it naturally follows from the fact that MM-1, MM-2 approaches contain an inner optimization cycle. C.5 LATENT SPACE OPTIMAL TRANSPORT DETAILS We follow the pipeline of Figure 13 below. The latent space distribution is constructed by using convolution auto-encoder to encode Celeb A images into 128-dimensional latent vectors respectively. To train the auto-encoder, we use a perceptual loss on features of a pre-trained VGG-16 network. Figure 13: The pipeline of latent space mass transport. We use Dense ICNN [4; 256; 256; 128; 64] to fit a cyclically monotone generative mapping to transform standard normal noise into the latent space distribution. For each problem the networks are trained for 100000 iterations with 128 samples in a mini batch. Adam optimizer with lr = 3 10 4 is used. We put λ = 100 as the cycle regularization parameter. We provide additional examples of generated images in Figure 14. We also visualize the latent space distribution of the autoencoder and the distribution fitted by generative map in Figure 15. The FID scores presented in Table 2 are computed via Py Torch implementation of FID Score12. As a benchmark score we added WGAN-QC by Liu et al. (2019). Finally, we emphasize that cyclically monotone generative mapping that we fit is explicit. Similarly to Normalizing Flows Rezende & Mohamed (2015) and in contrast to other methods, such as Lei et al. (2019), it provides tractable density inside the latent space. Since ψω ( ψθ) 1 is differentiable and injective, one may use the change of variables formula for density q(y) = [det 2ψω(y)] p( ψω(y)) to study the latent space distribution. 12https://github.com/mseitzer/pytorch-fid Published as a conference paper at ICLR 2021 Figure 14: Images decoded from standard Gaussian latent noise (1st row) and decoded from the same noise transferred by our cycle monotone map (2nd row). Figure 15: A pair of main principal components of Celeb A Autoencoder s latent space. From left to right: Z N(0, I) [blue], mapped Z by W2GN [red], true autoencoders latent space [green]. PCA decomposition is fitted on autoencoders latent space [green]. C.6 COLOR TRANSFER The problem of color transfer between images13 is to map the color palette of the image into the other one in order to make it look and feel similar to the original. Optimal transport can be applied to color transfer, but it is sensitive to noise and outliers. To avoid these problems, several relaxations were proposed Rabin et al. (2014); Paty et al. (2019). These approaches solve a discrete version of Wasserstein-2 OT problem. The computation of optimal transport cost for large images is barely feasible or infeasible at all due to extreme size of color palettes. Thus, the reduction of pixel color palette by k-means clustering is usually performed to make OT computation feasible. Yet such a reduction may lose color information. Our algorithm uses mini-batch stochastic optimization. Thus, it has no limitations on the size of color palettes. On training, we sequentially input mini-batches of images pixels ( R3) into potential networks with Dence ICNN [3; 128; 128, 64] architecture.14 The networks are trained for 5000 iterations with 1024 pixels in a mini batch. Adam optimizer with lr = 10 3 is used. We put λ = 3 as the cycle regularization parameter. We impose extra 10 10 L1-penalty on the weights. The color transfer results for 10 megapixel images are presented in Figure 16a. The corresponding color palettes are given in Figure 16b. Additional example of color transfer are given in Figure 17. C.7 DOMAIN ADAPTATION The domain adaptation problem is to learn a model f (e.g. a classifier) from a source distribution Q. This model has to perform well on a different (related) target distribution P. Most of the methods based on OT theory solve domain adaptation explicitly by transforming distribution P into Q and then applying the model f to generated samples. In some cases the mapping g : X Y (which transforms P to Q) is obtained by solving a discrete OT problem Courty et al. 13Images may have unequal size. Yet they are assumed to have the same number of channels, e.g. RGB ones. 14Since our model is parametric, the complexity of fitted generative mapping g : R3 R3 between the palettes depends on the size of potential networks. Published as a conference paper at ICLR 2021 (a) Original images (on the left) and images obtained by color transfer (on the right). The sizes of images are 3300 4856 (first) and 2835 4289 (second). (b) Color palettes (3000 random pixels, best viewed in color) for the original images (on the left) and for images with transferred color (on the right). Figure 16: Results of Color Transfer between high resolution images ( 10 megapixel) by a pixelwise cycle monotone mapping. (a) Original images (on the left) and images obtained by color transfer (on the right). (b) Color palettes (3000 random pixels, best viewed in color) for the original images (on the left) and for images with transferred color (on the right). Figure 17: Results of Color Transfer between images by a pixel-wise cycle monotone mapping. (2016; 2017); Redko et al. (2018), while some approaches adopt neural networks to estimate the mapping g Bhushan Damodaran et al. (2018); Seguy et al. (2017). We address the unsupervised domain adaptation problem which is the most difficult variant of this task. Labels are available only in the source domain, so we do not use any information about the Published as a conference paper at ICLR 2021 labels. Our method trains g as a gradient of a convex function. It can be applied to new arriving samples which are not present in the train set. We test our model on MNIST ( 60000 images; 28 28) and USPS ( 10000 images; rescaled to 28 26) digits datasets. We perform USPS MNIST domain adaptation. To do this, we train Le Net 99%-accuracy classifier h on MNIST. Then, we apply h to both datasets, extract 84 last layer features. Thus, we form distributions Q (features for MNIST) and P (features for USPS). To fit a cycle monotone domain adaptation mapping, we use Dense ICNN [32; 128; 128, 128] potentials. We train our model on mini-batches of 64 samples for 10000 iterations with cycle regularization λ = 1000. We use Adam optimizer with lr = 10 4 and impose 10 7 L1-penalty on the weights of the networks. Similar to Seguy et al. (2017), we compare the accuracy of MNIST 1-NN classifier f applied to features x P of USPS with the same classifier applied to mapped features g (x). 1-NN is chosen as the classification model in order to eliminate any influence of the base classification model on the domain adaptation and directly estimate the effect provided by our cycle monotone map. The results of the experiment are presented in Table 3. Since domain adaptation quality highly depends on the quality of the extracted features, we repeat the experiment 3 times, i.e. we train 3 Le Net MNIST classifiers for feature extraction. We report the results with mean and central tendency. For benchmarking purposes, we also add the score of 1-NN classifier applied to the features of USPS transported to MNIST features by the discrete optimal transport. It can be considered as the most straightforward optimal transport map.15 Repeat 1 Repeat 2 Repeat 3 Average (µ σ) Target features 75.7% 77% 75.4% 76 0.8% Mapped features (W2GN) 80.6% 80.3% 82.7% 81.2 1% Mapped features (Discrete OT) 76% 75.7% 76.1% 75.9 0.4% Mapped features Seguy et al. (2017) - - - 77.92% Table 3: 1-NN classification accuracy on USPS MNIST domain adaptation problem. Our reported scores are comparable to the ones reported by Seguy et al. (2017). We did not reproduce their experiments since Seguy et al. (2017) does not provide the source code for domain adaptation. Thus, we refer the reader directly to the paper s reported scores (Table 1 of Seguy et al. (2017), first column with scores). For visualization purposes, we plot the two main components of the PCA decomposition of feature spaces (for one of the conducted experiments) in Figure 18: MNIST features, mapped USPS features by using our method, original USPS features. C.8 IMAGE-TO-IMAGE STYLE TRANSFER We experiment with Conv ICNN potentials on publicly available16 Winter2Summer and Photo2Cezanne datasets containing 256 256 pixel images. We train our model on mini batches of 8 randomly cropped 128 128 pixel RGB image parts. As an additional augmentation, we use random rotations ( π 18), random horizontal flips and the addition of small Gaussian noise (σ = 0.01). The networks are trained for 20000 iterations with cycle regularization λ = 35000. We use Adam optimizer and impose additional 10 1 L1-penalty on the weights of the networks. Our scheme of style transfer between datasets is presented in Figure 19. We provide the additional results for Winter2Summer and Photo2Monet datasets in Figures 20b and 20a respectively. In all the cases, our networks change colors but preserve the structure of the image. In none of the results did we note that the model removes large snow masses (for winter-to-summer transform) or 15In contrast to our method, it can not be directly applied to out-of-train-sample examples. Moreover, its computation is infeasible for large datasets. 16https://github.com/junyanz/pytorch-Cycle GAN-and-pix2pix Published as a conference paper at ICLR 2021 Figure 18: A pair of main principal components of feature spaces. From left to right: MNIST feature space, mapped USPS features by W2GN, original USPS feature space. PCA decomposition is fitted on MNIST features. Best viewed in color (different colors represent different classes of digits 0 9). Figure 19: Schematically presented image-to-image style transfer by a pair of Conv ICNN fitted by our method. (a) Results of cycle monotone image-to-image style transfer by Conv ICNN on Photo2Cezanne dataset, 128 128 pixel images. (b) Results of cycle monotone image-to-image style transfer by Conv ICNN on Winter2Summer dataset, 128 128 pixel images. Figure 20: Additional results of image-to-image style transfer on Winter2Summer and Photo2Monet datasets. covers green trees with white snow (for summer-to-winter). We do not know the exact explanation for this issue but we suppose that the desired image manipulation simply may not be cycle monotone. For completeness of the exposition, we provide some results of cases when our model does not perform well (Figure 21). In Figure 21a the model simply increases the green color component, while in Figure 21b it decreases this component. Although in many cases it is actually enough to transform winter to summer (or vice-versa), sometimes more advanced manipulations are required. In the described experiments, we applied our method directly to original images without any specific preprocessing or feature extraction. The model captures some of the required attributes to transfer, but sometimes it does not produce expected results. To fix this issue, one may consider OT for the quadratic cost defined on features extracted from the image or on embeddings of images (similar to Published as a conference paper at ICLR 2021 (a) Winter to summer transform. (b) Summer to winter transform. Figure 21: Cases when the cycle monotone image-to-image style transfer by Conv ICNN on Winter2Summer dataset works not well, 128 128 pixel images. the domain adaptation in Subsection C.7 or latent space mass transport 5.2). This statement serves as the challenge for our further research.