# kernel_neural_optimal_transport__3867ddb8.pdf Published as a conference paper at ICLR 2023 KERNEL NEURAL OPTIMAL TRANSPORT Alexander Korotin Skolkovo Institute of Science and Technology Artificial Intelligence Research Institute Moscow, Russia a.korotin@skoltech.ru Daniil Selikhanovych Skolkovo Institute of Science and Technology Moscow, Russia selikhanovychdaniil@gmail.com Evgeny Burnaev Skolkovo Institute of Science and Technology Artificial Intelligence Research Institute Moscow, Russia e.burnaev@skoltech.ru We study the Neural Optimal Transport (NOT) algorithm which uses the general optimal transport formulation and learns stochastic transport plans. We show that NOT with the weak quadratic cost may learn fake plans which are not optimal. To resolve this issue, we introduce kernel weak quadratic costs. We show that they provide improved theoretical guarantees and practical performance. We test NOT with kernel costs on the unpaired image-to-image translation task. (a) Celeba (female) anime, 128 128. (b) Outdoor church, 128 128. Figure 1: Unpaired image-to-image translation (one-to-many) by Kernel Neural Optimal Transport. 1 INTRODUCTION Neural methods have become widespread in Optimal Transport (OT) starting from the introduction of the large-scale OT (Genevay et al., 2016; Seguy et al., 2018) and the Wasserstein Generative Adversarial Networks (Arjovsky et al., 2017) (WGANs). Most existing methods employ the OT cost as the loss function to update the generator in GANs (Gulrajani et al., 2017; Sanjabi et al., 2018; Petzka et al., 2018). In contrast to these approaches, (Korotin et al., 2023; Rout et al., 2022; Daniels et al., 2021; Fan et al., 2022a; Korotin et al., 2023) have recently proposed scalable neural methods to compute the OT plan (or map) and use it directly as the generative model. Published as a conference paper at ICLR 2023 In this paper, we focus on the Neural Optimal Transport (NOT) algorithm (Korotin et al., 2023). It is capable of learning optimal deterministic (one-to-one) and stochastic (one-to-many) maps and plans for quite general strong and weak (Gozlan et al., 2017; Gozlan & Juillet, 2020; Backhoff-Veraguas et al., 2019) transport costs. In practice, the authors of NOT test it on the unpaired image-to-image translation task (Korotin et al., 2023, M5) with the weak quadratic cost (Alibert et al., 2019, M5.2). Contributions. We conduct the theoretical and empirical analysis of the saddle point optimization problem of NOT algorithm for the weak quadratic cost. We show that it may have a lot of fake solutions which do not provide an OT plan. We show that NOT indeed might recover them (M3.1). We propose weak kernel quadratic costs and prove that they solve this issue (M3.2). Practically, we show how NOT with kernel costs performs on the unpaired image-to-image translation task (M5). Notations. We use X, Y, Z to denote Polish spaces and P(X), P(Y), P(Z) to denote the respective sets of probability distributions on them. For a distribution P, we denote its mean and covariance matrix by m P and ΣP, respectively. We denote the set of probability distributions on X Y with marginals P and Q by Π(P, Q). For a measurable map T : X Z Y (or Tx : Z Y), we denote the associated push-forward operator by T (or Tx ). We use H to denote a Hilbert space (feature space). Its inner product is , H, and H is the corresponding norm. For a function u : Y H (feature map), we denote the respective positive definite symmetric (PDS) kernel by k(y, y ) def = u(y), u(y ) H. A PDS kernel k : Y Y R is called characteristic if the kernel mean embedding P(Y) µ 7 u(µ) def = R X u(y)dµ(y) H is a one-to-one mapping. For a function ϕ : RD R { }, we denote its convex conjugate by ϕ(y) def = supx RD{ x, y ϕ (x)}. 2 BACKGROUND ON OPTIMAL TRANSPORT Strong OT formulation. For distributions P P(X), Q P(Y) and a cost function c : X Y R, Kantorovich s (Villani, 2008) primal formulation of the optimal transport cost (Figure 2a) is Cost(P, Q) def = inf π Π(P,Q) X Y c(x, y)dπ(x, y), (1) where the minimum is taken over all transport plans π, i.e., distributions on X Y whose marginals are P and Q. The optimal π Π(P, Q) is called the optimal transport plan. A popular example of an OT cost for X = Y = RD is the Wasserstein-2 (W2 2), i.e., formulation (1) for c(x, y) = 1 (a) Strong OT formulation (1). (b) Weak OT formulation (2). Figure 2: Strong (Kantorovich s) and weak (Gozlan et al., 2017) optimal transport formulations. Weak OT formulation. Let C : X P(Y) R be a weak cost (Gozlan et al., 2017), i.e., a function which takes a point x X and a distribution of y Y as inputs. The weak OT cost between P, Q is Cost(P, Q) def = inf π Π(P,Q) X C x, π( |x) dπ(x), (2) where π( |x) denotes the conditional distribution (Figure 2b). Weak OT (2) subsumes strong OT formulation (1) for C(x, µ) = R Y c(x, y)dµ(y). An example of a weak OT cost for X = Y = RD is the γ-weak (γ 0) Wasserstein-2 (W2,γ), i.e., formulation (2) with the γ-weak quadratic cost C2,γ x, µ def = Z 1 2 x y 2dµ(y) γ 2 Var(µ) = Z Y y dµ(y) 2dµ(y) + 1 γ 2 Var(µ), (3) where Var(µ) denotes the variance of µ: Var(µ) def = Z Y y dµ(y ) 2dµ(y) = 1 Y Y y y 2dµ(y)dµ(y ). (4) For γ = 0, the transport cost (3) is strong, i.e., W2 = W2,0. Published as a conference paper at ICLR 2023 If the cost C(x, µ) is lower bounded, lower-semicontinuous and convex in the second argument, then we say that it is appropriate. For appropriate costs, the minimizer π of (2) always exists (Backhoff-Veraguas et al., 2019, M1.3.1). Since Var(µ) is concave and non-negative, cost (3) is appropriate when γ [0, 1]. For appropriate costs the (2) admits the following dual formulation: Cost(P, Q) = sup f X f C(x)d P(x) + Z Y f(y)d Q(y), (5) where f are upper-bounded, continuous and not rapidly decreasing functions, see (Backhoff-Veraguas et al., 2019, M1.3.2), and f C(x) def = infµ P(Y){C(x, µ) R Y f(y)dµ(y)} is the weak C-transform. Figure 3: Implicit representation of a transport plan via function T : X Z Y. Neural Optimal Transport (NOT). In (Korotin et al., 2023), the authors propose an algorithm to implicitly learn an OT plan π with neural nets (Figure 3). They introduce a (latent) atomless distribution S P(Z), e.g., Z = RZ and S = N(0, IZ), and search for a function T : X Z Y (stochastic OT map) which satisfies T x S = π (y|x) for some OT plan π . That is, given x X, function T pushes the distribution S to the conditional distribution π (y|x) of an OT plan π . In particular, T satisfies the distributionpreserving condition T (P S) = Q. To get T , the authors use (5) to derive an equivalent dual form: Cost(P, Q) = sup f inf T C x, Tx S Z Z f Tx(z) d S(z) d P(x) + Z Y f(y)d Q(y), (6) where the inf is taken over measurable functions T : X Z Y. The functional under supf inf T is denoted by L(f, T). For every optimal potential f arg supf inf T L(f, T), it holds that T arg inf T L(f , T), (7) see (Korotin et al., 2023, Lemma 4). Consequently, one may extract optimal maps T from optimal saddle points (f , T ) of problem (6). In practice, saddle point problem (6) can be approached with neural nets fω, Tθ and the stochastic gradient descent-ascent (Korotin et al., 2023, Algorithm 1). The limitation (Korotin et al., 2023, M6) of NOT algorithm is that arg inf T set of f in (7) may contain not only optimal transport maps but other functions as well. As a result, the function T recovered from a saddle point (f , T ) may be not an optimal stochastic map. In this paper, we show that for the γ-weak quadratic cost (3) this may be problematic: the arg inf T sets might contain fake solutions T (M3.1). To resolve the issue, we propose kernel γ-weak quadratic costs (M3.2). Convex order. For two probability distributions P, Q on RD, we write P Q if for all convex functions h : RD R it holds R h(x)d P(x) R h(x)d Q(x). The relation " " is a partial order, i.e., not all P, Q are comparable. If P Q, then m P = m Q and ΣP ΣQ (Scarsini, 1998, Lemma 3). If P, Q are Gaussians, P Q holds if and only if m P = m Q and ΣP ΣQ (Scarsini, 1998, Theorem 4). For P, Q, we define the projection of P onto the convex set of distributions which are Q as Proj Q(P) = arg inf P Q W2 2(P, P ). (8) The infimum is attained uniquely (Gozlan & Juillet, 2020, Proposition 1.1). There exists a 1-Lipschitz, continuously diffirentiable and convex function ϕ : RD R satisfying Proj Q(P) = ϕ P, see (Gozlan & Juillet, 2020, Theorem 1.2) for details. Figure 4: Every optimal restricted potential ϕ satisfies ϕ P = Proj Q( 1 Weak OT with the quadratic cost (γ > 0). For the γ-weak quadratic cost (3) on X = Y = RD, (Gozlan & Juillet, 2020, M5), (Alibert et al., 2019, M5.2) prove that there exists a continuously differentiable convex function ϕ : RD R such that π Π(P, Q) is optimal if and only if R Y y dπ (y|x) = ϕ (x) holds true P-almost surely. In general, π is not unique. We say that ϕ is an optimal restricted potential. It may be not unique as a function RD R, but ϕ is uniquely defined P-almost everywhere. It holds true Published as a conference paper at ICLR 2023 that ϕ is 1 γ -Lipschitz; ϕ P = Proj Q( 1 γ P); it is the OT map between P and Proj Q( 1 γ P) for the strong quadratic cost (Figure 4). The function ϕ maximizes the dual form alternative to (5): W2 2,γ(P, Q) = max ϕ Cvx( 1 2 d P(x)+ Z X ϕ(x)d P(x)+ Z ϕ(y)d Q(y) , (9) where Cvx( 1 γ ) denotes the set of 1 γ -smooth convex functions ϕ : RD R. Duality formula (9) appears in (Alibert et al., 2019, M5.2), (Gozlan & Juillet, 2020, Theorem 1.2 & M5) but with different parametrization. In Appendix F, for completeness of the exposition, we derive (9) from the results of (Gozlan & Juillet, 2020) by the change of variables. 3 SOLVING ISSUES OF NEURAL OPTIMAL TRANSPORT In what follows, we consider X = Y RD. In M3.1, we theoretically derive that arg inf T sets (7) for the γ-weak quadratic cost (3) may indeed contain functions which are not stochastic OT maps. In M3.2, we introduce kernel weak quadratic costs and prove that they do not suffer from this issue, i.e., all functions in sets arg inf T for all optimal f are stochastic OT maps. In M3.3, we discuss the practical aspects of learning with kernels. We give the proofs of all the statements in Appendix G. 3.1 FAKE SOLUTIONS FOR THE WEAK QUADRATIC COST In this subsection, we consider X = Y = RD. We show that arg inf T L(f , T) sets (7) of optimal potentials f in (5) for the γ-weak quadratic cost (3), in general, may contain functions T which are not stochastic OT maps. We call such T fake solutions. To show why one should be concerned about fake solutions, we emphasize their key defect below. Lemma 1 (Fake solutions are not distribution-preserving). Let f arg supf inf T L(f, T) and T arg inf T L(f , T) be a fake solution. Then it holds that T (P S) = Q. Throughout the section, we assume that P, Q have finite second moments. We analyse the potentials of the form f (y) = 1 2 y 2 ϕ (y), where ϕ : RD R is an optimal restricted potential. To begin with, we show that such potentials are indeed optimal potentials for dual forms (5) and (6). Lemma 2 (Optimal restricted potentials provide dual form maximizers). Let ϕ : RD R be an optimal restricted potential. Assume that ϕ takes only finite values. Then f (y) def = 1 2 y 2 ϕ (y) maximizes dual formulations (5) and (6). For a convex ψ : RD R, we denote the area around point y RD in which ψ is linear by Uψ(y) = {y RD such that x yψ(y) it holds ψ(y ) = ψ(y) + x, y y } {y}. (10) Proposition 1 (Convexity of sets of local linearity of a convex function). Set Uψ(y) is convex. Our following theorem provides a full characterization of the arg inf T L(f , T) sets in view. Theorem 1 (Characterization of saddle points with optimal restricted f ). Let f (y) = y 2 2 ϕ (y), where ϕ is an optimal restricted potential. Assume that ϕ takes only finite values. Then it holds true that arg inf T L(f , T) is a convex set and T arg inf T L(f , T) R Z T x(z)d S(z) = ϕ (x) holds true P-almost everywhere; T x(z) Uψ ϕ (x) holds true P S-almost everywhere, (11) where ψ(y) def = ϕ (y) γ 2 y 2. Note that ψ is convex since ϕ is 1 γ -smooth (Kakade et al., 2009). We define the optimal barycentric projection T x(z) def = ϕ (x) for (x, z) X Z; it does not depend on z Z. The function depends on the choice of optimal ϕ ; we are interested only in its values in the support of P, where ϕ is unique (M2). From definition (10), we see that ϕ (x) Uψ ϕ (x) and T satisfies both conditions on the right side of (11). Thus, we have T arg inf T L(f , T). Lemma 3 (The barycentric projection is not always a stochastic OT map). The following holds true T is a stochastic OT map (a) Proj Q 1 γ P = Q (b) = T is the unique stochastic OT map. We use the word stochastic but T is actually deterministic since it does not depend on z. From our Lemma 3, we derive that if Proj Q( 1 γ P) = Q, it holds that (f , T ) is a fake saddle point. Published as a conference paper at ICLR 2023 (a) Input and target distributions P and Q. (b) Fitted map ˆTx(z) for γ = 1 (c) Fitted map ˆTx(z) for γ = 3 (d) Fitted map ˆTx(z) for γ = 1. Figure 5: Stochastic maps b T between P and Q fitted by NOT algorithm with costs C2,γ for various γ. Corollary 1 (Existence of fake saddle points). Assume that Proj Q( 1 γ P) = Q. Then problem (6) has optimal saddle points (f , T ) in which T is not a stochastic OT map. Beside T , our Theorem 1 can be used to construct arbitrary many fake solutions which are not OT maps. Let T be any stochastic OT map and T arg inf T L(f , T) satisfy T = T and Var T (P S) Var(Q). For example, T may be another stochastic OT map or the optimal barycentric projection T . For any α (0, 1) consider T α = αT +(1 α)T arg inf T L(f , T). Proposition 2 (Interpolant is not a stochastic OT map). Assume that Proj Q( 1 γ P) = Q. Then T α (P S) = Q. Consequently, T α is not a stochastic OT map between P and Q. It follows that a necessary condition for non-existence of fake saddle points is Proj Q( 1 γ P) = Q. This requirement is very restrictive and naturally prohibits using large values of γ. Also, due to our Lemma 3, the OT plan between P, Q must be deterministic. From the practical point of view, this requirement means that there will be no diversity in samples T x(z) for a fixed x and z S. On the other hand, if Proj Q( 1 γ P) = Q, i.e., P is not γ-times more disperse than Q, the optimization may indeed converge to fake solutions. To show this, we consider the following example. Toy 2D example. We consider P = N(0, [ 1 2]2I2), Q = N(0, I2) (Figure 5a) and run NOT (Korotin et al., 2023, Algorihm 1) for γ {1 4, 1}-weak quadratic costs C2,γ. We show the learned stochastic maps ˆTx(z) and their barycentric projections T(x) def = R Z b Tx(z)d S(z) in Figures 5b, 5c, 5d. Good case. When γ 1 2, we have 1 γ P = N(0, [ 1 2γ ]2I2) with 1 2γ 1. Since the distributions 1 γ P and Q are Gaussians and ΣQ Σ 1 γ P, we conclude that Proj Q( 1 γ P) = Q (Gozlan & Juillet, 2020, Corollary 2.1). Next, we use our Lemma 3 and derive that the OT plan is unique, deterministic and equals the barycentric projection. The latter is the OT map between P and Q for the quadratic cost. It is given by ϕ (x) = 2x (Álvarez-Esteban et al., 2016, Theorem 2.3). In Figure 5b (when γ = 1 2), we have b Tx(z) = T(x) 2x = ϕ (x) and b T (P S) Q. Thus, NOT correctly learns the (unique and deterministic) OT plan. Bad case. When γ > 1 2, we have 1 γ P = N(0, [ 1 2γ ]2I2) with 1 2γ < 1. Since 1 γ P and Q are Gaussians and ΣQ Σ 1 γ P, we conclude that 1 γ P Q (recall M2). Thus, Proj Q( 1 γ P = Q by definition of the projection (8). The optimal barycentric projection is the OT map between Gaussians P and 1 γ P for the quadratic cost. It is given by ϕ (x) = 1 γ x (Álvarez-Esteban et al., 2016, Theorem 2.3). In Figures 5c and 5d, we see that the learned T(x) 1 γ x, i.e., b T captures the Published as a conference paper at ICLR 2023 (a) Iteration 10k. (b) Iteration 10k+100. (c) Iteration 10k+200. (d) Iteration 10k+300. Figure 6: The evolution of the learned transport map b T during training on a toy 2D example (γ = 1). conditional expectation of π (y|x) shared by all OT plans π (M2). However, b T (P S) = Q and NOT fails to learn an OT plan. Importantly, we found that when γ > 1 2 (Figures 5c, 5d), the transport map ˆT extremely fluctuates during the optimization rather than converges to a solution. In Figure 6, we visualize the evolution of b T during training (for γ = 1). In all the cases, the barycentric projection T(x) x = ϕ (x) is almost correct. However, the "remaining" part of b T is literally random. To explain the behavior, we integrate ϕ (x) = x and get that ϕ (x) = 1 2 x 2 is an optimal restricted potential. We derive ψ(y) = ϕ (y) 1 2 y 2 0 = Uψ(y) = U0(y) RD for every y RD. From our Theorem 1 it follows that T arg inf T L(f , T) R Z T x(z)d S(z) = ϕ (x) = x holds P-almost everywhere. Thus, a function T recovered from (6) may be literally any function which captures the first conditional moment of a plan π (y|x). This agrees with our practical observations. In Appendix C, we give an additional toy example illustrating the issue with fake solutions. Our results show that the γ-weak quadratic cost C2,γ may be not a good choice for NOT algorithm due to fake solutions. However, prior works on OT (Korotin et al., 2023; Rout et al., 2022; Fan et al., 2022a; Gazdieva et al., 2022; Korotin et al., 2022) use strong/weak quadratic costs and show promising practical performance. Should we really care about solutions being fake? Yes. First, fake solutions T arg inf T L(f , T) do not satisfy T (P S) = Q, i.e., they are not distribution preserving (Lemma 1). Second, our analysis suggests that fake solutions might be one of the causes for the training instabilities reported in related works (Korotin et al., 2023, Appendix D), (Korotin et al., 2021b, M4): the map b T may fluctuate between fake solutions rather than converge. 3.2 KERNEL WEAK QUADRATIC COST REMOVES FAKE SADDLE POINTS In this section, we introduce kernel weak quadratic costs which generalize weak quadratic cost (3). We prove that for characteristic kernels the costs completely resolve the ambiguity of arg inf T sets. Henceforth, we assume that X =Y RD are compact sets. Let H be a Hilbert space (feature space). Let u : X H be a function (feature map). We define the γ-weak quadratic cost between features: Cu,γ(x, µ) def = 1 Y u(x) u(y) 2 Hdµ(y) γ Y Y u(y) u(y ) 2 Hdµ(y)dµ(y ) . (12) We denote the PDS kernel k : Y Y R with the feature map u by k(y, y ) def = u(y), u(y ) H. Cost (12) can be computed without knowing the map u, i.e., it is enough to know the PDS kernel k. By using u(y) u(y ) 2 H = k(y, y) 2k(y, y ) + k(y , y ), we obtain the equivalent form of (12): Ck,γ(x, µ) def = 1 2k(x, x)+ 1 γ Y k(y, y)dµ(y) Z Y k(x, y)dµ(y)+ γ Y Y k(y, y )dµ(y)dµ(y ). (13) Published as a conference paper at ICLR 2023 (a) Handbag shoes, 128 128. (b) Shoes handbags, 128 128. Figure 7: Unpaired one-to-many translation with kernel Neural Optimal Transport (NOT). We call (12) and (13) the γ-weak kernel cost. The γ-weak quadratic cost (3) is its particular case for H = RD and u(x) = x. The respective kernel k(x, y) = u(x), u(y) = x, y is bilinear. Lemma 4 (Weak kernel costs are appropriate). Let k be a continuous PDS kernel and γ [0, 1]. Then the cost Ck,γ(x, µ) is convex, lower semi-continuous and lower bounded in µ. Corollary 2 (Existence and duality for kernel costs). Let k be a continuous PDS kernel and γ [0, 1]. Then an OT plan π for cost Ck,γ(x, µ) exists and duality formulas (5) and (6) hold true. We focus on characteristic kernels k and show that they resolve the ambiguity in arg inf T sets. Lemma 5 (Uniqueness of the optimal plan for characteristic kernel costs). Let k be a characteristic PDS kernel and γ (0, 1]. Then the OT plan π for cost Ck,γ(x, µ) is unique. Theorem 2 (Optimality of stochastic functions in all optimal saddle points). Let k be a continuous characteristic PDS kernel and γ (0, 1]. Consider weak OT problem (2) with cost Ck,γ and its dual problem (5). For any optimal potential f arg supf inf T L(f, T) it holds that T arg inf T L(f , T) T x S = π (y|x) holds true P-almost surely for all x X , (14) i.e., every optimal saddle point (f , T ) provides a stochastic OT map T . Bilinear kernel k(x, y) = x, y is not characteristic and is not covered by our Theorem 2; its respective γ-weak quadratic cost C2,γ suffers from fake solutions (M3.1). In the next subsection, we give examples of practically interesting kernels k(x, y) which are ideologically similar to the bilinear but are characteristic. Consequently, their respective costs Ck do not have ambiguity in arg inf T sets. 3.3 PRACTICAL ASPECTS OF LEARNING WITH KERNEL COSTS Optimization. To learn the stochastic OT map T for kernel cost (13), we use NOT s training procedure (Korotin et al., 2023, Algorithm 1). It requires stochastic estimation of Ck,γ(x, Tx S) to compute the corresponding term in (6). Similar to the γ-weak quadratic cost (Korotin et al., 2023, Equation 23), it is possible to derive the following unbiased Monte-Carlo estimator b Ck,γ for x X and a batch Z S (|Z| 2): b Ck,γ x, Tx(Z) = 1 2k(x, x) + 1 γ z Z k Tx(z), Tx(z) z Z k(x, Tx(z)) + γ 2|Z|(|Z| 1) z =z k Tx(z), Tx(z ) Ck,γ(x, Tx S). (15) The time complexity of estimator (15) is O(|Z|2) since it requires considering pairs z, z in batch to estimate the variance term. Specifically for the bilinear kernel k(y, y ) = y, y , the variance can be Published as a conference paper at ICLR 2023 (a) Texture shoes, 128 128. (b) Texture handbags, 128 128. Figure 8: Unpaired one-to-many translation with kernel Neural Optimal Transport (NOT). estimated in O(|Z|) operations (Korotin et al., 2023, Equation 23), but NOT algorithm may suffer from fake solutions (M3.1). Kernels. Consider the family of distance-induced kernels k(x, y)= 1 2 x y α. For these kernels, we have u(x) u(x ) 2 H = x x α, i.e., (12), (13) can be expressed as Ck,γ(x, µ) = Cu,γ(x, µ) = 1 Y x y αdµ(y) γ Y Y y y αdµ(y)dµ(y ) . (16) For α = 2 the kernel is bilinear, i.e., k(x, y) = x, y ; it is PDS but not characteristic and (16) simply becomes the γ-weak quadratic cost (3). In the experiments (M5), we focus on the case α = 1; it yields a PDS and characteristic kernel (Sejdinovic et al., 2013, Definition 13 & Proposition 14). 4 RELATED WORK In deep learning, OT costs are primarily used as losses to train generative models. Such approaches are called Wasserstein GANs (Arjovsky & Bottou, 2017); they are not related to our paper since they only compute OT costs but not OT plans. Below we discuss methods to compute OT plans. Existing OT solvers. NOT (Korotin et al., 2023) is the only parametric algorithm which is capable of computing OT plans for weak costs (2). Although NOT is generic, the authors tested it only with the γ-weak quadratic cost (3). The core of NOT is saddle point formulation (6) which subsumes analogs (Korotin et al., 2021b, Eq. 9), (Rout et al., 2022, Eq. 14), (Fan et al., 2022a, Eq. 11), (Henry-Labordere, 2019, Eq. 11), (Gazdieva et al., 2022, Eq. 10), (Korotin et al., 2022, Eq. 7) for strong costs (1). For the strong quadratic cost, (Makkuva et al., 2020), (Taghvaei & Jalali, 2019, Eq. 2.2), (Korotin et al., 2021a, Eq. 10) consider analogous to (9) formulations restricted to convex potentials; they use Input Convex Neural Networks (ICNNs (Amos et al., 2017)) to approximate the potentials. ICNNs are popular in OT (Korotin et al., 2021c; Mokrov et al., 2021; Huang et al., 2020; Alvarez-Melis et al., 2022; Bunne et al., 2022) but recent studies (Korotin et al., 2021b; 2022; Fan et al., 2022b) show that OT algorithms based on them underperform compared to unrestricted formulations such as NOT. In (Genevay et al., 2016; Seguy et al., 2018; Daniels et al., 2021), the authors propose neural algorithms for f-divergence regularized costs (Genevay, 2019). The first two methods suffer from bias in high dimensions (Korotin et al., 2021b, M4.2). Algorithm (Daniels et al., 2021) alleviates the bias but is not end-to-end and is computationally expensive due to using the Langevin dynamics. There also exist GAN-based (Goodfellow et al., 2014) methods (Lu et al., 2020; Xie et al., 2019; González-Sanz et al., 2022) to learn OT plans (or maps) for strong costs. However, they are harder to set up in practice due to the large amount of tunable hyperparameters. Published as a conference paper at ICLR 2023 Kernels in OT. In (Zhang et al., 2019; Oh et al., 2020), the authors propose a strong kernel W2 distance and an algorithm to approximate the transport map under the Gaussianity assumption on P, Q. In (Li et al., 2021), the authors generalize Sinkhorn divergences (Genevay et al., 2019) to Hilbert spaces. These papers consider discrete OT formulations and data-to-data matching tasks; they do not use neural networks to approximate the OT map. 5 EVALUATION In Appendix A, we learn OT between toy 1D distributions and perform comparisons with discrete OT. In Appendix B, we conduct tests on toy 2D distributions. In this section, we test our algorithm on an unpaired image-to-image translation task. We perform comparison with principal translation methods in Appendix K. The code is written in Py Torch framework and is available at https://github.com/iamalexkorotin/Kernel Neural Optimal Transport Image datasets. We test the following datasets as P, Q: aligned anime faces1, celebrity faces (Liu et al., 2015), shoes (Yu & Grauman, 2014), Amazon handbags, churches from LSUN dataset (Yu et al., 2015), outdoor images from the MIT places database (Zhou et al., 2014), describable textures (Cimpoi et al., 2014). The size of datasets varies from 5K to 500K images. Train-test split. We pick 90% of each dataset for unpaired training. The rest 10% are considered as the test set. All the results presented here are exclusively for test images, i.e., unseen data. Transport costs. We focus on the γ-weak cost for the kernel k(x, y) = 1 2 x y . For completeness, we test other popular PDS kernels in Appendix E. Other training details (optimizers, architectures, pre-processing, etc.) are given in Appendix I. We learn stochastic OT maps between various pairs of datasets. We rescale images to 128 128 and use γ = 1 3 in the experiments with the kernel cost. Additionally, in Appendix D we analyse how varying parameter γ affects the diversity of generated samples. We provide the qualitative results in Figures 1, 7 and 8; extra results are in Appendix L. Thanks to the first term in (16), our translation map ˆTx(z) tries to minimally change the image content x in the pixel space. At the same time, the second term (kernel variance) in (16) enforces the map to produce diverse outputs for different z S. Datasets (128 128) C2,0 (strong) C2,γ (weak) Ck,γ (weak) Ours Handbags shoes 35.7 33.9 0.2 26.7 0.06 Shoes handbags 39.8 29.51 0.19 Outdoor church 25.5 25.97 0.14 15.16 0.03 Celeba (f) anime 38.73 28.21 0.12 21.96 0.07 Table 1: Test FID of NOT with various costs. We provide quantitative comparison with NOT with the γ-weak quadratic cost C2,γ. We compute FID score (Heusel et al., 2017) between the mapped input test subset and the output test subset (Table 1). For C2,γ, we use the pre-trained models provided by the authors of NOT (Korotin et al., 2023, M5).2 We observe that FID of NOT with kernel cost Ck,γ is better than that of NOT with cost C2,γ. We show qualitative examples in Appendix J. In Appendix H, we perform a detailed comparison of NOT s training stability with the weak quadratic and kernel costs. 6 DISCUSSION Potential impact. Neural OT methods and their usage in generative models constantly advance. We expect our proposed weak kernel quadratic costs to improve applications of OT to unpaired learning. In particular, we hope that our theoretical analysis provides better understanding of the performance. Limitations (theory). In our Theorem 2, we implicitly assume the existence a maximizer f of dual form (5) for kernel costs Ck,γ. Deriving precise conditions for existence of such maximizers is a challenging question. We hope that this issue will be addressed in the future theoretical OT research. Limitations (practice). Applying kernel costs to domains of different nature (RGB images depth maps, infrared images RGB images) is not straightforward as it might require selecting meaningful shared features u (or kernel k). Studying this question is a promising avenue for the future research. 1kaggle.com/reitanaka/alignedanimefaces 2https://github.com/iamalexkorotin/Neural Optimal Transport Published as a conference paper at ICLR 2023 Reproducibility. We provide the source code for all experiments and release the checkpoints for all models of M5. The details are given in README.MD in the official repository. ACKNOWLEDGEMENTS. The work was supported by the Analytical center under the RF Government (subsidy agreement 000000D730321P5Q0002, Grant No. 70-2021-00145 02.11.2021). J-J Alibert, Guy Bouchitté, and Thierry Champion. A new class of costs for optimal transport planning. European Journal of Applied Mathematics, 30(6):1229 1263, 2019. Amjad Almahairi, Sai Rajeshwar, Alessandro Sordoni, Philip Bachman, and Aaron Courville. Augmented cyclegan: Learning many-to-many mappings from unpaired data. In International Conference on Machine Learning, pp. 195 204. PMLR, 2018. Pedro C Álvarez-Esteban, E Del Barrio, JA Cuesta-Albertos, and C Matrán. A fixed-point approach to barycenters in Wasserstein space. Journal of Mathematical Analysis and Applications, 441(2): 744 762, 2016. David Alvarez-Melis, Yair Schiff, and Youssef Mroueh. Optimizing functionals on the space of probabilities with input convex neural networks. Transactions on Machine Learning Research, 2022. ISSN 2835-8856. URL https://openreview.net/forum?id=dp OYN7o8Jm. Brandon Amos, Lei Xu, and J Zico Kolter. Input convex neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 146 155. JMLR. org, 2017. Martin Arjovsky and Leon Bottou. Towards principled methods for training generative adversarial networks. In International Conference on Learning Representations, 2017. URL https:// openreview.net/forum?id=Hk4_qw5xe. Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pp. 214 223. PMLR, 2017. Julio Backhoff-Veraguas, Mathias Beiglböck, and Gudmun Pammer. Existence, duality, and cyclical monotonicity for weak transport costs. Calculus of Variations and Partial Differential Equations, 58(6):1 28, 2019. Charlotte Bunne, Laetitia Papaxanthos, Andreas Krause, and Marco Cuturi. Proximal optimal transport modeling of population dynamics. In International Conference on Artificial Intelligence and Statistics, pp. 6511 6528. PMLR, 2022. M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, , and A. Vedaldi. Describing textures in the wild. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014. Grady Daniels, Tyler Maunu, and Paul Hand. Score-based generative neural networks for large-scale optimal transport. Advances in Neural Information Processing Systems, 34, 2021. Jiaojiao Fan, Shu Liu, Shaojun Ma, Yongxin Chen, and Hao-Min Zhou. Scalable computation of monge maps with general costs. In ICLR Workshop on Deep Generative Models for Highly Structured Data, 2022a. Jiaojiao Fan, Qinsheng Zhang, Amirhossein Taghvaei, and Yongxin Chen. Variational wasserstein gradient flow. In International Conference on Machine Learning, pp. 6185 6215. PMLR, 2022b. Milena Gazdieva, Litu Rout, Alexander Korotin, Andrey Kravchenko, Alexander Filippov, and Evgeny Burnaev. An optimal transport perspective on unpaired image super-resolution. ar Xiv preprint ar Xiv:2202.01116, 2022. Aude Genevay. Entropy-regularized optimal transport for machine learning. Ph D thesis, Paris Sciences et Lettres (Com UE), 2019. Aude Genevay, Marco Cuturi, Gabriel Peyré, and Francis Bach. Stochastic optimization for largescale optimal transport. In Advances in neural information processing systems, pp. 3440 3448, 2016. Published as a conference paper at ICLR 2023 Aude Genevay, Lénaic Chizat, Francis Bach, Marco Cuturi, and Gabriel Peyré. Sample complexity of sinkhorn divergences. In The 22nd international conference on artificial intelligence and statistics, pp. 1574 1583. PMLR, 2019. Alberto González-Sanz, Lucas De Lara, Louis Béthune, and Jean-Michel Loubes. Gan estimation of lipschitz optimal transport maps. ar Xiv preprint ar Xiv:2202.07965, 2022. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672 2680, 2014. Nathael Gozlan and Nicolas Juillet. On a mixture of brenier and strassen theorems. Proceedings of the London Mathematical Society, 120(3):434 463, 2020. Nathael Gozlan, Cyril Roberto, Paul-Marie Samson, and Prasad Tetali. Kantorovich duality for general transport costs and applications. Journal of Functional Analysis, 273(11):3327 3405, 2017. Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of Wasserstein GANs. In Advances in Neural Information Processing Systems, pp. 5767 5777, 2017. Pierre Henry-Labordere. (martingale) optimal transport and anomaly detection with neural networks: A primal-dual algorithm. Available at SSRN 3370910, 2019. Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pp. 6626 6637, 2017. Chin-Wei Huang, Ricky TQ Chen, Christos Tsirigotis, and Aaron Courville. Convex potential flows: Universal probability distributions with optimal transport and convex optimization. In International Conference on Learning Representations, 2020. Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. Multimodal unsupervised image-to-image translation. In Proceedings of the European conference on computer vision (ECCV), pp. 172 189, 2018. Sham Kakade, Shai Shalev-Shwartz, and Ambuj Tewari. On the duality of strong convexity and strong smoothness: Learning applications and matrix regularization. Unpublished Manuscript, http://ttic. uchicago. edu/shai/papers/Kakade Shalev Tewari09. pdf, 2(1), 2009. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Alexander Korotin, Vage Egiazarian, Arip Asadulaev, Alexander Safin, and Evgeny Burnaev. Wasserstein-2 generative networks. In International Conference on Learning Representations, 2021a. URL https://openreview.net/forum?id=b Eoxz W_EXsa. Alexander Korotin, Lingxiao Li, Aude Genevay, Justin M Solomon, Alexander Filippov, and Evgeny Burnaev. Do neural optimal transport solvers work? a continuous wasserstein-2 benchmark. Advances in Neural Information Processing Systems, 34, 2021b. Alexander Korotin, Lingxiao Li, Justin Solomon, and Evgeny Burnaev. Continuous wasserstein-2 barycenter estimation without minimax optimization. In International Conference on Learning Representations, 2021c. URL https://openreview.net/forum?id=3t FAs5E-Pe. Alexander Korotin, Vage Egiazarian, Lingxiao Li, and Evgeny Burnaev. Wasserstein iterative networks for barycenter estimation. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https: //openreview.net/forum?id=Gi Enzx Tna MN. Alexander Korotin, Daniil Selikhanovych, and Evgeny Burnaev. Neural optimal transport. In International Conference on Learning Representations, 2023. URL https://openreview. net/forum?id=d8CBRl WNkq H. Published as a conference paper at ICLR 2023 Qian Li, Zhichao Wang, Gang Li, Jun Pang, and Guandong Xu. Hilbert sinkhorn divergence for optimal transport. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3835 3844, 2021. Huidong Liu, Xianfeng Gu, and Dimitris Samaras. Wasserstein GAN with quadratic transport cost. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4832 4841, 2019. Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015. Guansong Lu, Zhiming Zhou, Jian Shen, Cheng Chen, Weinan Zhang, and Yong Yu. Large-scale optimal transport via adversarial training with cycle-consistency. ar Xiv preprint ar Xiv:2003.06635, 2020. Ashok Makkuva, Amirhossein Taghvaei, Sewoong Oh, and Jason Lee. Optimal transport mapping via input convex neural networks. In International Conference on Machine Learning, pp. 6672 6681. PMLR, 2020. Petr Mokrov, Alexander Korotin, Lingxiao Li, Aude Genevay, Justin M Solomon, and Evgeny Burnaev. Large-scale wasserstein gradient flows. Advances in Neural Information Processing Systems, 34, 2021. Jung Hun Oh, Maryam Pouryahya, Aditi Iyer, Aditya P Apte, Joseph O Deasy, and Allen Tannenbaum. A novel kernel wasserstein distance on gaussian measures: an application of identifying dental artifacts in head and neck computed tomography. Computers in biology and medicine, 120:103731, 2020. Henning Petzka, Asja Fischer, and Denis Lukovnikov. On the regularization of wasserstein GANs. In International Conference on Learning Representations, 2018. URL https://openreview. net/forum?id=B1h YRMb CW. Gabriel Peyré, Marco Cuturi, et al. Computational optimal transport. Foundations and Trends in Machine Learning, 11(5-6):355 607, 2019. Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computerassisted intervention, pp. 234 241. Springer, 2015. Litu Rout, Alexander Korotin, and Evgeny Burnaev. Generative modeling with optimal transport maps. In International Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=5Jd LZg346Lw. Maziar Sanjabi, Jimmy Ba, Meisam Razaviyayn, and Jason D Lee. On the convergence and robustness of training gans with regularized optimal transport. Advances in Neural Information Processing Systems, 31, 2018. Filippo Santambrogio. Optimal transport for applied mathematicians. Birkäuser, NY, 55(58-63):94, 2015. Marco Scarsini. Multivariate convex orderings, dependence, and stochastic equality. Journal of applied probability, 35(1):93 103, 1998. Vivien Seguy, Bharath Bhushan Damodaran, Remi Flamary, Nicolas Courty, Antoine Rolet, and Mathieu Blondel. Large scale optimal transport and mapping estimation. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id= B1zlp1b RW. Dino Sejdinovic, Bharath Sriperumbudur, Arthur Gretton, and Kenji Fukumizu. Equivalence of distance-based and rkhs-based statistics in hypothesis testing. The Annals of Statistics, pp. 2263 2291, 2013. Amirhossein Taghvaei and Amin Jalali. 2-Wasserstein approximation via restricted convex potentials with application to improved training for GANs. ar Xiv preprint ar Xiv:1902.07197, 2019. Published as a conference paper at ICLR 2023 Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008. Yujia Xie, Minshuo Chen, Haoming Jiang, Tuo Zhao, and Hongyuan Zha. On scalable and efficient computation of large scale optimal transport. volume 97 of Proceedings of Machine Learning Research, pp. 6882 6892, Long Beach, California, USA, 09 15 Jun 2019. PMLR. URL http: //proceedings.mlr.press/v97/xie19a.html. Aron Yu and Kristen Grauman. Fine-grained visual comparisons with local learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 192 199, 2014. Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. ar Xiv preprint ar Xiv:1506.03365, 2015. Zhen Zhang, Mianzhi Wang, and Arye Nehorai. Optimal transport in reproducing kernel hilbert spaces: Theory and applications. IEEE transactions on pattern analysis and machine intelligence, 42(7):1741 1754, 2019. Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. Learning deep features for scene recognition using places database. 2014. Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223 2232, 2017. Published as a conference paper at ICLR 2023 A TOY EXPERIMENTS IN 1D AND COMPARISON WITH DISCRETE OT In this section, we learn transport plans between various pairs of toy 1D distributions and compare them with the discrete optimal transport (DOT) considered as the ground truth. We use the distanceinduced kernel and consider γ {1, 10}. All the rest training details (fully-connected architectures, optimizers, etc.) match those of (Korotin et al., 2023, Appendix C). We consider Gaussian N(0, 1) Mixture of 2 Gaussians and Mixture of 3 Gaussians Mixture of 2 Gaussians. In Figure 9, we visualize the pairs P, Q (1st and 2nd columns), the plan ˆπ learned by Kernel NOT (3rd column) and the plan π learned by DOT (4th column). To compute DOT, we sample 103 random points x P, y Q and compute a discrete plan by ot.optim.cg solver from Python OT (POT) library https://pythonot.github.io/. Our learned plan ˆπ and DOT s plan π nearly match. Note also that, as one may expect, with the increase of γ from 1 to 10, the conditional variance of the plan increases and for very high γ = 10 it becomes similar to the trivial plan P Q. This is analogous to the entropic optimal transport, see, e.g., (Peyré et al., 2019, Figure 4.2). (a) Gaussian N(0, 1) Mixture of 2 Gaussians, γ = 1. (b) Gaussian N(0, 1) Mixture of 2 Gaussians, γ = 10. (c) Mixture of 3 Gaussians Mixture of 2 Gaussians, γ = 1. (d) Mixture of 3 Gaussians Mixture of 2 Gaussians, γ = 10. Figure 9: Stochastic plans (3rd and 4th columns) between toy 1D distributions (1st and 2nd columns) learned by our Kernel NOT (3rd column) and discrete OT with the weak kernel cost (4th column). Published as a conference paper at ICLR 2023 B TOY EXPERIMENTS IN 2D In this section, we learn transport maps between various common pairs of toy 2D distributions. We use the distance-induced kernel and γ = 1. All the rest training details (fully-connected architectures, optimizers, etc.) exactly match those of (Korotin et al., 2023, Appendix B). We consider Gaussian N(0, 1 2I2) Gaussian N(0, I2) (the same experiment as in Figures 5d and 6), Gaussian N(0, [ 1 2]2I2) Mixture of 8 Gaussians and Gaussian N(0, [ 1 2]2I2) Swiss roll as P, Q pairs. In Figure 10, we provide the learned stochastic (one-to-many) maps. Since the ground truth OT maps for kernel costs are not known, we provide only qualitative results. (a) Gaussian N(0, [ 1 2]2I2) Gaussian N(0, I2). (b) Gaussian N(0, [ 1 2]2I2) Mixture of 8 Gaussians. (c) Gaussian N(0, 1 2I2) Swiss roll. (d) Mixture of 4 Gaussians Uniform. Figure 10: Stochastic (one-to-many) maps learned between toy 2D distributions by Kernel NOT. Published as a conference paper at ICLR 2023 C ADDITIONAL TOY 2D EXAMPLE In this section, we provide an additional toy 2D example demonstrating the issue with fake solutions for the weak quadratic cost. We consider a mixture of 4 Gaussians as P and the uniform distribution on a square as Q. We train NOT with the γ-weak quadratic cost for γ = 0, 1 2, 1 and report the results in Figure 11. For γ = 0 (Figure 11b), we see that NOT with the weak quadratic cost3 learns the target distribution. However, when γ = 1 2 (Figure 11c) and γ = 1 (Figure 11d), the method does not converge and yields fake solutions. In addition, in Figure 12, we show that for γ = 1 the method notably fluctuates between the fake solutions. This is analogous to the toy example with Gaussians in M3.1, see Figure 6. For completeness, we run NOT with our proposed kernel cost (16) on this pair (P, Q) and γ = 1 and show that it learns the distribution Q, see Figure 10d. (a) Input and target distributions P and Q. (b) Fitted map ˆTx(z) for γ = 0. (c) Fitted map ˆTx(z) for γ = 1 (d) Fitted map ˆTx(z) for γ = 1. Figure 11: Stochastic maps b T between P and Q fitted by NOT with costs C2,γ for various γ. (a) Iteration 10k. 2 (b) Iteration 10k+100. (c) Iteration 10k+200. (d) Iteration 10k+300. Figure 12: The evolution of the learned transport map b T during training on a toy 2D example (γ = 1). 3In this case, the cost is the strong quadratic cost. Published as a conference paper at ICLR 2023 D VARIANCE-SIMILARITY TRADE-OFF In this section, we study how parameter γ affects the resulting learned transport map. In (Korotin et al., 2023, Appendix A), the authors empirically show that for the γ-weak quadratic cost C2,γ the variety of samples Tx(z) produced for a fixed x and z S increases with the increase of γ, but their similarity to x decreases. We formalize this statement and generalize it for kernel costs Ck,γ. For a plan π, we define its (feature) conditional variance and (the square of) input-output distance by CVaru(π) def = Z X Var u π(y|x) dπ(x) and Dist2 u(π) def = Z X Y u(x) u(y) 2 Hdπ(x, y), (17) respectively. Recall that Var u π(y|x) = R Y Y u(y) u(y ) 2 Hdπ(y|x)dπ(y |x). We note that the γ-weak kernel cost of a plan π Π(P, Q) is given by Costk,γ(π) = 1 2 Distu(π) γ 2 CVaru(π). (18) Our following proposition explains the behaviour of the above mentioned values for OT plans. Figure 13: Texture shoes (64 64) translation with the γ-weak kernel cost for various values γ. Proposition 3 (Behavior of the conditional variance and input-output distance). Let π γ Π(P, Q) be an OT plan for γ-weak kernel cost. Then for γ2 > γ1 0 it holds true CVaru(π γ1) CVaru(π γ2) and Distu(π γ1) Distu(π γ2), (19) Published as a conference paper at ICLR 2023 γ = 0 γ = 1 3 γ = 1 γ = 4 3 CVaru(bπγ) 0.72 15.2 16.28 17.86 18.26 Dist2 u(bπγ) 46.6 48.13 48.75 49.87 51.24 Costk,γ(bπγ) 23.3 21.56 19.0 16.01 13.48 Table 2: The values of CVar(bπγ), Dist(bπγ) and Costk,γ(bπγ) of learned plans bπγ π γ with different values of γ on texture shoes (64 64) translation with the kernel quadratic cost. i.e., for larger γ, the OT plan π γ on average for each x yields more conditionally diverse samples π (y|x) but they are less close to x in features w.r.t. 2 H. The OT cost is non-increasing, i.e., Costk,γ1(π γ1) Costk,γ1(π γ2). (20) The proof is given in Appendix G. We empirically check the proposition by training OT maps for Ck,γ for texture shoes translation (64 64), distance-induced k (M5) and γ {0, 1 3}, see Figure 13 and Table 2. We observe the increase of the variety of samples with the increase of γ. At the same time, with the increase of γ, output samples become less similar to the inputs. E EXPERIMENTS WITH DIFFERENT KERNELS We empirically test several popular kernels on texture handbag translation (64 64), γ = 1 3. We consider bilinear, distance-based, Gaussian and Laplacian kernels. The three latter kernels are characteristic. The quantitative and qualitative results are given in Figure 14. For the kernels in view, the squared feature distance can be expressed as u(x) u(y) 2 H = h( x y ) for some increasing function h : R+ R+. Due to this, all the stochastic transport maps try to preserve the input content in the pixel space. According to FID, the distance-based kernel performs better than the bilinear one, which agrees with the results in M5. Interestingly, both Gaussian and Laplacian kernels are slightly outperformed by the bilinear kernel. We do not know why this happens, but we presume that this might be related to their boundness (0 < k(x, y) 1) and exp operation. F RESTRICTED DUALITY FOR THE WEAK QUADRATIC COST In this section, we derive duality formula (9) for the γ-weak quadratic cost. First, we derive formula (9) for γ = 1 by using (Gozlan & Juillet, 2020, Theorems 1.1, 1.2). Next, following the discussion in (Gozlan & Juillet, 2020, M5.2), we generalize duality formula (9) to arbitrary γ > 0. We note that the constants in our derivations differ from those in (Gozlan & Juillet, 2020) since our quadratic cost C2,γ differs by a constant multiplicative factor. Part 1 (γ = 1). From (Gozlan & Juillet, 2020, Theorems 1.1, 1.2) it follows that there exists a lower-semi-continuous and convex function v : RD R { } which maximizes the following expression: W2 2,1(P, Q) = max v l.s.c. Cvx X inf y RD v(y) + 1 2 x y 2 d P(x) Z Y v(y)d Q(y) . (21) Importantly, ϕ (x) def = ( 2 2 + v )(x) is the optimal restricted potential, i.e., ϕ implements the projection of P to Q and is a 1-smooth convex function. We consider the change of variables ϕ(x) = ( 2 2 + v)(x) in (21). Since y 2 2 + v(y) is 1-strongly convex, its conjugate ϕ(x) is 1-smooth. From the lower-semi-continuity of v it follows that 2 + v (y) = 2 2 + v (y) = 1 2 y 2 + v(y) = v(y) = 1 2 y 2 ϕ(y). (22) inf y RD v(y) + 1 2 x y 2 = 1 2 x 2 + inf y RD ϕ(y) x, y = Published as a conference paper at ICLR 2023 (a) Bilinear k(y, y ) = y, y (Korotin et al., 2023), FID = 18.79 0.08. (b) Distance k(y, y )= 1 2 y y , FID = 16.13 0.1. (c) Gaussian k(y, y ) = exp y y 2 2D , FID = 20.89 0.09. (d) Laplacian k(y, y ) = exp y y 2D , FID = 20 0.09. Figure 14: Texture handbags (64 64) translation with the different γ-weak kernel costs. 1 2 x 2 sup y RD x, y ϕ(y) = 1 2 x 2 ϕ(x). (23) We substitute (22) and (23) to (21) and obtain W2 2,1(P, Q) =max ϕ Cvx(1) 2 d P(x)+ Z X ϕ(x)d P(x)+ Z ϕ(y)d Q(y) . (24) We only need to note that (24) exactly matches the desired (9) for γ = 1. Part 2 (arbitrary γ > 0). In (Gozlan & Juillet, 2020, M5.2), the authors show that the OT problem between P and Q for the γ-weak cost becomes the OT problem between 1 γ P and Q for the 1-weak cost. It holds W2 2,γ(P, Q) = γ inf P Q W2 2( 1 γ P, P ) | {z } =W2 2,1( 1 γ P,Q), see (Backhoff-Veraguas et al., 2019, Thm. 1.4) X x 2d P(x) + 1 γ Y y 2d Q(x). (25) Moreover, ϕ (x) is the optimal restricted potential for γ-weak cost between P, Q if and only if ϕ 1(x) = γϕ (γ 1x) is the optimal restricted potential between 1 γ P and Q for the 1-weak quadratic Published as a conference paper at ICLR 2023 cost. Note that ϕ (x) = ϕ 1( 1 γ x), i.e., ϕ (x) first scales x P by 1 γ and then implements projection ϕ 1 of 1 γ P to Q. In particular, the function ϕ is 1 γ -Lipschitz. We reparameterize the duality formula (24) for distributions 1 γ P and Q as follows: γ P, Q) = max ϕ1 Cvx(1) ϕ1(y)d Q(y) = max ϕ1 Cvx(1) 2 d P(x)+ Z γ x)d P(x)+ Z ϕ1(y)d Q(y) = (26) max ϕ Cvx( 1 2 d P(x)+ Z X ϕ(x)d P(x)+ Z ϕ(y)d Q(y) . (27) In transition from (26) to (27), we use the change of variables for ϕ(x) = γϕ1( 1 γ x) known as the right scalar multiplication. It yields ϕ(y) = γϕ1(y). To finish the derivation of dual form (9), we simply substitute (27) to (25). Proof of Lemma 1. Assume the opposite, i.e., T (P S) = Q. Then T implicitly represents some transport plan π Π(P, Q) between P and Q. By the definition of f , T and (6), it holds Cost(P, Q) = L(f , T ) = Z X C(x, T x S)d P(x) Z Z f T x(z) d S(z)d P(x) + Z Y f (y)d Q(y) = (28) Z X C(x, T x S)d P(x) Z Y f (y)d Q(y) + Z Y f (y)d Q(y) = (29) Z X C(x, T x S)d P(x) = Z X C(x, π ( |x))dπ (x), (30) where in transition between lines (28) and (29), we use the change of variables formula for y = T x(z) and the equality T (P S) = Q. From (30) we see that the cost of the plan π Π(P, Q) equals the optimal cost Cost(P, Q). As a result, it is an optimal plan by definition. In turn, T is an stochastic OT map but not a fake solution. This is a contradiction. Thus, it holds that T (P S) = Q. Proof of Lemma 2. We compute the C-transform of f (y) = y 2 2 ϕ (x). We have (f )C(x) = inf µ P(Y) C2,γ(x, µ) Z Y f (y)dµ(y) = Y x y 2dµ(y) γ 2 ϕ (y) dµ(y) = Y y dµ(y) + 1 Y y 2dµ(y) γ 2 ϕ (y) dµ(y) = Y y dµ(y) γ 2 Var(µ) + Z ϕ (y)dµ(y) = 1 2 x 2 + inf µ P(Y) ϕ (y) x, y dµ(y) γ 1 2 x 2 + inf µ P(Y) 2 y 2 x, y dµ(y) + γ Published as a conference paper at ICLR 2023 Pick any µ P(Y) and denote its expectation by m = R Y y dµ(y). Since ϕ is 1 γ -smooth, it holds that ϕ is γ-strongly convex, i.e., ϕ (y) γ 2 y 2 x, y is convex. We use the Jensen s inequality: Z 2 y 2 x, y dµ(y) ϕ (m) γ 2 m 2 x, m . (32) Therefore, we may restrict the feasible set of inf in (31) to Dirac distributions δm, m Y. That is, (f )C(x) = 1 2 x 2 + inf m Y 2 m 2 x, m + γ 1 2 x 2 + inf m Y ϕ (m) x, m = 1 2 x 2 sup m Y 2 x 2 ϕ (x), (33) where ϕ (x) = ϕ (x) holds since ϕ is continuous. We substitute (33) to (5) and obtain Z X (f )C(x)d P(x) + Z Y f (y)d Q(y) = 2 ϕ (x) d P(x) + Z 2 ϕ (y) d Q(y) = W2 2,γ(P, Q), (34) where in line (34), we use the optimality of ϕ and (9). We conclude that f maximizes (5), (6). Proof of Proposition 1. For all y , y Uψ(y), α [0, 1], from the convexity of ψ it follows that ψ αy + (1 α)y αψ(y ) + (1 α)ψ(y ). (35) By the definition of the subgradient of a convex function it also holds ψ αy + (1 α)y ψ(y) + x, αy + (1 α)y y = α ψ(y) + x, y y + (1 α) ψ(y) + x, y y = αψ(y ) + (1 α)ψ(y ) (36) for all x yψ(y). Therefore, (35) and (36) are equalities, and αy + (1 α)y Uψ(y). Proof of Theorem 1. It holds that T arg inf T L(f , T) T arg inf T C x, Tx S Z Z f Tx(z) d S(z) d P(x). The latter condition holds if and only if P-almost surely for all x X we have T x arg inf Tx:Z Y C x, T x S Z Z f T x(z) d S(z) . (37) We substitute f = 2 2 ϕ and C = C2,γ. As a result, we derive C2,γ x, T x S Z f T x(z) d S(z) = Z x T x(z) 2d S(z) γ 2 Var(T x S) 1 Z T x(z) 2d S(z) + Z ϕ T x(z) d S(z) = 1 2 x 2 x, Z Z T x(z)d S(z) γ 2 Var(T x S) + Z ϕ T x(z) d S(z) = 1 2 x 2 x, Z Z T x(z)d S(z) + γ Z T x(z)d S(z) 2 T x(z) 2 d S(z) = 1 2 x 2 x, Z Z T x(z)d S(z) + γ Z T x(z)d S(z) Z ψ T x(z) d S(z) (38) 1 2 x 2 x, Z Z T x(z)d S(z) + γ Z T x(z)d S(z) Z T x(z)d S(z) , (39) Published as a conference paper at ICLR 2023 where we substitute ψ = ϕ γ 2 2. Recall that ψ is convex since ϕ is 1 γ -smooth (Kakade et al., 2009). In transition from (38) to (39), we use the convexity of ψ and the Jensen s inequality. Part 1. To begin with, we prove implication "= " in (11). If (37) holds true, then (39) is the equality. If it is not, one may pick T x(z) def = R Z T x(z )d S(z ) which provides smaller value (39) for (37) than T x. Let T be any stochastic OT map and π be its respective OT plan. We know that T arg inf T L(f , T), i.e., the optimal value (39) of (37) equals 1 2 x 2 x, Z Z T x(z)d S(z) + γ Z T x(z)d S(z) Z T x(z)d S(z) = (40) 1 2 x 2 x, ϕ (x) + γ 2 ϕ (x) 2 + ψ ϕ (x) , (41) where we use the equalities ϕ (x) = R Y ydπ (y|x) = R Z T x(z)d S(z), see M2. The function m 7 1 2 x 2 x, m + γ 2 m 2 + ψ m is γ-strongly convex in m, therefore, the minimizer m = ϕ (x) is unique. This yields that R Z T x(z)d S(z) = m = ϕ (x). We know that (38) equals (39), i.e., Z Z ψ T x(z) d S(z) ψ Z Z T x(z)d S(z) = Z Z ψ T x(z) d S(z) ψ ϕ (x) = 0, (42) and the Jensen s gap vanishes. Now we are going to prove that Supp(Tx S) Uψ( ϕ (x)). We need to show that for every x y( ϕ (x)) the following inequality ψ(y) ψ( ϕ (x)) + x , y ϕ (x) (43) is the equality T x S-almost surely for all y Supp(T x S). Assume the opposite, i.e., for some x , (43) holds true not T x S-almost surely. We integrate (43) w.r.t. y T x S and get the strict inequality Z Y ψ(y)d T x S (y) > ψ ϕ (x) + x , Z Y yd T x S (y) ϕ (x) . We use the change of variables for y = T x(z) and obtain Z Z ψ T x(z) d S(z) > ψ ϕ (x) + y , Z Z T x(z)d S(z) ϕ (x) | {z } =0 = ψ ϕ (x) , which contradicts (42). This finishes the proof of implication "= ". Part 2. Now we prove implication " =" in (11). For every T satisfying the conditions on the right-hand side of (11), the Jensen s gap (42) is zero since ψ is linear in Uψ ϕ (x) . Therefore, (38) equals (39). Due to R Z T x(z)d S(z) = ϕ (x), we have that (39) attains the optimal (minimal) values. Consequently, (37) holds. Finally, we note that convexity of arg inf T L(f , T) follows from the convexity of sets Uψ( ). Proof of Lemma 3. The plan π Π(P, Q) is optimal if and only if R Y ydπ(y|x) = ϕ (x). That is, a stochastic map T : X Z Y is optimal if and only if it pushes P to Q (represents some plan π Π(P, Q)), i.e., T (P S) = Q, and R Z Tx(z)d S = ϕ (x). The optimal barycentric projection T satisfies the second condition by the definition. Consider implication " (a) = ". Since T is a stochastic OT map, the first condition T (P S) = Q holds true. Recall that by the definition of ϕ and T we have T (P S) = ϕ P = ( 1 Thus, Q = Proj Q( 1 γ P). Consider implication " (a) =". Since Proj Q( 1 γ P) = Q, the first condition T (P S) = Q holds true. Therefore, T is a stochastic OT map. Now we prove implication " (b) = ". Let T be any stochastic OT map. We compute the second moment of T x (P S) = Q below: Z Y y 2d Q(y) = Z Z T x(z) 2d S(z)d P(x) = (44) Published as a conference paper at ICLR 2023 Var(T x S) + Z T x(z)d S(z) 2 d P(x) = Z X Var(T x S)d P(x) + Z X ϕ (x) 2d P(x) = (45) Z X Var(T x S)d P(x) + Z Y y 2d Q(y), (46) where in transition to (45), we use the equality R Z T x(z)d S(z) = ϕ (x); in transition to (46), we use the change of variables formula for y = ϕ (x) and ϕ (x) P = Q. Finally, by comparing (44) and (46), we obtain that Var(Tx S) = 0 holds P-almost surely. That is, T is deterministic (does not depend on z) and (P S)-almost surely matches the optimal barycentric projection T . Proof of Proposition 2. We are going to show that the second moment of T α (P S) is less than that of Q. Consequently, T α (P S) = Q, and T α can not be a stochastic OT map. First, for T we have Z Z T x(z) 2d S(z)d P(x) = Var T (P S) | {z } Var(Q) Z T x(z)d S(z)d P(x) X ϕ (x)d P(x) 2 = Var(Q) m Q 2 = Z Z T x(z) 2d S(z)d P(x), (47) where in the last equality we use T (P S) = Q. Finally, we derive Z Z T α x (z) 2d S(z)d P(x) = Z Z αT x(z) + (1 α)T x(z) 2d S(z)d P(x) < Z α T x(z) 2 + (1 α) T x(z) 2 d S(z)d P(x) = (48) Z T x(z) 2d S(z)d P(x) + (1 α) Z Z T x(z) 2d S(z)d P(x) Z Z T x(z) 2d S(z)d P(x) = [second moment of Q]. (49) In transition to (48), we use the Jensen s inequality for 2. The inequality is strict since 2 is strictly convex and T = T (P S-almost surely). In transition to (49), we use (47). That is, T α (P S) = Q as its second moment is smaller. Remark. The assumption Proj Q( 1 γ P) = Q in Proposition 2 is needed to guarantee the existence of a function T arg inf T L(f , T) which differs from the given stochastic OT map T . Due to our Lemma 3, the optimal barycenteric projection T arg inf T L(f , T) is not an OT map. Thus, T = T is a suitable example. Since T (P S) Q, we also have Var T (P S Var(Q). Proof of Lemma 4. The lower-semi-continuity of Ck,γ(x, µ) = Cu,γ(x, µ) in (x, µ) follows from the continuity of k, compactness of X = Y RD and (Santambrogio, 2015, Lemma 7.3). 4 Y u(x) u(y) 2 Hdµ(y) in (12) is linear in µ and, consequently, convex. The second term in (12) equals to γ 2 Var(u µ). The pushforward operator u is linear and Var( ) is concave. Therefore, the second term is convex in µ (γ 0). As a result, Ck,γ(x, µ) is convex in µ. To prove that Ck,γ is lower-bounded, we rewrite (12) analogously to (3), i.e., Ck,γ(x, µ) = 1 Y u(y)dµ(y) 2 H + 1 γ Y Y u(y) u(y ) 2 Hdµ(y)dµ(y ) = 4We use the lower semi-continuity (in µ) of C(x, µ) w.r.t. the weak convergence of distributions in P(Y). In contrast, Backhoff-Veraguas et al. (2019) work with µ Pp(Y) P(Y), i.e., with the distributions which have a finite p-th moment. They prove the existence and duality results for weak OT (2) assuming that C(x, µ) is lower semi-continuous w.r.t. the convergence in the Wasserstein-p sence in Pp(Y). Since we consider compact Y, it holds that Pp(Y) = P(Y) and these notions of convergence coincide (Villani, 2008, Def. 6.8). Published as a conference paper at ICLR 2023 Y u(y)dµ(y) 2 H + 1 γ 2 Var(u µ). (50) Both terms are non-negative (γ [0, 1]), i.e., Ck,γ(x, µ) is lower bounded by 0. Proof of Lemma 5. To begin with, we prove that if k is characteristic, then Ck,γ(x, µ) = Cu,γ(x, µ) is strictly convex in µ. The term R X u(x) u(y) 2 Hdµ(y) in (12) is linear in µ, so we focus on the second (variance) term γ 2 Var(u µ). We prove that Var(u µ) is strictly concave. We derive Var(u µ) = 1 Y Y u(y) u(y ) 2 Hdµ(y)dµ(y ) = Y u(y) 2 Hdµ(y) Z Y Y u(y), u(y ) Hdµ(y)dµ(y ) + 1 Y u(y ) 2 Hdµ(y ) = Z Y u(y) 2 Hdµ(y) Z Y u(y)dµ(y), Z Y u(y )dµ(y ) Y u(y) 2 Hdµ(y) Z Y u(y)dµ(y) 2 H. (51) The first term in (51) is linear in µ, so it sufficies to prove that R Yu(y)dµ(y) 2 H = u(µ) 2 H is strictly convex. To do this, we pick any µ1 = µ2 P(X), α (0, 1). Since the kernel k is characteristic, it holds that u(µ1) = u(µ2). The squared norm function 2 H is strictly convex. As a result, the following strict inequality holds α u(µ1) 2 H + (1 α) u(µ2) 2 H > u αµ1 + (1 α)µ2 2 H, which yields strict convexity of u(µ) 2 H. Consequently, Ck,γ(x, µ) is strictly convex in µ. In this case, the weak OT functional π 7 R X Ck,γ(x, π( |x))d P(x) is also strictly convex and yields the unique minimizer π Π(P, Q) which is the OT plan (Backhoff-Veraguas et al., 2019, M1.3.1). Proof of Theorem 2. To begin with, we expand the functional L: T arg inf T L(f , T) T arg inf T C x, Tx S Z Z f Tx(z) d S(z) d P(x). (52) Define µ x def = T x S. We are going to prove that µ x π (y|x), where π Π(P, Q) is the (unique) optimal plan; this yields that T is a stochastic OT map. Since the optimization over functions in NOT equals to the optimization over distributions that they generate (Korotin et al., 2023, M4.1), we have {µ x} arg inf {µx} Y f (y)dµx(y) d P(x), (53) where the inf is taken over collections of distributions µx P(Y) indexed by X. Importantly, (53) can be split into x X independent problems, i.e., we have that µ x arg inf µ Y f (y)dµx(y) holds true P-almost surely for all x X. Note that the functional µ 7 C(x, µ) R Y f (y)dµ(y) consists of a strictly convex term C(x, µ), which follows from the proof of Lemma 5, and a linear term (integral over µ). Therefore, the functional itself is strictly convex. Since π (y|x) minimizes this functional, it is the unique solution due to the strict convexity. Therefore, µ (x) = π (y|x) holds true P-almost surely for x X and T x S = µ x = π (y|x), i.e., T is a stochastic OT map. Proof of Proposition 3. Since π γ1 is optimal for the γ1-weak cost, for all π Π(P, Q) it holds Costk,γ1(π γ1) Costk,γ1(π). In particular, for π = π γ2 it holds true that Dist2 u(π γ1) γ1 CVaru(π γ1) Dist2 u(π γ2) γ1 CVaru(π γ2). (55) Published as a conference paper at ICLR 2023 Analogously, π γ2 is optimal for the γ2-weak cost; the following holds: Dist2 u(π γ2) γ2 CVaru(π γ2) Dist2 u(π γ1) γ2 CVaru(π γ1). (56) We sum (55) and (56) and obtain γ1 CVaru(π γ1) γ2 CVaru(π γ2) γ1 CVaru(π γ2) γ2 CVaru(π γ1), or, equivalently, (γ2 γ1) CVaru(π γ1) (γ2 γ1) CVaru(π γ2), which is equivalent to CVaru(π γ1) CVaru(π γ2) since γ2 γ1 > 0. Now we multiply (55) and (56) by γ2 and γ1, respectively, and sum the resulting inequalities. We obtain γ2 Dist2 u(π γ1) + γ1 Dist2 u(π γ2) γ2 Dist2 u(π γ2) + γ1 Dist2 u(π γ1), or, equivalently, (γ2 γ1) Dist2 u(π γ1) (γ2 γ1) Dist2 u(π γ2), which provides Distu(π γ1) Distu(π γ2). Finally, we note that Costk,γ2(π γ2) = inf π Π(P,Q) Costk,γ2(π) Costk,γ2(π γ1) = Costk,γ1(π γ1) (γ2 γ1) | {z } 0 CVaru(π γ1) | {z } 0 Costk,γ1(π γ1), (57) which concludes the proof. H WEAK QUADRATIC VS. KERNEL COSTS ON REAL DATA The goal of this section is to demonstrate that our proposed kernel cost (12) consistently outperforms the weak quadratic cost (3) in the downstream task of unpaired image translation. We consider the distance-induced kernel k(x, y) = 1 2 x y . In short, we run NOT (Korotin et al., 2023, Algorithm 1) multiple times (with various random seeds) with the same hyperparameters (Appendix I) for weak and kernel costs, and then we compare the obtained FID (µ σ). Datasets. We consider shoes handbags and celeba (female) anime translation. We work only with small 32 32 images to speed up the training and be able to run many experiments. EXPERIMENT 1. We consider the γ-weak quadratic cost for each γ { 1 2, 2} we run NOT with 5 different random seeds and train it for 60k iterations of fω. In this experiment, during training, we evaluate test FID every 1k iteration and for each experiment we report 3 FID values: the best FID value of iterations 25-60k5 indicating what the model can achieve best, the max FID value of iterations 25-60k indicating what the model achieves worst (because of potential training instabilities) and the last FID value at the end of training (60k) showing what the model actually achieved. The experimental results of Tables 3, 4 provide several important insights. First, we see that with the increase of parameter γ, the best FID stably increases. Second, the overall training becomes less stable: max FID becomes extremely large, which indicates severe fluctuations of the model. In particular, both the mean and standard deviation of last FID (as well as max FID) drastically increase. Why does this happen? We think that with the increase of γ sets Uψ( ) (11) become large. These sets determine how much a fake solution may vary (Theorem 1). Thus, this naturally leads to high ambiguity of the solutions and results in unstable and unpredictable behaviour. EXPERIMENT 2. We pick the highest considered value γ = 2 and show that our γ-weak kernel cost performs better than the weak quadratic cost from the previous experiments. We run NOT with the kernel cost 5 times and report the results in the same Tables 3, 4. The results show that even for high γ the issues with the fluctuation are notably softened. This is seen from the fact that for γ = 2 the gap between the last FID (or max FID) and best much smaller for the kernel cost than for the quadratic cost. In particular, this gap is comparable to the gap for γ = 1 2-weak quadratic cost for which sets Uψ( ) are presumably small and provide less ambiguity to the solutions. Conclusion. Our empirical evaluation shows that NOT with our proposed kernel costs yields more stable behaviour than NOT with the weak quadratic cost. This agrees with our theory which suggests that one of the reasons for unstable behaviour and severe fluctuations might be the existence of the fake solutions (M3.1). Our weak kernel cost removes all the fake solutions (M3.2). 5We choose 25k iterations as the starting point because at this time point FID roughly stabilizes at a small level. This indicates the model has nearly converged and starts fluctuating around the optimum. Published as a conference paper at ICLR 2023 Method Weak quadratic, C2,γ Weak kernel Ck,γ [Ours] Setting γ = 0.5 γ = 1 γ = 1.5 γ = 2 γ = 2 Best FID 16.08 0.37 19.87 0.69 23.01 5.16 25.36 1.26 16.50 0.88 Last FID 21.98 3.04 33.56 10.49 30.35 8.85 37.18 8.58 20.19 2.36 Max FID 28.40 3.18 52.03 16.00 46.59 9.53 67.50 30.76 29.85 3.75 Table 3: Test FID (µ σ) on shoes handbags, 32 32 of C2,γ and Ck,γ for different γ. Red color highlights µ 30, orange color highlights σ 5. Method Weak quadratic, C2,γ Weak kernel Ck,γ [Ours] Setting γ = 0.5 γ = 1 γ = 1.5 γ = 2 γ = 2 Best FID 18.88 0.57 34.71 6.48 24.49 0.64 40.49 5.57 19.39 1.06 Last FID 20.26 1.98 36.89 8.19 28.49 1.66 49.66 5.45 20.62 1.66 Max FID 29.51 1.45 81.78 25.93 48.44 5.14 76.25 4.26 36.86 0.45 Table 4: Test FID (µ σ) on celeba (female) anime, 32 32 of C2,γ and Ck,γ for different γ. Red color highlights µ 40, orange color highlights σ 4. I ADDITIONAL TRAINING DETAILS Pre-processing. In all the cases, we rescale RGB channels of images from [0, 1] to [ 1, 1]. As in (Korotin et al., 2023), we beforehand rescale anime face images to 512 512, and do 256 256 crop with the center located 14 pixels above the image center to get the face. Next, for all the datasets except for the describable textures, we resize images to the required size (64 64 or 128 128). Specifically for the describable textures dataset ( 5K textures), we augment the samples. We rescale input textures to minimal border size of 300, do the random resized crop (from 128 to 300 pixels) and random horizontal & vertical flips. Then we resize images to the required size (64 64 or 128 128). Neural networks. We use WGAN-QC discriminator s Res Net architecture (Liu et al., 2019) for potential f. We use UNet6 (Ronneberger et al., 2015) as the stochastic transport map T(x, z) = Tx(z). To condition it on z, we insert conditional instance normalization (Cond IN) layers after each UNet s upscaling block7. We use Cond IN from Aug Cycle GAN8 (Almahairi et al., 2018). In experiments, z is the 128-dimensional standard Gaussian noise. Optimization. To learn stochastic OT maps, we use NOT algorithm (Korotin et al., 2023, Algorithm 1). We use the Adam optimizer (Kingma & Ba, 2014) with the default betas for both Tθ and fω. The learning rate is lr = 1 10 4. We use the Multi Step LR scheduler which decreases lr by 2 after [15k, 25k, 40k, 55k, 70k] (iterations of fω). The batch size is |X| = 64, |Zx| = 4. The number of inner iterations is k T = 10. In toy experiments, we do 10K total iterations of fω update. In the image-to-image translation experiments, we observe convergence in 70k iterations for 128 128 datasets, in 40k iterations for 64 64 datasets. In image-to-image translation, we gradually change γ. Starting from γ = 0, we linearly increase it to the desired value (mostly 1 3) during 25K first iterations of fω. Computational complexity. NOT with kernel costs for 128 128 images converges in 3-4 days on a 4 Tesla V100 GPUs (16 GB). This is slightly bigger than the respective time for NOT (Korotin et al., 2023, M6) with the quadratic cost due to the reasons discussed in M3.3. 6github.com/milesial/Pytorch-UNet 7github.com/kgkgzrtk/c UNet-Pytorch 8github.com/Erfan MN/Augmented_Cycle GAN_Pytorch Published as a conference paper at ICLR 2023 J COMPARISON WITH NOT WITH THE QUADRATIC COST (a) Outdoor church, 128 128. (b) Celeba (f) anime, 128 128. (c) Handbags shoes, 128 128. Figure 15: Unpaired translation with Neural Optimal Transport with various costs (C2,0, C2,γ, Ck,γ). (a) C2,0 (Korotin et al., 2023, M5.2). (b) C2,γ (Korotin et al., 2023, M5.3). (c) Ck,γ [Ours]. Figure 16: Celeba (f) anime (128 128) translation with NOT with costs C2,0, C2,γ, Ck,γ. (a) C2,0 (Korotin et al., 2023, M5.2). (b) C2,γ (Korotin et al., 2023, M5.3). (c) Ck,γ [Ours]. Figure 17: Outdoor church (128 128) translation with NOT with costs C2,0, C2,γ, Ck,γ. Published as a conference paper at ICLR 2023 (a) C2,0 (Korotin et al., 2023, M5.2). (b) C2,γ (Korotin et al., 2023, M5.3). (c) Ck,γ [Ours]. Figure 18: Handbags shoes (128 128) translation with NOT with costs C2,0, C2,γ, Ck,γ. K COMPARISON WITH IMAGE-TO-IMAGE TRANSLATION METHODS We compare NOT with our kernel costs with principal models (one-to-one and one-to-many) for unpaired image-to-image translation. We consider Cycle GAN 9(Zhu et al., 2017), Aug Cycle GAN10 (Almahairi et al., 2018) and MUNIT11 (Huang et al., 2018) for comparison. We use the official or community implementations with the hyperparameters from the respective papers. We consider outdoor church and texture shoes dataset pairs (128 128). The FID scores are given in Table 5. Qualitative examples are shown in Figures 20, 19. Method One-to-one One-to-many Datasets (128 128) Cycle GAN (with L1 loss) Cycle GAN (no L1 loss) Aug Cycle GAN MUNIT NOT with Ck,γ (Ours) Outdoor church 43.74 36.16 51.15 0.19 32.14 0.18 15.16 0.03 Texture shoes 34.65 0.12 50.95 0.12 N/A 43.74 0.16 24.84 0.09 Table 5: Test FID of the considered image-to-image translation methods. We do not include the results of Aug Cycle GAN on texture shoes as it did not converge on these datasets (FID 100). We tried tuning its hyperparameters, but this did not yield improvement. 9github.com/eriklindernoren/Py Torch-GAN/tree/master/implementations/ cyclegan 10github.com/aalmah/augmented_cyclegan 11github.com/NVlabs/MUNIT Published as a conference paper at ICLR 2023 (a) Cycle GAN (b) Cycle GAN (no identity loss) (d) NOT with Ck,γ [Ours] Figure 19: Texture shoes (128 128) translation with various methods. Published as a conference paper at ICLR 2023 (a) Cycle GAN (no identity loss) (b) Cycle GAN (d) Aug Cycle GAN (e) NOT with Ck,γ [Ours] Figure 20: Outdoor church (128 128) translation with various methods. Published as a conference paper at ICLR 2023 L ADDITIONAL EXPERIMENTAL RESULTS (a) Input images x and random translated examples Tx(z). (b) Interpolation in the conditional latent space, z = (1 α)z1 + αz2. Figure 21: Celeba (female) anime translation, 128 128. Additional examples. Published as a conference paper at ICLR 2023 (a) Input images x and random translated examples Tx(z). (b) Interpolation in the conditional latent space, z = (1 α)z1 + αz2. Figure 22: Outdoor church, 128 128. Additional examples. Published as a conference paper at ICLR 2023 (a) Input images x and random translated examples Tx(z). (b) Interpolation in the conditional latent space, z = (1 α)z1 + αz2. Figure 23: Texture shoes translation, 128 128. Additional examples. Published as a conference paper at ICLR 2023 (a) Input images x and random translated examples Tx(z). (b) Interpolation in the conditional latent space, z = (1 α)z1 + αz2. Figure 24: Texture handbags translation, 128 128. Additional examples. Published as a conference paper at ICLR 2023 (a) Input images x and random translated examples Tx(z). (b) Interpolation in the conditional latent space, z = (1 α)z1 + αz2. Figure 25: Shoes handbags translation, 128 128. Additional examples. Published as a conference paper at ICLR 2023 (a) Input images x and random translated examples Tx(z). (b) Interpolation in the conditional latent space, z = (1 α)z1 + αz2. Figure 26: Handbags shoes translation, 128 128. Additional examples.