# light_unbalanced_optimal_transport__bb8eb648.pdf Light Unbalanced Optimal Transport Milena Gazdieva Skolkovo Institute of Science and Technology Artificial Intelligence Research Institute Moscow, Russia milena.gazdieva@skoltech.ru Arip Asadulaev ITMO University Artificial Intelligence Research Institute Moscow, Russia asadulaev@airi.net Evgeny Burnaev Skolkovo Institute of Science and Technology Artificial Intelligence Research Institute Moscow, Russia e.burnaev@skoltech.ru Alexander Korotin Skolkovo Institute of Science and Technology Artificial Intelligence Research Institute Moscow, Russia a.korotin@skoltech.ru While the continuous Entropic Optimal Transport (EOT) field has been actively developing in recent years, it became evident that the classic EOT problem is prone to different issues like the sensitivity to outliers and imbalance of classes in the source and target measures. This fact inspired the development of solvers that deal with the unbalanced EOT (UEOT) problem the generalization of EOT allowing for mitigating the mentioned issues by relaxing the marginal constraints. Surprisingly, it turns out that the existing solvers are either based on heuristic principles or heavyweighted with complex optimization objectives involving several neural networks. We address this challenge and propose a novel theoretically-justified, lightweight, unbalanced EOT solver. Our advancement consists of developing a novel view on the optimization of the UEOT problem yielding tractable and a non-minimax optimization objective. We show that combined with a light parametrization recently proposed in the field our objective leads to a fast, simple, and effective solver which allows solving the continuous UEOT problem in minutes on CPU. We prove that our solver provides a universal approximation of UEOT solutions and obtain its generalization bounds. We give illustrative examples of the solver s performance. The code is publicly available at https://github.com/milenagazdieva/Light Unbalanced Optimal Transport 1 Introduction The computational optimal transport (OT) has proven to be a powerful tool for solving various popular tasks, e.g., image-to-image translation [68, 17, 50, 28], image generation [67, 13, 7] and biological data transfer [5, 42, 66]. Historically, the majority of early works in the field were built upon solving the OT problem between discrete probability measures [10, 53]. Only recently the advances in the field of generative models have led to explosive interest from the ML community in developing the continuous OT solvers, see [38] for a survey. The setup of this problem assumes that the learner needs to estimate the OT plan between continuous measures given only empirical samples of data from them. Due to convexity-related issues of OT problem [40], many works consider the EOT problem, i.e., use entropy regularizers which guarantee, e.g., the uniqueness of learned plans. Meanwhile, researches attract attention to other shortcomings of the classic OT problem. It enforces hard constraints on the marginal measures and, thus, does not allow for mass variations. As a result, OT shows high sensitivity to an imbalance of classes and outliers in the source and target 38th Conference on Neural Information Processing Systems (Neur IPS 2024). measures [4] which are almost inevitable for large-scale datasets. To overcome these issues, it is common to consider extensions of the OT problem, e.g., unbalanced OT/EOT (UOT/UEOT) [8, 43]. The unbalanced OT/EOT formulations allow for variation of total mass by relaxing the marginal constraints through the use of divergences. The scope of our paper is the continuous UEOT problem. It seems that in this field, a solver that is fast, light, and theoretically justified has not yet been developed. Indeed, many of the existing solvers follow a kind of heuristical principles and are based on the solutions of discrete OT. For example, [45] uses a regression to interpolate the discrete solutions, and [16, 36] build a flow matching upon them. Almost all of the other solvers [9, 70] employ several neural networks with many hyper-parameters and require time-consuming optimization procedures. We solve the aforementioned shortcomings by introducing a novel lightweight solver that can play the role of a simple baseline for unbalanced EOT. Contributions. We develop a novel lightweight solver to estimate continuous unbalanced EOT couplings between probability measures (M4). Our solver has a non-minimax optimization objective and employs the Gaussian mixture parametrization for the UEOT plans. We provide the generalization bounds for our solver (M4.4) and experimentally test it on several tasks (M5.1, M5.2). Notations. We work in the Euclidian space (Rd, ). We use P2,ac(Rd) to denote the set of absolutely continuous (a.c.) Borel probability measures on Rd with finite second moment and differential entropy. The set of nonnegative measures on Rd with finite second moment is denoted as M2,+(Rd). We use C2(Rd) to denote the space of all continuous functions ζ : Rd R for which a = a(ζ), b = b(ζ) such that x Rd : |ζ(x)| a + b x 2. Its subspace of functions which are additionally bounded from above is denoted as C2,b(Rd). For a.c. measure p P2,ac(Rd) (or M2,+(Rd)), we use p(x) to denote its density at a point x Rd. For a given a.c. measure γ M2,+(Rd Rd), we denote its total mass by γ 1 def = R Rd Rd γ(x, y)dxdy. We use γx(x), γy(y) to denote the marginals of γ(x, y). They satisfy the equality γx 1 = γy 1 = γ 1. We write γ( |x) to denote the conditional probability measure. Each such measure has a unit total mass. We use f to denote the Fenchel conjugate of a function f: f(t) def = supu R{ut f(u)}. We use IA to denote the convex indicator of a set A, i.e., IA(x) = 0 if x A; IA(x) = + if x / A. 2 Background Here we give an overview of the relevant entropic optimal transport (EOT) concepts. For additional details on balanced EOT, we refer to [10, 24, 53], unbalanced EOT - to [8, 43]. f-divergences for positive measures. For positive measures µ1, µ2 M2,+(Rd ) and a lower semi-continuous function f : R R { }, the f-divergence between µ1, µ2 is defined by Df (µ1 µ2) def = Z µ2(x)dx if µ1 µ2 and + otherwise. We consider f(t) which are convex, non-negative and attain zero uniquely when t = 1. In this case, Df is a valid measure of dissimilarity between two positive measures (see Appendix C for details). This means that Df(µ1 µ2) 0 and Df(µ1 µ2)=0 if and only if µ1 = µ2. In our paper, we also assume that all f that we consider satisfy the property that f is a non-decreasing function. Kullback-Leibler divergence DKL [8, 62], is a particular case of such f-divergence for positive measures. It has a generator function f KL(t) def = t log t t+1. Its convex conjugate f KL(t)=exp(t) 1. Another example is the χ2-divergence Dχ2 which is generated by fχ2(t) def = (t 1)2 when t 0 and otherwise. The convex conjugate of this function is fχ2(t) = 1 if t< 2 and 1 4t2+t when t 2. Remark. By the definition, convex conjugates of f KL and fχ2 divergences are proper, non-negative and non-decreasing. These properties are used in some of our theoretical results. Entropy for positive measures. For µ M2,+(Rd ), its entropy [8] is given by H(µ) def = Z Rd µ(x)log µ(x)dx + µ 1, if µ is a.c. and otherwise. (1) When µ P2,ac(Rd ), i.e., µ 1 =1, equation (1) is the usual differential entropy minus 1. Classic EOT formulation (with the quadratic cost). Consider two probability measures p P2,ac(Rd), q P2,ac(Rd). For ε > 0, the EOT problem between p and q is min π Π(p,q) 2 π(x, y)dxdy εH(π), (2) where Π(p, q) is the set of probability measures π P2,ac(Rd Rd) with marginals p, q (transport plans). Plan π attaining the minimum exists, it is unique and called the EOT plan. Classic EOT imposes hard constraints on the marginals which leads to several issues, e.g., sensitivity to outliers [4], inability to handle potential measure shifts such as class imbalances in the measures p, q. The UEOT problem [70, 9] overcomes these issues by relaxing the marginal constraints [62]. Unbalanced EOT formulation (with the quadratic cost). Let Df1 and Df2 be two f-divergences over Rd. For two probability measures p P2,ac(Rd), q P2,ac(Rd) and ε > 0, the unbalanced EOT problem between p and q consists of finding a minimizer of inf γ M2,+(Rd Rd) 2 γ(x, y)dxdy εH(γ)+ Df1 (γx p) + Df2 (γy q) , (3) Figure 1: Unbalanced EOT problem. see Fig. 1. Here the minimum is attained for a unique γ which is called the unbalanced optimal entropic (UEOT) plan. Typical choices of fi (i [1, 2]) are fi(t) = τif KL(t) or fi(t) = τifχ2(t) (τi > 0) yielding the scaled DKL and Dχ2, respectively. In this case, the bigger τ1 (τ2) is, the more γx (γy) is penalized for not matching the corresponding marginal distribution p (q). Remark 1. While the set M2,+(Rd Rd) contains not only a.c. measures, infimum in problem (3) is attained for a.c. measure γ since εH(γ ) term turns to + otherwise. Thus, almost everywhere in the paper we are actually interested in the subset of a.c. measures in M2,+(Rd Rd). Remark 2. The balanced EOT problem (2) is a special case of (3). Indeed, let f1 and f2 be the convex indicators of {1}, i.e., f1 = f2 = Ix=1. Then the f-divergences Df1 (γx p) and Df2 (γy q) become infinity if p = γx or q = γy, and become zeros otherwise. Dual form of unbalanced EOT problem (3) is formulated as follows ε(ϕ(x)+ψ(y) x y 2 Rdf 1( ϕ(x))p(x)dx Z Rdf 2( ψ(y))q(y)dy It is known that there exist two measurable functions ϕ , ψ delivering maximum to (4). They have the following connection with the solution of the primal unbalanced problem (3): γ (x, y)=exp{ϕ (x) ε }exp{ x y 2 2ε }exp{ψ (y) Remark. For some of our results, we will need to restrict potentials (ϕ, ψ) in problem (4) to the space C2,b(Rd) C2,b(Rd). Since established variants of dual forms [8] typically correspond to other functional spaces, we derive and theoretically justify a variant of the dual problem (4) in Appendix A.3. Note that it may hold that optimal potentials ψ , ϕ / C2,b(RD), i.e., the supremum is not achieved within our considered spaces. This is not principle for our subsequent derivations. Computational UEOT setup. Analytical solution for the unbalanced EOT problem is, in general, not known.1 Moreover, in real-world setups where unbalanced EOT is applicable, the measures p, q are typically not available explicitly but only through their empirical samples (datasets). We assume that data measures p, q P2,ac(Rd) are unknown and accessible only by a limited number of i.i.d. empirical samples {x0, ..., x N} p, {y0, ..., y M} q. We aim to approximate the optimal UEOT plan γ solving (3) between the entire measures p, q. The recovered plans should allow the out-of-sample estimation, i.e., generation of samples from γ ( |xnew) where xnew is a new test point (not necessarily present in the train data). Optionally, one may require the ability to sample from γ x. 1Analytical solutions are known only for some specific cases. For example, [33] consider Gaussian measures and UEOT problem with DKL divergence instead of the differential entropy. This setup differs from ours. Solver Problem Principles What recovers? Limitations [70] UOT Solves c-transform based semi-dual max-min reformulation of UOT using neural nets Scaling factor γ (x)/p(x) and stochastic OT map T (x, z) Complex max-min objective; 3 neural networks [45] Custom UOT Regression on top of discrete EOT between re-balanced measures combined with ICNN-based solver [47] Scaling factors and OT maps between re-scaled measures Heuristically uses minibatch OT approximations [9] UOT Solves semi-dual max-min reformulation of UOT using neural nets Stochastic UOT map T (x, z) Complex max-min objective; 2 neural networks [16] UEOT Flow Matching on top of discrete UEOT using neural nets Parametrized vector field (vt,θ)t [0,1] to transport the mass Heuristically uses minibatch OT approximations [36] UEOT Conditional Flow Matching on top of discrete EOT between re-balanced measures using neural nets Scaling factors and parametrized conditional vector field (vt,θ)t [0,1] to transport the mass between re-scaled measures Heuristically uses minibatch OT approximations U-Light OT (ours) UEOT Solves non-minimax reformulation of dual UEOT using Gaussian Mixtures Density of UEOT plan γ together with light procedure to sample x γ x( ) and y γ y ( |x) Restricted to Gaussian Mixture parametrization Table 1: Comparison of the principles of existing UOT/UEOT solvers and our proposed light solver. The described setup is typically called the continuous EOT and should not be twisted up with the discrete EOT [53, 10]. There the aim is to recover the (unbalanced) EOT plan between the empirical measures ˆp= 1 N PN i=1δxi, ˆq= 1 M PM j=1δyj and out-of-sample estimations are typically not needed. 3 Related Work Nowadays, the sphere of continuous OT/EOT solvers is actively developing. Some of the early works related to this topic utilize OT cost as the loss function [27, 25, 2]. These approaches are not relevant to us as they do not learn OT/EOT maps (or plans). We refer to [39] for a detailed review. At the same time, there exist a large amount of works within the discrete OT/EOT setup [10, 15, 69, 51], see [53] for a survey. We again emphasize that solvers of this type are not relevant to us as they construct discrete matching between the given (train) samples and typically do not provide a generalization to the new unseen (test) data. Only recently ML community started developing out-ofsample estimation procedures based on discrete/batched OT. For example, [19, 56, 32, 14, 48, 57] mostly develop such estimators using the barycentric projections of the discrete EOT plans. Though these procedures have nice theoretical properties, their scalability remains unclear. Balanced OT/EOT solvers. There exists a vast amount of neural solvers for continuous OT problem. Most of them learn the OT maps (or plans) via solving saddle point optimization problems [3, 18, 37, 22, 60, 50]. Though the recent works [28, 61, 11, 41, 30] tackle the EOT problem (2), they consider its balanced version. Hence they are not relevant to us. Among these works, only [41, 30] evade non-trivial training/inference procedures and are ideologically the closest to ours. The difference between them consists of the particular loss function used. In fact, our paper proposes the solver which subsumes these solvers and generalizes them for the unbalanced case. The derivation of our solver is non-trivial and requires solid mathematical apparatus, see M4. Unbalanced OT/EOT solvers. A vast majority of early works in this field tackle the discrete UOT/UEOT setup [6, 20, 54] but the principles behind their construction are not easy to generalize to the continuous setting. Thus, many of the recent papers that tackle the continuous unbalanced OT/EOT setup employ discrete solutions in the construction of their solvers. For example, [45] regress neural network on top of scaling factors obtained using the discrete UEOT while simultaneously learning the continuous OT maps using an ICNN method [47]. In [16] and [36], the authors implement Flow Matching [44, FM] and conditional FM on top of the discrete UEOT plans, respectively. The algorithm of the latter consists of regressing neural networks on top of scaling factors and simultaneously learning a conditional vector field to transport the mass between re-balanced measures. Despite the promising practical performance of these solvers, it is still unclear to what extent such approximations of UEOT plans are theoretically justified. The recent papers [70, 9] are more related to our study as they do not rely on discrete OT approximations of the transport plan. However, they have non-trivial minimax optimization objectives solved using complex GAN-style procedures. Thus, these GANs often lean on heavy neural parametrization, may incur instabilities during training, and require careful hyperparameter selection [46]. For completeness, we also mention other papers which are only slightly relevant to us. For example, [23] considers incomplete OT which relaxes only one of the OT marginal constraints and is less general than the unbalanced OT. Other works [12, 4] incorporate unbalanced OT into the training objective of GANs aimed at generating samples from noise. In contrast to the listed works, our paper proposes a theoretically justified and lightweight solver to the UEOT problem, see Table 1 for the detailed comparison of solvers. 4 Light Unbalanced OT Solver In this section, we derive the optimization objective (M4.1) of our U-Light OT solver, present practical aspects of training and inference procedures (M4.2) and derive the generalization bounds for our solver (M4.4). We provide the proofs of all theoretical results in Appendix A. 4.1 Derivation of the Optimization Objective Following the learning setup described above, we aim to get a parametric approximation γθ,w of the UEOT plan γ . Here θ, ω are the model parameters to learn, and it will be clear later why we split them into two groups. To recover γθ,ω γ , our aim is to learn θ, ω by directly minimizing the DKL divergence between γθ,w and γ : DKL (γ γθ,w) min (θ,w) . (6) The main difficulty of this optimizing objective (6) is obvious: the UEOT plan γ is unknown. Fortunately, below we show that one still can optimize (6) without knowing γ . Recall that the optimal UEOT plan γ has the form (5). We first make some changes of the variables. Specifically, we define v (y) def = exp{ 2ψ (y) y 2 2ε }. Formula (5) now reads as γ (x, y) = exp{2ϕ (x) x 2 2ε }exp{ x, y ε }v (y)= γ (y|x) exp{ x, y ε }v (y). (7) Since the conditional plan has the unit mass, we may write γ (y|x) = exp{ x, y where cv (x) def = R Rd exp{ x,y ε }v (y)dy is the normalization costant ensuring that R Rd γ (y|x)dy = 1. Consider the decomposition γ (x, y) = γ x(x)γ (y|x). It shows that to obtain parametrization for the entire plan γ (x, y), it is sufficient to consider parametrizations for its left marginal γ x and the conditional measure γ (y|x). Meanwhile, equation (8) shows that conditional measures γ ( |x) are entirely described by the variable v . We use these observations to parametrize γθ,w. We set γθ,w(x, y) def = uω(x)γθ(y|x)=uω(x)exp{ x,y /ε}vθ(y) cθ(x) , (9) where uω and vθ parametrize marginal measure γ x and the variable v , respectively. In turn, the constant cθ(x) def = R Rd exp{ x,y ε }vθ(y)dy is the parametrization of cv (x). Next, we demonstrate our main result which shows that the optimization of (6) can be done without the access to γ . Theorem 4.1 (Tractable form of DKL minimization). Assume that γ is parametrized using (9). Then the following bound holds: εDKL (γ γθ,w) L(θ, w) L , where L(θ, w) def = Z Rdf 1( ε log uω(x) 2 )p(x)dx+ Z Rdf 2( ε log vθ(y) y 2 2 )q(y)dy+ε uω 1, (10) and constant ( L ) is the optimal value of the dual form (4). The bound is tight in the sence that it turns to 0 = 0 when vθ(y)=exp{2ψ (y) y 2/2ε} and uω(x)=cθ(x) exp{2ϕ (x) x 2/2ε}. In fact, (10) is the dual form (4) but with potentials (ϕ, ψ) expressed through uω, vθ (and cθ): ϕ(x) ϕθ,ω(x)=εloguω(x) cθ(x) + x 2 2 , ψ(y) ψθ(y)=ε log vθ(y)+ y 2 Our result can be interpreted as the bound on the quality of approximate solution γθ,ω (3) recovered from the approximate solution to the dual problem (4). It can be directly proved using the original (ϕ, ψ) notation of the dual problem, but we use (uω, vθ) instead as with this change of variables the form of the recovered plan γθ,ω is more interpretable (uw defines the first marginal, vθ conditionals). Instead of optimizing (6) to get γθ,ω, we may optimize the upper bound (10) which is more tractable. Indeed, (10) is a sum of the expectations w.r.t. the probability measures p, q. We can obtain Monte-Carlo estimation of (10) from random samples and optimize it with stochastic gradient descent procedure w.r.t. (θ, ω). The main challenge here is the computation of the variable cθ and term uω 1. Below we propose a smart parametrization by which both variables can be derived analytically. 4.2 Parametrization and the Optimization Procedure Parametrization. Recall that uω parametrizes the density of the marginal γ x which is unnormalized. Setting x = 0 in equation (7), we get γ (y|x = 0) v (y) which means that v also corresponds to an unnormalized density of a measure. These motivate us to use the unnormalized Gaussian mixture parametrization for the potential vθ(y) and measure uω(x): k=1 αk N(y|rk,εSk); uω(x)= l=1 βl N(x|µl,εΣl). (11) Here θ def = {αk, rk, Sk}K k=1, w def = {βl, µl, Σl}L l=1 with αk, βl 0, rk, µl Rd and 0 Sk, Σl Rd d. The covariance matrices are scaled by ε just for convenience. For this type of parametrization, it holds that uω 1 = PL l=1 βl. Moreover, there exist closed-from expressions for the normalization constant cθ(x) and conditional plan γθ(y|x), see [41, Proposition 3.2]. Specifically, define rk(x) def = rk + Skx and eαk(x) def = αk exp{ x T Skx+2r T k x 2ε }. It holds that k=1 eαk(x); γθ(y|x) = 1 cθ(x) k=1 eαk(x)N(y|rk(x), εSk). (12) Using this result and (11), we get the expression for γθ,w: γθ,w(x, y) = uω(x) γθ(y|x) = l=1 βl N(x|µl,εΣl) | {z } uω(x) PK k=1 eαk(x)N(y|rk(x), εSk) PK k=1 eαk(x) | {z } γθ(y|x) Training. We recall that the measures p, q are accessible only by a number of empirical samples (see the learning setup in M2). Thus, given samples {x1, ..., x N} and {y1, ..., y M}, we optimize the empirical analog of (10): b L(θ, w) def = 1 f 1( ε log uω(xi) cθ(xi) xi 2 f 2( ε log vθ(yj) yj 2 2 )+ε uω 1 (14) using minibatch gradient descent procedure w.r.t. parameters (θ, w). In the parametrization of vθ and uω (11), we utilize the diagonal matrices Sk, Σl. This allows decreasing the number of learnable parameters in θ and speeding up the computation of inverse matrices S 1 k , Σ 1 l . Inference. Sampling from the conditional and marginal measures γθ(y|x) γ (y|x), uω γ x is easy and lightweight since they are explicitly parametrized as Gaussian mixtures, see (12), (11). 4.3 Connection to Related Prior Works The idea of using Gaussian Mixture parametrization for dual potentials in EOT-related tasks first appeared in the EOT/SB benchmark [29]. There it was used to obtain the benchmark pairs of probability measures with the known EOT solution between them. In [41], the authors utilized this type of parametrization to obtain a light solver (Light SB) for the balanced EOT. Our solver for unbalanced EOT (10) subsumes their solver for balanced EOT as well as one problem subsumes the other for the special case of divergences, see the remark in M2. Let f1, f2 be convex indicators of {1}. Then f1(t) = t and f2(t) = t and objective (10) becomes Rd( ε log uω(x) 2 )p(x)dx + Z Rd( ε log vθ(y) y 2 2 )q(y)dy+ε uω 1 = Rdlog uω(x) cθ(x) p(x)dx+ Z Rdlog vθ(y)q(y)dy uω 1 Z 2 q(y)dy | {z } def =Const(p,q) Rd log cθ(x)p(x)dx Z Rd log vθ(y)q(y)dy (15) Rd log uω(x)p(x)dx + ε uω 1 Const(p, q). (16) Here line (15) depends exclusively on θ and exactly coincides with the Light SB s objective, see [41, Proposition 8]. At the same time, line (16) depends only on ω, and its minimum is attained when uω = p. Thus, this part is not actually needed in the balanced case, see [41, Appendix C]. 4.4 Generalization Bounds and Universal Approximation Property Theoretically, to recover the UEOT plan γ , one needs to solve the problem L(θ, ω) minθ,ω as stated in our Theorem 4.1. In practice, the measures p and q are accessible via empirical samples X def = {x1, ..., x N} and Y def = {y1, ..., y M}, thus, one needs to optimize the empirical counterpart b L(θ, ω) of L(θ, ω), see (14). Besides, the available potentials uω, vθ over which one optimizes the objective come from the restricted class of functions. Specifically, we consider unnormalized Gaussian mixtures uω, vθ with K and L components respectively. Thus, one may naturally wonder: how close is the recovered plan γbθ,bω to the UEOT plan γ given that (bθ, bω) = arg minθ,ω b L(θ, ω)? To address this question, we study the generalization error EDKL γ γbθ,bω , i.e., the expectation of DKL between γ and γbθ,bω taken w.r.t. the random realization of the train datasets X p, Y q. Let (θ, ω) = arg min(θ,ω) Θ ΩL(θ, ω) denote the best approximators of L(θ, ω) in the admissi- ble class. From Theorem 4.1 it follows that that EDKL γ γbθ,bω can be upper bounded using E(L(bθ, bω) L ). The latter can be decomposed into the estimation and approximation errors: E(L(bθ, bω) L )=E[L(bθ, bω) L(θ, ω)]+E[L(θ, ω) L ]=E[L(bθ, bω) L(θ, ω)] | {z } Estimation error + [L(θ, ω) L ] | {z } Approximation error We establish the quantitative bound for the estimation error in the proposition below. Proposition 4.2 (Bound for the estimation error). Let p, q be compactly supported and assume that f 1, f 2 are Lipshitz. Assume that the considered parametric classes Θ, Ω( (θ, ω)) consist of unnormalized Gaussian mixtures with K and L components respectively with bounded means rk , µl R (for some R > 0), covariances s I Sk, Σl SI (for some 0 < s S) and weights a αk, βl A (for some 0 < a A). Then the following holds: E L(bθ, bω) L(θ, ω) O( 1 where O( ) hides the constants depending only on K, L, R, s, S, a, A, p, q, ε but not on sizes M, N. This proposition allows us to conclude that the estimation error converges to zero when N and M tend to infinity at the usual parametric rate. It remains to clarify the question: can we make the approximation error arbitrarily small? We answer this question positively in our Theorem below. Theorem 4.3 (Gaussian mixture parameterization for the variables provides the universal approximation of UEOT plans). Let p and q be compactly supported and assume that f 1, f 2 are Lipshitz. Then for all δ > 0 there exist Gaussian mixtures vθ, uω (11) with scalar covariances Sk = λk Id 0, Σl = ζl Id 0 of their components that satisfy DKL (γ γθ,ω) ε 1(L(θ, ω) L ) < δ. In summary, results of this section show that one can make the generalization error arbitrarily small given a sufficiently large amount of samples and components in the Gaussian parametrization. It means that theoretically our solver can recover the UEOT plan arbitrarily well. 5 Experiments In this section, we test our U-Light OT solver on several setups from the related works. The code is written using Py Torch framework and is publicly available at https://github.com/milenagazdieva/Light Unbalanced Optimal Transport. The experiments are issued in the convenient form of *.ipynb notebooks. Each experiment requires several minutes of training on CPU with 4 cores. Technical training details are given in Appendix B. 5.1 Example with the Mixture of Gaussians We provide an illustrative Gaussians Mixture example in 2D to demonstrate the ability of our solver to deal with the imbalance of classes in the source and target measures. We follow the experimental (a) Input, target measures (b) U-Light OT, τ = 102 (c) U-Light OT, τ = 101 (d) U-Light OT, τ = 100 Figure 2: Conditional plans γθ,ω(y|x) learned by our solver in Gaussians Mixture experiment with unbalancedness parameter τ [100, 101, 102]. Here pω denotes the normalized first marginal uw, i.e., pω = uω/ uω 1. setup proposed in [16, Figure 2] and define the probability measures p, q as follows (Fig. 2a): 4N(x|( 3, 3), 0.1 I2)+ 3 4N(x|(1, 3), 0.1 I2), 4N(y|( 3, 0), 0.1 I2)+ 1 4N(y|(1, 0), 0.1 I2). We test our U-Light OT solver with scaled DKL divergences, i.e., f1(t), f2(t) are defined by τ f KL(t) = τ(t log(t) t + 1) where τ > 0. We provide the learned plans for τ [1, 101, 102]. The results in Fig. 2 show that parameter τ can be used to control the level of unbalancedness of the learned plans. For τ = 1, our U-Light OT solver truly learns the UEOT plans, see Fig. 2d. When τ increases, the solver fails to transport the mass from the input Gaussians to the closest target ones. Actually, for τ = 102, our solutions are similar to the solutions of [41, Light SB] solver which approximates balanced EOT plans between the measures. Hereafter, we treat τ as the unbalancedness parameter. In Appendix C, we test the performance of our solver with Dχ2 divergence. Remark. Here we conduct all experiments with the entropy regularization parameter ε = 0.05. The parameter ε is responsible for the stochasticity of the learned transport γθ( |x). Since we are mostly interested in the correct transport of the mass (controlled by f1, f2) rather than the stochasticity, we do not pay much attention to ε throughout the paper. 5.2 Unpaired Image-to-Image Translation Another popular testbed which is usually considered in OT/EOT papers is the unpaired image-toimage translation [71] task. Since our solver uses the parametrization based on Gaussian mixture, it may be hard to apply U-Light OT for learning translation directly in the image space. Fortunately, nowadays it is common to use autoencoders [59] for more efficient translation. We follow the setup of [41, Section 5.4] and use pre-trained ALAE autoencoder [55] for 1024 1024 FFHQ dataset [34] of human faces. We consider different subsets of FFHQ dataset (Adult, Young, Man, Woman) and all variants of translations between them: Adult Young and Woman Man. Class Man Woman Young 15K 23K Adult 7K 3.5K Table 2: Number of train FFHQ images for each subset. The main challenge of the described translations is the imbalance of classes in the images from source and target subsets, see Table 2. Let us consider in more detail Adult Young translation. In the FFHQ dataset, the amount of adult men significantly outnumbers the adult women, while the amount of young men is smaller than that of young women. Thus, balanced OT/EOT solvers are expected to translate some of adult man representatives to young woman ones. At the same time, solvers based on unbalanced transport are supposed to alleviate this issue. Baselines. We perform a comparison with the recent procedure [16, UOT-FM] which considers roughly the same setup and demonstrates good performance. This method interpolates the results of unbalanced discrete OT to the continuous setup using the flow matching [44]. For completeness, we include the comparison with Light SB and balanced optimal transport-based flow matching (OTFM) [44] to demonstrate the issues of the balanced solvers. We also consider neural network based solvers relying on the adversarial learning such as UOT-SD [9] and UOT-GAN[70]. Figure 3: Visualization of pairs of accuracies (keep-target) for our U-Light OT solver and other OT/EOT methods in the image translation experiment. The values of unbalancedness parameters for our U-Light OT solver (τ) and [16, UOT-FM] (λ = reg_m) are specified directly on the plots. Experiment [16, UOT-FM] [9, UOT-SD] [70, UOT-GAN] U-Light OT (ours) Time (sec) 03:21 18:11 16:30 02:38 Table 3: Comparison of wall-clock running times of unbalanced OT/EOT solvers in Young Adult translation. The best results are in bold, second best are underlined. Metrics. We aim to assess the ability of the solvers to perform the translation of latent codes keeping the class of the input images unchanged, e.g., keeping the gender in Adult Young translation. Thus, we train a 99% MLP classifier for gender using the latent codes of images. We use it to compute the accuracy of preserving the gender during the translation. We denote this number by accuracy (keep). Meanwhile, it is also important to ensure that the generated images belong to the distribution of the target images rather than the source ones, e.g., belong to the distribution of young people in our example. To monitor this property, we use another 99% MLP classifier to identify whether each of the generated images belong to the target domain or to the source one. Then we calculate the fraction of the generated images belonging to the target domain which we denote as accuracy (target). Results. Unbalanced OT/EOT approaches are equipped with some kind of unbalancedness parameter (like τ in our solver) which influences the methods results: with the increase of unbalancedness, accuracy of keeping the class increases but the accuracy of mapping to the target decreases. The latter is because of the relaxation of the marginal constraints in UEOT. For a fair comparison, we aim to compute the trade-offs between the keep/target accuracies for different values of the unbalancedness. We do this for our solver and UOT-FM. We found that GAN-based unbalanced approaches show unstable behaviour, so we report UOT-SD and UOT-GAN s results only for one parameter value. For convenience, we visualize the accuracies pairs for our solver and its competitors in Fig. 3. The evaluation shows that our U-Light OT method can effectively solve translation tasks in high dimensions (d = 512) and outperforms its alternatives in dealing with class imbalance issues. Namely, we see that our solver allows for achieving the best accuracy of keeping the attributes of the input images than other methods while it provides good accuracy of mapping to the target class. While balanced methods and some of the unbalanced ones provide really high accuracy of mapping to the target, their corresponding accuracies of keeping the attributes are worse than ours meaning that we are better in dealing with class imbalance issues. As expected, for large parameter τ, the results of our U-Light OT solver coincide with those for Light SB which is a balanced solver. Meanwhile, our solver has the lowest wall-clock running time among the existing unbalanced solvers, see Table 3 for comparison. We demonstrate qualitative results of our solver and baselines in Fig. 4. The choice of unbalancedness parameter for visualization of our method and UOT-FM is detailed in Appendix B. We present the results of the quantitative comparison in the form of tables in Appendix D.1. In Appendix C, we perform the ablation study of our solver focusing on the selection of parameters τ, ε and number of Gaussian mixtures components. 6 Discussion Potential impact. Our light and unbalanced solver has a lot of advantages in comparison with the other existing UEOT solvers. First, it does not require complex max-min optimization. Second, it provides the closed form of the conditional measures γθ(y|x) γ (y|x) of the UEOT plan. (a) Young Adult (b) Adult Young (c) Man Woman (d) Woman Man Figure 4: Unpaired translation with Light SB, OT-FM, UOT-FM and our U-Light OT solvers applied in the latent space of ALAE [55] for FFHQ images [34] (1024 1024). Moreover, it allows for sampling both from the conditional measure γθ(y|x) and marginal measure uω(x) γ x(x). Besides, the decisive superiority of our lightweight and unbalanced solver is its simplicity and convenience of use. Indeed, it has a straightforward and non-minimax optimization objective and avoids heavy neural parametrization. As a result, our lightweight and unbalanced solver converges in minutes on CPU. We expect that these advantages could boost the usage of our solver as a standard and easy baseline for UEOT task with applications in different spheres. The limitations and broader impact of our solver are discussed in Appendix E. ACKNOWLEDGEMENTS. The work of Skoltech was supported by the Analytical center under the RF Government (subsidy agreement 000000D730321P5Q0002, Grant No. 70-2021-00145 02.11.2021). We thank Kirill Sokolov, Mikhail Persiianov and Petr Mokrov for providing valuable feedback and suggestions for improving the proofs and clarity of our paper. [1] R. Agrawal and T. Horel. Optimal bounds between f-divergences and integral probability metrics. Journal of Machine Learning Research, 22(128):1 59, 2021. [2] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214 223. PMLR, 2017. [3] A. Asadulaev, A. Korotin, V. Egiazarian, and E. Burnaev. Neural optimal transport with general cost functionals. In The Twelfth International Conference on Learning Representations, 2024. [4] Y. Balaji, R. Chellappa, and S. Feizi. Robust optimal transport with applications in generative modeling and domain adaptation. Advances in Neural Information Processing Systems, 33: 12934 12944, 2020. [5] C. Bunne, S. G. Stark, G. Gut, J. S. Del Castillo, M. Levesque, K.-V. Lehmann, L. Pelkmans, A. Krause, and G. Rätsch. Learning single-cell perturbation responses using neural optimal transport. Nature Methods, 20(11):1759 1768, 2023. [6] L. Chapel, R. Flamary, H. Wu, C. Févotte, and G. Gasso. Unbalanced optimal transport through non-negative penalized linear regression. Advances in Neural Information Processing Systems, 34:23270 23282, 2021. [7] T. Chen, G.-H. Liu, and E. Theodorou. Likelihood training of schrödinger bridge using forwardbackward sdes theory. In International Conference on Learning Representations, 2021. [8] L. Chizat. Unbalanced optimal transport: Models, numerical methods, applications. Ph D thesis, Université Paris sciences et lettres, 2017. [9] J. Choi, J. Choi, and M. Kang. Generative modeling through the semi-dual formulation of unbalanced optimal transport. In Advances in Neural Information Processing Systems, volume 36, 2024. [10] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems, pages 2292 2300, 2013. [11] M. Daniels, T. Maunu, and P. Hand. Score-based generative neural networks for large-scale optimal transport. Advances in neural information processing systems, 34:12955 12965, 2021. [12] Q. Dao, B. Ta, T. Pham, and A. Tran. Robust diffusion gan using semi-unbalanced optimal transport. ar Xiv preprint ar Xiv:2311.17101, 2023. [13] V. De Bortoli, J. Thornton, J. Heng, and A. Doucet. Diffusion schrödinger bridge with applications to score-based generative modeling. Advances in Neural Information Processing Systems, 34:17695 17709, 2021. [14] N. Deb, P. Ghosal, and B. Sen. Rates of estimation of optimal transport maps using plug-in estimators via barycentric projections. Advances in Neural Information Processing Systems, 34: 29736 29753, 2021. [15] P. Dvurechenskii, D. Dvinskikh, A. Gasnikov, C. Uribe, and A. Nedich. Decentralize and randomize: Faster algorithm for Wasserstein barycenters. In Advances in Neural Information Processing Systems, pages 10760 10770, 2018. [16] L. Eyring, D. Klein, T. Uscidda, G. Palla, N. Kilbertus, Z. Akata, and F. Theis. Unbalancedness in neural monge maps improves unpaired domain translation. In The Twelfth International Conference on Learning Representations, 2024. [17] J. Fan, A. Taghvaei, and Y. Chen. Scalable computations of Wasserstein barycenter via input convex neural networks. ar Xiv preprint ar Xiv:2007.04462, 2020. [18] J. Fan, S. Liu, S. Ma, H.-M. Zhou, and Y. Chen. Neural monge map estimation and its applications. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=2m ZSl Qscj3. Featured Certification. [19] K. Fatras, Y. Zine, R. Flamary, R. Gribonval, and N. Courty. Learning with minibatch wasserstein: asymptotic and gradient properties. In AISTATS 2020-23nd International Conference on Artificial Intelligence and Statistics, volume 108, pages 1 20, 2020. [20] K. Fatras, T. Séjourné, R. Flamary, and N. Courty. Unbalanced minibatch optimal transport; applications to domain adaptation. In International Conference on Machine Learning, pages 3186 3197. PMLR, 2021. [21] R. Flamary, N. Courty, A. Gramfort, M. Z. Alaya, A. Boisbunon, S. Chambon, L. Chapel, A. Corenflos, K. Fatras, N. Fournier, et al. Pot: Python optimal transport. The Journal of Machine Learning Research, 22(1):3571 3578, 2021. [22] M. Gazdieva, L. Rout, A. Korotin, A. Kravchenko, A. Filippov, and E. Burnaev. An optimal transport perspective on unpaired image super-resolution. ar Xiv preprint ar Xiv:2202.01116, 2022. [23] M. Gazdieva, A. Korotin, D. Selikhanovych, and E. Burnaev. Extremal domain translation with neural optimal transport. In Advances in Neural Information Processing Systems, volume 36, 2023. [24] A. Genevay. Entropy-Regularized Optimal Transport for Machine Learning. Theses, PSL University, Mar. 2019. URL https://theses.hal.science/tel-02319318. [25] A. Genevay, G. Peyré, and M. Cuturi. Learning generative models with sinkhorn divergences. In International Conference on Artificial Intelligence and Statistics, pages 1608 1617. PMLR, 2018. [26] N. Gozlan, C. Roberto, P.-M. Samson, and P. Tetali. Kantorovich duality for general transport costs and applications. Journal of Functional Analysis, 273(11):3327 3405, 2017. [27] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of Wasserstein GANs. In Advances in Neural Information Processing Systems, pages 5767 5777, 2017. [28] N. Gushchin, A. Kolesov, A. Korotin, D. Vetrov, and E. Burnaev. Entropic neural optimal transport via diffusion processes. In Advances in Neural Information Processing Systems, 2023. [29] N. Gushchin, A. Kolesov, P. Mokrov, P. Karpikova, A. Spiridonov, E. Burnaev, and A. Korotin. Building the bridge of schr\" odinger: A continuous entropic optimal transport benchmark. ar Xiv preprint ar Xiv:2306.10161, 2023. [30] N. Gushchin, S. Kholkin, E. Burnaev, and A. Korotin. Light and optimal schr\"odinger bridge matching. In ar Xiv preprint ar Xiv:2402.03207, 2024. [31] G. H. Hardy, J. E. Littlewood, and G. Pólya. Inequalities. Cambridge university press, 1952. [32] J.-C. Hütter and P. Rigollet. Minimax estimation of smooth optimal transport maps. 2021. [33] H. Janati, B. Muzellec, G. Peyré, and M. Cuturi. Entropic optimal transport between unbalanced gaussian measures has a closed form. Advances in neural information processing systems, 33: 10468 10479, 2020. [34] T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401 4410, 2019. [35] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. [36] D. Klein, T. Uscidda, F. Theis, and M. Cuturi. Generative entropic neural optimal transport to map within and across spaces. ar Xiv preprint ar Xiv:2310.09254, 2023. [37] A. Korotin, L. Li, A. Genevay, J. M. Solomon, A. Filippov, and E. Burnaev. Do neural optimal transport solvers work? a continuous wasserstein-2 benchmark. Advances in Neural Information Processing Systems, 34:14593 14605, 2021. [38] A. Korotin, L. Li, J. Solomon, and E. Burnaev. Continuous wasserstein-2 barycenter estimation without minimax optimization. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=3t FAs5E-Pe. [39] A. Korotin, A. Kolesov, and E. Burnaev. Kantorovich strikes back! wasserstein gans are not optimal transport? In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. [40] A. Korotin, D. Selikhanovych, and E. Burnaev. Kernel neural optimal transport. In International Conference on Learning Representations, 2023. URL https://openreview.net/forum? id=Zuc_MHt Uma4. [41] A. Korotin, N. Gushchin, and E. Burnaev. Light schr\" odinger bridge. In The Twelfth International Conference on Learning Representations, 2024. [42] T. Koshizuka and I. Sato. Neural lagrangian schr\"{o} dinger bridge: Diffusion modeling for population dynamics. In The Eleventh International Conference on Learning Representations, 2022. [43] M. Liero, A. Mielke, and G. Savaré. Optimal entropy-transport problems and a new hellinger kantorovich distance between positive measures. Inventiones mathematicae, 211(3):969 1117, 2018. [44] Y. Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. ar Xiv preprint ar Xiv:2210.02747, 2022. [45] F. Lübeck, C. Bunne, G. Gut, J. S. del Castillo, L. Pelkmans, and D. Alvarez-Melis. Neural unbalanced optimal transport via cycle-consistent semi-couplings. ar Xiv preprint ar Xiv:2209.15621, 2022. [46] M. Lucic, K. Kurach, M. Michalski, S. Gelly, and O. Bousquet. Are GANs created equal? a large-scale study. In Advances in neural information processing systems, pages 700 709, 2018. [47] A. Makkuva, A. Taghvaei, S. Oh, and J. Lee. Optimal transport mapping via input convex neural networks. In International Conference on Machine Learning, pages 6672 6681. PMLR, 2020. [48] T. Manole, S. Balakrishnan, J. Niles-Weed, and L. Wasserman. Plugin estimation of smooth optimal transport maps. ar Xiv preprint ar Xiv:2107.12364, 2021. [49] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of machine learning. MIT press, 2018. [50] P. Mokrov, A. Korotin, and E. Burnaev. Energy-guided entropic neural optimal transport. In The Twelfth International Conference on Learning Representations, 2024. [51] Q. M. Nguyen, H. H. Nguyen, Y. Zhou, and L. M. Nguyen. On unbalanced optimal transport: Gradient methods, sparsity and approximation error. ar Xiv preprint ar Xiv:2202.03618, 2022. [52] T. T. Nguyen, H. D. Nguyen, F. Chamroukhi, and G. J. Mc Lachlan. Approximation by finite mixtures of continuous density functions that vanish at infinity. Cogent Mathematics & Statistics, 7(1):1750861, 2020. [53] G. Peyré, M. Cuturi, et al. Computational optimal transport. Foundations and Trends in Machine Learning, 11(5-6):355 607, 2019. [54] K. Pham, K. Le, N. Ho, T. Pham, and H. Bui. On unbalanced optimal transport: An analysis of sinkhorn algorithm. In International Conference on Machine Learning, pages 7673 7682. PMLR, 2020. [55] S. Pidhorskyi, D. A. Adjeroh, and G. Doretto. Adversarial latent autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14104 14113, 2020. [56] A.-A. Pooladian and J. Niles-Weed. Entropic estimation of optimal transport maps. ar Xiv preprint ar Xiv:2109.12004, 2021. [57] P. Rigollet and A. J. Stromme. On the sample complexity of entropic optimal transport. ar Xiv preprint ar Xiv:2206.13472, 2022. [58] R. Rockafellar. Duality and stability in extremum problems involving convex functions. Pacific Journal of Mathematics, 21(1):167 187, 1967. [59] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684 10695, 2022. [60] L. Rout, A. Korotin, and E. Burnaev. Generative modeling with optimal transport maps. In International Conference on Learning Representations, 2022. [61] V. Seguy, B. B. Damodaran, R. Flamary, N. Courty, A. Rolet, and M. Blondel. Large scale optimal transport and mapping estimation. In International Conference on Learning Representations, 2018. [62] T. Séjourné, G. Peyré, and F.-X. Vialard. Unbalanced optimal transport, from theory to numerics. ar Xiv preprint ar Xiv:2211.08775, 2022. [63] S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014. [64] A. Taghvaei and A. Jalali. 2-Wasserstein approximation via restricted convex potentials with application to improved training for GANs. ar Xiv preprint ar Xiv:1902.07197, 2019. [65] A. Tong, J. Huang, G. Wolf, D. Van Dijk, and S. Krishnaswamy. Trajectorynet: A dynamic optimal transport network for modeling cellular dynamics. In International conference on machine learning, pages 9526 9536. PMLR, 2020. [66] F. Vargas, P. Thodoroff, A. Lamacraft, and N. Lawrence. Solving schrödinger bridges via maximum likelihood. Entropy, 23(9):1134, 2021. [67] G. Wang, Y. Jiao, Q. Xu, Y. Wang, and C. Yang. Deep generative learning via schrödinger bridge. In International Conference on Machine Learning, pages 10794 10804. PMLR, 2021. [68] Y. Xie, M. Chen, H. Jiang, T. Zhao, and H. Zha. On scalable and efficient computation of large scale optimal transport. In International Conference on Machine Learning, pages 6882 6892. PMLR, 2019. [69] Y. Xie, Y. Luo, and X. Huo. An accelerated stochastic algorithm for solving the optimal transport problem. ar Xiv preprint ar Xiv:2203.00813, 2022. [70] K. D. Yang and C. Uhler. Scalable unbalanced optimal transport using generative adversarial networks. In International Conference on Learning Representations, 2018. [71] J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman. Toward multimodal image-to-image translation. Advances in neural information processing systems, 30, 2017. A.1 Proof of Theorem 4.1 Proof. For convenience, we split this proof into several steps. Step 1. To begin with, we recall that the primal UEOT problem (3) attains a minimizer γ . Then by substituting γ directly in (3), we get 2 γ (x, y)dxdy εH(γ )+ Df1 (γ x p) + Df2 γ y q = Z 2 γ (x, y)dxdy + ε Z Rd γ (x, y)log γ (x, y)dxdy ε γ 1 + Z γ x(x) p(x) γ y(y) q(y) q(y)dy = ε γ 1 + Z 2 γ (x, y)dxdy + ε Z Rd γ (x, y) ε 1 ϕ (x)+ψ (y) x y 2 | {z } log γ (x,y)= dxdy + (18) γ x(x) p(x) γ y(y) q(y) q(y)dy = ε γ 1 + Rdγ x(x)ϕ (x)dx+ Z Rd γ y(y)ψ (y)dy+ Z γ x(x) p(x) γ y(y) q(y) γ x(x) p(x) +ϕ (x)γ x(x) p(x) γ y(y) q(y) +ψ (y)γ y(y) q(y) q(y)dy. (19) Here in line (18), we use equation (5) which establishes the connection between γ and the optimal potentials ϕ , ψ delivering maximum to the dual UEOT problem (4). Substituting these optimal potentials to the dual problem, we get ε(ϕ (x)+ψ (y) x y 2 2 )}dxdy | {z } = γ 1 Rdf 1( ϕ (x))p(x)dx Rdf 2( ψ (y))q(y)dy = ε γ 1 Z Rdf 1( ϕ (x))p(x)dx Z Rdf 2( ψ (y))q(y)dy. (20) Note that (19) equals to (20) due to the equivalence between primal and dual UEOT problems, see M2. We will use this fact in further derivations. For every function f, its convex conjugate f(t) = sup u R {ut f(u)} = f( t) = sup u R { ut f(u)} = inf u R{ut + f(u)} = f( t) ut + f(u) u R. (21) Then for arbitrary x Rd, taking f = f1, u = γ x(x) p(x) , t = ϕ (x), we get f1( ϕ (x)) γ x(x) p(x) ϕ (x) + f1 γ x(x) p(x) Rd f1(ϕ (x))p(x)dx Z γ x(x) p(x) ϕ (x) + f1 γ x(x) p(x) p(x)dx. (23) Similarly, for arbitrary y Rd Rd f2(ψ (y))q(y)dy Z γ y(y) q(y) ψ (y) + f2 γ y(y) q(y) q(y)dy. (24) Since (19)=(20), inequalities (22), (23), (24) actually turn to equalities. Step 2. Since we consider convex, lower semi-continuous functions f, it holds that f = f. Then, substituting f in formula (21), we derive f( t) = f( t) = inf u R{ut + f(u)} = f( t) ut + f(u) u R. (25) Thus, for arbitrary x Rd, taking f = f1, t = γ x(x) p(x) , we get that f1 γ x(x) p(x) u γ x(x) p(x) + f1( u) u R. At the same time, from step 1, we know that f1 γ x(x) p(x) = γ x(x) p(x) ϕ (x) + f1( ϕ (x)), see (22). Then γ x(x) p(x) ϕ (x) + f1( ϕ (x)) uγ x(x) p(x) + f1( u) u R = γ x(x) p(x) (ϕ (x) u) (f1( u) f1( ϕ (x)) u R. (26) Similarly, for arbitrary y Rd γ y(y) q(y) (ψ (y) u) (f2( u) f2( ψ (y)) u R. (27) Step 3. Now we are ready to prove the main result. We note that: DKL (γ γθ,ω)= Z Rd Rd γ (x, y) log γ (x, y)dxdy Z Rd Rdγ (x, y) log γθ,ω(x, y)dxdy + γθ,ω 1 γ 1. (28) While the ground-truth UEOT plan γ (x, y) and the optimal dual variables ϕ (x), ψ (y) are connected via equation (5), our parametrized plan γθ,ω(x, y) can be expressed using ϕθ,ω(x), ψθ(y) as γθ,ω(x, y) = exp{ε 1(ϕθ,ω(x) + ψθ(y) x y 2/2)}, see (9). Then (28) = ε 1 Z Rd Rdγ (x, y) ϕ (x)+ψ (y) x y 2/2 dxdy Rd Rdγ (x, y) ϕθ,ω(x)+ψθ(y) x y 2/2 dxdy+ γθ,ω 1 γ 1 = Rd Rd γ (x, y) ϕ (x) + ψ (y) ϕθ,ω(x) ψθ(y) dxdy + γθ,ω 1 γ 1 = Rd γ x(x) p(x) ϕ (x) ϕθ,ω(x) p(x)dx + Rd γ y(y) q(y) ψ (y) ψθ(y) q(y)dy + γθ,ω 1 γ 1 (26) (27) Rd(f 1( ϕθ,ω(x)) f 1( ϕ (x)))p(x)dx+ Rd(f 2( ψθ(y)) f 2( ψ (y)))q(y)dy+ γθ,ω 1 γ 1 = Rdf 1( ϕθ,ω(x))p(x)dx+ Z Rdf 2( ψθ(y))q(y)dy+ε γθ,ω 1 Rdf 1( ϕ (x))p(x)dx+ Z Rdf 2( ψ (y))q(y)dy+ε γ 1) | {z } ε 1(L(θ, ω) L ). A.2 Proof of Proposition 4.2 We begin with proving the auxiliary theoretical results (Propositions A.1, A.2) which are needed to prove the main proposition of this section. Proposition A.1 (Rademacher bound on the estimation error). It holds that E L(bθ) L(θ) 4RN(F1, p) + 4RM(F2, q), where F1 = {f 1( ϕθ,ω)|(θ, ω) Θ Ω}, F2 = {f 2( ψθ)|θ Θ} for ϕθ,ω(x) = ε log uω(x) 2 , ψθ(y) = ε log vθ(y) + y 2 2 , and RN(F1, p), RM(F2, q) denote the Rademacher complexity [63, M26] of the functional classes U, V w.r.t. to the sample sizes N, M of distributions p, q. Proof of Proposition A.1. The derivation of this fact is absolutely analogous to [41, Proposition B.1], [50, Theorem 4] or [64, Theorem 3.4]. Proposition A.2 (Bound on the Rademacher complexity of the considered classes). Let 0 < a A, let 0 < u U, let 0 < w W and V > 0. Consider the class of functions F1 = x 7 f 1( ε log uω(x) + ε log cθ(x) x 2 F2 = y 7 f 2( ε log vθ(y) y 2 uω(x), vθ(y), cθ(x) belong to the class k=1 αk exp x T Ukx + v T k x + wk) with (29) u I Uk = U T k UI; vk V ; w wk W; a αk A . Following [41, Proposition B.2], we call the functions of the class V as constrained log-sum-exp quadratic functions. We assume that f 1, f 2 are Lipshitz functions and measures p, q are compactly supported with the supports lying in a zero-centered ball of a radius R > 0. Then RN(F1, p) C0 N , RM(F2, q) C1 where the constants C0, C1 do not depend on sizes N, M of the empirical samples from p, q. Proof of Proposition A.2. Thanks to [41, Proposition B.2], the Rademacher complexities of constrained log-sum-exp quadratic functions x 7 log uω(x), x 7 log cθ(x) and y 7 log vθ(y) are known to be bounded by O( 1 N ) or O( 1 M ) respectively. According to the definition of Rademacher complexity, for single quadratic functions x 7 x T x 2 (y 7 y T y 2 ) it is just equal to zero. Then, using the well-known scaling and additivity properties of the Rademacher complexity [63], we get that x 7 ε log uω(x) + ε log cθ(x) x 2 2 and y 7 ε log vθ(y) y 2 2 are bounded by O( 1 M ) respectively. The remaining step is to recall that f 1(x) and f 2(y) are Lipschitz. Therefore, according to Talagrand s contraction principle [49], the Rademaher complexities of F1 and F2 are also bounded by O( 1 N ) and O( 1 M ), respectively. Proof of Proposition 4.2. The proof directly follows from Propositions A.1 and A.2. A.3 Proof of Theorem 4.3 To begin with, we provide a quick reminder about the Fenchel-Rockafellar theorem which is needed to derive the dual form of problem (3). Theorem A.3 (Fenchel-Rockafellar [58]). Let (E, E ) and (F, F ) be two couples of topologically paired spaces. Let A : E 7 F be a continuous linear operator and A : F 7 E be its adjoint. Let f and g be lower semi-continuous and proper convex functions defined on E and F respectively. If there exists x domf s.t. g is continuous at Ax, then sup x E f( x) g(Ax) = min y F f(A y) + g(y) and the min is attained. Moreover, if there exists a maximizer x E then there exists y F satisfying Ax g(y) and Ay f( x). We note that in the below theorem C2(Rd) does not denote the space of twice differentiable continuous functions. The exact definition of this space and C2,b(Rd) is given in notations part. Theorem A.4 (Dual form of problem (3)). The primal UEOT problem (3) has the dual counterpart (4) where the potentials (ϕ, ψ) belong to the space C2,b(Rd) C2,b(Rd). The minimum of (3) is attained for a unique γ M2,+(Rd Rd). Proof. We recall that in the primal form, the minimization is performed over functions γ belonging to M2,+(Rd Rd). In this proof, we suppose that this space is endowed with a coarsest topology σ(M2,+(Rd Rd)) which makes continuous the linear functionals γ 7 R ζdγ, ζ C2(Rd Rd). Then the topological space M2,+(Rd Rd), σ(M2,+(Rd Rd)) has a topological dual M2,+(Rd Rd), σ(M2,+(Rd Rd)) which, actually, is (linear) isomorphic to the space C2(Rd Rd), see [26, Lemma 9.9]. This fact opens an opportunity to apply the well-celebrated Fenchel- Rockafellar theorem. For this purpose, we will consider the following spaces: E def = C2(Rd) C2(Rd), F def = C2(Rd Rd) and their duals E def = M2,+(Rd) M2,+(Rd) and F def = M2,+(Rd Rd). Step 1. Recall that the convex conjugate of any function g : M2,+(Rd Rd) R {+ } is defined for each ζ (M2,+(Rd Rd)) = C2(Rd Rd) as g(ζ) = supγ M2,+(Rd Rd){ γ, ζ g(γ)}. For the convenience of further derivations, we introduce additional functionals corresponding to the summands in the primal UEOT problem (3): P(γ) def = Z 2 γ(x, y)dxdy εH(γ); F1(γx) def = Df1 (γx p) ; F2(γy) def = Df2 (γy q) . (30) For our purposes, we need to calculate the convex conjugates of these functionals. Fortunately, convex conjugates of f-divergences F1(γx) and F2(γy) are well-known, see [1, Proposition 23], and equal to F1(ϕ) def = Z Rd f1(ϕ(x))p(x)dx, F2(ψ) def = Z Rd f2(ψ(y))q(y)dy. To proceed, we calculate the convex conjugate of P(γ): 2 γ(x, y)dxdy εH(γ)= = sup γ M2,+(Rd Rd) Rd ζ(x, y)γ(x, y)dxdy Z 2 γ(x, y)dxdy sup γ M2,+(Rd Rd) ζ(x, y) x y 2 2 γ(x, y)dxdy + Rd γ(x, y) log γ(x, y)dxdy + γ 1 | {z } =H(γ) sup γ M2,+(Rd Rd) ε n Z Rd γ(x, y) ζ(x, y) x y 2 2 ε log γ(x, y) dxdy + γ 1 o = sup γ M2,+(Rd Rd) ( ε) n Z Rd γ(x, y) log γ(x, y) ζ(x, y) x y 2 2 ε dxdy γ 1 o = sup γ M2,+(Rd Rd) ( ε) n Z Rd γ(x, y) log γ(x, y) ζ(x, y) x y 2 2 ε dxdy γ 1 + Rd exp{ζ(x, y) x y 2 2 ε }dxdy Z Rd exp{ζ(x, y) x y 2 2 ε }dxdy | {z } =0 sup γ M2,+(Rd Rd) ( ε) γ exp{ζ(x, y) x y 2 Rd exp{ζ(x, y) x y 2 Here in the transition from (31) to (32), we keep in mind our prior calculations of DKL in (28). Recall that DKL is non-negative and attains zero at the unique point γ(x, y) = exp{ζ(x, y) x y 2 2 ε }. (33) Thus, we get Rd exp{ζ(x, y) x y 2 2 ε }dxdy. (34) Step 2. Now we are ready to apply the Fenchel-Rockafellar theorem in our case. To begin with, we show that this theorem is applicable to problem (3), i.e., that the functions under consideration satisfy the necessary conditions. Indeed, it is known that the convex conjugate of any functional (e.g., F1( ), F2( ), P( )) is lower semi-continuous and convex. Besides, the listed functionals are proper convex2. Indeed, the properness of F1( ) and F2( ) follows from the fact that f-divergences are known to be lower-semicontinuous and proper themselves, while properness of P( ) is evident from (32). Now we consider the linear operator A : C2(Rd) C2(Rd) 7 C2(Rd Rd) which is defined as A(ϕ, ψ) : (x, y) 7 ϕ(x) + ψ(y). It is continuous, and its adjoint is defined on M2,+(Rd Rd) as A(γ) = (γx, γy). Indeed, A(γ), (u, v) = γ, A(u, v) = R Rd γ(x, y)(u(x) + v(y))dxdy = R Rd γx(x)u(x)dx + R Rd γy(y)v(y)dy. Thus, the strong duality and the existence of minimizer for (3) follows from the Fenchel-Rockafellar theorem which states that problems sup (ϕ,ψ) C2(Rd) C2(Rd) { P(A(ϕ, ψ)) F1( ϕ) F2( ψ)} (35) min γ M2,+(Rd Rd){P(γ) + F1(γx) + F2(γy)} (36) are equal. The uniqueness of the minimizer for (3) comes from the strict convexity of P( ) (which holds thanks to the entropy term). Note that the conjugate of the sum of F1 and F2 is equal to the sum of their conjugates since they are defined for separate non-intersecting groups of parameters. Next we prove that the supremum can be restricted to C2,b(Rd Rd). Here we use to denote the operation of taking minimum between the function f : Rd R and real value k: (f k)(x) def = min(f(x), k). Then analogously to [26, Theorem 9.6], we get: 2Proper convex function is a real-valued convex function which has a non-empty domain, never attains the value ( ) and is not identically equal to (+ ). This property ensures that the minimization problem for this function has non-trivial solutions. sup (ϕ,ψ) C2(Rd) C2(Rd) { P(A(ϕ, ψ)) F1( ϕ) F2( ψ)} = sup (ϕ,ψ) C2(Rd) C2(Rd) lim k1,k2 { P(A(ϕ k1, ψ k2)) F1( (ϕ k1)) F2( (ψ k2))} sup (ϕ,ψ) C2,b(Rd) C2,b(Rd) { P(A(ϕ, ψ)) F1( ϕ) F2( ψ)}. (37) Since the another inequality is obvious, the two quantities are equal which completes the proof. Proof of Theorem 4.3. Our aim is to prove that for all δ > 0 there exist unnormalized Gaussian mixtures uω and vθ s.t. L(θ, ω) L < δε. We define J (ϕ, ψ) def = ε(ϕ(x) + ψ(y) x y 2 2 )}dxdy+ Z Rd f 1( ϕ(x))p(x)dx+ Z Rd f 2( ψ(y))q(y)dy. Then from (4) and Theorem (A.4), it follows that L = inf(ϕ,ψ) C2,b(Rd) C2,b(Rd) J (ϕ, ψ). Finally, using the definition of the infimum, we get that for all δ > 0 there exist some functions (bϕ, bψ) C2,b(Rd) C2,b(Rd) such that J (bϕ, bψ) 0 which contain the supports of p and q respectively. Then we define eϕ(x) def = bϕ(x) max{0, max{ x 2 R2, x 4 R4}} bϕ(x) ba; eψ(y) def = bψ(y) max{0, y 2 R2} bψ(y) bb. We get that eϕ(x) bϕ(x), eψ(y) bψ(y) = ε(eϕ(x) + eψ(y) x y 2 ε(bϕ(x) + bψ(y) x y 2 2 )}dxdy (38) Importantly, for all x and y within the supports of p and q it holds that eϕ(x) = bϕ(x) and eψ(y) = bψ(y), respectively. Then Z Rd f 1( bϕ(x))p(x)dx = Z Rd f 1( eϕ(x))p(x)dx, Z Rd f 2( bψ(y))q(y)dy = Z Rd f 2( eψ(y))q(y)dy. (39) Combining (38) and (39), we get that J (eϕ, eψ) J (bϕ, bψ) < L + δ. Before moving on, we note that functions exp{eϕ(x)/ε} and exp{ e ψ(y)/ε} are continuous and nonnegative. Therefore, since measures p and q are compactly supported, there exist some constants emin, hmin > 0 such that exp{eϕ(x)/ε} > emin and exp{ e ψ(y)/ε} > hmin for all x and y from the supports of measures p and q respectively. We keep these constants for future steps. Step 2. This step of our proof is similar to [41, Theorem 3.4]. We get that exp eϕ(x)/ε exp ba max{0, x 2 R2} exp ba + R2 ε exp( x 2/ε), (40) exp e ψ(y)/ε exp bb max{0, y 2 R2} exp bb + R2 ε exp( y 2/ε). From this we can deduce that y 7 exp( e ψ(y)/ε) is a normalizable density since it is bounded by the unnormalized Gaussian density. Moreover, we see that it vanishes at the infinity. Thus, using the result [52, Theorem 5a], we get that for all δ > 0 there exists an unnormalized Gaussian mixture veθ = veθ(y) such that veθ exp( e ψ/ε) = sup y RD |veθ(y) exp( e ψ(y)/ε)| < δ . (41) Following the mentioned theorem, we can set all the covariances in veθ to be scalar, i.e., define veθ(y) = PK k=1 eαk N(y|erk, εeλk Id) for some K and eαk R+, erk Rd, eλk R+ (k {1, . . . , K}). For our future needs, we set L1 ε emin + L2 ε hmin + ε(2πε) d 2 1 + (πε) d 2 exp ba + R2 For simplicity, we consider the other mixture vθ(y) def = veθ(y) exp( y 2 2ε ) which is again unnormalized and has scalar covariances, see the proof of [41, Theorem 3.4] for explanation. We denote the weights, means, and covariances of this mixture by αk R+, rk Rd and λk R+, respectively. We derive that ε(eϕ(x) + eψ(y) x y 2 Rdexp{ eϕ(x) ε } exp{ x y 2 2ε } exp{ eψ(y) Rdexp{ eϕ(x) ε } exp{ x y 2 2ε }(veθ(y) δ )dxdy = Rdexp{ eϕ(x) ε } exp{ x y 2 2ε }veθ(y)dxdy Rdexp{ eϕ(x) ε } exp{ x y 2 Rdexp{ eϕ(x) ε }exp{ x 2 2ε }exp{ x, y ε }exp{ y 2 2ε }veθ(y) | {z } =vθ(y) Rdexp{ eϕ(x) ε }exp{ x y 2 Rdexp{ eϕ(x) ε } exp{ x2 Rd exp{ x, y ε }vθ(y)dy | {z } =cθ(x) Rdexp{ eϕ(x) Rd exp{ x y 2 2ε }dy | {z } Rdexp{ eϕ(x) ε } exp{ x2 2ε }cθ(x)dx δ ε(2πε) d/2 Z Rdexp{ eϕ(x) ε }dx (40) > Rdexp{ eϕ(x) ε } exp{ x2 2ε }cθ(x)dx δ ε(2πε) d/2 exp ba + R2 Rd exp{ x 2/ε}dx | {z } Rdexp{ eϕ(x) ε } exp{ x2 2ε }cθ(x)dx δ 2 d/2πdε(d+1) exp ba + R2 Step 3. At this point, we will show that for every δ > 0, there exists an unnormalized Gaussian mixture ueω which is δ -close to exp{eϕ(x)/ε}cθ(x). Using the closed-form expression for cθ(x) from [41, Proposition 3.2], we get that k=1 αk exp{ rk 2 2ελk } exp{ rk + xλk 2 exp eϕ(x)/ε cθ(x) exp ba max{0, x 4 R4} cθ(x) exp ba + R4 ε exp( x 4/ε)cθ(x) = k=1 αk exp ba + R4 ε exp{ rk 2 2ελk } exp( x 4 ε ) exp{ rk + xλk 2 k=1 αk exp ba + R4 ε exp{ rk 2 2ελk } exp{ rk + xλk 2 2λk x 4 From this, we see that exp eϕ(x)/ε cθ(x) tends to zero while x approaches infinity. It means that x 7 exp eϕ(x)/ε cθ(x) corresponds to the normalizable density. Using [52, Theorem 5a], we get that for all δ > 0 there exists an unnormalized Gaussian mixture ueω such that ueω exp(eϕ/ε)cθ = sup x RD |ueω(x) exp(eϕ(x)/ε)cθ(x))| < δ . (43) Analogously with veθ, we can set all the covariances in ueω to be scalar, i.e., define ueω = PL l=1 eβl N(x|eµl, εeζl Id) for some L, eµl Rd, eζl R+ (l {1, ..., L}). Moreover, we consider uω(x) = ueω(x) exp{ x 2 2ε } which is again an unnormalized density with scalar covariances. We denote the weights, means, covariances of this mixture by βl R+, µl Rd, ζl R+ respectively. Next we recall the equation (42) and get that Rd exp{ x 2 2ε }(ueω(x) δ )dx δ 2 d/2πdε(d+1) exp ba + R2 Rd exp{ x 2 2ε }ueω(x) | {z } =uω(x) Rd exp{ x 2 2ε }dx | {z } δ 2 d/2πdε(d+1) exp ba + R2 ε uω 1 εδ (2πε) d 2 1 + (πε) d 2 exp ba + R2 Step 4. Now we turn to other expressions. Using the property that a function t 7 log t is 1 tmin -Lipshitz on interval [tmin, + ) we get log(exp{ eψ(y) ε }) log(exp{ eψ(y) log(exp{ eψ(y) ε } δ ) eψ(y) hmin . (45) Similarly, we get that log(exp{ eϕ(y) ε }cθ(x) δ ) eϕ(y) ε + log cθ(x) δ emin . (46) We use this inequality, monotonicity of logarithm function, and (41), to derive veθ(y) (41) > exp{ eψ(y) ε } δ = log veθ(y) > log(exp{ eψ(y) ε } δ ) (45) eψ(y) eψ(y) > ε log veθ(y) εδ hmin ; (47) ueω(x) (43) > exp{ eϕ(x) ε }cθ(x) δ = log ueω(x) > log(exp{ eϕ(x) ε }cθ(x) δ ) (46) ε + log cθ(x) δ emin = eϕ(y) > ε log ueω(x) emin . (48) Recall that f1 and f2 are non-decreasing functions. Moreover, they are Lipshitz with the constants L1, L2 respectively. Thus, we get f2( eψ(y)) f2( ε log veθ(y) εδ hmin ) = f2( ε(log vθ(y) + y 2 f2( ε log vθ(y) y 2 hmin ) f2( ε log vθ(y) y 2 hmin ; (49) f1( eϕ(y)) f1( ε log ueω(x) emin ) = f1( ε log ueω(x) + ε log cθ(x) εδ f1( ε log uω(x) x 2 2 + ε log cθ(x) εδ emin ) = f1( ε log uω(x) f1( ε log uω(x) emin . (50) Integrating these inequalities over all x and y in supports of p and q respectively, we get Rd f 2( eψ(y))q(y)dy Z Rd f2( ε log vθ(y) y 2 2 )q(y)dy L2 εδ hmin ; (51) Z Rd f 1( eϕ(x))p(x)dx Z Rd f1( ε log uω(x) 2 )p(x)dx L1 εδ emin . (52) Finally, combining (44) and (52), we get L(θ, ω) = J ε log uω(x) cθ(x) + x 2 , ε log vθ(y) + y 2 Rd f1( ε log uω(x) 2 )p(x)dx + Z Rd f2( ε log vθ(y) y 2 2 )q(y)dy < ε(eϕ(x) + eψ(y) x y 2 2 )}dxdy+ Z Rd f 1( eϕ(x))p(x)dx+ Z Rd f 2( eψ(y))q(y)dy + emin + L2 εδ hmin + εδ (2πε) d 2 1 + (πε) d 2 exp ba + R2 L + δ + δ " L1 ε emin + L2 ε hmin + ε(2πε) d 2 1 + (πε) d 2 exp ba + R2 DKL (γθ,ω γ ) ε 1(L(θ, ω) L ) < δ which completes the proof. Remark. In fact, the assumption of the Lishitzness of f1(x), f2(x) can be omitted. Indeed, under the "everything is compact" assumptions of Proposition 4.2 and Theorem 4.3, inputs to f1( ), f2( ) also always belong to certain compact sets. The convex functions are known to be Lipshitz on compact subsets of R, see [31, Chapter 3, M18], and we actually do not need the Lipschitzness on the entire R. B Experiments Details B.1 General Details To minimize our objective (14), we parametrize αk, rk, Sk of vθ and βl, µl, Σl of uω in (11). Here we follow the standard practices in deep learning and parametrize logarithms log αk, log βl instead of directly parameterizing αk, βl. In turn, variables rk, µl are parametrized directly as multidimensional vectors. We consider diagonal matrices Sk, Σl and parametrize them via their diagonal values log(Sk)i,i and log(Σl)i,i respectively. We initialize the parameters following the scheme in [41]. In all our experiments, we use the Adam optimizer. B.2 Details of the Experiment with Gaussian Mixtures We use K = L = 5, ε = 0.05, lr = 3e 4 and batchsize 128. We do 2 104 gradient steps. For the Light SB algorithm, we use the parameters presented by the authors in the official repository. B.3 Details of the Image Translation Experiment We use the code and decoder model from https://github.com/podgorskiy/ALAE We download the data and neural network extracted attributes for the FFHQ dataset from https://github.com/ngushchin/Light SB/ In the Adult class we include the images with the attribute Age 44; in the Young class - with the Age [16, 44]. We excluded the images with faces of children to increase the accuracy of classification per gender attribute. For the experiments with our solver, we use weighted DKL divergence with parameters τ specified in Appendix C, and set K = L = 10, ε = 0.05, lr = 1, and batch size to 128. We do 5 103 gradient steps using Adam optimizer [35] and Multi Step LR scheduler with parameter γ = 0.1 and milestones= [500, 1000]. For testing [41, Light SB] solver, we use the official code (see the link above) and instructions provided by the authors. Baselines. For the OT-FM and UOT-FM methods, we parameterize the vector field (vt,θ)t [0,1] for mass transport using a 2-layer feed-forward network with 512 hidden neurons and Re LU activation. An additional sinusoidal embedding[65] was applied for the parameter t. The learning rate for the Adam optimizer was set to 1e-4. To obtain an optimal transport plan π (x, y) discrete OT solvers from the POT [21] package were used. These methods are built on the solutions (plans π (x, y)) of discrete OT problems, to obtain them we use the POT [21] package. Especially for the UOT-FM, we use the ot.unbalanced.sinkhorn with the regularization equal to 0.05. We set the number of training and inference time steps equal to 100. To obtain results of UOT-FM for Fig. 3, we run this method for 3K epochs with parameter reg_m [5e 4, 5e 3, 5e 2, 0.5, 1, 10, 102, 103, 104, 105, 106] and reported the mean values of final metrics for 3 independent launches with different seeds. In Tables 20, 21, 22, for each translation we report the results for one chosen parameter specified in (a) U-Light OT, τ = 10 (b) U-Light OT, τ = 5 (c) U-Light OT, τ = 2 (d) U-Light OT, τ = 1 Figure 5: Conditional plans γθ,ω(y|x) learned by our solver with scaled Dχ2 divergences in Gaussians Mixture experiment (τ [1, 2, 5, 10]). Appendix D.1. We use the corresponding checkpoints of UOT-FM to visualize its performance in Fig. 4. For [9, UOT-SD], [70, UOT-GAN] we use the official code provided by the authors, see the links: https://github.com/Jae-Moo/Unbalanced-Optimal-Transport-Generative-Model https://github.com/uhlerlab/unbalanced_ot While both UOT-SB, UOT-GAN methods were not previously applied to the FFHQ dataset, we set up a grid search for the parameters and followed the instructions provided by the authors for parameter settings. Both for UOT-SB, UOT-GAN, we used a 3-layer neural network with 512 hidden neurons, and Re LU activation was used for the generator networks and the potential and discriminator, respectively. Adam optimizer [35] with lr = 10 5 and lr = 10 4 was used to train the networks in UOT-SB and UOT-GAN, respectively. We train the methods for 10K iterations and set a batch size to 128. For UOT-SD, we used DKL divergence with their unbalancedness parameter τ =0.002. For other parameters, we used the default values provided by the authors for CIFAR-10 generation tasks. For all baseline models which use entropy regularization, we set ε = 0.05. C Additional Discussion & Experiments with f-divergences Details about f-divergences between positive measures. In the classic form, f-divergences are defined as measures of dissimilarity between two probability measures. This definition should be revised when dealing with measures of arbitrary masses. In the paragraph below we show that if the function f is convex, non-negative, and attains zero uniquely at point {1} then Df (µ1 µ2) is a valid measure of dissimilarity between two positive measures. Let µ1, µ2 M2,+(Rd ) be two positive measures. The f-divergence satisfies Df (µ1 µ2) 0 which is obvious from the non-negativity of f. From the definition of Df (µ1 µ2) and the fact that function f attains zero uniquely at a point {1}, we obtain that Df (µ1 µ2) = 0 if and only if µ1(x) = µ2(x) holds µ2-everywhere. Actually, µ1(x) = µ2(x) should hold for all x as µ1 must be absolutely continuous w.r.t. µ2 (otherwise Df (µ1 µ2) is assumed to be equal + ). The usage of Dχ2 divergence. We tested the performance of our solver with scaled DKL divergences in the main text, see M5.1, M5.2. For completeness, here we evaluate our solver with Dχ2 divergence in Gaussian Mixture experiment. We use the same experimental setup as in M5.1 and present the qualitative results in Fig. 5. Interestingly, the solver s results differ from those which we obtain for DKL divergence. For Dχ2 divergence, supports of learned plans marginals constitute only parts of source and target measures supports when τ = 1. The issue disappears with a slight increase of τ, i.e., for τ = 2. At the same time, a further increase of τ is useless, since the learned plans fail to deal with class imbalance issue. Thus, parameter τ should be adjusted heuristically. In the case of DKL divergence, supports coincide for all τ, see Fig. 2. This motivates us to use DKL divergences in our main experiments. Parameter τ in Gaussian Mixture experiment. An ablation study on unbalancedness parameter τ in Gaussian Mixture experiment is conducted in M5.1 and above in this section. For completeness, we also perform the quantitative assessment of our solver with DKL divergence for different unbalancedness parameters τ. We compute the normalized OT cost (Ex p Ey γ(y|x) (x y)2 2 ) between the source and generated distributions, and the Wasserstein distance between the generated and target distributions (computed by a discrete OT solver). For completeness, we additionally calculate the metrics for the balanced [41, Light SB] approach. The results are presented in Table 4. Method U-Light OT (τ = 1) U-Light OT (τ = 10) U-Light OT (τ = 50) U-Light OT (τ = 100) Light SB OT cost ( ) 2.023 2.913 3.874 3.931 3.952 W2-distance ( ) 2.044 1.107 0.138 0.091 0.088 Table 4: Normalized OT cost between the input and learned distributions, and Wasserstein distance between the learned and target distributions in Gaussian mixture experiment. Results. Recall that the unbalanced nature of our solver leads to two important properties. Firstly, our solver better preserves the properties of the input objects than the balanced approaches indeed, it allows for preserving object attributes (classes) even in the case of class imbalance. Secondly, due to the relaxed boundary condition for the target distribution, the distribution generated by our solver is naturally less similar to the target distribution than for balanced methods. The above intuitive reasoning is confirmed by the metrics we obtained. Indeed, as the τ parameter increases, when our method becomes more and more similar to balanced approaches, the normalized OT cost between the source and generated distributions increases, and the Wasserstein distance between learned and target distributions decreases. Light SB [1] baseline, which is a purely balanced approach, shows the best quality in terms of Wasserstein distance and the worst in terms of OT cost. Parameters τ, ε in image experiments. The effect of entropy regularization parameter ε is well studied, see, e.g., [28, 41]. Namely, increasing the parameter ε stimulates the conditional distributions γθ(y|x) to become more dispersed. Still, below we provide an additional quantitative analysis of its influence on the learned translation. Besides, we address the question how does the parameter τ influence the performance of our solver in image translation experiments? To address this question, we learn the translations Young Adult, Man Woman on FFHQ dataset varying the parameters τ, ε, see M5.2 for the experimental setup details. We test our solver with scaled DKL divergence training it for 5K iterations. Other hyperparameters are in Appendix B. In Tables 5, 8, 11, 14, we report the accuracy of keeping the attributes of the source images (e.g., gender in Young Adult translation). In Tables 6, 9, 12, 15, we report the accuracy of mapping to the correct target class (e.g., adult people in Young Adult translation). In Tables 7, 10, 13, 16, we report FD metrics which is defined as Frechet distance between means and covariances of the learned and the target measures. For convenience, we additionally illustrate the results of ablation studies on Fig. 6. ε τ 10 20 50 102 250 103 106 0.01 96.37 0.07 96.12 0.25 93.44 0.22 90.26 0.18 85.15 0.82 81.79 0.98 79.27 1.11 0.05 96.03 0.08 95.55 0.13 93.19 0.55 88.93 0.94 84.49 1.57 80.59 0.55 78.42 0.73 0.1 95.20 0.21 94.92 0.24 92.30 0.25 87.99 0.83 82.91 1.01 78.76 0.50 78.25 0.87 0.5 88.53 0.26 87.46 0.41 83.41 0.80 80.14 0.56 76.00 0.84 73.36 0.23 71.84 0.27 1.0 81.26 1.09 77.84 0.33 74.33 0.54 71.01 1.13 67.43 0.15 66.71 0.13 Table 5: Test accuracy ( ) of keeping the class in Young Adult translation. ε τ 10 20 50 102 250 103 106 0.01 38.94 0.91 51.92 0.80 67.68 1.06 75.49 0.30 81.34 1.06 83.67 0.74 85.27 1.35 0.05 40.09 0.16 53.19 0.49 69.01 0.74 77.41 0.67 81.78 0.33 84.80 0.90 85.63 0.76 0.1 44.18 1.50 56.74 0.58 71.77 0.57 78.34 0.91 83.70 0.54 87.07 0.37 88.21 0.23 0.5 50.51 0.34 65.61 2.04 81.50 1.17 87.48 0.06 92.21 0.29 92.93 0.14 93.80 0.55 1.0 70.81 2.69 83.82 2.45 89.56 0.50 93.78 0.35 95.04 0.20 95.50 0.41 Table 6: Test accuracy ( ) of mapping to the target in Young Adult translation. ε τ 10 20 50 102 250 103 106 0.01 35.55 1.24 26.77 1.59 17.39 0.19 16.42 1.96 12.28 1.16 11.46 0.70 10.85 0.12 0.05 42.05 3.04 31.47 0.03 23.90 1.49 18.31 0.16 17.15 0.48 16.27 0.62 19.89 5.51 0.1 50.05 2.48 40.67 0.73 31.15 0.32 27.49 0.56 25.86 0.88 24.57 0.06 26.30 0.15 0.5 114.49 1.06 104.42 0.33 98.21 4.21 92.48 0.98 89.42 0.10 88.74 0.12 89.55 1.18 1.0 165.39 1.50 149.84 0.86 142.39 0.18 137.40 0.21 135.85 0.17 135.25 0.17 Table 7: Test FD ( ) of generated latent codes in Young Adult translation. Results show that increase of ε negatively influences both accuracy of keeping the attributes of the source images and FD of generated latent codes which is caused by an increased dispersity of γθ(y|x). ε τ 10 20 50 102 103 106 0.01 96.14 0.08 95.96 0.03 93.60 0.72 89.88 0.74 81.49 0.89 80.19 2.17 0.05 95.36 0.41 95.26 0.24 92.77 0.48 89.48 0.60 81.34 0.35 79.99 0.42 0.1 94.73 0.22 94.33 0.21 92.65 0.52 89.12 0.40 81.06 0.82 77.73 1.21 0.5 88.21 0.82 88.43 0.15 86.13 0.94 83.89 0.67 74.92 0.53 70.32 1.46 1.0 81.67 0.94 79.71 1.28 77.15 1.65 67.80 0.30 65.31 0.90 Table 8: Test accuracy ( ) of keeping the class in Adult Young translation. ε τ 10 20 50 102 103 106 0.01 67.45 0.51 76.38 0.36 84.86 0.23 87.87 0.37 91.56 0.40 92.29 0.37 0.05 67.21 0.10 76.66 1.22 84.43 0.40 87.79 0.38 91.98 0.50 92.29 0.71 0.1 65.29 1.54 76.58 1.10 84.29 0.73 88.65 0.64 92.33 0.49 92.87 0.11 0.5 69.13 3.43 76.74 1.39 82.33 6.53 86.86 2.89 94.06 0.62 94.91 0.45 1.0 80.13 1.89 85.87 0.53 91.00 0.62 95.66 0.41 96.37 0.52 Table 9: Test accuracy ( ) of mapping to the target in Adult Young translation. ε τ 10 20 50 102 103 106 0.01 45.58 7.90 31.24 0.62 22.44 0.32 18.23 0.24 17.30 2.99 24.29 12.53 0.05 60.09 13.23 38.88 3.33 27.85 0.07 30.79 8.58 24.59 4.48 26.44 7.23 0.1 54.45 1.07 46.52 0.70 41.98 5.22 34.02 0.25 44.25 9.33 40.16 7.51 0.5 128.91 1.09 121.00 0.91 116.78 10.57 111.49 5.02 102.51 0.04 102.31 0.11 1.0 187.33 1.64 173.11 1.67 171.32 13.12 156.25 4.40 152.73 0.38 Table 10: Test FD ( ) of generated latent codes in Adult Young translation. ε τ 10 20 50 102 103 106 0.01 93.75 0.10 93.28 0.07 92.01 0.16 90.85 0.12 88.70 0.36 87.62 0.34 0.05 93.40 0.17 93.20 0.14 91.91 0.25 90.30 0.39 88.05 0.20 87.77 0.23 0.1 92.99 0.18 92.78 0.15 91.24 0.46 89.57 0.06 87.45 0.62 87.21 0.11 0.5 89.96 0.40 88.34 0.37 87.24 0.26 86.75 1.06 84.09 0.31 83.54 0.41 1.0 84.94 0.54 83.54 0.41 81.80 0.14 80.72 0.20 79.87 0.47 Table 11: Test accuracy ( ) of keeping the class in Man Woman translation. ε τ 10 20 50 102 103 106 0.01 61.38 0.29 74.57 0.78 85.78 0.87 90.32 0.70 93.79 0.26 94.19 0.40 0.05 60.97 1.14 74.99 0.32 85.01 0.64 90.23 0.50 93.87 0.25 94.44 0.07 0.1 61.36 0.60 73.52 0.24 86.38 0.40 90.38 0.39 94.19 0.28 94.73 0.36 0.5 65.74 0.87 73.61 0.58 86.45 0.48 89.80 0.66 95.14 0.58 95.59 0.61 1.0 78.18 0.18 86.85 0.74 91.50 0.74 95.63 0.14 95.93 0.17 Table 12: Test accuracy ( ) of mapping to the target in Man Woman translation. ε τ 10 20 50 102 103 106 0.01 54.84 4.12 37.87 0.55 27.27 3.17 22.69 3.48 27.73 4.28 17.32 1.29 0.05 51.61 2.58 51.25 11.78 34.28 1.04 27.29 2.60 30.63 1.67 25.77 2.22 0.1 59.45 3.60 51.17 3.28 43.05 3.68 38.02 1.64 34.42 0.91 37.54 1.79 0.5 132.45 6.07 119.82 1.07 107.16 1.48 107.23 6.76 103.24 2.32 100.71 1.66 1.0 182.26 2.20 164.51 1.50 156.41 2.04 146.73 0.28 146.05 0.10 Table 13: Test FD ( ) of generated latent codes in Man Woman translation. ε τ 10 20 50 102 103 106 0.01 95.70 0.05 95.31 0.10 93.68 0.09 92.46 0.13 89.08 0.32 88.81 0.05 0.05 95.55 0.08 95.05 0.09 93.67 0.10 92.32 0.27 89.66 0.17 88.39 0.42 0.1 95.21 0.26 94.83 0.19 93.14 0.30 91.96 0.26 89.09 0.53 87.38 0.40 0.5 93.02 0.22 92.53 0.29 91.13 0.01 89.73 0.44 85.87 1.20 85.24 0.41 1.0 90.00 0.78 88.83 0.73 87.46 0.30 83.32 1.03 83.16 0.55 Table 14: Test accuracy ( ) of keeping the class in Woman Man translation. Interestingly, accuracy of mapping to the correct target class does not have an evident dynamics w.r.t. ε. At the same time, when τ increases, the learned plans provide worse accuracy for keeping the input latents class but better FD of generated latent codes and accuracy of mapping to the target ε τ 10 20 50 102 103 106 0.01 51.54 0.48 64.67 1.09 77.59 0.39 82.52 0.49 88.61 0.48 88.99 0.18 0.05 49.26 0.72 64.09 1.08 76.88 0.17 83.47 0.52 88.59 0.55 89.14 0.55 0.1 49.74 0.78 63.99 0.82 76.96 0.54 82.45 0.22 89.29 0.55 89.14 0.38 0.5 48.64 2.31 60.88 0.55 77.58 0.28 82.10 1.17 89.82 0.99 90.63 1.06 1.0 62.62 0.24 74.95 1.17 82.42 0.73 90.20 0.34 90.75 0.37 Table 15: Test accuracy ( ) of mapping to the target in Woman Man translation. ε τ 10 20 50 102 103 106 0.01 47.00 1.74 33.65 0.67 23.85 1.15 20.83 1.27 16.48 0.09 18.78 3.44 0.05 52.30 1.29 39.79 1.25 29.66 1.81 27.23 3.45 24.68 2.80 23.43 1.91 0.1 58.40 0.62 48.66 0.73 37.55 0.35 37.74 2.39 31.63 0.42 32.83 2.02 0.5 131.17 0.78 120.63 0.76 108.44 0.60 104.85 1.17 101.29 0.11 102.26 1.58 1.0 186.46 0.92 169.64 0.63 160.52 0.42 152.78 0.13 152.37 0.09 Table 16: Test FD ( ) of generated latent codes in Woman Man translation. Figure 6: Visualization of ablation studies on parameters τ, ε in image translation experiment. class. It is an expected behavior since for bigger τ, the constraints on the marginals of the learned plans become more strict. That is, we enforce the marginals of the learned plans to be closer to source and target measures which allows learning more accurate mappings to target measure but does not allow keeping the source classes in the case of imbalance issues. Interestingly, in Adult Young translation, FD of learned latents and accuracy of mapping to the target do not change much for τ 102 while accuracy of keeping the attributes exhibits a significant drop between τ = 102 and τ = 103. Thus, we can treat τ = 102 as optimal since it provides the best trade-off between the quality of learned translations and their ability to keep the features of input latents. In the case of Young Adult translation, the values of accuracy and FD exhibit significant differences for considered τ. Thus, we may consider a more detailed scale and choose τ = 2.5 102 as the optimal one. It is important to note that our method offers a flexible way to select a domain translation configuration that allows for better preserving the properties of the original objects or generating a distribution closer to the target one. The final optimal configuration selection remains at the discretion of the user. The highlighted values in the Tables are used for comparison with other approaches in MD.1. Remark. FD should be treated as a relative measure of similarity between learned and target measures. The results obtained by balanced solver [41, Light SB] (equivalent to ours for big τ) are considered as a gold standard. Number of Gaussian components in potentials. For completeness, we perform an ablation study of our U-Light OT solver with different number on Gaussian components (K, L) in potentials vθ and uω, respectively. We run the solver in Young Adult translation with 5K steps, ε = 0.05 and set τ = 250. The quantitative results (accuracy of keeping the class, accuracy of mapping to the target, FD of generated latents vs target latents) are presented in the Tables below. The results show that in the considered task, our solver provides good performance even for small number of Gaussian components. This can be explained by the smoothness of the latent representations of data in ALAE autoencoder. K L 1 2 4 8 16 32 64 128 1 84.17 0.45 84.73 1.27 84.25 1.08 85.08 0.78 84.84 0.92 85.82 0.70 85.52 0.74 83.70 0.48 2 84.99 0.59 84.54 0.43 84.56 0.39 84.18 0.63 83.87 2.03 83.56 1.37 84.66 0.44 86.74 0.54 4 84.50 0.86 84.81 0.46 84.60 0.37 84.65 0.77 83.75 1.01 84.00 1.52 83.51 0.90 83.99 0.63 8 83.88 1.01 84.08 0.81 83.71 0.60 82.76 1.98 84.69 0.38 84.30 0.39 85.05 1.59 83.03 1.02 16 84.65 0.33 85.00 1.62 84.28 0.78 84.76 1.23 83.66 0.33 85.14 0.49 83.77 1.16 84.34 0.15 32 83.48 0.40 86.02 1.22 84.79 0.60 84.44 0.42 85.24 0.84 84.06 1.00 84.73 0.51 84.54 0.27 64 85.24 0.27 85.23 0.59 84.12 0.12 84.64 0.64 84.01 1.95 84.10 1.26 84.77 1.25 83.76 1.44 128 85.21 0.29 85.16 0.07 84.63 1.15 84.64 0.49 84.12 0.57 84.11 0.76 84.22 0.68 84.64 0.93 Table 17: Test accuracy ( ) of keeping the attributes in Young Adult translation for our U-Light OT solver with different number of Gaussian components in potentials. K L 1 2 4 8 16 32 64 128 1 83.35 0.85 82.03 0.11 82.69 0.82 83.34 0.08 82.95 0.46 82.80 0.58 82.49 0.64 82.54 0.36 2 83.26 0.98 81.80 0.81 81.82 0.66 82.35 0.92 82.77 0.45 82.26 0.98 82.36 0.37 82.84 0.38 4 82.76 0.17 81.47 0.29 82.32 0.13 82.60 0.66 82.70 0.64 82.48 0.37 83.23 1.24 82.44 0.68 8 82.35 0.39 81.63 0.38 82.23 0.52 82.19 0.72 82.64 0.58 82.55 0.18 82.85 0.28 82.87 0.09 16 82.43 0.61 82.12 1.10 81.30 0.79 82.11 0.93 82.56 0.26 81.88 0.85 83.34 0.57 82.90 0.39 32 81.57 0.47 81.39 0.62 81.76 0.27 81.28 0.57 82.47 1.30 82.06 0.41 82.35 0.55 82.56 0.89 64 82.01 0.79 82.33 0.83 81.97 0.61 82.01 0.26 82.44 0.74 81.61 0.75 82.06 0.51 83.44 0.38 128 82.39 0.92 82.18 0.56 81.74 0.19 82.22 0.74 81.58 0.23 82.66 0.33 83.07 0.30 82.85 0.45 Table 18: Test accuracy ( ) of mapping to the target in Young Adult translation for our U-Light OT solver with different number of Gaussian components in potentials. K L 1 2 4 8 16 32 64 128 1 17.41 0.28 17.42 0.36 18.10 1.41 17.83 0.41 17.97 0.69 18.42 0.86 17.69 0.37 18.07 0.61 2 19.07 0.81 17.47 0.83 16.87 0.25 16.93 0.56 21.05 1.86 17.38 0.69 17.22 0.76 17.83 0.70 4 17.07 0.53 16.66 0.14 17.02 0.84 18.01 1.79 16.71 0.06 17.07 0.95 16.59 0.41 16.42 0.10 8 16.55 0.22 23.56 7.59 16.37 0.23 16.81 0.95 17.21 1.01 18.37 1.47 17.13 0.77 16.96 0.86 16 17.58 1.58 18.06 1.61 18.25 2.46 17.69 1.87 18.00 0.34 17.19 0.71 17.62 0.93 18.97 2.01 32 17.85 0.10 17.15 0.71 16.50 0.10 21.91 4.05 20.04 2.78 19.31 3.22 18.79 2.03 22.56 1.66 64 21.39 1.32 21.91 4.23 21.00 4.47 18.02 1.24 21.63 3.14 19.56 0.99 17.53 0.47 20.26 4.46 128 22.09 4.91 35.68 21.07 33.80 21.67 20.46 2.39 24.80 5.33 22.73 3.03 22.09 1.25 22.72 7.41 Table 19: Test FD ( ) of generated latent codes in Young Adult translation for our U-Light OT solver with different number of Gaussian components in potentials. D Additional Experimental Results D.1 Quantitative comparison with other methods in Image-to-Image translation experiment In this section, we provide additional results of quantitative comparison of our U-Light OT solver and other unbalanced and balanced OT/EOT solvers in image translation experiment. Tables 20, 21, 22, provide values used for plotting Fig. 3 in the main text. The unbalancedness parameters used for our U-Light OT solver and [16, UOT-FM] are specified in th Tables below. For obtaining the result of [9, UOT-SD], we use their unbalancedness parameter τ = 0.002. For other details on methods parameters used to obtain the results below, see Appendix B. Note that we do not include FID metric for assessing the quality of generated images since we found that it is not a representative metric for assessing the performance of models performing the translation of ALAE latent codes. Experiment OT-FM [16] [41, Light SB] or [30, Light SBM] [16, UOT-FM] UOT-SD [9] UOT-GAN [70] U-Light OT (ours) Young Adult 67.71 78.16 84.46 (reg_m = 0.005) 45.71 73.85 84.49 (τ = 250) Adult Young 53.79 80.25 76.05 (reg_m = 0.005) 49.30 74.74 89.48 (τ = 102) Man Woman 76.05 87.82 84.42 (reg_m = 0.005) 75.50 84.04 90.30 (τ = 102) Woman Man 72.40 88.10 86.10 (reg_m = 0.05) 72.03 84.56 89.66 (τ = 103) Table 20: Comparison of accuracies of keeping the attributes of the source images. Experiment OT-FM [16] [41, Light SB] or [30, Light SBM] [16, UOT-FM] UOT-SD [9] UOT-GAN [70] U-Light OT (ours) Young Adult 93.28 85.97 74.10 (reg_m = 0.005) 87.33 84.25 81.78 (τ = 250) Adult Young 96.12 93.10 89.30 (reg_m = 0.005) 97.39 95.88 87.79 (τ = 102) Man Woman 93.33 94.37 92.35 (reg_m = 0.005) 98.16 97.38 90.23 (τ = 102) Woman Man 94.27 89.66 88.53 (reg_m = 0.05) 94.96 92.91 88.59 (τ = 103) Table 21: Comparison of accuracies of mapping to the target. Experiment OT-FM [16] [41, Light SB] or [30, Light SBM] [16, UOT-FM] UOT-SD [9] UOT-GAN [70] U-Light OT (ours) Young Adult 11.93 15.50 11.57 (reg_m = 0.005) 13.28 11.23 17.15 (τ = 250) Adult Young 14.10 21.41 17.00 (reg_m = 0.005) 18.44 14.94 30.79 (τ = 102) Man Woman 16.20 20.91 10.31 (reg_m = 0.005) 16.13 22.41 27.29 (τ = 102) Woman Man 11.42 30.87 6.99 (reg_m = 0.05) 13.23 10.55 24.68 (τ = 103) Table 22: Comparison of FD between the generated and learned latents. D.2 Outlier robustness property of U-Light OT solver To show the outlier robustness property of our solver, we conduct the experiment on Gaussian Mixtures with added outliers and visualize the results in Fig. 7. The setup of the experiment, in general, follows the Gaussian mixtures experiment setup described in section 5.2 of our paper. The difference consists in outliers (small gaussians) added to the input and output measures. (a) Measures p, q (b) Light SB (c) U-Light OT (ours) Figure 7: Conditional plans γθ,ω(y|x) learned by our solver (τ = 1) and Light SB in Gaussian Mixtures with otliers experiment. The results show that our U-Light OT solver successfully eliminates the outliers and manages to simultaneously handle the class imbalance issue. At the same time, the balanced Light SB [41] solver fails to deal with either of these problems. E Limitations and Broader Impact Limitations. One limitation of our solver is the usage of the Gaussian Mixture parametrization which might restrict the scalability of our solver. This points to the necessity for developing ways to optimize objective (4) with more general parametrization, e.g., neural networks. Broader impact. Our work aims to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here. Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: For each contribution listed in abstract and introduction we provide the links to the sections about them. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: We discuss the limitations of our solver in M6. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: We provide the proofs and the assumptions in Appendix A. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: We provide a full list of the experimental details in Appendix B. We use publicly available datasets. The code for the experiments is provided in supplementary material. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: The code for our solver is included in supplementary material and will be made public after acceptance of our paper. We use publicly available datasets. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: We provide the experimental details in Appendix B. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [No] Justification: The focus of the papers is mostly theoretical. The experiments are provided just for illustration of the derived theoretical results. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: The computational resources are discussed in M5. 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: The research conforms with Neur IPS Code of Ethics. We discuss the societal impact of our paper in M6. 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [Yes] Justification: We discuss the societal impact of our paper in M6. 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: The research does not release data or models that have a high risk for misuse and does not need safeguards. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: We cited the creators all of the used assets. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [Yes] Justification: The code is included in supplementary material. The license for the code will be provided after the paper acceptance. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: The paper does not include crowdsourcing experiments or research with human subjects. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: The paper does not include crowdsourcing experiments or research with human subjects.