# universal_neural_optimal_transport__08d9a4b8.pdf Universal Neural Optimal Transport Jonathan Geuter 1 2 * Gregor Kornhardt 3 * Ingimar Tomasson 3 * Vaios Laschos 4 Optimal Transport (OT) problems are a cornerstone of many applications, but solving them is computationally expensive. To address this problem, we propose UNOT (Universal Neural Optimal Transport), a novel framework capable of accurately predicting (entropic) OT distances and plans between discrete measures for a given cost function. UNOT builds on Fourier Neural Operators, a universal class of neural networks that map between function spaces and that are discretization-invariant, which enables our network to process measures of variable resolutions. The network is trained adversarially using a second, generating network and a self-supervised bootstrapping loss. We ground UNOT in an extensive theoretical framework. Through experiments on Euclidean and non-Euclidean domains, we show that our network not only accurately predicts OT distances and plans across a wide range of datasets, but also captures the geometry of the Wasserstein space correctly. Furthermore, we show that our network can be used as a stateof-the-art initialization for the Sinkhorn algorithm with speedups of up to 7.4 , significantly outperforming existing approaches. 1. Introduction Optimal Transport (Villani, 2009; Peyr e & Cuturi, 2019) plays an increasing role in various areas in machine learning, such as domain adaptation (Courty et al., 2017), singlecell genomics (Schiebinger et al., 2019), imitation learning (Dadashi et al., 2020), imaging (Schmitz et al., 2018), dataset adaptation (Alvarez-Melis & Fusi, 2021), and signal processing (Kolouri et al., 2017). Oftentimes, an entropic *Equal contribution 1Harvard John A. Paulson School of Engineering and Applied Sciences 2Kempner Institute at Harvard University 3Department of Mathematics, Technische Universit at Berlin, Germany 4Weierstrass Institute, Berlin, Germany. Correspondence to: Jonathan Geuter . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). MNIST CIFAR MNIST LFW-BEAR LFW BEAR 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 Relative Error 28x28 28x28 28x28 64x64 64x64 64x64 Ones Gauss UNOT Figure 1. Errors on the OT distance after a single Sinkhorn iteration for the default initialization (Ones), the Gaussian one (Thornton & Cuturi, 2022), and ours (UNOT), for c(x, y) = x y 2. regularizer is added, as this allows for efficient computation of the solution via the Sinkhorn algorithm (Cuturi, 2013). The entropic OT problem between probability measures µ P(X), ν P(Y) on Polish spaces X, Y, given a cost function c : X Y R { }, is defined as OTϵ(µ, ν) = inf π Π(µ,ν) X Y c dπ ϵKL(π||µ ν), (1) where Π(µ, ν) is the set of all transport plans (i.e. measures on X Y that admit µ resp. ν as their marginals), KL(π||µ ν) = R log(π/µ ν)dπ is the KL divergence of π from µ ν, and ϵ > 0 is a regularizing coefficient.1 Many of these applications require solving problem (1) repeatedly, such as in single-cell perturbations (Bunne et al., 2022; 2023b;a), Natural Language Processing (Xu et al., 2018), flow matching with OT couplings (Tong et al., 2024; Pooladian et al., 2023), or even seismology (Engquist & Froese, 2013). However, solving OT problems is computationally expensive, and fast approximation methods are an active area of research. Variations of the transport problem, such as (generalized) sliced Wasserstein distances (Kolouri et al., 2015; 2019), reduce computational complexity at 1For a background on OT, see Appendix A. Universal Neural Optimal Transport the cost of accuracy via random projections. Two previous works are aimed at predicting good initializations for the Sinkhorn algorithm, which can iteratively solve problem (1). In (Thornton & Cuturi, 2022), initializations are computed from OT problems between Gaussians. (Amos et al., 2023) train a neural network to predict transport plans and costs via the entropic dual OT problem (Section 2). While their framework shares some similarities with ours (see Section 5), it is inherently limited to measures of fixed dimension from the training dataset. Instead, we present a Universal Neural OT (UNOT) solver which, given discrete measures µ P(X) and ν P(Y) of variable resolution (viewed as discretizations of continuous measures; see Section 3.1), can accurately predict the OT cost and plan associated with problem (1). To this end, we leverage Fourier Neural Operators (FNOs) (Kovachki et al., 2024), a discretization-invariant class of neural networks that can process inputs of variable sizes. An FNO Sϕ is trained to predict a solution to the dual OT problem (Section 2) given two measures (µ, ν), from which the primal problem (1) can be solved. Training is self-supervised with an adversarial generator network Gθ which creates training distributions Gθ(z) = (µ, ν) from z ρz = N(0, I) (see Section 3.4). We want to highlight that in contrast to most neural OT frameworks, such as (Bunne et al., 2023a; Uscidda & Cuturi, 2023; Korotin et al., 2023), we generalize across OT problems (given a fixed cost). Ge ONet (Gracyk & Chen, 2024) also uses Neural Operators to learn Wasserstein geodesics. In Section 4, we show that UNOT significantly outperforms Ge ONet on approximating geodesics, despite not being trained on them. Our contributions are as follows: present UNOT, the first neural OT solver capable of generalizing across datasets and input dimensions introduce a generator Gθ (Section 3.3) which can provably generate any discrete distribution (of fixed dimension) during training (in fact, we prove this result for a very general class of residual networks, see Theorem 3 and Corollary 4) propose a self-supervised bootstrapping loss which provably minimizes the loss against the ground truth dual potentials (Proposition 5) show that UNOT can accurately predict OT distances across various datasets, costs, and domains of different dimensions up to a few percent error, and that it accurately captures the geometry of the Wasserstein space by approximating barycenters (Section 4) approximate Wasserstein geodesics through barycenters and OT plans predicted by UNOT (Section 4) demonstrate how UNOT sets a new state-of-the-art for initializing the Sinkhorn algorithm while maintaining its desirable properties, such as parallelizability and differentiability (Section 4) 2. Background We give a brief overview of optimal transport, and how it relates to UNOT. For a more thorough introduction, see Appendix A.1 or (Peyr e & Cuturi, 2019; Villani, 2009). Notation. We write vectors in bold (x, u, etc.) and matrices in capitals (X, U, etc.). 1n Rn denotes the all-ones vector, n 1 the simplex in Rn, and all elements in n 1 with positive entries are denoted by n 1 >0 . P(X) denotes the space of probability measures on X, and Pp(X) the set of probability measures with finite p-th moments. For µ P(X) and a map T, we denote by T#µ the pushforward of µ under T, i.e. the measure µ T 1. S2 = {x R3 : x = 1} is the unit sphere in R3. 2.1. Optimal Transport Unregularized Optimal Transport. The unregularized problem takes the form infπ Π(µ,ν) R X Y c(x, y)dπ(x, y), akin to (1) without the regularization term. In the case where X = Y, c(x, y) = d(x, y)p for p 1 and a metric d on X, the Wasserstein-p distance is defined as Wp(µ, ν) = inf π Π(µ,ν) X X d(x, y)pdπ(x, y) 1 which is indeed a distance on the space Pp(X) of Borel measures with finite p-th moments (Villani, 2009). Dual Optimal Transport Problem. The regularized Kantorovich problem (1) admits a dual formulation: sup f L1(µ),g L1(ν) X f(x)dµ(x) + Z Y g(y)dν(y) ıϵ(f, g), ıϵ(f, g) = ϵ Z X Y e 1 ϵ (f(x)+g(y) c(x,y)) 1dµ(x)dν(y). It can be shown that if c L1(µ ν), the values of the primal (1) and the dual (3) coincide (Nutz, 2022). Disrete Optimal Transport. We want to apply UNOT to discretizations of measures µ P(X), ν P(Y) (see Section 3.1). To this end, consider measures µ and ν that are supported on finitely many points x1, ..., xm X, y1, ..., yn Y resp, i.e. µ = Pm i=1 aiδxi, ν = Pn j=1 bjδyj. By abusing notation, we can write µ Rm 0 and ν Rn 0.2 Note that the dual potentials in problem (3) are elements in L1(µ) resp. L1(ν), hence we can abuse notation again and consider the potentials f Rm and g Rn to be vectors as well. This point of view gives rise to discrete optimal 2Whenever we view a discrete measure or a function as a vector, we will use bold characters. Universal Neural Optimal Transport transport. We set C Rm n via Cij = c(xi, yj), and view transport plans as matrices Π Rm n (see Appendix A.1 for a more thorough introduction to discrete OT). The following proposition shows that the plan Π can be recovered from the dual vectors f and g (Peyr e & Cuturi, 2019). Proposition 1. Define the Gibbs kernel K = exp( C/ϵ). The unique solution Π of the discrete OT problem is given by Π = diag(u)Kdiag(v) (4) for two positive scaling vectors u and v unique up to a scaling constant (i.e. λu, 1 λv for λ > 0). Furthermore, (u, v) are linked to a solution (f, g) of the dual problem via (u, v) = (exp(f/ϵ), exp(g/ϵ)) . In Section 3.1, we show how the solution to the entropic dual between discrete (µn, νn) converges to the solution of the continuous dual (3) as (µn, νn) converge to continuous measures (µ, ν) in some way, which will be crucial for the design of our network Sϕ. 2.2. The Sinkhorn Algorithm The Sinkhorn Algorithm 1 can iteratively solve the discrete dual problem and was introduced in (Cuturi, 2013). It requires an initialization v0 Rn, which is typically set to 1n, and µ and ν to be positive everywhere. Algorithm 1 Sinkhorn(µ, ν > 0, K = exp( C/ϵ), ϵ, v0) 1: for l = 0, ..., N do 2: ul+1 µ./Kvl 3: vl+1 ν./K ul+1 4: end for 5: Π diag(ul)Kdiag(vl), OTϵ(µ, ν) C, Π 6: return u, v, Π, OTϵ(µ, ν) In the algorithm, ./ is to be understood as element-wise division. Sinkhorn and Knopp (Sinkhorn & Knopp, 1967) showed that the iterates ul and vl from the algorithm converge to the vectors u and v from Proposition 1. 2.3. Predicting Dual Potentials Given discrete measures µ and ν, UNOT should ultimately be used to approximate the associated transport plan and cost. However, given an optimal dual potential v, the corresponding potential u can be computed as u = µ./Kv, (5) which also holds at convergence of the Sinkhorn algorithm. Thus, solving for the m n-dimensional plan Π can be reduced to a n-dimensional problem over v. Since computations in the log space tend to be more stable (Peyr e & Cuturi, 2019), we will instead let UNOT predict the dual potential g = ϵ log(v), i.e. Sϕ(µ, ν) = g, µ P(X), ν P(Y). The prediction g can then be used to solve the entropic OT problem via the relationship (5) and Proposition 1, or to initialize the Sinkhorn algorithm via v0 = exp(g/ϵ). Note that the solution to the entropic dual is not unique (see Proposition 1). How we account for this non-uniqueness is explained in Section 3.4. However, when endowing Rm Rn with the equivalence relation (u1, v1) (u2, v2) λ > 0 : (u1, v1) = (λu2, 1 λv2) (i.e. accounting for the non-uniqueness of the dual solution), the map (µ, ν) 7 v, mapping two measures to the associated dual potential in the quotient space, is Lipschitz continuous (Carlier et al., 2022), which supports its learnability by a neural network. 3. Universal Neural Optimal Transport Consider the OT problem between two (grayscale) images, encoded as vectors in µn, νn Rn. These can be viewed as discrete measures on P([0, 1]2), which discretize continuous measures µ, ν P([0, 1]2), where the discretization depends on the resolution of the image, and the continuous measures correspond to the images at infinite resolution. UNOT should predict the corresponding dual potential gn Rn solving (3) independent of the resolution n.3 In Section 3.1, we establish a convergence result for the dual potentials as n , which justifies the use of Neural Operators (Kovachki et al., 2024) as a parametrization of Sϕ; also see Section 3.2. Furthermore, as we want UNOT to work across datasets, we require a generator Gθ that can provably generate any pair of distributions during training (Section 3.3). In Section 3.4, we construct an adversarial training objective for Sϕ and Gθ. Further details about hyperparameter and architecture choices can be found in Appendix C. The implementation and model weights are available at https: //github.com/Gregor Kornhardt/UNOT. 3.1. Convergence of Dual Potentials In this section, we prove convergence of the discrete dual potentials gn as n goes to infinity. For brevity, this section is kept informal; see Appendix B for a formal treatment. Assume now that X = Y RN is compact, and c(x, y) is Lipschitz continuous in both its arguments. For absolutely continuous µ, ν P(X), denote by (µn)n N, (νn)n N P(X) discretizing sequences of µ and ν (formally defined in 3Note that while we consider images as an example, the learning task is the same for any setting where discrete measures of varying resolution share an underlying continuous cost function, which arises in settings such as single-cell genomics, fluid dynamics, point cloud processing, or economics. Universal Neural Optimal Transport Appendix B). While a solution (fn, gn) of the discrete dual problem between µn and νn is only defined µn - resp. νn - a.e., it can be canonically extended to all of X (Feydy et al., 2018) (see Appendix B for details). The following proposition shows that the extended potentials (fn, gn) converge to the solution (f, g) of the continuous entropic problem. Proposition 2. (Informal) Let (µn)n N, (νn)n N be discretizing sequences for absolutely continuous µ, ν P(X). Let (fn, gn) be the (unique) extended dual potentials of (µn, νn) such that fn(x0) = 0 for some x0 X and all n. Let (f, g) be the (unique) dual potentials of (µ, ν) such that f(x0) = 0. Then fn and gn converge uniformly to f and g on all of X. A formal version and its proof can be found in Appendix B. This proposition is crucial in designing our network Sϕ, as we discuss in the following section. 3.2. Fourier Neural Operators Fourier Neural Operators (FNOs) (Kovachki et al., 2024) are neural networks mapping between infinite-dimensional function spaces. More precisely, a neural operator is a map F : A U between Banach spaces A and U of functions a A : Da Rd a and u : Du Rd u respectively, for bounded domains Da Rda and Du Rdu. An input function a A evaluated at points x1, ..., xn Rda can be encoded as a vector a = [a(x1), ..., a(xn)] Rn d a; the same applies to the output function u U, which can be written as u = [u(y1), ..., u(ym)] Rm d u, corresponding to the values at y1, ..., ym Rdu. At its core, an FNO applies a sequence of L kernel layers to the input vector a. In each of these layers, a fixed number of Fourier features of the discrete Fourier transform of the input is computed, the features are transformed by a (C-valued) linear layer (we use a two-layer network in practice instead, as we found it to improve performance), and then mapped back by the inverse Fourier transform. Importantly, neural operators are by construction discretization-invariant when inputs and outputs correspond to discretizations of underlying functions. This is exactly what Proposition 2 guarantees: the dual potentials corresponding to measures µn and νn converge uniformly to the continuous potentials corresponding to the limiting distributions µ and ν as the resolution of µn and νn increases. Hence, FNOs are a natural choice of architecture in our setting. More details on FNOs, and how we implemented Sϕ, can be found in Appendix A.5. 3.3. Generating Measures for Training UNOT is trained on pairs of distributions generated by a generator network Gθ of the following form: Gθ : Rd P(X) P(X) z ρz 7 R [Re LU (NNθ(z) + λ Id,d (z)) + δ] , (6) where ρz = N(0, Id) is a Gaussian prior, NNθ is a trainable neural network (in practice, we use a 5-layer fully connected MLP, see Appendix C), Id,d is an interpolation operator matching the generator s output dimension d and acting as a skip connection reminiscent of Res Nets (He et al., 2016), and λ > 0 is a constant for the skip connection. δ > 0 is a small constant needed to generate our targets with the Sinkhorn algorithm, as outlined in Section 3.4. R denotes renormalizing to two probability measures and downsampling them to random dimensions in a set range, such that Sϕ trains on measures of varying resolutions, which is known to improve NO training (Li et al., 2024a). More specifically, if we write [x1, x2] = Re LU (NNθ(z) + λ Id,d (z)) + δ for two vectors x1 and x2 of equal size (say both with n samples), R first maps them to [x1/ P i(x1)i, x2/ P i(x2)i] and then uses 2D bilinear interpolation to downsample them to m samples each. The generator is universal in the following sense: Theorem 3. Let 0 < λ 1 and Gθ : Rd Rd be defined via Gθ(z) = Re LU (NNθ(z) + λz) , where z ρz = N(0, I), and where NNθ : Rd Rd is Lipschitz continuous with Lip(NNθ) = L < λ. Then Gθ is Lipschitz continuous with Lip(q) < L + λ, and G(z) := NNθ(z) + λz is invertible on Rd. Furthermore, for any x Rd 0 it holds ρGθ#ρz(x) 1 (L + λ)d N Gθ 1(x)|0, I . In other words, Gθ#ρz has positive density at any nonnegative x Rd 0. This shows that any pair of discrete probability measures (µ, ν) of joint dimension d can be generated by Gθ. A direct consequence of the theorem is an extension to functions that are compositions of functions Gθ as above, which covers a wide class of Res Nets. Both proofs can be found in Appendix B. Corollary 4. Let Gθ = Gθ1 Gθ1 ... GθR be a composition of functions Gθi, each of which is of the form as in Theorem 3. Let z ρz = N(0, I). Then ρ Gθ#ρz(x) 1 (L + λ)Rd N Gθ 1(x)|0, I for any x Rd. As in Theorem 3, this also holds for any x Rd 0 if Gθ is followed by a Re LU activation. Although the more general Corollary 4 is not needed for our purposes, it might be of independent interest to the research community. Note that the generator in Theorem 3 does not exactly match our generator s architecture. A discussion of how the theorem relates to our setting, as well as further details on the generator, can be found in Appendix C. Universal Neural Optimal Transport Figure 2. Generated pair of training samples (lighter=more mass). Figure 2 shows a pair of samples generated by Gθ. The generator seems to layer highly structured shapes with more blurry ones. More examples, as well as an analysis of the performance of Sϕ on samples generated by Gθ over the course of training, can be found in Appendix D.6. 3.4. UNOT Training Algorithm Given a pair of distributions (µ, ν) = Gθ(z) (in this section, we will remove the subscript n for clarity), Sϕ(µ, ν) =: gϕ should predict the true dual potential g associated with µ and ν. Hence, we could simply compute the true potential g with the Sinkhorn algorithm and use L2(gϕ, g) := gϕ g 2 2 as our training loss. However, it would be prohibitively expensive to run the Sinkhorn algorithm until convergence. Hence, we instead employ a bootstrapping loss on the prediction gϕ. Let τk : (µ, ν, gϕ) 7 gτk denote running the Sinkhorn algorithm on (µ, ν) with initialization v0 = exp(gϕ/ϵ) for a very small number of iterations k, i.e. warmstarting the Sinkhorn algorithm with the current prediction gϕ, and returning ϵ log v = gϕ.4 To ensure uniqueness and improve training, we shift gτk to have zero sum; this corresponds to the non-uniqueness of the dual potentials, see Proposition 1. Minimizing L2(gϕ, gτk) implies minimizing the ground truth loss L2(gϕ, g) against the true potential g. Proposition 5. For two discrete measures (µ, ν) with n particles, let g be an optimal dual potential, gϕ = Sϕ(µ, ν), and gτk = τk(µ, ν, gϕ). Without loss of generality, assume that P i gτk i = 0. Then L2(gϕ, g) c(K, k, n) L2(gϕ, gτk) for some constant c(K, k, n) > 1 depending only on the Gibbs kernel K, k and n. The proposition shows that minimizing L2(gϕ, gτk) implies minimizing L2(gϕ, g), i.e. the loss between the prediction and the ground truth potential. The proof is based on the Hilbert projective metric (Peyr e & Cuturi, 2019) and can be found in Appendix B. Training objective. Having defined the loss for Sϕ, as well 4The Sinkhorn algorithm requires input measures to be positive; this is the reason we add the constant δ > 0 in the generator. Algorithm 2 UNOT Training Algorithm 1: in cost c, reg parameter ϵ, prior ρz, learning rates {αi}i, {βi}i, Sinkhorn target generator τk 2: for i = 1, 2, ..., T do 3: z sample(ρz) 4: (µ, ν) Gθ(z) 5: for mini-batch (µb, νb) in (µ, ν) do 6: gϕ Sϕ(µb, νb) 7: gτk τk(µb, νb, gϕ) 8: ϕ ϕ αi ϕ L2(gτk, gϕ) 9: end for 10: for mini-batch zb in z do 11: (µθ, νθ) Gθ(zb) 12: gθ Sϕ(µθ, νθ) 13: gτk τk(µθ, νθ, gθ) 14: θ θ + βi θ L2(gτk, gθ) 15: end for 16: end for as the target generation procedure, the training objective for Sϕ and Gθ consists of Sϕ trying to minimize the loss L2(gϕ, g), while Gθ attempts to maximize it, similar to the training objective in GANs (Goodfellow et al., 2014). Putting everything together, our adversarial training objective for Sϕ and Gθ reads max θ min ϕ Ez ρz [L2 (τk (G(z), S(G(z))) , Sϕ(Gθ(z)))] , (7) where S and G without subscripts denote no gradient tracking, as the target is not backpropagated through. The training algorithm can be seen in Algorithm 2. In practice, training will be batched, which we omitted for clarity. Note that vectors g with subscripts θ or ϕ are backpropagated through with respect to these parameters, whereas target vectors (with subscript τk) are not. 4. Experiments Training Details. We test UNOT in three different settings: a) with c(x, y) = x y 2 2 on the unit square X = [0, 1]2; b) with c(x, y) = x y 2 on [0, 1]2; c) with the spherical distance c(x, y) = arccos( x, y ) on the unit sphere S2 = {x R3 : x = 1}. For each of these settings, we train a separate model on 200M samples z, where training samples (µ, ν) are between 10 10 and 64 64 dimensional (randomly downsampled in Gθ). Training takes around 35h on an H100 GPU. Sϕ is an FNO with 26M parameters optimized with Adam W (Loshchilov & Hutter, 2019); Gθ is a fully connected MLP with 272k parameters optimized with Adam (Kingma & Ba, 2017). In the spherical setting c(x, y) = arccos( x, y ) we parametrize Sϕ with a Spherical FNO (SFNO) (Bonev et al., 2023) instead, which is essentially an FNO adapted to the sphere; for more details Universal Neural Optimal Transport Table 1. Mean number of iterations needed to achieve 0.01 relative error on the OT distance for c(x, y) = x y 2. UNOT (OURS) ONES GAUSS MNIST 3 5 16 9 10 7 CIFAR 3 6 80 22 52 19 CIFAR-MNIST 4 4 32 15 13 9 LFW 7 8 78 20 35 14 BEAR 4 6 41 16 25 13 LFW-BEAR 4 6 53 18 29 13 Table 2. Relative speedup of Sinkhorn with UNOT and cost c(x, y) = x y 2. Time in s to achieve 0.01 relative error on the OT distance. UNOT (OURS) ONES SPEEDUP MNIST 1.2 10 3 1.5 10 3 1.25 CIFAR 9.5 10 4 7.1 10 3 7.4 CIFAR-MNIST 1.3 10 3 2.7 10 3 2.07 LFW 3.0 10 3 1.5 10 2 5 BEAR 2.6 10 3 1.0 10 2 3.8 LFW-BEAR 2.7 10 3 1.2 10 2 4.4 on FNOs and SFNOs see Appendix A.5. We highlight that Sϕ is relatively small, such that its runtime vanishes compared to the runtime of even just a few Sinkhorn iterations, making it much cheaper to run than Sinkhorn (see Section 4.1). We set k (the number of Sinkhorn iterations in τk) to 5, and ϵ = 0.01. Additional training details can be found in Appendix C. We demonstrate the performance of the three models on various tasks, such as predicting transport distances, initializing the Sinkhorn algorithm, computing Sinkhorn divergence barycenters, and approximating Wasserstein geodesics. For the Euclidean settings a) and b) (from above), we view images as discrete distributions on the unit square, and test on MNIST (28 28), grayscale CIFAR10 (28 28), the teddy bear class from the Google Quick, Draw! dataset (64 64), and Labeled Faces in the Wild (LFW, 64 64), as well as cross-datasets CIFAR-MNIST and LFW-Bear (where µ comes from one dataset and ν from the other). For the spherical setting c), we project these images onto the unit sphere in R3 (for details, see Appendix D.1). Unless otherwise noted, we perform a single Sinkhorn iteration on g = Sϕ(µ, ν) in all experiments in order to compute the second potential f. Errors are averaged over 500 samples. Additional experiments, including a sweep over input sizes 10 10 to 64 64, as well as variants of UNOT for fixed input dimension or variable ϵ, can be found in Appendix D. 4.1. Predicting Transport Distances We compare the convergence of the Sinkhorn algorithm in terms of relative error on the transport distance OTϵ(µ, ν) for our learned initialization v0 = exp(Sϕ(µ, ν)/ϵ) to 0.3 MNIST(28x28) CIFAR(28x28) MNIST-CIFAR(28x28) 0 10 20 30 40 50 Iteration 0.3 LFW(64x64) 0 10 20 30 40 50 Iteration BEAR(64x64) 0 10 20 30 40 50 Iteration LFW-BEAR(64x64) Relative Error UNOT Gauss Ones Figure 3. Relative error on the OT distance for Sinkhorn with our initialization (UNOT), compared to the default (Ones) and Gaussian initialization (Gauss) (Thornton & Cuturi, 2022). the default initialization 1n and the Gaussian initialization from (Thornton & Cuturi, 2022), which is based on closed-form solutions for Gaussian distributions. Note that the Gaussian initialization is only valid for c(x, y) = x y 2, hence we omit it when c(x, y) = x y or c(x, y) = arccos( x, y ).5 We do not compare to Meta OT (Amos et al., 2023) here, as their approach is inherently dataset dependent and breaks down when testing on out-of-distribution data.6 For completeness, we include a detailed comparison in Appendix D.2, which shows that UNOT significantly outperforms Meta OT on all datasets except MNIST, the training dataset of Meta OT. Surprisingly, UNOT also almost matches Meta OT on MNIST, despite not having seen any MNIST samples during training, while Meta OT was explicitly trained on them. Figure 1 (Section 1) shows the relative error on OTϵ(µ, ν) after a single Sinkhorn iteration for c(x, y) = x y 2, and Figure 4 shows the same plot for c(x, y) = x y on the square, and c(x, y) = arccos( x, y ) on the sphere. In Figure 3, we plot the relative error on the OT distance over the number of Sinkhorn iterations for c(x, y) = x y 2 (for the equivalent plots for the other cost functions, please see Appendix D.7), demonstrating that UNOT can be used as a state-of-the-art initialization. Table 1 shows the average number of Sinkhorn iterations needed to achieve 0.01 relative error on OTϵ(µ, ν) for c(x, y) = x y 2. In Table 2 we show the relative speedup achieved by initializing the Sinkhorn algorithm with UNOT implemented in JAX over the default initialization (on a batch size of 64 in float32 5If the cost function is not x y 2, the Gaussian initialization is not theoretically justified. Empirically, we noted that it behaves similar to the default initialization in these cases. 6We note that it should be possible to finetune UNOT on specific datasets as well; however, we have not tested this. Universal Neural Optimal Transport MNIST CIFAR MNIST LFW-BEAR LFW BEAR 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 Relative Error 28x28 28x28 28x28 64x64 64x64 64x64 MNIST CIFAR MNIST LFW-BEAR LFW BEAR 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 Relative Error 28x28 28x28 28x28 64x64 64x64 64x64 Figure 4. Error on the OT distance after a single Sinkhorn iteration with UNOT vs. the default initialization (Ones) for cost x y on the square [0, 1]2 (left) and arccos( x, y ) on the sphere {x R3 : x = 1} (right). Figure 5. Sinkhorn divergence barycenters computed with UNOT via eq. (10) (top) vs. ground truth (bottom) of between 5 to 10 MNIST samples of the same digit per barycenter. on an NVIDIA 4090). We achieve an average speedup of 3.57 on 28 28 datasets and 4.4 on 64 64 datasets.7 For comparison, the relative speedup achieved in (Amos et al., 2023) was 1.96 (for a model trained only on MNIST). 4.2. Sinkhorn Divergence Barycenters The Wasserstein barycenter for a set of measures {ν1, ..., νN} P2(X) and λ n 1 is defined as µ = arg min µ P2(X) i λi W 2 2 (µ , νi). (8) 7We did not optimize the network Sϕ much for efficiency, and more efficient implementations likely exist. Note that FNOs process complex numbers, but Py Torch is heavily optimized for real number operations. With kernel support for complex numbers, UNOT will likely be much faster. In addition, computation times can vary significantly across hardware, batch sizes, precision, etc. To make this problem tractable, consider the Sinkhorn divergence barycenter µ = arg min µ P2(X) i λi SDϵ(µ , νi), (9) where the Sinkhorn divergence8 between µ and ν is SDϵ(µ, ν) = OTϵ(µ, ν) 1 2 OTϵ(µ, µ) 1 2 OTϵ(ν, ν). Now for discrete measures µ, ν, denote by (f, g) the dual potentials for OTϵ(µ, ν), and by p that for OTϵ(µ, µ).9 Writing µ = P i aiδxi for some a Rn, the gradient of (9) w.r.t a is given by (cf. (Feydy et al., 2018)): a SDϵ(µ, ν) = f p. (10) Hence, we can solve (9) with (projected) gradient descent, where Sϕ predicts f and p in (10).10 Further details and a pseudocode can be found in Appendix A.2. Throughout this section, we set c(x, y) = x y 2, and always run 200 gradient steps using gradients from (10). Figure 5 shows UNOT barycenters vs. the true barycenters (computed with the POT library) of between 5 and 10 MNIST samples of the same digit per barycenter. In Appendix D.7, we also provide quantitative results for barycenters with different initializations. In Figure 6, we show barycenters computed between four shapes. UNOT accurately predicting barycenters demonstrates it captures the geometry of the Wasserstein space beyond predicting distances. 8It can be seen as a debiased version of OTϵ(µ, ν), and we use it as an approximation of the squared Wasserstein distance. 9If both measures are identical, the dual potentials can be chosen to be identical as well. 10To be precise, this solves the barycenter problem on the discrete space {x1, ..., xn}. Universal Neural Optimal Transport Figure 6. UNOT barycenters computed between four shapes (corners) by linearly interpolating λ = (λ1, λ2, λ3, λ4) from eq. (9) between the four unit vectors, and solving via eq. (10) with UNOT. 4.3. Calculating Geodesics Let µ, ν P2(X) be two measures such that ν = T#µ for an optimal transport map T : X X (which exists for the non-entropic optimal transport problem under certain conditions, see Appendix A.1). The Wasserstein geodesic between µ and ν, also called Mc Cann interpolation, is the constant-speed geodesic between µ and ν and given by µt : [0, 1] P2(X), t 7 [(1 t)Id + t T]#µ. It can be interpreted as the shortest path between µ and ν. The Wasserstein barycenter (8) between (µ, (1 t)) and (ν, t) (i.e. where 1 t and t are the weights λi from equation (8)) turns out to be equal to µt (Agueh & Carlier, 2011). This gives us two methods to approximate the Wasserstein geodesic between µ and ν: Either by iteratively computing barycenters as in Section 4.2, or by computing the (entropic) transport plan from equation (4) as an approximation to T (we are leaving out some technicalities for brevity here, which can be found in Appendix A.3). We compare the geodesics computed by UNOT to the ground truth geodesic (obtained from the true OT plan), as well as to Ge ONet (Gracyk & Chen, 2024), a recently proposed framework that also uses Neural Operators to learn Wasserstein geodesics directly by parametrizing a coupled PDE system encoding the optimality conditions of the dynamic OT problem. Akin to (Amos et al., 2023), Ge ONet is inherently dataset dependent. Figure 8 shows the Mc Cann interpolation between two MNIST digits using the ground truth OT plan, the OT plan computed by UNOT, barycenters computed by UNOT, and the Ge ONet geodesic, where we use the UNOT model trained with c(x, y) = x y 2 again. We see that despite Figure 7. Sinkhorn divergence particle flow between distributions of images, from noise to LFW (64x64). Gradients computed via eq. (11) and (10) with UNOT. Bottom row is target images. Ge ONet being trained to predict geodesics on MNIST, while UNOT does not train on geodesics, nor on MNIST, both geodesics computed by UNOT are significantly closer to the ground truth than the Ge ONet geodesic. 4.4. Wasserstein on Wasserstein Gradient Flow Oftentimes in machine learning, the distributions of interest are not images, but distributions over images, such as in generative modeling. In this experiment, we show that UNOT can successfully transport distributions over images as well. Let ˆµ, ˆν P2((P2([0, 1]2), c), W2), i.e. the space of distributions over images equipped with the Wasserstein distance (and P([0, 1]2) being equipped with c(x, y) = x y 2). Denote by ˆ SDϵ(ˆµ, ˆν) the Sinkhorn divergence between ˆµ and ˆν, where we use SDϵ(µ, ν) as the ground cost between µ, ν P2([0, 1]2) as an approximation of W 2 2 (µ, ν). Writing ˆµ = 1 n Pn i δµi, ˆν = 1 n Pn j δνj for µi, νj P2([0, 1]2), we let UNOT approximate the particle flow t ˆµt = ˆ µt[ ˆ SDϵ(ˆµt, ˆν)], for which we can derive the gradient via (see (Li et al., 2024b)): ˆ SDϵ(ˆµ, ˆν) SDϵ(µk, νj) µk Πkj, k = 1, ..., n, (11) where Πkj is an optimal transport plan between µk and νj. These gradients can be approximated by UNOT via equation (10) as before; further details can be found in Appendix A.4. In Figure 7, we plot the particle flow from Gaussian noise ˆµ to a distribution ˆν over 10 images, where we visualize ˆµt after every 10 gradient steps (using Adam W (Loshchilov & Hutter, 2019) with gradients computed via equation (10)). We can see that the UNOT flow converges quickly. Universal Neural Optimal Transport Figure 8. Mc Cann interpolations computed with the true OT plan, UNOT OT plan, UNOT barycenters, and Ge ONet (top to bottom). 5. Related Work Neural OT. Typically, neural OT approaches aim at solving individual instances of (high-dimensional) OT problems. In (Korotin et al., 2023), a maximin formulation for the dual problem is derived and two networks, parametrizing the transport plan and the dual potential resp., are trained adversarially. In (Bunne et al., 2023a), transport maps between continuous input distributions conditioned on a context variable are learned. Another interesting recent paper (Uscidda & Cuturi, 2023) suggests a universal regularizer, called the Monge gap, to learn OT maps and distances. Unlike these works, we focus on generalizing across OT problems. Initializing Sinkhorn. There exists very little literature on initializing the Sinkhorn algorithm. (Thornton & Cuturi, 2022) propose using dual vectors recovered from the unregularized 1D optimal transport problem, or from closed-form transport maps in a Gaussian (mixture) setting, and were able to significantly speed up convergence. (Amos et al., 2023) propose a neural approach, training a single network to predict the optimal dual potential f of the discrete dual problem, and their loss is simply the (negative) dual objective (3). This approach works well when training on low-dimensional datasets such as MNIST, and is elegant as it does not require ground truth potentials, i.e. is fully unsupervised, but it is not able to generalize to out-of-distribution data, and can only be used for input measures of fixed size. OT for Machine Learning. Leveraging OT to formulate new machine learning methods has seen a surge in popularity in recent years, and it has been applied to a wide range of problems. Relevant works include the celebrated Wasserstein GAN (Arjovsky et al., 2017), multi-label learning (Frogner et al., 2015), inverse problems in physics (Engquist & Yang, 2019), point cloud processing (Geuter et al., 2025; Fishman et al., 2025), or few-shot image classifica- tion to compute distances between images (Zhang et al., 2020). In flow matching OT can be used to straighten paths (Lipman et al., 2023; Tong et al., 2024; Pooladian et al., 2023). Approximating Wasserstein gradient flows with the JKO scheme has been explored in numerous works (Alvarez Melis & Fusi, 2021; Alvarez-Melis et al., 2022; Bunne et al., 2022; Choi et al., 2024). The theory of Wasserstein gradient flows has also been used to study learning dynamics in various settings, such as for overparametrized two-layer networks (Chizat & Bach, 2018) or simplified transformers (Geshkovski et al., 2024). Generative Adversarial Networks. GANs (Goodfellow et al., 2014), like other types of generative models, aim at generating samples from a distribution ρdata, given access to a finite number of samples. In contrast, we do not have access to samples from the target distribution. However, our loss function (7) shares similarities with the adversarial GAN loss. Given prior samples z ρz and data samples x ρdata, the GAN objective for a generator G is min G max D Ex ρdata [log D(x)]+Ez ρz [log(1 D(G(z)))] , where D is the discriminator, which predicts the probability that a sample came from the target distribution rather than the generator. Note that while our generator maximizes the objective, the GAN generator minimizes it. 6. Discussion We presented UNOT, a neural OT solver capable of solving entropic OT problems universally across datasets, for a given cost function. Leveraging Neural Operators, UNOT can process distributions of varying resolutions supported on grids. UNOT s training involves a generator network Gθ producing synthetic training samples for the predictive network Sϕ, where both networks are trained jointly via a self-supervised adversarial loss. Sϕ predicts the potential of the dual OT problem, and our training objective provably minimizes the loss w.r.t. the ground truth potentials. We show that UNOT is universal in the sense that the generator can create any discrete distributions during training, and empirically verify this through experiments on Euclidean and non-Euclidean image datasets of varying resolutions. UNOT consistently predicts OT distances up to 1-3% relative error, and approximates barycenters and geodesics in Wasserstein space by solving for the OT plan. Furthermore, we demonstrate that UNOT can be used as a state-of-the-art initialization for the Sinkhorn algorithm, achieving speedups of up to 7.4 . Current limitations are that UNOT does not extrapolate well to measures with significantly higher resolutions than the training samples, nor generalizes to cost functions other than the training cost. Scaling UNOT to higher resolutions, as well as applying it to other data modalities or non-uniform grids, are interesting directions for future research. Universal Neural Optimal Transport Acknowledgements For this work, VL has been funded by Deutsche Forschungsgemeinschaft (DFG) - Project-ID 318763901 - SFB1294. GK acknowledges funding within the BMBF project VI-Screen-PRO (Teilvorhaben: Mathematische Bildverarbeitung und maschinelles Lernen) VC3-23/13N17309. Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, as optimal transport has a vast range of applications, but none of these we feel must be specifically highlighted here. Agueh, M. and Carlier, G. Barycenters in the Wasserstein space. SIAM Journal on Mathematical Analysis, 43(2): 904 924, 2011. doi: 10.1137/100805741. URL https: //doi.org/10.1137/100805741. Alvarez-Melis, D. and Fusi, N. Dataset Dynamics via Gradient Flows in Probability Space. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 219 230. PMLR, 18 24 Jul 2021. URL https://proceedings.mlr. press/v139/alvarez-melis21a.html. Alvarez-Melis, D., Schiff, Y., and Mroueh, Y. Optimizing Functionals on the Space of Probabilities with Input Convex Neural Networks. Transactions on Machine Learning Research, 2022. URL https://openreview.net/ forum?id=dp OYN7o8Jm. Amos, B., Luise, G., Cohen, S., and Redko, I. Meta Optimal Transport. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 791 813. PMLR, 23 29 Jul 2023. URL https://proceedings.mlr.press/ v202/amos23a.html. Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein Generative Adversarial Networks. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 214 223. PMLR, 06 11 Aug 2017. URL https://proceedings.mlr. press/v70/arjovsky17a.html. Behrmann, J., Grathwohl, W., Chen, R. T. Q., Duvenaud, D., and Jacobsen, J.-H. Invertible Residual Networks. Proceedings of the International Conference on Machine Learning, 2019. doi: 10.48550/ARXIV.1811.00995. URL https://arxiv.org/abs/1811.00995. Bonciocat, A.-I. and Sturm, K.-T. Mass transportation and rough curvature bounds for discrete spaces. Journal of Functional Analysis, 256(9):2944 2966, 2009. ISSN 0022-1236. doi: https://doi.org/10.1016/j.jfa.2009.01. 029. URL https://www.sciencedirect.com/ science/article/pii/S0022123609000305. Bonev, B., Kurth, T., Hundt, C., Pathak, J., Baust, M., Kashinath, K., and Anandkumar, A. Spherical Fourier Neural Operators: Learning Stable Dynamics on the Sphere, 2023. URL https://arxiv.org/abs/ 2306.03838. Bunne, C., Papaxanthos, L., Krause, A., and Cuturi, M. Proximal Optimal Transport Modeling of Population Dynamics. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pp. 6511 6528. PMLR, 28 30 Mar 2022. URL https://proceedings.mlr.press/ v151/bunne22a.html. Bunne, C., Krause, A., and Cuturi, M. Supervised Training of Conditional Monge Maps, 2023a. URL https:// arxiv.org/abs/2206.14262. Bunne, C., Stark, S. G., Gut, G., et al. Learning singlecell perturbation responses using neural optimal transport. Nature Methods, 20:1759 1768, 2023b. doi: 10.1038/ s41592-023-01969-x. URL https://doi.org/10. 1038/s41592-023-01969-x. Carlier, G., Chizat, L., and Laborde, M. Lipschitz Continuity of the Schr odinger Map in Entropic Optimal Transport, 2022. Chang, B., Meng, L., Haber, E., Ruthotto, L., Begert, D., and Holtham, E. Reversible architectures for arbitrarily deep residual neural networks, 2017. Chewi, S., Niles-Weed, J., and Rigollet, P. Statistical optimal transport, 2024. URL https://arxiv.org/ abs/2407.18163. Chizat, L. and Bach, F. On the global convergence of gradient descent for over-parameterized models using optimal transport, 2018. URL https://arxiv.org/abs/ 1805.09545. Choi, J., Choi, J., and Kang, M. Scalable Wasserstein Gradient Flow for Generative Modeling through Unbalanced Optimal Transport, 2024. URL https:// arxiv.org/abs/2402.05443. Universal Neural Optimal Transport Courty, N., Flamary, R., Habrard, A., and Rakotomamonjy, A. Joint distribution optimal transportation for domain adaptation. In Advances in Neural Information Processing Systems, volume 30, 2017. URL https://proceedings. neurips.cc/paper/2017/file/ 0070d23b06b1486a538c0eaa45dd167a-Paper. pdf. Cuturi, M. Sinkhorn Distances: Lightspeed Computation of Optimal Transport. In Burges, C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K. (eds.), Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. URL https://proceedings. neurips.cc/paper/2013/file/ af21d0c97db2e27e13572cbf59eb343d-Paper. pdf. Dadashi, R., Hussenot, L., Geist, M., and Pietquin, O. Primal Wasserstein Imitation Learning, 2020. URL https://arxiv.org/abs/2006.04678. Engquist, B. and Froese, B. D. Application of the Wasserstein metric to seismic signals, 2013. URL https: //arxiv.org/abs/1311.4581. Engquist, B. and Yang, Y. Seismis imaging and optimal transport. Communications in Information and Systems, 19(2):95 145, 2019. URL https://www.intlpress.com/site/pub/ pages/journals/items/cis/content/ vols/0019/0002/a001/index.php. Fefferman, C., Mitter, S., and Narayanan, H. Testing the manifold hypothesis. Journal of the American Mathematical Society, 29(4):983 1049, 2016. URL https://www.ams.org/journals/jams/ 2016-29-04/S0894-0347-2016-00852-4/. Feydy, J., S ejourn e, T., Vialard, F.-X., ichi Amari, S., Trouv e, A., and Peyr e, G. Interpolating between Optimal Transport and MMD using Sinkhorn Divergences, 2018. URL https://arxiv.org/abs/1810.08278. Fishman, N., Gowri, G., Yin, P., Gootenberg, J., and Abudayyeh, O. Generative Distribution Embeddings, 2025. URL https://arxiv.org/abs/2505.18150. Franklin, J. and Lorenz, J. On the scaling of multidimensional matrices. Linear Algebra and its Applications, 114-115:717 735, mar-apr 1989. URL https://doi. org/10.1016/0024-3795(89)90490-4. Frogner, C., Zhang, C., Mobahi, H., Araya, M., and Poggio, T. A. Learning with a Wasserstein Loss. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https://proceedings. neurips.cc/paper/2015/file/ a9eb812238f753132652ae09963a05e9-Paper. pdf. Geshkovski, B., Letrouit, C., Polyanskiy, Y., and Rigollet, P. A mathematical perspective on transformers, 2024. URL https://arxiv.org/abs/2312.10794. Geuter, J., Bonet, C., Korba, A., and Alvarez-Melis, D. DDEQs: Distributional Deep Equilibrium Models through Wasserstein Gradient Flows. In Proceedings of the 28th International Conference on Artificial Intelligence and Statistics (AISTATS 2025), 2025. URL https://arxiv.org/abs/2503.01140. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative Adversarial Nets. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., and Weinberger, K. (eds.), Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. URL https://proceedings. neurips.cc/paper/2014/file/ 5ca3e9b122f61f8f06494c97b1afccf3-Paper. pdf. Goswami, S., Bora, A., Yu, Y., and Karniadakis, G. E. Physics-informed deep neural operator networks, 2022. URL https://arxiv.org/abs/2207.05748. Gouk, H., Frank, E., Pfahringer, B., and Cree, M. J. Regularisation of Neural Networks by Enforcing Lipschitz Continuity, 2020. Gracyk, A. and Chen, X. Ge ONet: a neural operator for learning the Wasserstein geodesic, 2024. URL https: //arxiv.org/abs/2209.14440. Hashan, A. M. Facial expression images, 2022. URL https://www.kaggle.com/ds/2366449. He, K., Zhang, X., Ren, S., and Sun, J. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770 778, 2016. doi: 10.1109/CVPR.2016.90. Jacobsen, J.-H., Smeulders, A., and Oyallon, E. i-Rev Net: Deep Invertible Networks, 2018. Kingma, D. P. and Ba, J. Adam: A Method for Stochastic Optimization, 2017. URL https://arxiv.org/ abs/1412.6980. Kolouri, S., Zou, Y., and Rohde, G. K. Sliced Wasserstein Kernels for Probability Distributions, 2015. URL https://arxiv.org/abs/1511.03198. Universal Neural Optimal Transport Kolouri, S., Park, S. R., Thorpe, M., Slepcev, D., and Rohde, G. K. Optimal Mass Transport: Signal processing and machine-learning applications. IEEE Signal Processing Magazine, 34(4):43 59, 2017. doi: 10.1109/MSP.2017. 2695801. Kolouri, S., Nadjahi, K., Simsekli, U., Badeau, R., and Rohde, G. Generalized Sliced Wasserstein Distances. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alch e-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips. cc/paper_files/paper/2019/file/ f0935e4cd5920aa6c7c996a5ee53a70f-Paper. pdf. Korotin, A., Selikhanovych, D., and Burnaev, E. Neural optimal transport, 2023. URL https://arxiv.org/ abs/2201.12220. Kovachki, N., Li, Z., Liu, B., Azizzadenesheli, K., Bhattacharya, K., Stuart, A., and Anandkumar, A. Neural Operator: Learning Maps Between Function Spaces, 2024. URL https://arxiv.org/abs/2108.08481. Li, S., Yu, X., Xing, W., Kirby, M., Narayan, A., and Zhe, S. Multi-Resolution Active Learning of Fourier Neural Operators, 2024a. URL https://arxiv.org/abs/ 2309.16971. Li, X., Lu, F., Tao, M., and Ye, F. X. F. Robust first and second-order differentiation for regularized optimal transport, 2024b. URL https://arxiv.org/abs/ 2407.02015. Li, Z., Kovachki, N., Azizzadenesheli, K., Liu, B., Bhattacharya, K., Stuart, A., and Anandkumar, A. Neural operator: Graph kernel network for partial differential equations, 2020. URL https://arxiv.org/abs/ 2003.03485. Li, Z., Kovachki, N., Azizzadenesheli, K., Liu, B., Bhattacharya, K., Stuart, A., and Anandkumar, A. Fourier neural operator for parametric partial differential equations, 2021. URL https://arxiv.org/abs/ 2010.08895. Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling, 2023. URL https://arxiv.org/abs/2210.02747. Loshchilov, I. and Hutter, F. Decoupled Weight Decay Regularization, 2019. URL https://arxiv.org/ abs/1711.05101. Nutz, M. Introduction to Entropic Optimal Transport, 2022. URL https://www.math.columbia.edu/ mnutz/docs/EOT_lecture_notes.pdf. Peyr e, G. and Cuturi, M. Computational Optimal Transport: With Applications to Data Science. Foundations and Trends in Machine Learning, 11(5-6):355 607, 2019. ISSN 1935-8237. doi: 10.1561/2200000073. URL http://dx.doi.org/10.1561/2200000073. Pooladian, A.-A., Ben-Hamu, H., Domingo-Enrich, C., Amos, B., Lipman, Y., and Chen, R. T. Q. Multisample flow matching: Straightening flows with minibatch couplings, 2023. URL https://arxiv.org/abs/ 2304.14772. Santambrogio, F. Optimal transport for applied mathematicians. Birk auser, NY, 55(58-63):94, 2015. Santambrogio, F. Euclidean, Metric, and Wasserstein gradient flows: an overview, 2016. URL https://arxiv. org/abs/1609.03890. Schiebinger, G., Shu, J., Tabaka, M., Cleary, B., Subramanian, V., Solomon, A., Gould, J., Liu, S., Lin, S., Berube, P., Lee, L., Chen, J., Brumbaugh, J., Rigollet, P., Hochedlinger, K., Jaenisch, R., Regev, A., , and Lander, E. S. Optimal-Transport Analysis of Single-Cell Gene Expression Identifies Developmental Trajectories in Reprogramming. Cell, 176(4):928 943, 2019. Schmitz, M. A., Heitz, M., Bonneel, N., Ngol e, F., Coeurjolly, D., Cuturi, M., Peyr e, G., and Starck, J.-L. Wasserstein Dictionary Learning: Optimal Transport-Based Unsupervised Nonlinear Dictionary Learning. SIAM Journal on Imaging Sciences, 11(1):643 678, jan 2018. doi: 10.1137/17m1140431. URL https://doi.org/10. 1137%2F17m1140431. Serrurier, M., Mamalet, F., Fel, T., B ethune, L., and Boissin, T. On the explainable properties of 1-Lipschitz Neural Networks: An Optimal Transport Perspective, 2023. Sinkhorn, R. and Knopp, P. Concerning nonnegative Matrices and doubly stochastic Matrices. Pacific Journal of Mathematics, 21(2), 1967. Thornton, J. and Cuturi, M. Rethinking Initialization of the Sinkhorn Algorithm, 2022. URL https://arxiv. org/abs/2206.07630. Tong, A., Fatras, K., Malkin, N., Huguet, G., Zhang, Y., Rector-Brooks, J., Wolf, G., and Bengio, Y. Improving and generalizing flow-based generative models with minibatch optimal transport, 2024. URL https: //arxiv.org/abs/2302.00482. Uscidda, T. and Cuturi, M. The Monge Gap: A Regularizer to Learn All Transport Maps, 2023. Villani, C. Optimal Transport Old and New. Springer, 2009. Universal Neural Optimal Transport Xu, H., Wang, W., Liu, W., and Carin, L. Distilled Wasserstein Learning for Word Embedding and Topic Modeling. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips. cc/paper_files/paper/2018/file/ 22fb0cee7e1f3bde58293de743871417-Paper. pdf. Zhang, C., Cai, Y., Lin, G., and Shen, C. Deep EMD: Few-Shot Image Classification With Differentiable Earth Mover s Distance and Structured Classifiers. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12200 12210, 2020. doi: 10.1109/CVPR42600.2020.01222. Universal Neural Optimal Transport The Appendix is structured as follows: In Appendix A we give a more thorough background on OT, as well as additional technical details for our experiments on computing barycenters and geodesics. Appendix B contains all proofs omitted in the paper. In Appendix C, we provide additional training details, and Appendix D contains additional experiments and materials. A. Background A.1. Optimal Transport In this section, we recall some properties of optimal transport. First, we define the unregularized continuous problems for completeness. Problem A.1 (Kantorovich Optimal Transport Problem). For µ P(X), ν P(Y), and a cost c : X Y R { } the Kantorovich problem takes the form inf π Π(µ,ν) Z c(x, y) dπ(x, y) (12) The infimum in (12) is called the transport cost, and the minimizer π, if it exists, the optimal transport plan. The continuous dual problem is similar to the regularized dual (3). For a more thorough overview of OT, we refer the reader to (Villani, 2009; Peyr e & Cuturi, 2019; Chewi et al., 2024). Problem A.2 (Dual Optimal Transport Problem). For µ, ν and c as before, the dual problem reads sup f L1(µ),g L1(ν) f+g c X f(x)dµ(x) + Z Y g(y)dν(y), where f + g c is to be understood as f(x) + g(y) c(x, y) for all x, y. An important concept in optimal transport are transport maps. Definition A.3 (Transport Maps). A map T : X Y is called a transport map between µ and ν if ν = T#µ. If there exists an optimal transport plan π such that π = (Id, T)#µ, T is called an optimal transport map. Of course, not every transport plan admits a transport map; however, every transport map yields an optimal transport plan via π = (Id, T)#µ. For sufficient conditions for the existence of both transport plans and maps, we refer the reader to (Villani, 2009). In the paper we mentioned that if µ and ν are supported on finitely many points, one can rewrite the problems A.1 and A.2 with vectors. We now define the discrete problems carefully. Problem A.4 (Discrete Optimal Transport Problem). For two discrete measures µ m 1 and ν n 1, and a cost matrix C Rm n, the discrete OT problem is defined as L(µ, ν) := min Π Π(µ,ν) C, Π Here, Π(µ, ν) denotes the set of all transport plans between µ and ν, i.e. matrices Π Rm n 0 s.t. Π1n = µ and Π 1m = ν. The problem has a dual formulation: Problem A.5 (Discrete Dual Optimal Transport Problem). For two discrete measures µ m 1 and ν n 1, and a cost matrix C Rm n, the discrete dual OT problem is defined as D(µ, ν) := max f Rm, g Rn f+g C f, µ + g, ν Here, f + g C is to be understood as fi + gj Cij for all i Jm K, j Jn K. In the special case where X = Y and C corresponds to a metric, i.e. Cij = d(xi, yj), the Wasserstein distance of order p between µ and ν for p [1, ) is defined as: min π Π(µ,ν) i,j Cp ijΠij Universal Neural Optimal Transport This definition coincides with the definition from the paper for continuous measures, if they are supported on finitely many points. For completeness, we also state the entropically regularized primal and dual problem in the discrete case. The discrete problem is typically formulated with an entropy term instead of the KL divergence as in equation (1), but the two can be shown to be equivalent (Chewi et al., 2024). Definition A.6 (Entropy). For a matrix P = [pij]ij Rm n, we define its entropy H(P) as j=1 pij log pij if all entries are positive, and H(P) := if at least one entry is negative. For entries pij = 0, we use the convention 0 log 0 = 0, as x log x x 0 0. The entropic optimal transport problem is defined as follows. Problem A.7 (Entropic Discrete Optimal Transport Problem). For ϵ > 0, the entropic optimal transport problem is defined as: OTϵ(µ, ν) := min Π Π(µ,ν) C, Π ϵ H(Π). Note that this is identical to the unregularized optimal transport problem, except that the unregularized one does not contain the regularization term ϵH(Π). As the objective in Problem A.7 is ϵ-strongly convex, the problem admits a unique solution (Peyr e & Cuturi, 2019). The Gibbs kernel is defined as K = exp( C/ϵ). Then the entropic dual problem reads: Problem A.8 (Entropic Discrete Dual Problem). max fϵ Rm, gϵ Rn fϵ, µ + gϵ, ν ϵ D efϵ/ϵ, Kegϵ/ϵE . Again, without the regularization term ϵ efϵ/ϵ, Kegϵ/ϵ , this equals the regular optimal transport dual; note, however, that the unregularized dual is subject to the constraint f + g C. In both the continuous, as well as the discrete setting, there is duality, i.e. the optima of the primal and dual problems coincide. In addition, the optimizers are intrinsically linked, akin to Proposition 1 for the discrete entropic problem. We refer the reader to (Villani, 2009; Peyr e & Cuturi, 2019) for more details. A.2. Calculating the Barycenter Recall the Sinkhorn divergence barycenter for a set of discrete measures {ν1, ..., νN} P2(X), µ = arg min µ P2(X) i αi SDϵ(µ , νi). For a solution (f, g) to the dual problem A.8 between two measures µ = P i aiδxi and ν = P j bjδyj, it holds SDϵ(µ, ν) = µ, f p + ν, g q , see (Feydy et al., 2018). Here, p and q are the optimal potentials for (µ, µ) and (ν, ν) resp. (if both measures in the OT problem are the same, the dual potentials can be chosen to be equal). From this identity, we immediately get a SDϵ(µ, ν) = f p. Note that this is not a gradient with respect to the measure µ; instead, we view µ as a vector, and compute the gradient w.r.t. the entries in that vector. This means we essentially compute the barycenter on the discrete space {x1, ..., xn}. Universal Neural Optimal Transport A pseudocode for how to approximate the Sinkhorn divergence barycenter with UNOT is given in Algorithm 3. Note that instead of using softmax to project back to probability measures, one could also just rescale; however, softmax proved better in practice. We also run a single Sinkhorn iteration on the output of Sϕ in practice, as it improved visual quality of the barycenters; however, this is not strictly needed. Algorithm 3 Barycenter Computation 1: in set of measures {νi}i n 1, initial µ0 n 1, weights λ n 1 2: µ µ0 3: for i = 1, 2, ..., T do 4: p Sϕ(µ, µ) 5: for νi in {νi}i do 6: fi Sϕ(νi, µ) //switch the order of the arguments to get fi instead of gi 7: end for 8: µ softmax (µ P i λi(fi pk)) 9: end for Barycenters can be far when Transport Distances are close. We now give a simple example that illustrates that merely predicting transport distances accurately does not necessarily imply predicting barycenters accurately, at least in the nonregularized setting. Let µ be the measure with mass 1/2 at the two points (0, 1) and (0, 1) in R2. Let ν1 be the measure with mass 1/2 at each of the points ( 1, 0) and (1, ϵ) for a small ϵ > 0, and ν2 a measure with mass 1/2 at each of the two points ( 1, 0) and (1, ϵ). Then as ϵ goes to 0, the transport distances between between µ and ν1 resp. µ and ν2 become arbitrarily close. However, the unique Wasserstein-p barycenter (for p > 1) between µ and ν1 has mass 1/2 at each of the points ( 1/2, 1/2) and (1/2, (1 + ϵ)/2), whereas the barycenter between µ and ν2 has mass 1/2 at each of the points ( 1/2, 1/2) and (1/2, (1 + ϵ)/2), so no matter how small ϵ gets, the barycenters will always be far apart. A.3. Geodesics In Section 4.3, we saw that the Mc Cann interpolation between two measures µ, ν P2(X) is a constant-speed geodesic. In this section, we provide additional background on constant-speed geodesics, and establish a connection between constantspeed geodesics in P2([0, 1]2) and the notion of strong-ϵ quasi-geodesics in the discretized space P( [[n]]2 n ). This makes our approximation of geodesics as in Section 4.3 more rigorous. First, we recall the definition of constant-speed geodesics. Definition A.9. A curve ω : [0, 1] (P2(X), W2) is called constant-speed geodesic between ω(0) and ω(1) if it satisfies W2(ω(t), ω(s)) = |t s|W2(ω(0), ω(1)), t, s [0, 1]. It turns out that for convex X Rd, constant-speed geodesics are equivalent to push-forwards under transport plans, and if the starting point ω(0) is absolutely continuous, this is equal to the Mc Cann interpolation. Theorem A.10. Let X Rd be convex. Then a curve ω : [0, 1] (P2(X), W2) is a constant-speed geodesic between ω(0) and ω(1) if and only if it is of the form ω(t) = (pt)#π for an optimal transport plan π between ω(0) and ω(1), where the interpolation pt is given by pt(x, y) = (1 t)x + ty. If, in addition, ω(0) is absolutely continuous, then we can write ω(t) = [(1 t) Id +t T]#ω(0) for an optimal transport map T from ω(0) to ω(1). This theorem holds, in fact, for any Wasserstein-p space for p > 1, see (Santambrogio, 2016). Now, denote by µt the Mc Cann interpolation between µ and ν. As mentioned in Section 4.2, we can express µt as the following barycenter: µt = arg min µ P2(X) (1 t)W 2 2 (µ , µ) + t W 2 2 (µ , ν) , Universal Neural Optimal Transport which we approximate by the Sinkhorn Divergence barycenter µt = arg min µ P2(X) ((1 t) SDϵ(µ , µ) + t SDϵ(µ , ν)) , (13) which is justified by the fact that Sinkhorn Divergences converge to the OT cost as ϵ 0, and that they are reliable loss functions, in the sense that weak convergence of a sequence of measures is equivalent to convergence of the Sinkhorn divergence, see (Feydy et al., 2018). As also shown in (Feydy et al., 2018), the gradient of the Sinkhorn Divergence w.r.t. the vector a, when writing a discrete measure µ as µ = P i aiδxi, is given by a SDϵ(µ, ν) = f p. (14) However, if we now do a simple gradient descent on (13) using (14), we are not actually computing the barycenter on the space P2(X) anymore, as we only consider gradients w.r.t. a, which does not allow particles to move, but merely to teleport mass to other particles. In particular, if X is a discrete space, there exist no constant-speed geodesics between different points anymore, as can easily be seen from the following example. Let µ0 = δx0 and µ1 = δx1 be two Dirac measures for some x0, x1 X. Assume there would exist a constant speed geodesic ω joining µ0 and µ1. Then for t > 0, W2(ω(t), ω(0)) = t W2(ω(1), ω(0)). However, since the space is discrete, this implies that x0 = x1, i.e. the only constant-speed geodesics are constant. We therefore work with the following approximation of geodesics. Definition A.11 (Quasi-Isometry). Let (X1, d1) and (X2, d2) be metric spaces. f : X1 X2 is called a (λ, ϵ)-quasiisometry if there exist λ 0 and ϵ > 0 such that for all x, y X1 1 λd1(x, y) ϵ d2(f(x), f(y)) λd1(x, y) + ϵ If in addition there exists a C > 0 such that for all z X2 there exist an x X1 such that d2(f(x), z) C, f is called quasi-isometry. We can then use this to define quasi-geodesics. (Bonciocat & Sturm, 2009) introduced a similar concept called h-rough geodesics, for which they just used the upper bound. Definition A.12 (Strong-ϵ Quasi-Geodesics). A strong-ϵ quasi-geodesic in a metric space (X, d) is a map γ : [0, 1] X such that for all s, t [0, 1], d(γ0, γ1)|t s| ϵ d(γt, γs) d(γ0, γ1)|t s| + ϵ. Now let X = [0, 1]2, and denote by [[n]]2 n [0, 1]2 the discrete space consisting of all xi of the form xi = 1 n, 0 + j 0, 1 n , for k, j = 0, ..., n 1. We can then show that (P2([0, 1]2), W2) is quasi-isometric to (P( [[n]]2 Proposition A.13. The metric space (P2([0, 1]2), W2) is (1, 1 2n)-quasi-isometric to (P( [[n]]2 n ), W2), i.e. there exist an f : (P2([0, 1]2), W2) (P( [[n]]2 n ), W2) such that for all µ, ν P2([0, 1]) it holds that 2n W2(f(µ), f(ν)) W2(µ, ν) + 1 Proof. We split the space [0, 1]2 into squares via N(xi) := (xi + [ 1 2n]2). We define f : P([0, 1]2) P( [[n]]2 Universal Neural Optimal Transport By triangle inequality, we have W2(f(µ), f(ν)) W2(µ, ν) + W2(µ, f(µ)) + W2(ν, f(ν)) W2(µ, ν) W2(f(µ), f(ν)) + W2(µ, f(µ)) + W2(ν, f(ν)) For any measure µ P([0, 1]), denoting by T : [0, 1]2 [[n]]2 n the map that sends each point to the corresponding midpoint xi, we get W 2 2 (µ, f(µ)) Z [0,1]2 |T(x) x|2dµ Z [0,1]2 2 4n2 dµ = 2 4n2 . Therefore we have a (1, 1 2n)-quasi-isometry between both spaces. We also need to show that there exist a C > 0 such that for all µ P( [[n]]2 n ) there exist a ν P([0, 1]2) with W2(f(ν), µ) < C. Choosing C = 1 2n and ν = µ concludes the proof. We immediately get the following corollary. Corollary A.14. Constant-speed geodesics P2([0, 1]2) are strong-ϵ quasi-geodesics in P([[n]]2/n). This justifies doing gradient descent on (13) using the discrete space gradient (14) to approximate the geodesic, as we can approximate the constant-speed geodesic with a strong-ϵ-quasi-geodesic in the discrete space. A.4. Wasserstein on Wasserstein Distance In this section, we provide additional details on how to solve the particle flow t ˆµt = ˆ µt[ ˆ SDϵ(ˆµt, ˆν)] (15) from Section 4.4. Recall that ˆµ, ˆν P2(P2([0, 1]2, c), W2), ˆµ = 1 i δµi, ˆν = 1 j δνj for µi, νj P2([0, 1]2). From (Li et al., 2024b), we get that ˆ SDϵ(ˆµ, ˆν) SDϵ(µk, νj) where Πkj is an optimal transport plan between µk and νj. Now as in the previous section, we can approximate SDϵ(µk,νj) µk = fkj pk, where fkj is the dual potential from OTϵ(µk, νj), and pk that of OTϵ(µk, µk). As before, we can approximate these gradients with UNOT, which lets us solve (15) with a simple gradient descent scheme, as shown in Section A.2. As in Section A.2, we add a single Sinkhorn iteration on the predictions made by Sϕ as it improves visual quality, but this is not strictly necessary. A.5. Fourier Neural Operators In this section, we describe FNOs in more detail. The main breakthrough for Neural Operators came in the combination with approximating solutions to partial differential equations (PDEs) (Li et al., 2020; 2021; Goswami et al., 2022). Many problems, including PDEs, can be numerically solved by discretizing infinite-dimensional input and output functions. Neural Operators are a class of neural networks that parametrize functions F : A U, where A and U are Banach spaces whose elements are functions a : Da Rd a and u : Du Rd u respectively, for bounded domains Da Rda and Du Rdu. One of the main advantages of Neural Operators is that they can generalize over different grid discretizations, unlike traditional neural networks, which makes them particularly well-suited for solving PDEs,11 and they are universal approximators for continuous operators acting on Banach spaces (Kovachki et al., 2024). While our space P([0, 1]2) is not 11For example, an FNO could be used to solve PDEs of the form u = a with Dirichlet boundary conditions, for which we get a unique solution u for every a. The FNO then maps each a A to the corresponding solution u U. Universal Neural Optimal Transport Layer 1 Layer L Layer 2 Figure 9. Fourier Neural Operator architecture, adapted from (Kovachki et al., 2024). The input measures (µ, ν) are passed through a point-wise lifting operator P which is then followed by L Fourier operators and point-wise non-linearity operators. After the last Fourier layer, we project back to the output potential g with a point-wise operator Q. technically a Banach space, the space of finite signed measures with the total variation norm is, and P([0, 1]2) is a subset. We note that approximation theory for Neural Operators usually A neural operator usually has the following form: a 7 Q BL .... B1 P(a), which in our setting becomes Sϕ : P([0, 1]2) P([0, 1]2) L1([0, 1]2) (µ, ν) 7 Q BL .... B1 P(µ, ν). Here, P is a lifting map, Bi are the kernel layers, and Q is a projection back to the target space. Different versions of neural operators have been proposed, which mostly differ in how the kernel layers Bi are defined. Our network Sϕ is parametrized as a Fourier Neural Operator (FNO) (Kovachki et al., 2024), where the kernel layers act on Fourier features of the inputs. We outline details for all the layers in the following. Lifting (P). The lifting map is a pointwise map {a : Da Rd a} 7 {v0 : D0 Rdv0}, which maps the input a to a function v0 by mapping points in Rd a to points in Rdv0. We use a 2D convolutional layer for P, and in our setting, Da = [0, 1]2 [0, 1]2, as we can view elements in P([0, 1]2) as maps [0, 1]2 R when dealing with discretizations of measures. Iterative Fourier Layer (Bi). The network has L Fourier layers Bi. In each of them, we map {vi : Di Rdvi} 7 {vi+1 : Di+1 Rdvi+1} by first applying the (discrete) Fourier transform F from which we select a fixed number of Fourier features, then a neural network NN on these features, and then the inverse Fourier transform F 1. Note that the Fourier features are complex, hence the network NN is also complex (with multiplications in C). Each Fourier layer also contains a bypass layer, which is similar to a skip connection, but contains a layer W which is typically a 2D convolution; cmp Figure 9. Hence, the output of the Fourier layer is given by σ(F 1(NN(F(v)) + b + Wv), where σ is an activation. Projection (Q). The projection Q is the analogue to the lifting layer, mapping the hidden representation to the output function {v L : DL Rdv L} 7 {u : Du Rd u}. In our setting, Du = [0, 1]2. In contrast to (Kovachki et al., 2024), we found that a Fourier layer containing a two-layer neural network NN instead of just a linear layer worked better in practice. Our bypass layer is still a linear layer W. On the unit sphere S2, we use Spherical FNOs (SFNOs) (Bonev et al., 2023) instead of regular FNOs, which respect the geometry of S2. SFNOs leverage the Fourier transform on the sphere FS2, which can be viewed as a change of basis into Universal Neural Optimal Transport an orthogonal basis of L2(S2), instead of the regular Fourier transform F for flat geometries. Everything else about our architecture remains the same. Details on hyperparameter choices can be found in Appendix C. Universal Neural Optimal Transport This section contains all proofs, as well as further technical details omitted in the paper. For convenience, we restate the statements from the paper. We start off by rigorously restating Proposition 2. Let X RN be a compact set. We start off with a natural definition of discretization of a continuous measure, which applies, for example, to discrete images as discretizations of an underlying ground truth continuous image. Definition B.1 (Discretization of Measures). Let µ P(X) be an absolutely continuous measure, and let Xn = {xn 1, ..., xn n} X. The discretization of µ on Xn is defined as the measure µn P(X) supported on Xn, where µn(xn i ) = Z with Ωn i = {x X : x xn i x y y X}. Note that the intersections Ωn i Ωn j have Lebesgue measure zero, so this is well-defined. We cannot guarantee that an arbitrary sequence of discretizations µn converges weakly to µ as n ; simply consider the case where all the xn i are identical for all n and i. Hence, we need to ensure that the discretization is uniform over all of X in some way. Definition B.2 (Uniform Discretization). Let Xn = {xn i , ..., xn n} be subsets of X for all n N. Then we call the sequence (Xn)n N a uniform discretization of X if for all x X, lim n min i=1,...,n x xn i = 0. While this may seem like a pointwise discretization at first, it turns out to be uniform, as an Arzel a-Ascoli type argument shows. Theorem B.3. Let X Rd be compact, and let {Xn}n 1 be a sequence of finite subsets of X with |Xn| = n for each n. The following are equivalent: 1. lim n sup x X min y Xn x y = 0. 2. x X : lim n min y Xn x y = 0. Proof. (1) = (2). If supx X miny Xn x y 0, then in particular for each fixed x X we have min y Xn x y sup z X min y Xn z y 0. (2) = (1). Define fn(x) = min y Xn x y , x X. By hypothesis (2), fn(x) 0 for every x X. Moreover for any x, z X, fn(x) fn(z) = min y Xn x y min y Xn z y x z , so {fn} is equicontinuous on the compact set X. Since fn 0 pointwise, the Arzel a Ascoli theorem upgrades to uniform convergence of the entire sequence (instead of just a subsequence): lim n sup x X fn(x) = lim n sup x X min y Xn x y = 0. Hence (1) holds, completing the proof. Universal Neural Optimal Transport Note that condition (1) in Theorem B.3 is equivalent to Definition 1 of a discrete refinement in (Kovachki et al., 2024). The following lemma holds. Lemma B.4 (Weak Convergence of Discretizations of Measures). Let µ X be absolutely continuous, and (µn)n N be a sequence of discretizations of µ supported on a uniform discretization (Xn)n N of X. Then µn converges weakly to µ. Proof. Let f Cb(X) be a test function. We have to show that Z Since X is compact and f : X R is continuous, by the Heine Cantor theorem f is uniformly continuous. Hence, for every ε > 0 there exists δ > 0 such that x y < δ = |f(x) f(y)| < ε for all x, y X. Since sup x X min 1 i n x xn i n 0, we can choose n such that for all n n , sup x X min 1 i n x xn i < δ. In particular, for each x X, there is some xn i Xn with x xn i < δ, giving |f(x) f(xn i )| < ε whenever x xn i < δ. (16) Let Ωn i = n x X : x xn i x xn j for all j = 1, . . . , n o as above. These sets form a partition of X (up to measure-zero boundaries). Then for all n n , we have (using equation (16)): f(xn i ) f(x) dµ(x) f(xn i ) f(x) dµ(x) i=1 ϵµ(Ωn i ) and letting ϵ 0 finishes the proof. In Proposition 2, we used the canonical extension for dual potentials. For a pair of dual variables (f, g) solving the dual problem (3) between µ and ν, their canonical extensions are defined by f and g satisfying the following conditions: f(x) = ϵ log Z ϵ (g(y) c(x, y)) dν(y), g(x) = ϵ log Z ϵ (f(y) c(x, y)) dµ(y). We refer to (Santambrogio, 2015; Feydy et al., 2018) for more details. We can now state and prove a formal version of Proposition 2. Universal Neural Optimal Transport Proposition 2. (Formal) Let c(x, y) : X X R be Lipschitz continuous in both its arguments, and X RN compact. Let (µn)n N, (νn)n N be discretization sequences for absolutely continuous µ, ν P(X), supported on a uniform discretization (Xn)n N of X. Let (fn, gn) be the (unique) extended dual potentials of (µn, νn) such that fn(x0) = 0 for some x0 X and all n. Let (f, g) be the (unique) dual potentials of (µ, ν) such that f(x0) = 0. Then fn and gn converge uniformly to f and g on all of X. Proof. By Lemma B.4, we know that µn µ and νn ν. The statement now follows immediately from Proposition 13 in (Feydy et al., 2018). Theorem 3. Let 0 < λ 1 and Gθ : Rd Rd be defined via Gθ(z) = Re LU (NNθ(z) + λz) , where z ρz = N(0, I), and where NNθ : Rd Rd is Lipschitz continuous with Lip(NNθ) = L < λ. Then Gθ is Lipschitz continuous with Lip(q) < L + λ, and G(z) := NNθ(z) + λz is invertible on Rd. Furthermore, for any x Rd 0 it holds ρGθ#ρz(x) 1 (L + λ)d N Gθ 1(x)|0, I . In other words, Gθ#ρz has positive density at any non-negative x Rd 0. Proof. Since the Lipschitz constant of the sum of two functions is bounded by the sum of the Lipschitz constants of the two functions, we have Lip( Gθ) L + λ. From Theorem 1 in (Behrmann et al., 2019), it follows that Gθ is invertible, and Lemma 2 therein implies Lip( Gθ 1) 1 λ L. The Lipschitz continuity of Gθ 1 implies that for any h, z Rd with h = 0, we have Gθ(z)h = lim t 0 Gθ(z + th) Gθ(z) Lip( Gθ 1) lim t 0 Gθ 1( Gθ(z + th)) Gθ 1( Gθ(z)) t Lip( Gθ 1) h which shows that Gθ is invertible everywhere. Hence, by the inverse function theorem, we get Gθ 1(x) = Gθ 1( Gθ( Gθ 1(x))) = ( Gθ( Gθ 1(x))) 1 for any x Rd. Furthermore, similar to above, we have Gθ(z)ei = lim t 0 Gθ(z + tei) Gθ(z) Lip( Gθ) lim t 0 where ei is the ith unit vector. Hence, we get from Hadamard s inequality that |det Gθ(z)| Πi Gθ(z)ei Πi(L + λ) = (L + λ)d. Universal Neural Optimal Transport Putting everything together, by change of variables, we get for any x Rd: ρ Gθ#ρz(x) = ρz( Gθ 1(x)) det Gθ 1(x) = ρz( Gθ 1(x)) det Gθ( Gθ 1(x)) 1 1 (L + λ)d ρz( Gθ 1(x)) = 1 (L + λ)d N( Gθ 1(x)|0, I). Now clearly, if x Rd 0, then ρGθ#ρz(x) ρ Gθ#ρz(x), as for any z with G(z) = x, we also have G(z) = x. Thus, we also have ρGθ#ρz(x) 1 (L + λ)d N( Gθ 1(x)|0, I), which finishes the proof. Corollary 4. Let Gθ = Gθ1 Gθ1 ... GθR be a composition of functions Gθi, each of which is of the form as in Theorem 3. Let z ρz = N(0, I). Then ρ Gθ#ρz(x) 1 (L + λ)Rd N Gθ 1(x)|0, I for any x Rd. As in Theorem 3, this also holds for any x Rd 0 if Gθ is followed by a Re LU activation. Proof. Consider the case where Gθ = G1 θ1 G2 θ2. Then for any x Rd, we get from the proof of Theorem 3 above: ρ Gθ#ρz(x) 1 (L + λ)d ρ G2 θ2#ρz(( G1 θ1) 1(x)) 1 (L + λ)2d N G2 θ2 1 G1 θ1 1 (x) |0, I = 1 (L + λ)2d N( Gθ 1(x)|0, I). The claim now follows by induction over the layers of Gθ. Note that if Gθ is followed by a Re LU activation, this inequality also holds for any x Rd 0, similar to Theorem 3. Next, we prove Proposition 5. The proof is based on the Hilbert projective metric. For two vectors u, v Rn +, it is defined as d H(u, v) := max i [log(ui) log(vi)] min i [log(ui) log(vi)], and can be shown to be a distance on the projective cone Rn +/ , where u u if u = ru for some r > 0 (Peyr e & Cuturi, 2019; Franklin & Lorenz, 1989). For f = log(u) and g = log(v), we thus define the following loss: LH(f, g) := max i [fi gi] min i [fi gi]. Lemma B.5. Let f, g Rn. Then LH(f, g) If, in addition, P i gi = 0, then f g 2 n LH(f, g). Universal Neural Optimal Transport Proof. Let h = f g. For the first inequality, observe that LH(f, g) = maxi hi mini hi. Let j and k be the indices achieving maxi hi and mini hi, respectively. Define the vector e such that ej = 1, ek = 1, and ei = 0 for all other i. Then: LH(f, g) = e h e 2 h 2 = Now assume that P i gi = 0. Set M = maxi hi and m = mini hi. If all hi = 0, both statements are trivial. Hence, assume at least one of the hi is not zero. Since P i fi gi = 0, this implies M > 0 and m < 0. For any index i, hi M, and thus (hi)2 M 2 (M m)2 = LH(f, g)2. Summing over all indices, we have: f g 2 2 = h 2 2 = i=1 (hi)2 n LH(f, g)2. Taking the square root yields: f g 2 n LH(f, g). This finishes the proof. Proposition 5. For two discrete measures (µ, ν) with n particles, let g be a potential solving the dual problem, gϕ = Sϕ(µ, ν), and gτk = τk(µ, ν, gϕ) the target. Without loss of generality, assume that P i gτk i = 0. Then L2(gϕ, g) c(K, k, n) L2(gϕ, gτk) for some constant c(K, k, n) > 1 depending only on the Gibbs kernel K, k and n. Proof. We first show a similar inequality as in Proposition 5 for the Hilbert loss. A well-known fact about the Hilbert metric is that positive matrices (in our case, the Gibb s kernel K) act as strict contractions on positive vectors with respect to the Hilbert metric (cf. Theorem 4.1 in (Peyr e & Cuturi, 2019)). More precisely, we have d H(Kv, Kv ) λ(K)d H(v, v ) for any positive vectors v, v Rn, where η(K) + 1 , η(K) := max i,j,k,l Kik Kjl Kjk Kil . The same inequality also holds for K in place of K. Note that by definition, η(K) 1, hence 0 < λ(K) < 1. Now consider a starting vector v0 to the Sinkhorn algorithm, and let vl denote the lth iterate of the vector. Denote by v the limit liml vl of the algorithm. Then (letting / denote element-wise division): d H(vl+1, v ) = d H ν/K ul+1, ν/K u = d H K ul+1, K u λ(K)d H(ul+1, u ) = λ(K)d H µ/Kvl, µ/Kv = λ(K)d H Kvl, Kv λ(K)2d H vl, v , where we used the Hilbert metric inequality twice, once on K and once on K . Iteratively applying this inequality and translating into log-space notation, this gives us LH(gτk, g) λ(K)2k LH(gϕ, g). For now, assume that P i gϕi = 0. By triangle inequality, LH(gϕ, g) LH(gϕ, gτk) + LH(gτk, g) LH(gϕ, gτk) + λ(K)2k LH(gϕ, g), Universal Neural Optimal Transport which gives us LH(gϕ, g) 1 1 λ(K)2k LH(gϕ, gτk) =: c(K, k) LH(gϕ, gτk). Combining this with Lemma B.5 yields gϕ g 2 n LH(gϕ, g) nc(K, k)LH(gϕ, gτk) 2 nc(K, k) gϕ gτk 2 = c(K, k, n) gϕ gτk 2 , (17) from which the claim follows by squaring both sides. We are left with proving the general case when P i gϕi = 0. Write gϕ = ˆgϕ + gϕ, where gϕ is equal to 1 i gϕi in each entry, s.t. ˆgϕ sums to zero. We then get L2(gϕ, g) = ˆgϕ g 2 + gϕ 2 , (18) as ˆgϕ g, gϕ = 0. Similarly, we get L2(gϕ, gτk) = ˆgϕ gτk 2 + gϕ 2 . (19) Combining equations (17), (18) and (19), we get L2(gϕ, g) = ˆgϕ g 2 + gϕ 2 c L2(ˆgϕ, gτk) + gϕ 2 = c L2(gϕ, gτk) gϕ 2 + gϕ 2 = c L2(gϕ, gτk) + (1 c) gϕ 2 c L2(gϕ, gτk), where the last inequality follows from the fact that 1 c < 0. This finishes the proof. Remark B.6. Looking at the proof of Proposition 5, one might wonder why we didn t opt for the Hilbert projective metric as the loss directly. We tried using it instead of L2, and it works quite well, but training with L2 seems to have an edge, probably because the indifference of the Hilbert projetive metric to constant shifts is not a helpful inductive bias for deep learning. Universal Neural Optimal Transport C. Training Details Generator Architecture. Recall that the generator is of the form Gθ : Rd P(X) P(X) z ρz 7 R [Re LU (NNθ(z) + λ Id,d (z)) + δ] , where we set λ = 1.0, δ = 1e-6 (note we first normalize, then add δ, and then normalize again in practice), and z is of size 2 10 10. R normalizes and randomly downsizes output distributions to resolutions between 10 10 and 64 64 (per distribution). This improves generalization of the FNO Sϕ across resolutions, which is true for FNOs in general (Li et al., 2024a). NNθ is a five-layer fully connected MLP, where all hidden layers are of dimension 0.04 642, and the output is of dimension 2 642. All layers except the output layer contain Batch Normalization and ELU activations; the last layer has a sigmoid activation only. We note the architecture might seem strange, as the network is relatively deep, while the hidden layers are relatively narrow. However, this architecture worked best amongst an extensive sweep of architectures. Applying Theorem 3. In the following, we discuss the relation between our generator Gθ and Theorem 3 in more detail. Note that Theorem 3 is not directly applicable to our setting for a few reasons: First, we add a small constant η to the generator s output. This constant ensures that all training samples are positive everywhere, and vastly improves learning speed as it ensures that all inputs are active. However, this is not restrictive of the problem, as the Sinkhorn algorithm requires inputs to be positive anyways. Second, in Theorem 3 both inand outputs to Gθ have the same dimension. This could be achieved in our setting by choosing the input dimension equal to the output dimension, i.e. Id,n equal to the identity. However, in practice, using lower-dimensional inputs achieves significantly better results. This can be argued for by the manifold hypothesis (Fefferman et al., 2016), i.e. the fact that typically, datasets live on low-dimensional manifolds embedded in high-dimensional spaces. Depending on the application, i.e. the expected target dataset dimension, the dimension of the input can be adjusted accordingly. Finally, note that the theorem assumes that NNθ is Lipschitz continuous with Lipschitz constant L < λ, where λ is the scaling factor of the skip connection. We do not enforce this constraint, as not doing so yields empirically better results. Still, Theorem 3 goes to show that our algorithm s performance is not bottlenecked by the generator s inability to generalize. We note that a bound on the Lipschitz constant is not necessary for invertibility of Res Nets; other approaches have been suggested in the literature, e.g. through the lens of ODEs (Chang et al., 2017) or by partitioning input dimensions (Jacobsen et al., 2018). It is also possible to directly divide by the Lipschitz constant of each layer (Serrurier et al., 2023); these approaches could be studied in future research. We will now describe how one can bound the Lipschitz constant of the generator. Since λ = 1.0, we need to make sure that the Lipschitz constant of netθ is smaller than 1 in order for Theorem 3 to be applicable. Since the Lipschitz constant of a composition of functions is bounded by the product of the Lipschitz constants of each component function, this means we have to bound the product of the Lipschitz constants of components of netθ. ELU is Lipschitz continuous with constant 1, whereas sigmoid s Lipschitz constant is 0.25. Furthermore, for a batch normalization layer BN, we have BN(x) BN(y) = x µb where µb and σb denote the empirical mean and standard deviation of the batch. Since we draw our data from a standard normal Gaussian, we have E[σb] = 1, i.e. in expectation, the batch normalization layer is Lipschitz with constant 1. Hence, all that remains is to bound the product of Lipschitz constants of the three linear layers by (any number smaller than) 4 (because the constant of sigmoid is 0.25, this will ensure that the network has a Lipschitz constant smaller than 1), for which it suffices to bound the operator norms of the weight matrix of each layer. In practice, these can be approximated with the power method as in (Gouk et al., 2020) to find a lower bound on the Lipschitz constant of each linear layer, and these bounds can be used to add a soft constraint to the loss. Empirically, this suffices to bound the Lipschitz constant of the generator. Alternatively, one can use a hard constraint as outlined in (Behrmann et al., 2019). However, empirically, this proved detrimental to training, hence we did not control the Lipschitz constant during our training. Yet, Theorem 3 is still of value, as it goes to show that our algorithm s performance is not bottlenecked by the generator s inability to generalize. We leave properly enforcing the Lipschitz constraint for future research. Architecture of Sϕ. Our FNO architecture follows the general structure outlined in Section A.5. We set dvi = 64 for all i; recall this is the hidden dimension in the Fourier layer. We set the number of Fourier features selected from the Fourier transform to 10 10, i.e. 10 along each of the two dimensions of the domain. The (complex) weight matrices of the neural network in Fourier space, i.e. the one acting on the Fourier features, are tensors of shape (dvi, 4dvi, Nmodesx, Nmodesy) = Universal Neural Optimal Transport (64, 256, 10, 10) and (4dvi, dvi, Nmodesx, Nmodesy) = (256, 64, 10, 10) respectively, i.e. the hidden dimension is four times the hidden dimension of vi. Note that since these are complex layers, each layer has two (real) weight tensors of this shape, one for the real and one for the complex part. These layers are the only complex layers in the network Sϕ. The inputs to the layer are of shape (dvi, Nmodesx, Nmodesy) = (64, 10, 10) (in C) and multiplied along all dimensions by the weights, i.e. for input ˆx C64,10,10 and weight matrix A C64,256,10,10 (the first of the two layers): ˆyo,n,m = X i Ai,o,n,mˆxi,n,m. The activation used within this network, as well as after each Fourier block, is Ge LU. The lifting layer P, bypass layer W, and projection layer Q are 2D convolutions with kernel size 1. Hyperparameters. In Table 3 we present all relevant hyperparameters again for convenience. Table 3. Training hyperparameters. Hyperparameter Value # params Gθ 272k # layers Gϕ 5 hidden dims Gϕ (164, 164, 164, 164) δ (eq. (6)) 1e-6 λ (eq. (6)) 1 d (dimension of latent z) 2 10 10 = 200 optimizer Gϕ Adam activations Gϕ ELU β1 (initial learning rate Gθ ) 0.001 learning rate decay Gθ 1 weight decay Gϕ 0 # params Sϕ 26M Number of Fourier layers 4 dvi (dim. in Fourier blocks) 64 hidden dim. of Fourier NN 256 # layers in Fourier NN 2 Nmodesx (# Fourier modes) 10 Nmodesy (# Fourier modes) 10 optimizer Sϕ Adam W σ (activation in Sϕ) Ge LU α1 (initial learning rate Sϕ) 1e-4 learning rate decay Sϕ 0.9999 weight decay Sθ 1e-4 minimum training sample size 10 10 maximum training sample size 64 64 # training samples 200M batch size 5000 mini batch size 64 T (number batches) 40k ϵ (for Sinkhorn targets) 0.01 k (# Sinkhorn iterations for targets) 5 Code. Source code for UNOT, including the weights for the model used in the experiments, can be found at https: //github.com/Gregor Kornhardt/UNOT. Universal Neural Optimal Transport D. Additional Experiments and Materials D.1. Test Sets In Figure 10 we show samples from our test datasets. For some of the experiments in the appendix, we included two additional datasets, the cars class which is also from the Quick, Draw! dataset, and the Facial Expressions dataset (Hashan, 2022), which consists of 48 48-dimensional greyscale images. The datasets are very diverse, and range in dimensionality from very low (MNIST) to fairly low (BEARS, CARS), medium high (CIFAR) and very high (EXPRESSIONS, LFW). Figure 11 shows samples from our spherical datasets (where only part of the sphere is visible here). To create a grid on the sphere, we sample elevation angles θ uniformly in π 2 and azimuthal angles φ uniformly in 0, 2π . Concretely, we set 2 + i n 1 π, φj = 2π j n 1 , i, j = 0, . . . , n 1, and form the n n grid (θi, φj) i,j. Each pair (θi, φj) is mapped to a point on the sphere by x = cos(θi) cos(φj), y = cos(θi) sin(φj), z = sin(θi). EXPRESSIONS Figure 10. Test dataset samples on the unit square. D.2. Comparison with Meta OT We trained a Meta OT (Amos et al., 2023) network with the official Git Hub implementation12 and compared it against UNOT on our test datasets, where we rescaled all datasets to 28 28, as Meta OT does not natively support inputs of varying sizes. In Table D.2, we report the relative errors on the OT distance (in %) after a single Sinkhorn iteration. 12https://github.com/facebookresearch/meta-ot Universal Neural Optimal Transport Figure 11. Test dataset samples on the sphere. Table 4. Relative Errors on the OT distance (in %) after a single Sinkhorn iteration with UNOT s initialization, compared to Meta OT (Amos et al., 2023), the Gaussian initialization (Thornton & Cuturi, 2022), and the default initialization. Datasets rescaled to 28 28 such that the Meta OT network can process them. MNIST CIFAR MNIST-CIFAR LFW BEAR LFW-BEAR UNOT (ours) 2.7 2.4 1.3 1.1 2.8 2.6 1.5 1.3 2.0 1.6 1.8 1.3 Meta OT 2.4 1.8 23.1 15.7 11.4 5.8 24.6 15.7 11.8 8.3 31.0 14.8 Gauss 18.1 10.0 19.7 7.6 32.2 8.7 21.1 6.5 20.4 8.3 19.3 6.4 Ones 39.5 13.4 47.4 20.2 74.5 6.9 56.9 15.4 54.2 13.5 66.4 10.8 We see that UNOT outperforms Meta OT on all datasets except MNIST, which is to be expected, as Meta OT is explicitly trained on MNIST, while UNOT is not trained on any MNIST data. However, surprisingly, we see that UNOT almost matches Meta OT s performance on MNIST, suggesting strong coverage of MNIST-like distributions by our generator network during training. D.3. MLP-UNOT We mention that in applications of fixed-size distributions, one can replace the Neural Operator with an MLP and achieve similar results for a fraction of the training cost. We note that since the MLP acts on a fixed discrete space one does not need to have equispaced samples. In experiments, we found the MLP approach to also be very reliable for fixed-size inputs, and to vastly outperform the standard initialization of the Sinkhorn algorithm. Notably, it can be trained in just a few minutes to relative errors below 5%. Universal Neural Optimal Transport D.4. Generalization across Resolutions In this section, we show that UNOT successfully generalizes across resolutions. To this end, we downsample resp. upsample our test datasets to resolutions between 10 10 and 64 64. Figure 12 shows the relative errors on the transport distance over this range of resolutions after a single Sinkhorn iteration, compared against the default and the Gaussian initializations. (In Section D.5, we also provide some results on upsampling the dimension of the data beyond 64 64, i.e. beyond the largest resolution that the network saw during training.) We see that UNOT generalizes very well across all resolutions between 10 10 and 64 64. MNIST CIFAR BEAR LFW EXPRESSIONS CAR EXPRESSIONS 10 20 30 40 50 60 0.0 10 20 30 40 50 60 10 20 30 40 50 60 10 20 30 40 50 60 10 20 30 40 50 60 10 20 30 40 50 60 Resolution Relative Error UNOT Gauss Ones Figure 12. Relative error on the transport distance over the image resolution, ranging from 10 10 to 64 64. Universal Neural Optimal Transport D.5. Variable Epsilon In this section, we provide experimental results on a variant of UNOT that also receives the parameter ϵ as an input. Instead of the pair of measures (µ, ν) encoded as a tensor of size (B, 2, n, n), we use an input size of (B, 3, n, n), where the third channel is equal to ϵ everywhere. During training, we sample epsilon randomly per sample from a distribution with values between 0.01 and 1. Otherwise, training is identical to the training of regular UNOT. In Figure 13, we plot the relative errors over ϵ ranging from 0.01 to 1 on the x-axis, and the resolution of the data ranging from 10 10 to 70 70 on the y-axis (where we downsample resp. upsample the data to these dimensions, cf. Section D.4; note that we still only trained on image resolutions between 10 10 and 64 64). This variant of UNOT seems to do surprisingly well across different values of ϵ and across a wide range of resolutions, with relatively stable performance across different values of ϵ. However, we can see that when the resolution gets smaller than around 15 15, or close to 70 70, the error increases. 10 20 30 40 50 60 70 MNIST CIFAR BEAR LFW EXPRESSIONS CAR 10 20 30 40 50 60 70 10 20 30 40 50 60 70 10 20 30 40 50 60 70 10 20 30 40 50 60 70 EXPRESSIONS 0.01 0.1 1 10 20 30 40 50 60 70 Regularization Parameter Figure 13. Relative error on the transport distance, over the resolution and varying values of ϵ. Universal Neural Optimal Transport D.6. Generated Measures Figure 14 shows images created by the generator. The generator creates very different images over the course of training, including highly structured distributions, large areas of mass, and distributions with mass concentrated in very small areas. Figure 14. Pairs of training samples before and after 20%, 40%, 60%, 80%, and 100% of training, from left to right (lighter=more mass). Top row: actual training images; bottom row: training samples visualized with a smaller skip constant λ to accentuate learned features. In Table D.6, we also report the average OT distance error of samples created by the generator at various stages of training. We can see that the generator indeed creates samples that are initially difficult, but that it quickly picks up on them, and by the end of training is capable of predictions for samples from all stages of training. Table 5. Relative error on OT distance for samples created after 10, 20, ..., 70% of training. Errors for all samples computed at the time of their creation (i.e., after 10, 20, ...% of training) and at the end of training. Error after ...% of Training 0% 10% 20% 30% 40% 50% 60% 70% At Generation 53.2% 3.1% 2.1% 1.6% 1.8% 1.7% 2.1% 1.9% At End of Training 2.0% 1.6% 1.4% 1.1% 1.6% 1.5% 2.0% 1.9% Universal Neural Optimal Transport D.7. Additional Experiments We provide additional results from our experiments in this section. In Table 6, we show the average Wasserstein-2 distance of barycenters computed by gradient descent using equation (10) to the true barycenter, where we compute the gradient in equation (10) from the different initializations and a single Sinkhorn iteration. Figures 15 and 16 show the relative error on the OT distance over Sinkhorn iterations for c(x, y) = x y (on the square) and c(x, y) = arccos( x, y ) (on the sphere) resp., complementing Figure 3. In Figure 17, we plot the relative error on the transport distance w.r.t. computation time when initializing the Sinkhorn algorithm with UNOT, and compare against the default initialization. We see that particularly on higher dimensional data, UNOT is significantly faster than Sinkhorn. However, interestingly, on MNIST the default initialization actually seems to be faster. We note that these results heavily depend on the hardware used, and that we did not optimize our FNO architecture for performance, so a more efficient architecture would probably lead to even more significant speedups. We have not included the initialization from (Thornton & Cuturi, 2022) in the plots, as it was very slow for us, even slower than the standard initialization, despite our best efforts to implement it as efficiently as possible. However, from (Thornton & Cuturi, 2022; Amos et al., 2023) it seems like the speedup should be somewhere between 1.1x and 2x, depending on the dataset, which would make it significantly slower than UNOT on most of our datasets. We mention again that FNOs process complex numbers, but Py Torch is heavily optimized for real number operations. With kernel support for complex numbers, UNOT will likely be much faster. Finally, in Figures 18 and 19, we plot the marginal constraint violation (MCV), defined as 1 mΠ ν 1 + Π1n µ 1 for a transport plan Π, again for a single Sinkhorn iteration (Figure 18) and over iterations (Figure 19). The MCV measures how far the transport plan is from the marginals µ and ν. It is often used as a stopping criterion for the Sinkhorn algorithm, as the ground truth OT distance is unknown in practice. We compute the predicted transport plan for UNOT via equation (4). Table 6. Average W2 distance from the predicted barycenter to the true barycenter on MNIST after 100 gradient steps. W2 Distance UNOT (Ours) 0.021 0.011 Gauss 0.033 0.018 Ones 0.057 0.034 0.30 MNIST(28x28) 0.6 CIFAR(28x28) 0.30 MNIST-CIFAR(28x28) 0 10 20 30 40 50 Iteration 0.6 LFW(64x64) 0 10 20 30 40 50 Iteration 0.30 BEAR(64x64) 0 10 20 30 40 50 Iteration 0.30 LFW-BEAR(64x64) Relative Error Figure 15. Relative Error on the OT distance on the unit square with c(x, y) = x y for the UNOT initialization compared to the default one, over number of Sinkhorn iterations. Note the y-axis has been rescaled for CIFAR and LFW to fit the curve for the default initialization, and that the Gaussian initialization does not exist for the Euclidean cost function. Universal Neural Optimal Transport 0.3 MNIST(28x28) CIFAR(28x28) MNIST-CIFAR(28x28) 0 10 20 30 40 50 Iteration 0.3 LFW(64x64) 0 10 20 30 40 50 Iteration BEAR(64x64) 0 10 20 30 40 50 Iteration LFW-BEAR(64x64) Relative Error Figure 16. Relative Error on the OT distance on the unit sphere with c(x, y) = arccos( x, y ) for the UNOT initialization compared to the default one, over number of Sinkhorn iterations. Note that the Gaussian initialization does not exist for the spherical cost function. Figure 17. Comparison of relative errors on the transport distance over computation time in seconds. Evaluated on an NVIDIA 4090. The x-offset of the UNOT curves corresponds to the time needed for the forward pass through Sϕ. Universal Neural Optimal Transport MNIST CIFAR MNIST LFW-BEAR LFW BEAR 0.0 Marginal Constraint Violation 28x28 28x28 28x28 64x64 64x64 64x64 Ones Gauss UNOT MNIST CIFAR MNIST LFW-BEAR LFW BEAR 0.0 Marginal Constraint Violation 28x28 28x28 28x28 64x64 64x64 64x64 MNIST CIFAR MNIST LFW-BEAR LFW BEAR 0.0 Marginal Constraint Violation 28x28 28x28 28x28 64x64 64x64 64x64 Figure 18. Average marginal constraint violation (see eq. (20)) after a single Sinkhorn iteration, for the unit square domain with c(x, y) = x y 2 (top) and c(x, y) = x y (middle), and the unit sphere with c(x, y) = arccos( x, y ) (bottom). Note that the Gaussian initialization exists only for the squared Euclidean distance cost. Universal Neural Optimal Transport 0.3 MNIST(28x28) CIFAR(28x28) MNIST-CIFAR(28x28) 0 10 20 30 40 50 Iteration 0.3 LFW(64x64) 0 10 20 30 40 50 Iteration BEAR(64x64) 0 10 20 30 40 50 Iteration LFW-BEAR(64x64) Marginal Constraint Violation UNOT Gauss Ones 0.3 MNIST(28x28) CIFAR(28x28) MNIST-CIFAR(28x28) 0 10 20 30 40 50 Iteration 0.3 LFW(64x64) 0 10 20 30 40 50 Iteration BEAR(64x64) 0 10 20 30 40 50 Iteration LFW-BEAR(64x64) Marginal Constraint Violation 0.3 MNIST(28x28) CIFAR(28x28) MNIST-CIFAR(28x28) 0 10 20 30 40 50 Iteration 0.3 LFW(64x64) 0 10 20 30 40 50 Iteration BEAR(64x64) 0 10 20 30 40 50 Iteration LFW-BEAR(64x64) Marginal Constraint Violation Figure 19. Average marginal constraint violation (see eq. (20)) over number of Sinkhorn iterations, for the unit square domain with c(x, y) = x y 2 (top) and c(x, y) = x y (middle), and the unit sphere with c(x, y) = arccos( x, y ) (bottom). Note that the Gaussian initialization exists only for the squared Euclidean distance cost.