# expected_sliced_transport_plans__4060f231.pdf Published as a conference paper at ICLR 2025 EXPECTED SLICED TRANSPORT PLANS Xinran Liu 1, Rocío Martín Díaz 2, Yikun Bai1, Ashkan Shahbazi1, Matthew Thorpe3, Akram Aldroubi4, Soheil Kolouri1 1Department of Computer Science, Vanderbilt University, Nashville, TN, 37235 2Department of Mathematics, Tufts University, Medford, MA 02155 3Department of Statistics, University of Warwick, Coventry, CV4 7AL, UK 4Department of Mathematics, Vanderbilt University, Nashville, TN, 37235 The optimal transport (OT) problem has gained significant traction in modern machine learning for its ability to: (1) provide versatile metrics, such as Wasserstein distances and their variants, and (2) determine optimal couplings between probability measures. To reduce the computational complexity of OT solvers, methods like entropic regularization and sliced optimal transport have been proposed. The sliced OT framework improves efficiency by comparing one-dimensional projections (slices) of high-dimensional distributions. However, despite their computational efficiency, sliced-Wasserstein approaches lack a transport plan between the input measures, limiting their use in scenarios requiring explicit coupling. In this paper, we address two key questions: Can a transport plan be constructed between two probability measures using the sliced transport framework? If so, can this plan be used to define a metric between the measures? We propose a lifting operation to extend one-dimensional optimal transport plans back to the original space of the measures. By computing the expectation of these lifted plans, we derive a new transport plan, termed expected sliced transport (EST) plans. We further prove that using the EST plan to weight the sum of the individual Euclidean costs x y p for moving from x to y results in a valid metric between the input discrete probability measures. Finally, we demonstrate the connection between our approach and the recently proposed min-SWGG, along with illustrative numerical examples that support our theoretical findings. 1 INTRODUCTION The optimal transport (OT) problem (Villani, 2009) seeks the most efficient way to transport a distribution of mass from one configuration to another, minimizing the cost associated with the transportation process. It has found diverse applications in machine learning due to its ability to provide meaningful distances, i.e., the Wasserstein distances (Peyré & Cuturi, 2019), between probability distributions, with applications ranging from supervised learning (Frogner et al., 2015) to generative modeling (Arjovsky et al., 2017). Beyond merely measuring distances between probability measures, the optimal transport plan obtained from the OT problem provides correspondences between the empirical samples of the source and target distributions, which are used in various applications, including domain adaptation (Courty et al., 2014), positive-unlabeled learning (Chapel et al., 2020), texture mixing (Rabin et al., 2011), color transfer (Rabin et al., 2014), image analysis (Basu et al., 2014), and even single-cell and spatial omics (Bunne et al., 2024), to name a few. One of the primary challenges in applying the OT framework to large-scale problems is its computational complexity. Traditional OT solvers for discrete measures typically scale cubically with the number of samples N (i.e., the support size) (Kolouri et al., 2017). Precisely, when using linear programming for solving the OT problem, the computational complexity is of order O(N 3 log(N)). This computational burden has spurred significant research efforts to accelerate OT computations. Various approaches have been developed to address this challenge, including the seminal work (Cuturi, 2013), which introduces entropic regularization, leveraging the Sinkhorn algorithm to compute OT efficiently with quadratic time complexity; multiscale methods (Schmitzer, 2016); and projection- Equal contribution. Published as a conference paper at ICLR 2025 based techniques such as sliced-Wasserstein distances (Rabin et al., 2011) and robust subspace OT (Paty & Cuturi, 2019). Each of these methods has its own advantages and limitations. For instance, the entropic regularized OT is solved via an iterative algorithm (i.e., the Sinkhorn algorithm) with quadratic computational complexity per iteration. However, the number of iterations required for convergence typically increases as the regularization parameter decreases, which can offset the computational benefits of these methods. Precisely, the complexity of Sinkhorn method is of order O(N 2 log(N)/λ2), where λ is the regularization parameter Dvurechensky et al. (2018); Altschuler et al. (2017). Additionally, while entropic regularization interpolates between Maximum Mean Discrepancy (MMD) (Gretton et al., 2012) and the Wasserstein distance (Feydy et al., 2019), it does not produce a true metric between probability measures. Despite not providing a metric, the entropic OT provides a transport plan, i.e., soft correspondences, albeit not the optimal one. On the other hand, sliced-Wasserstein distances offer linearithmic computational complexity, enabling the comparison of discrete measures with millions of samples. These distances are also topologically equivalent to the Wasserstein distance and offer statistical advantages, such as better sample complexity (Nadjahi et al., 2020). However, despite their computational efficiency, the sliced-Wasserstein approaches do not provide a transport plan between the input probability measures, limiting their applicability to problems that require explicit coupling between measures. In this paper, we address two central questions: First, can a transportation plan be constructed between two probability measures using the sliced transport framework? If so, can the resulting transportation plan be used to define a metric between the two probability measures? Within the sliced transport framework, the "slices" refer to the one-dimensional marginals of the source and target probability measures, for which an optimal transportation plan is computed. Crucially, this optimal transportation plan applies to the marginals (i.e., one-dimensional probability measures) rather than the original measures. To derive a transportation plan between the source and target measures, this optimal plan for the marginals must be "lifted" back to the original space. For discrete measures with equal support size, N, and uniform mass, 1/N, the optimal transportation plan between marginals is represented by a correspondence matrix, specifically an N N permutation matrix. Previous works have used directly the correspondence matrix obtained for a slice as a transportation plan in the original space of measures (Rowland et al., 2019; Mahey et al., 2023). This paper provides a holistic and rigorous analysis of this problem for general discrete probability measures. Our specific contributions in this paper are: 1. Introducing a computationally efficient transport plan between discrete probability measures, the Expected Sliced Transport (EST) plan. Motivated by the first question highlighted above, we construct this transport plan as the average of transport plans computed via a lifting scheme involving one-dimensional sliced transport plans. (See Definition 2.4 below.) 2. Providing a distance for discrete probability measures, the Expected Sliced Transport (EST) distance. (See Definition 2.6 and Theorem 2.10 below.) 3. Offering both a theoretical proof and an experimental visualization showing that the EST distance is equivalent to the Wasserstein distance (and to weak convergence) when applied to discrete measures. 4. Illustrating the potential applicability of the proposed distance and transport plan, with a focus on interpolation and classification tasks. 2 EXPECTED SLICED TRANSPORT 2.1 PRELIMINARIES Given a probability measure µ P(Rd) and a unit vector θ Sd 1 Rd, we define θ#µ := θ, #µ to be the θ-slice of the measure µ, where θ, x = θ x = θT x denotes the standard inner product in Rd. For any pair of probability measures with finite p-moment (p > 1) µ1, µ2 Pp(Rd), one can pose the following two Optimal Transport (OT) problems: On the one hand, consider the classical OT problem, which gives rise to the p-Wasserstein metric: Wp(µ1, µ2) := min γ Γ(µ1,µ2) Rd Rd x y pdγ(x, y) 1/p (1) Published as a conference paper at ICLR 2025 Figure 1: Our goal in this paper is to construct transportation plans between d-dimensional measures (d > 1) using their one-dimensional marginals, or slices. For two measures µ1 (represented by green circles) and µ2 (represented by blue circles), we first derive a one-dimensional transport plan Λµ1,µ2 θ from their slices along a unit vector θ (indicated by triangles). This one-dimensional transport plan is then lifted back to the original d-dimensional space to produce a transport plan between the measures, γµ1,µ2 θ . Finally, these transport plans are aggregated over all directions θ Sd 1. In (a), the measures µ1 and µ2 are uniform, with their masses non-overlapping when projected in the direction of θ. In (b), µ1 and µ2 are non-uniform, resulting in overlapping masses when projected along θ. For further details, see Remark A.1 in Appendix A.1. where denotes the Euclidean norm in Rd and Γ(µ1, µ2) P(Rd Rd) is the subset of all probability measures with marginals µ1 and µ2. On the other hand, for a given θ Sd 1, consider the one-dimensional OT problem: Wp(θ#µ1, θ#µ2) = min Λθ Γ(θ#µ1,θ#µ2) R R |u v|pdΛθ(u, v) 1/p (2) In this case, since the measures θ#µ1, θ#µ2 can be regarded as one-dimensional probabilities in P(R), there exists a unique optimal transport plan, which we denote by Λµ1,µ2 θ (see, for e.g., (Villani, 2021, Thm. 2.18, Remark 2.19)], (Maggi, 2023, Thm. 16.1)). 2.2 ON SLICING AND LIFTING TRANSPORT PLANS In this section, given discrete probability measures µ1, µ2 Pp(Rd), we describe the process of slicing them according to a direction θ Sd 1 and lifting the optimal transport plan Λµ1,µ2 θ , which solves the 1-d OT problem (2), to get a plan in Γ(µ1, µ2). Thus, we obtain a new measure, denoted as γµ1,µ2 θ , in P(Rd Rd) with first and second marginal µ1 and µ2, respectively. For clarity, we first describe the process for discrete uniform measures and then extend it to any pair of discrete measures. 2.2.1 ON SLICING AND LIFTING TRANSPORT PLANS FOR UNIFORM DISCRETE MEASURES Given N N, consider the space P(N)(Rd) of uniform discrete probability measures concentrated at N particles in Rd, that is, i=1 δxi | xi Rd, i {1, ..., N} N PN i=1 δxi, µ2 = 1 N PN j=1 δyj P(N)(Rd), where xi, yj Rd and δxi denotes a Dirac measure located at xi (respectively for δyj). Let us denote by U(Sd 1) the uniform measure on the hypersphere Sd 1 Rd. In this case, the θ-slice of µ1 is represented by θ#µ1 = 1 N PN i=1 δθ xi, and similarly for θ#µ2. Let SN denote the symmetric group of all permutations of the elements in the set [N] := {1, . . . , N}. Let ζθ, τθ SN denote the sorted indices of the projected points {θ xi}N i=1 and {θ yj}N j=1, respectively, that is, θ xζ 1 θ (1) θ xζ 1 θ (2) θ xζ 1 θ (N) and θ yτ 1 θ (1) θ yτ 1 θ (2) θ yτ 1 θ (N) (3) The optimal matching from θ#µ1 to θ#µ2 for the problem (2) is induced by the assignment θ xζ 1 θ (i) 7 θ yτ 1 θ (i), 1 i N. (4) We define T µ1,µ2 θ : {x1, . . . , x N} {y1, . . . , y N} the lifted transport map between µ1 and µ2 by: θ (xi) = yτ 1 θ (ζθ(i)), 1 i N. (5) Published as a conference paper at ICLR 2025 Rigorously, T µ1,µ2 θ is not necessarily a function defined on {x1, . . . , x N} but on the labels {1, . . . , N}, as two projected points θ xi and θ xj could coincide for i = j. As a result, it is more appropriate to work with lifted transport plans. Indeed, the matrix uµ1,µ2 θ Rn n given by θ (i, j) = 1/N if j = τ 1 θ (ζθ(i)) 0 otherwise (6) encodes the weights of the optimal transport plan between θ#µ1 and θ#µ2 given by θ (i, j)δ(θ xi,θ yj), (7) as well as the weights of the lifted transport plan between the original measures µ1 and µ2 according to the θ-slice defined by γµ1,µ2 θ (i, j)δ(xi,yj). (8) This new measure γµ1,µ2 θ P(Rd Rd) has marginals µ1 and µ2. While γµ1,µ2 θ is not necessarily optimal for problem (1), it can be interpreted as a transport plan in Γ(µ1, µ2) which is optimal when projecting µ1 and µ2 in the direction of θ. See Figure 1 (a) for a visualization. 2.2.2 ON SLICING AND LIFTING TRANSPORT PLANS FOR GENERAL DISCRETE MEASURES Consider discrete measures µ1, µ2 Pp(Rd). In this section, we will use the notation µ1 = P x Rd p(x)δx, where p(x) 0 for all x Rd, p(x) = 0 for at most countable many points x Rd, and P x Rd p(x) = 1. Similarly, µ2 = P y Rd q(y)δy for a non-negative density function q in Rd with finite or countable support and such that P y Rd q(y) = 1. Given θ Sd 1, consider the equivalence relation defined by: x θ x if and only if θ x = θ x We denote by xθ the equivalence class of x Rd. By abuse of notation, we will use interchangeably that xθ is a point in the quotient space Rd/ θ, and also the set {x Rd : θ x = θ x }, which is the orthogonal hyperplane to θ that intersects x. The intended meaning of xθ will be clear from the context. Notice that, geometrically, the quotient space Rd/ θ is the line R in the direction of θ. Now, we interpret the projected measures θ#µ1, θ#µ2 as 1-dimensional probability measures in P(Rd/ θ) given by θ#µ1 = P xθ Rd/ θ P( xθ)δ xθ, where P( xθ) = P x xθ p(x ), and similarly, θ#µ2 = P yθ Rd/ θ Q( yθ)δ yθ, where Q( yθ) = P y yθ q(y ). Remark 2.1. Notice that if P( xθ) = 0, then p(x ) = 0 for all x xθ, or, equivalently, if p(x) = 0, then P( xθ) = 0 (where x is any representative of the class xθ). Similarly for Q. Consider the optimal transport plan Λµ1,µ2 θ Γ(θ#µ1, θ#µ2) P(Rd/ θ Rd/ θ) between θ#µ1 and θ#µ2, which is unique for the OT problem (2) as we are considering 1-dimensional probability measures. Let us define θ (x, y) := ( p(x)q(y) P ( xθ)Q( yθ)Λµ1,µ2 θ ({( xθ, yθ)}) if p(x) = 0 and q(y) = 0 0 if p(x) = 0 or q(y) = 0 which allows us to generalize the lifted transport plan given in (6) in the general discrete case: y Rd uµ1,µ2 θ (x, y)δ(x,y) (9) See Figure 1 (b) for a visualization. Remark 2.2. Notice that this lifting process can be performed by starting with any transport plan Λθ Γ(θ#µ1, θ#µ2), but in this article we will always consider the optimal transport plan, i.e., Λθ = Λµ1,µ2 θ . The reason why we make this choice is because this will give rise to a metric between discrete probability measures: The EST distance which will be defined in Section 2.3. Lemma 2.3. Given general discrete probability measures µ1 and µ2 in Rd, the discrete measure γµ1,µ2 θ defined by (9) has marginals µ1 and µ2, that is, γµ1,µ2 θ Γ(µ1, µ2) P(Rd Rd). We refer the reader to the appendix for its proof. Published as a conference paper at ICLR 2025 2.3 EXPECTED SLICED TRANSPORT (EST) FOR DISCRETE MEASURES Leveraging on the transport plans γµ1,µ2 θ described before, in this section we propose a new transport plan γµ1,µ2 Γ(µ1, µ2), which will give rise to a new metric in the space of discrete probability measures. Definition 2.4 (Expected Sliced Transport plan). Let σ P(Sd 1). For discrete probability measures µ1, µ2 in Rd, we define the Expected Sliced Transport (EST) plan γµ1,µ2 P(Rd Rd) by γµ1,µ2 := Eθ σ[γµ1,µ2 θ ], where each γµ1,µ2 θ is given by (9), that is, (10) γµ1,µ2({(x, y)}) = Z Sd 1 γµ1,µ2 θ ({(x, y)})dσ(θ), x, y Rd Rd. In other words, γµ1,µ2 = P y Rd U µ1,µ2(x, y)δ(x,y), where the new weights are given by U µ1,µ2(x, y) = ( p(x)q(y) R Sd 1 Λµ1,µ2 θ ({( xθ, yθ)}) P ( xθ)Q( yθ) dσ(θ) if p(x) = 0 and q(y) = 0 0 otherwise Remark 2.5. The measure γµ1,µ2 is well-defined and, moreover, (as an easy consequence of Lemma 2.3) γµ1,µ2 Γ(µ1, µ2), i.e., it has marginals µ1 and µ2. (See Lemma A.5 in the appendix.) Definition 2.6 (Expected Sliced Transport distance). Let σ P(Sd 1) with supp(σ) = Sd 1. We define the Expected Sliced Transport discrepancy for discrete probability measures µ1,µ2 in Rd by Dp(µ1, µ2) := y Rd x y p γµ1,µ2({(x, y)}) , where γµ1,µ2 is defined by (10). (11) Remark 2.7. By defining the following generalization of the Sliced Wasserstein Generalized Geodesics (SWGG) dissimilarity presented in Mahey et al. (2023), Dp(µ1, µ2; θ) := y Rd x y pγµ1,µ2 we can rewrite (11) as Dp(µ1, µ2) = E1/p θ σ[Dp p(µ1, µ2; θ)] Remark 2.8. Since the EST plan γµ1,µ2 is a transport plan, we have that Wp(µ1, µ2) Dp(µ1, µ2). In Appendix B we show that they define the same topology for the space of discrete probabiliies. Remark 2.9 (EST for discrete uniform measures and the Projected Wasserstein distance). Consider uniform measures µ1 = 1 N PN i=1 δxi, µ2 = 1 N PN j=1 δyj P(N)(Rd), and for θ Sd 1, let ζθ, τθ SN be permutations that allow us to order the projected points as in (3). Notice that if σ = U(Sd 1), by using the formula (5) for each assignment given θ and noticing that τ 1 θ ζθ SN, we can re-write (11) as Dp(µ1, µ2)p = Eθ U(Sd 1) i=1 xi yτ 1 θ (ζθ(i)) p # Therefore, we have that the expression for Dp( , ) given by (13) coincides with the Projected Wasserstein distance proposed in (Rowland et al., 2019, Definition 3.1). Then, by applying (Rowland et al., 2019, Proposition 3.3), we have that the Expected Sliced Transport discrepancy defined in Equation (13) is a metric on the space P(N)(Rd). We generalise this in the next theorem. Theorem 2.10. The Expected Sliced Transport Dp( , ) defined in (11) is a metric in the space of finite discrete probability measures in Rd. Published as a conference paper at ICLR 2025 Sketch of the proof of Theorem 2.10. For the detailed proof, we refer the reader to Appendix A. Here, we present a brief overview of the main ideas and steps involved in the proof. The symmetry of Dp( , ) follows from our construction of the transport plan γµ1,µ2, which is based on considering a family of optimal 1-d transport plans {Λµ1,µ2 θ }θ Sd 1. The identity of indiscernibles follows essentially from Remark 2.8. To prove the triangle inequality we use the following strategy: 1. We leverage the fact that Dp( , ) is a metric for the space P(N)(Rd) of uniform discrete probability measures concentrated at N particles in Rd (Rowland et al. (2019)) to prove that Dp( , ) is also a metric on the set of measures in which the masses are rationals. To do so, we establish a correspondence between finite discrete measures with rational weights and finite discrete measures with uniform mass (see the last paragraph of Proposition A.7). 2. Given finite discrete measures µ1, µ2, we approximate them, in terms of the Total Variation norm, by sequences of probability measures with rational weights {µ1 n}, {µ2 n}, supported on the same points as µ1 and µ2, respectively. We then turn our attention on how the various plans constructed behave as the n increases and show the following convergence results in Total Variation norm: (a) The sequence Λµ1 n,µ2 n θ n N converges to Λµ1,µ2 (b) The sequence γµ1 n,µ2 n θ n N converges to γµ1,µ2 (c) The sequence γµ1 n,µ2 n n N converges to γµ1,µ2. As a consequence, we obtain limn Dp(µ1 n, µ2 n) = Dp(µ1, µ2). Finally, given three finite discrete measures µ1, µ2, µ3, we proceed as in point 2 by considering sequences of probability measures with rational weights {µ1 n}, {µ2 n}, {µ3 n} supported on the same points as µ1, µ2, µ3, respectively, that approximate the original measures in Total Variation, obtaining Dp(µ1, µ2 n) = lim n Dp(µ1 n, µ2 n) lim n Dp(µ1 n, µ3 n) + Dp(µ3 n, µ2 n) = Dp(µ1, µ3) + Dp(µ3, µ2) where the equalities follows from point 2 and the middle triangle inequality follows from point 1. 3 EXPERIMENTS 3.1 COMPUTATIONAL EFFICIENCY Figure 2: Wall-clock time plot of Sinkhorn algorithm for entropic OT with varying λ and the EST method for different numbers of slices, L, as the size N of the empirical measures increases. In practice, to compute the EST plan (10) and distance (11), we sample L unit vectors {θℓ}L ℓ=1 (or slices) uniformly from Sd 1 to approximate γµ1,µ2 1 L PL ℓ=1 γµ1,µ2 θℓ . For empirical measures µ1, µ2 in Rd of size N, the computational complexity of our proposed EST method is of order O((L + d)N 2). A detailed analysis is provided in Appendix C. Additionally, Figure 2 presents the wall-clock time of the Sinkhorn algorithm, used to compute entropic OT for various values of the regularization parameter λ, and the proposed EST method, evaluated with different numbers L of slices, as the size N of the support of the discrete measures increases. The wall clock times are averaged over 10 runs. 3.2 COMPARISON OF TRANSPORT PLANS Figure 3 illustrates the behavior of different transport schemes: the optimal transport plan for W2( , ), the transport plan obtained by solving an entropically regularized transportation problem between the source and target probability measures, and the new expected sliced transport plan γ. We include comparisons with entropic regularization because it is one of the most popular approaches, as it allows for the use of Sinkhorn s algorithm. From the figure, we observe that while γ promotes mass splitting, this phenomenon is less pronounced than in the entropically regularized OT scheme. This observation will be revisited in Subsection 3.3. Published as a conference paper at ICLR 2025 Figure 3: Depiction of transport plans (an optimal transport plan, a plan obtained from solving an entropically regularized transport problem, and the proposed expected sliced transport plan) between source (orange) and target (blue) for four different configurations of masses. The measures in the left and right panels are concentrated on the same particles, respectively; however, the top row depicts measures with uniform mass, while the bottom row depicts measures with random, non-uniform mass. transport plans are shown as gray assignments and as n m heat matrices encoding the amount of mass transported (dark color = no transportation, bright color = more transportation), where n is the number of particles on which the source measure is concentrated, and m = 2n) is the number of particles on which the target measure is concentrated. 3.3 TEMPERATURE APPROACH Given µ1, µ2 discrete probability measures, we perform the new expected sliced transportation scheme by using the following averaging measure στ U(Sd 1) on the sphere: dστ(θ) = e τDp p(µ1,µ2;θ) R Sd 1 e τDp p(µ1,µ2;θ )dθ dθ, (14) where Dp(µ1, µ2; θ) is given by (12), and τ 0 is a hyperparameter we will refer to as the temperature (note that increasing τ corresponds to reducing the temperature). If τ = 0, then σ0 = U(Sd 1). However, when τ = 0, στ is a probability measure on Sd 1 with density given by (14), which depends on the source and target measures µ1 and µ2. We have chosen this measure στ because it provides a general parametric framework that interpolates between our proposed scheme with the uniform measure (τ = 0) and min-SWGG (Mahey et al., 2023), as the EST distance approaches min-SWGG when τ . For the implementations, we use στ(θl) = e τDp p(µ1,µ2;θl) PL ℓ =1 e τDp p(µ1,µ2;θℓ ) , (15) where L represents the number of slices or unit vectors θ1, . . . , θL Sd 1. Figure 4 illustrates that as τ , the weights used for averaging the lifted transport plans converge to a one-hot vector, i.e., the slice minimizing Dp(µ1, µ2; θ) dominates, leading to a transport plan with fewer mass splits. For the visualization we have used source µ1 and target µ2 uniform probability measures concentrated on different number of particles. For consistency, the configurations are the same as in Figure 3. 3.4 INTERPOLATION We use the Point Cloud MNIST 2D dataset Garcia (2023), a reimagined version of the classic MNIST dataset (Le Cun, 1998), where each image is represented as a set of weighted 2D point clouds instead Published as a conference paper at ICLR 2025 Figure 4: The effect of increasing τ (i.e., decreasing temperature) on the expected sliced plan. The left most column shows the OT plan, and the rest of the columns show the expected sliced plan as a function of increasing τ. The right most column depicts that expected sliced plan recovers the min-SWGG Mahey et al. (2023) transportation map. Figure 5: Interpolation between two point clouds via ((1 t)x + ty)#γ, where γ is the optimal transport plan for W2( , ) (top left), the transport plan obtained from entropic OT with various regularization parameters (bottom left), and the EST for different temperatures τ (right). of pixel values. In Figure 5, we illustrate the interpolation between two point clouds that represent digits 7 and 6. Since the point clouds are discrete probability measures with non-uniform mass, we perform three different interpolation schemes via ((1 t)x + ty)#γ where 0 t 1 for different transport plans γ, namely: 1. γ = γ , an optimal transport plan for W2( , ); 2. a transport plan γ obtained from solving an entropically regularized transportation problem (performed for three different regularization parameters λ); Published as a conference paper at ICLR 2025 3. γ = γ: the expected sliced transport plan computed using στ given by formula (14) (or (15) for implementations) for four different values of the temperature parameter τ. As the temperature increases, the transport plan exhibits less mass splitting, similar to the effect of decreasing the regularization parameter λ in entropic OT. However, unlike entropic OT, where smaller regularization parameters require more iterations for convergence, the computation time for expected sliced transport remains unaffected by changes in temperature. 3.5 WEAK CONVERGENCE 𝜇! 𝜇" = 𝜈 𝜇!.$% 𝜇!.% 𝜇!.&% Figure 6: Discrepancies calculated from transport plans between µt and ν, when µt ν, as a function of t [0, 1]. Given finite discrete probability measures µ and ν, we consider µt, for 0 t 1, the Wasserstein geodesic between µ and ν. In particular, µt is a curve of probability measures that interpolates µ and ν, that is µ0 = µ and µ1 = ν. Moreover, we have that W2(µt, ν) = (1 t)W2(µ, ν) 0 as t 1, or equivalently, we can say µt converges in the weak -topology to ν. Figure 6 illustrates that the expected sliced distance also satisfies D2(µt, ν) 0 as t 1. Indeed, this experimental conclusion is justified by the following theoretical result: Let µ, µn P(Ω) be discrete measures with finite or countable support, where Ω Rd is compact. Assume σ U(Sd 1). Then, Dp(µn, µ) 0 if and only if µn µ. We present its proof in Appendix B. For the experiment, µ and ν are chosen to be discrete measures with N particles of uniform mass, sampled from two Gaussian distributions (see Figure 6, top). For different values of time, 0 t 1, we compute different discrepancies, PN i=1 PN j=1 xi yj 2γµt,ν ij , calculated for various transport plans: (1) the optimal transport plan, (2) the outer product plan µt ν, (3) the plan obtained from entropic OT with two different regularization parameters λ, and (4) our proposed expected sliced plan computed with στ given in (14) for four different temperature parameters τ. As µt converges to ν, it is evident that both the OT and our proposed EST distance approach zero, while the entropic OT and outer product plans, as expected, do not converge to zero. 3.6 TRANSPORT-BASED EMBEDDING Following the linear optimal transportation (LOT) framework, also referred to as the Wasserstein or transport-based embedding framework (Wang et al., 2013; Kolouri et al., 2021; Nenna & Pass, 2023; Bai et al., 2023; Martín et al., 2024), we investigate the application of our proposed transport plan in point cloud classification. Let µ0 = PN i=1 αiδxi denote a reference probability measure and let µk = PNk j=1 βk j δyk j denote a target probability measure. Let γµ0,µk be a transport plan between µ0 and µk, and define the barycentric projection (Ambrosio et al., 2011, Definition 5.4.2) of this plan as: bi(γµ0,µk) := 1 j=1 γµ0,µk ij yk j , i 1, ..., N. (16) Note that bi(γµ0,µk) represents the center of mass to which xi from the reference measure is transported according to the transport plan γµ0,µk. When γµ0,µk is the OT plan, the LOT framework of Wang et al. (2013) uses [ϕ(µk)]i := bi(γµ0,µk) xi, i 1, ..., N (17) as an embedding ϕ for the measure µk. This framework, as demonstrated in Kolouri et al. (2021), can be used to define a permutation-invariant embedding for sets of features and, more broadly, point clouds. More precisely, given a point cloud Yk = {(βk j , yk j )}Nk j=1, where PNk j=1 βk j = 1 and βj represent the mass at location yj, we represent this point cloud as a discrete measure µk. Published as a conference paper at ICLR 2025 %92.3 %94.3 %92.2 %96.8 Figure 7: t-SNE visualization of the embeddings computed using different transport plans, along with the corresponding logistic regression accuracy for each embedding. The t-SNE plots are generated for embeddings with a reference size of N = 100, and for EST, we used L = 128 slices. The table shows the accuracy of the embeddings as a function of reference size N. For the table, the regularization parameter for entropic OT is set to λ = 10, and for EST, the temperature is set to τ = 0 with L = 128 slices. Lastly, the plot on the bottom left shows the performance of EST, when N = 100 and L = 128, as a function of the temperature parameter, τ. In this section, we use a reference measure with N particles of uniform mass to embed the digits from the Point Cloud MNIST 2D dataset using various transport plans. We then perform a logistic regression on the embedded digits and present the results in Figure 7. The figure shows a 2D t-SNE visualization of the embedded point clouds using: (1) the OT plan, (2) the entropic OT plan with two different regularization parameters, and (3) our expected sliced plan with two temperature parameters (using N = 100 for all methods). In addition, we report the test accuracy of these embeddings for different reference sizes. Lastly, we make an interesting observation about the embedding computed using EST. As we reduce the temperature, i.e., increase τ, the embedding becomes progressively less informative. We attribute this to the dependence of στ on µk. In other words, the embedding is computed with respect to different στ for different measures, leading to inaccuracies when comparing the embedded measures. This finding also suggests that the min-SWGG framework, while meritorious, may not be well-suited for defining a transport-based embedding. Additionally, we refer the reader to Appendix D for a similar classification experiment using a more complex dataset: Model Net40. 4 CONCLUSIONS In this paper, we explored the feasibility of constructing transport plans between two probability measures using the computationally efficient sliced optimal transport (OT) framework. We introduced the Expected Sliced Transport (EST) framework and proved that it provides a valid metric for comparing discrete probability measures while preserving the computational efficiency of sliced transport and enabling explicit mass coupling. Through a diverse set of numerical experiments, we illustrated the behavior of this newly introduced transport plan. Additionally, we demonstrated how the temperature parameter in our approach offers a flexible framework that connects our method to the recently proposed min-Sliced Wasserstein Generalized Geodesics (min-SWGG) framework. Finally, the theoretical insights and experimental results presented here open up new avenues for developing efficient transport-based algorithms in machine learning and beyond. Published as a conference paper at ICLR 2025 Jason Altschuler, Jonathan Niles-Weed, and Philippe Rigollet. Near-linear time approximation algorithms for optimal transport via sinkhorn iteration. Advances in neural information processing systems, 30, 2017. Luigi Ambrosio, Edoardo Mainini, and Sylvia Serfaty. Gradient flow of the chapman rubinstein schatzman model for signed vortices. In Annales de l IHP Analyse non linéaire, volume 28, pp. 217 246, 2011. Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pp. 214 223. PMLR, 2017. Yikun Bai, Ivan Vladimir Medri, Rocio Diaz Martin, Rana Shahroz, and Soheil Kolouri. Linear optimal partial transport embedding. In International Conference on Machine Learning, pp. 1492 1520. PMLR, 2023. Saurav Basu, Soheil Kolouri, and Gustavo K Rohde. Detecting and visualizing cell phenotype differences from microscopy images using transport-based morphometry. Proceedings of the National Academy of Sciences, 111(9):3448 3453, 2014. Charlotte Bunne, Geoffrey Schiebinger, Andreas Krause, Aviv Regev, and Marco Cuturi. Optimal transport for single-cell and spatial omics. Nature Reviews Methods Primers, 4(1):58, 2024. Laetitia Chapel, Mokhtar Z Alaya, and Gilles Gasso. Partial optimal tranport with applications on positive-unlabeled learning. Advances in Neural Information Processing Systems, 33:2903 2913, 2020. Nicolas Courty, Rémi Flamary, and Devis Tuia. Domain adaptation with regularized optimal transport. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2014, Nancy, France, September 15-19, 2014. Proceedings, Part I 14, pp. 274 289. Springer, 2014. Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information processing systems, 26:2292 2300, 2013. Pavel Dvurechensky, Alexander Gasnikov, and Alexey Kroshnin. Computational optimal transport: Complexity by accelerated gradient descent is better than by sinkhorn s algorithm. In International conference on machine learning, pp. 1367 1376. PMLR, 2018. Jean Feydy, Thibault Séjourné, François-Xavier Vialard, Shun-ichi Amari, Alain Trouvé, and Gabriel Peyré. Interpolating between optimal transport and mmd using sinkhorn divergences. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2681 2690. PMLR, 2019. Charlie Frogner, Chiyuan Zhang, Hossein Mobahi, Mauricio Araya, and Tomaso A Poggio. Learning with a Wasserstein loss. Advances in neural information processing systems, 28, 2015. Cristian Garcia. Point cloud mnist 2d dataset, 2023. URL https://www.kaggle.com/ datasets/cristiangarcia/pointcloudmnist2d. Accessed: 2024-09-28. Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723 773, 2012. Soheil Kolouri, Se Rim Park, Matthew Thorpe, Dejan Slepcev, and Gustavo K. Rohde. Optimal mass transport: Signal processing and machine-learning applications. IEEE Signal Processing Magazine, 34(4):43 59, 2017. doi: 10.1109/MSP.2017.2695801. Soheil Kolouri, Navid Naderializadeh, Gustavo K. Rohde, and Heiko Hoffmann. Wasserstein embedding for graph learning. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=AAes_3W-2z. Yann Le Cun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998. Francesco Maggi. Optimal mass transport on Euclidean spaces, volume 207. Cambridge University Press, 2023. Published as a conference paper at ICLR 2025 Guillaume Mahey, Laetitia Chapel, Gilles Gasso, Clément Bonet, and Nicolas Courty. Fast optimal transport through sliced generalized wasserstein geodesics. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum? id=n3Xu Ydvh NW. Rocío Díaz Martín, Ivan V Medri, and Gustavo Kunde Rohde. Data representation with optimal transport. ar Xiv preprint ar Xiv:2406.15503, 2024. Kimia Nadjahi, Alain Durmus, Lénaïc Chizat, Soheil Kolouri, Shahin Shahrampour, and Umut Simsekli. Statistical and topological properties of sliced probability divergences. Advances in Neural Information Processing Systems, 33:20802 20812, 2020. Luca Nenna and Brendan Pass. Transport type metrics on the space of probability measures involving singular base measures. Applied Mathematics & Optimization, 87(2):28, 2023. Khai Nguyen and Nhat Ho. Energy-based sliced wasserstein distance. Advances in Neural Information Processing Systems, 36, 2024. François-Pierre Paty and Marco Cuturi. Subspace robust wasserstein distances. In International conference on machine learning, pp. 5072 5081. PMLR, 2019. Gabriel Peyré and Marco Cuturi. Computational optimal transport: With applications to data science. Foundations and Trends in Machine Learning, 11(5-6):355 607, 2019. Julien Rabin, Gabriel Peyré, Julie Delon, and Marc Bernot. Wasserstein barycenter and its application to texture mixing. In International Conference on Scale Space and Variational Methods in Computer Vision, pp. 435 446. Springer, 2011. Julien Rabin, Sira Ferradans, and Nicolas Papadakis. Adaptive color transfer with relaxed optimal transport. In 2014 IEEE international conference on image processing (ICIP), pp. 4852 4856. IEEE, 2014. Mark Rowland, Jiri Hron, Yunhao Tang, Krzysztof Choromanski, Tamas Sarlos, and Adrian Weller. Orthogonal estimation of wasserstein distances. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 186 195. PMLR, 2019. Filippo Santambrogio. Optimal transport for applied mathematicians, volume 55. Birkäuser, NY, 2015. Bernhard Schmitzer. A sparse multiscale algorithm for dense optimal transport. Journal of Mathematical Imaging and Vision, 56:238 259, 2016. Cedric Villani. Optimal transport: old and new. Springer, 2009. URL https: //www.cedricvillani.org/sites/dev/files/old_images/2012/08/ preprint-1.pdf. Cédric Villani. Topics in optimal transportation, volume 58. American Mathematical Soc., 2021. Wei Wang, Dejan Slepˇcev, Saurav Basu, John A Ozolek, and Gustavo K Rohde. A linear optimal transportation framework for quantifying and visualizing variations in sets of images. International journal of computer vision, 101:254 269, 2013. Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1912 1920, 2015. Published as a conference paper at ICLR 2025 A PROOF OF THEOREM 2.10: METRIC PROPERTY OF THE EXPECTED SLICED DISCREPANCY FOR DISCRETE PROBABILITY MEASURES A.1 PRELIMINARIES ON EXPECTED SLICED TRANSPORTATION Remark A.1 (Figure 1). Let us elaborate on explaining Figure 1. (a) Visualization for uniform discrete measures µ1 = 1 3(δx1 + δx2 + δx3) (green circles), µ2 = 1 3(δy1 + δy2 + δy3) (blue circles) in P(N)(Rd) with n = 3. Given an angle θ (red unit vector), when sorting {θ xi}3 i=1 and {θ yj}3 j=1 we use permutations ζθ and τθ given by ζθ(2) = 1, ζθ(1) = 2, ζθ(3) = 3, and τθ(1) = 1, τθ(3) = 2, τθ(2) = 3. The optimal transport map between θ#µ1 (green triangles) and θ#µ2 (blue triangles) is given by the following assignment: θ xζ 1 θ (1) = θ x2 7 θ y1 = θ yτ 1 θ (1), θ xζ 1 θ (2) = θ x1 7 θ y3 = θ yτ 1 θ (2), θ xζ 1 θ (3) = θ x3 7 θ y2 = θ yτ 1 θ (3). This gives rise to the plan Λµ1,µ2 θ given in (7) with is represented by solid arrows in the first panel. The lifted plan θ defined in (8) is represented in the second panel by dashed assignments. (b) Visualization for finite discrete measures µ1 = 0.1δx1 +0.3δx2 +0.6δx3 (green circles), µ2 = 0.5δy1 +0.3δy2 +0.2δy3 (blue circles). When projection according a given direction θ, the locations with green masses 0.3 and 0.6 overlap, as well as the locations with blue masses 0.2 and 0.3. Thus, the mass of θ#µ1 is concentrated at two green points (triangles) on the line determined by θ, each one with 0.1 and 0.9 of the total mass, and similarly θ#µ2 is concentrated at two points (blue triangles) each one with 0.5 of the total mass. Now, let us prove Lemma 2.3, that is, let us show that each measure γµ1,µ2 θ defined in (9) is a transport plan in Γ(µ1, µ2). Proof of Lemma 2.3. Let x Rd. First, if p(x) = 0, then uµ1,µ2 θ (x, y) = 0 for every y Rd, and so θ ({x} Rd) = 0 = p(x) = µ1({x}). Now, assume that p(x) = 0, then X y Rd uµ1,µ2 θ (x, y) = p(x) y Rd: q(y) =0 q(y) Q( yθ)Λµ1,µ2 θ ({( xθ, yθ)}) yθ Rd/ θ: Q( yθ) =0 1 Q( yθ)Λµ1,µ2 θ ({( xθ, yθ)}) yθ Rd/ θ: Q( yθ) =0 Q( yθ) 1 Q( yθ)Λµ1,µ2 θ ({( xθ, yθ)}) yθ Rd/ θ Λµ1,µ2 θ ({( xθ, yθ)}) = p(x) P( xθ)P( xθ) = p(x). Thus, γµ1,µ2 θ ({x} Rd) = p(x) = µ1({x}) for every x Rd. Similarly, P x Rd uµ1,µ2 q(y), or equivalently, γµ1,µ2 θ (Rd {y}) = q(x) = µ2({y}) for every y Rd . Remark A.2 (Expected Sliced Transport for uniform discrete measures). Let µ1, µ2 P(N)(Rd) of the form µ1 = 1 N PN i=1 δxi, µ2 = 1 N PN j=1 δyj. Then, the expected sliced transport plan between µ1 and µ2, γµ1,µ2 = Eθ σ[γµ1,µ2 θ ], defines a discrete measure on P(Rd Rd) supported on {(xi, yj)}i,j [N] where it takes the values γµ1,µ2({xi, yj}) = Z Sd 1 γµ1,µ2 θ ({xi, yj})dσ(θ) i [N], j [N]. (18) Thus, it can be regarded as an N N matrix whose (i, j)-entry is given by (18). Moreover, each N N matrix uµ1,µ2 θ defined by (6) can be obtained by swapping rows from the N N identity matrix multiplied by 1/N, there are finitely many matrices (precisely, N! matrices in total). Hence, the function θ 7 uµ1,µ2 θ is a piece-wise constant matrix-valued function. Thus, the function θ 7 γµ1,µ2 Published as a conference paper at ICLR 2025 (where γµ1,µ2 θ is an in (8)) is a measurable function. This can be generalized for any pair of finite discrete measures as in the following remarks. Remark A.3 (Expected Sliced Transport for finite discrete measures). Consider arbitrary finite discrete measures µ1 = Pn i=1 p(xi)δxi and µ2 = Pm j=1 q(yi)δyj, i.e., discrete measures with finite support. Fix xi, yj Rd, then θ 7 p(xi)q(yj) P ( xθ i )Q( yθ j ) = 1 for all but a finite number of directions. This is due to the fact that only for finitely many directions θ we obtain overlaps of the projected points θ x. The optimal transport plan Λµ1,µ2 θ is given by matching from left to right until fulfilling the target bins : that is, one has to order the points similarly as in (3) and consider an increasing assignment plan. Since the order of {θ xi}n i=1 and {θ yj}m j=1 changes a finite number of times when varying θ Sd 1, the function θ 7 Λµ1,µ2 θ takes a finite number of possible transport plan options. Thus, the range of θ 7 uµ1,µ2 θ is finite. Remark A.4 ( γµ1,µ2 is well-defined for finite discrete measures). First, we notice that for each θ, the support of Λµ1,µ2 θ is finite or countable, and so the support of γµ1,µ2 θ is also finite or countable. Given an arbitrary point (x, y) Rd Rd, we have to justify that the function Sd 1 θ 7 uµ1,µ2 θ (x, y) is (Borel)-measurable: If the supports of µ1 and µ2 are finite, by Remark A.3, θ 7 uµ1,µ2 θ is a piece-wise constant function, and so it is measurable and integrable on the sphere. For the general case, when the supports of µ1 and µ2 are countable, we refer to Remark A.14. Lemma A.5 ( γµ1,µ2 is a transport plan between µ1 and µ2). We have that γµ1,µ2 Γ(µ1, µ2), i.e., it has marginals µ1 and µ2. This is because for each θ Sd 1, γµ1,µ2 θ Γ(µ1, µ2) and because σ is a probability measure on Sd 1. Then, γµ1,µ2 is a convex combination of transport plans γµ1,µ2 θ , and since Γ(µ1, µ2) is a convex set, we obtain that γµ1,µ2 Γ(µ1, µ2). Precisely, for every test function ϕ : Rd R Z Rd Rd ϕ(x)d γµ1,µ2(x, y) = Z Rd Rd ϕ(x)dγµ1,µ2 θ (x, y)dσ(θ) Rd ϕ(x)dµ1(x)dσ(θ) = Z Rd ϕ(x)dµ1(x) Similarly, R Rd Rd ψ(y)d γµ1,µ2(x, y) = R Rd ψ(y)dµ2(y), and so γµ1,µ2 has marginals µ1 and µ2. A.2 AN AUXILIARY RESULT For simplicity, in this paper we consider the strictly convex cost x y p (1 < p < ). Also, in this section we consider σ = U(Sd 1) and in this case we denote dσ(θ) = dθ. Proposition A.6. Let Ω Rd be a compact set, and let µ1, µ2 P(Ω). Let (µ1 n)n N, (µ2 n)n N P(Ω) be sequences such that, for i = 1, 2, µi n µi as n , where the limit is in the weak*- topology. For each n N, consider optimal transport plans γn Γ (µ1 n, µ2 n). Then, there exists a subsequence such that γnk γ, for some optimal transport plan γ Γ (µ1, µ2). Proof. As (γn)n N is a sequence of probability measures, their mass is 1, by Banach-Alaoglu Theorem, there exists a subsequence such that γnk γ, for some γ P(Ω Ω). It is easy to see that the limit γ has marginals µ1, µ2. Indeed, given any test functions ϕ, ψ C(Ω), since each γn has marginals µ1 n, µ2 n, we have Z Ω Ω ϕ(x)dγn(x, y) = Z Ω ϕ(x)dµ1 n(x) and Z Ω Ω ψ(y)dγn(x, y) = Z Ω ψ(y)dµ2 n(y) Published as a conference paper at ICLR 2025 and taking limit as n , we obtain Z Ω Ω ϕ(x)dγ(x, y) = Z Ω ϕ(x)dµ1(x) and Z Ω Ω ψ(y)dγ(x, y) = Z Ω ψ(y)dµ2(y). Now, we only have to prove the optimality of γ for the OT problem between µ1 and µ2. Since (x, y) 7 x y p is continuous and γn, γ are compactly supported, by using that for each n N, γn is optimal for the OT problem between µ1 n and µ2 n, we have lim n Wp(µ1 n, µ2 n) p = lim n Ω Ω x y pdγn(x, y) Ω Ω x y pdγ(x, y) Wp(µ1, µ2) p (19) Also, by hypothesis and (Santambrogio, 2015, Theorem 5.10) we have that for any ν P(Ω), lim n Wp(µ1 n, ν) = Wp(µ1, ν) and lim n Wp(ν, µ2 n) = Wp(ν, µ2). So, by using the the triangle inequality for the p-Wasserstein distance we get lim n Wp(µ1 n, µ2 n) lim n Wp(µ1 n, µ1) + lim n Wp(µ2, µ2 n) + Wp(µ1, µ2) = 0 + Wp(µ1, µ2) = Wp(µ1, µ2) (20) Therefore, from (20) and (19) we have that lim n Wp(µ1 n, µ2 n) = Wp(µ1, µ2). In particular, in (19) we have that the following equality holds: Z Ω Ω x y pdγ(x, y) = Wp(µ1, µ2) p . As a result, γ is optimal for the OT problem between µ1 and µ2. A.3 FINITE DISCRETE MEASURES WITH RATIONAL WEIGHTS Let us denote by PQ(Rd) the set of finite discrete probability measures in Rd with rational weights, that is, µ PQ(Rd) if and only if it is of the form µ = Pm i=1 qiδxi with xi Rd, qi Q i [m] for some m N, and Pm i=1 qi = 1. We have P(N)(Rd) PQ(Rd), N N. In the definition of an uniform discrete measure µ = 1 N PN i=1 δxi P(N)(Rd) one can allow xi = xj for some pairs of indexes i = j. Proposition A.7. Dp( , ) defined by (11) is a metric in PQ(Rd). Proof. This was essentially pointed out Remark 2.9: When restricting to the space P(N)(Rd), our Dp( , ) and the Projected Wasserstein distance presented in (Rowland et al., 2019) coincide. Rowland et al. (2019) prove the metric property. We recall here their main argument, which is used for showing the triangle inequality. Given µ1, µ2, µ3 P(N)(Rd) of the form µ1 = 1 N PN i=1 δxi, µ2 = 1 N PN i=1 δyi, µ3 = 1 N PN i=1 δzi. Fix θ Sd 1, and consider permutations ζθ, τθ, ξθ SN, so that θ xζ 1 θ (1) θ xζ 1 θ (N), θ yτ 1 θ (1) θ yτ 1 θ (N), θ zξ 1 θ (1) θ zξ 1 θ (N) Published as a conference paper at ICLR 2025 Thus, the key idea is that Dp(µ1, µ2)p = Z i=1 xζ 1 θ (i) yτ 1 θ (i) pdθ = X i=1 xζ 1(i) yτ 1(i) p Dp(µ2, µ3)p = Z i=1 yτ 1 θ (i) zξ 1 θ (i) pdθ = X i=1 yτ 1(i) zξ 1(i) p Dp(µ3, µ1)p = Z i=1 zξ 1 θ (i) xζ 1 θ (i) pdθ = X i=1 zξ 1(i) xζ 1(i) p where q P(SN SN SN) is such that q(ζ, τ, ξ) is the probability that the tuple permutations (ζ, τ, ξ) = (ζθ, τθ, ξθ) are required, given that θ is drawn from Unif(Sd 1). With these alternative expressions established by the authors in (Rowland et al., 2019), the triangle inequality follows from the standard Minkowski inequality for weighted finite Lp-spaces. Finally, notice that we have used the fact that each µ1 is associated to N-indexes {1, . . . , N}, without asking that the points {xi} do not overlap, i.e., they could be repeated. That is, given µ1 = 1 N Pn i=1 δxi P(N)(Rd) one can allow xi = xj for some pairs of indexes i = j (analogously for µ2 and µ3). Thus, the proof also holds for measures in PQ(Rd): Indeed, let µ1, µ2, µ3 PQ(Rd) be of the form µ1 = Pn1 i=1 r1 i s1 i δxi, µ2 = Pn2 i=1 r2 i s2 i δyi, µ3 = Pn3 i=1 r3 i s3 i δzi with rj i , qj i N. First, consider the n 1 as the least common multiple of {s1 1, . . . , s1 n1} and, for each i [n1], let r1 i so that r1 i n 1 = r1 i s1 i . Thus, we can rewrite µ1 = Pn1 i=1 r1 i n 1 δxi. Notice that, since µ1 is a probability measure, we have n 1 = Pn1 i=1 r1 i . Now, for each i [n1] such that r1 i > 1, consider r1 i copies of the corresponding point xi so that we can rewrite µ1 = Pn 1 i=1 1 n 1 δxi (where we recall that n 1 = Pn1 i=1 r1 i and the points xi in the new expression can be repeated, i.e., they are not necessarily all different). Repeat this process to rewrite µ2 = Pn 2 i=1 1 n 2 δyi, µ3 = Pn 3 i=1 1 n 3 δzi. Now, consider N as the least common multiple of n 1, n 2, n 3, and rewrite the measures as µ1 = 1 N PN i=1 δxi, µ2 = 1 N PN i=1 δyi, µ3 = 1 N PN i=1 δzi where the points xi, yi, zi can be repeated if needed. Thus, µ1, µ2, µ3 can be regarded as measures in P(N)(Rd) where Dp( , ) behaves as a metric. A.4 THE PROOF FOR GENERAL FINITE DISCRETE MEASURES We first introduce some notation. Consider a finite discrete probability measure µ P(Rd) of the form µ = Pm i=1 piδxi, with general weights pi R+ such that Pm i=1 pi = 1. For each i {1, . . . , m 1}, consider an increasing sequence of rational numbers (pi n)n N Q, with 0 pi n pi, such that limn pi n = pi. For i = m, consider the sequence (pm n )n N Q defined by 0 pm n := 1 Pm 1 i=1 pi n 1. Thus, limn pm n = 1 limn Pm 1 i=1 pi n = 1 Pm 1 i=1 pi = pm. Define the sequence of probability measures (µn)n N given by µn := Pm i=1 pi nδxi PQ(Rd). It is easy to show that (µn)n N converges to µ in Total Variation (i.e., uniform convergence or strong convergence): Indeed, let ε > 0. For each i [m], let Ni N such that |pi n pi| < ε/m n Ni and define N = max{N1, . . . , Nm}. Now, given any set B Rd we obtain, for n N, |µn(B) µ(B)| = i [m]: xi B (pi n pi) (µn and µ have the same support) i [m]: xi B |pi n pi| (triangle inequality) i [m] |pi n pi| (sum over all indexes to get independence of the set B) < ε. This shows that limn µn µ TV = 0. Moreover, this shows that in this case, i.e., when approximating a finite discrete measure µ by a sequence of measures having the same support as µ, we only care about point-wise convergence. Published as a conference paper at ICLR 2025 We will now introduce some lemmas which together with the above proposition will allow us to prove the metric property of Dp( , ) for finite discrete probability measures. For all of them we will consider: µ1, µ2 two finite discrete probability measures in Rd given by µ1 = Pm1 i=1 piδxi, µ2 = Pm2 j=1 qjδyj (µ1 n)n N, (µ2 n)n N approximating sequences of probability measures µ1 n = Pm1 i=1 pi nδxi, µ2 n = Pm2 j=1 qj nδyj, with rational weights {pi n}, {qj n}, defined in analogy to what we have already done, i.e., so that, for k = 1, 2 we have that (µk n)n N is a sequence of probability measures that converges to µk in Total Variation). Also, for each θ, Λµ1,µ2 θ denotes the unique optimal transport plan between θ#µ1 and θ#µ2; γµ1,µ2 θ denotes lifted transport plan between µ1 and µ2 given as in (9); and γµ1,µ2 the expected sliced transport plan between µ1 and µ2 given as in (10). Similarly, for each n N we consider the plans Λµ1 n,µ2 n θ , γµ1 n,µ2 n θ , and γµ1 n,µ2 n. Lemma A.8. The sequence Λµ1 n,µ2 n θ n N converges to Λµ1,µ2 θ in Total Variation. Lemma A.9. The sequence γµ1 n,µ2 n θ n N converges to γµ1,µ2 θ in Total Variation. Lemma A.10. The sequence γµ1 n,µ2 n n N converges to γµ1,µ2 in Total Variation. Lemma A.11. lim n Dp(µ1 n, µ2 n) = Dp(µ1, µ2). In general, notice that since for every i [m1], j [m2] we have that limn pi n = pi, and limn qj n = pj, then we obtain that limn P i n = P i, and limn Qj n = Qj, where P i = P i [m1]:xi xθ i pi, Qj = P j [m2]:yj yθ j qj, and where, for each n N, P i n and Qj n are analogously defined (see Subsection 2.2.2). Thus, lim n pi nqj n P in Qj n = piqj P i Qj i [m1], j [m2]. (21) Proof of Lemma A.8. The support of all the measures we are considering are finite and so, the measures have compact support. Hence, we can apply Proposition A.6 to θ#µi, (θ#µi)n N, i = 1, 2. Specifically, given (Λµ1 n,µ2 n θ )n N Γ (θ#µ1 n, θ#µ2 n), there exists a subsequence (Λ µ1 nk ,µ2 nk θ )K N and Λθ Γ (θ#µ1, θ#µ2) such that Λ µ1 nk ,µ2 nk θ Λθ. (22) As we are in one dimension, the set Γ (θ#µ1, θ#µ2) is a singleton, and so we have that Λθ = Λµ1,µ2 θ is the unique optimal transport plan. Since the supports of all the measures are the same, (that is, {(θ xi, θ yj)}i [m1],j [m2]), the weak convergence in (22) implies the stronger convergence in Total Variation. Now, suppose that the original sequence (Λµ1 n,µ2 n θ )n N does not converge to Λµ1,µ2 θ (in Total Variation). Then, given ε > 0, there exists a subsequence (Λ µ1 nj ,µ2 nj θ )j N such that Λ µ1 nj ,µ2 nj θ Λµ1,µ2 θ TV > ε (23) But again, from Proposition A.6, using that the supports of all the measures involved are the same set {(θ xi, θ yj)}i [m1],j [m2], and the fact that Γ (θ#µ1, θ#µ2) = {Λµ1,µ2 θ } (only one optimal transport plan), we have that there exists a sub-subsequence such that Λ µ1 nji ,µ2 nji θ Λµ1,µ2 θ TV < ε. contradicting (23). Since the contradiction is achieved from assuming that the whole sequence (Λµ1 n,µ2 n θ )n N does not converge to Λµ1,µ2 θ , we have that, in fact, it does converge to Λµ1,µ2 θ in Total Variation. Published as a conference paper at ICLR 2025 Proof of Lemma A.9. This holds by looking at (9): Due to the fact that the supports of µ1 n and µ1 are the same (respectively, for µ2 n and µ2), we only care about the locations {(xi, yj)}i [m1],j [m2], and then by using (21) and the convergence from Lemma A.8, the result holds true. Proof of Lemma A.10. As pointed out before, we only care about point-wise convergence: That is, since the supports of the measures involved coincide (are the same set {(xi, yj)}i [m1],j [m2]) weak convergence, point-wise convergence and convergence in Total Variation are equivalent. Since 0 γµ1 n,µ2 n θ ({(xi, yj)}) 1 and Sd 1 is compact, by the convergence result from Lemma A.9 and using the Dominated Convergence Theorem, we have that for each i [m1], j [m2], lim n γµ1 n,µ2 n({(xi, yj)}) = lim n Sd 1 γµ1 n,µ2 n θ ({(xi, yj)})dθ Sd 1 lim n γµ1 n,µ2 n θ ({(xi, yj)})dθ Sd 1 γµ1,µ2 θ ({(xi, yj)})dθ = γµ1,µ2({(xi, yj)}) Proof of Lemma A.11. Dp(µ1 n, µ2 n)p Dp(µ1, µ2)p max i [m1],j [m2]{ xi yj p} γµ1 n,µ2 n γµ1,µ2 TV (24) where the RHS goes to 0 as n , due to Lemma A.10. Theorem A.12. Dp( , ) is a metric for the space of finite discrete probability measures in Rd. Symmetry: The way we constructed Dp( , ) makes it so that Dp(µ1, µ2) = Dp(µ2, µ1). Positivity: It is clear that by definition Dp(µ1, µ2) 0. Identity of indiscernibles: First, if µ1 = µ2 =: µ, then γµ1,µ2 θ = (id id)#µ for all θ Sd 1. Hence γµ1,µ2 = (id id)#µ which implies Dp(µ1, µ2) = 0. Secondly, if µ1, µ2 are such that Dp(µ1, µ2) = 0, by having Wp(µ1, µ2) Dp(µ1, µ2) = 0, we can use the fact that Wp( , ) satisfies the identity of indiscernibles by being a distance. That is, Wp(µ1, µ2) = 0 implies µ1 = µ2. Triangle inequality: Given µ1, µ2, µ3 arbitrary finite discrete measures with arbitrary real weights, consider approximating sequences (µ1 n)n N, (µ2 n)n N, (µ3 n)n N in PQ(Rd) as before. Notice that every subsequence of (µ1 n)n N (respectively of (µ2 n)n N and (µ3 n)n N) will converge to µ1 (respectively, to µ2 and µ3), as every subsequence of a convergent sequence is convergent. By Proposition A.7, we have that, for each n Rd, Dp(µ1 n, µ2 n) Dp(µ1 n, µ3 n) + Dp(µ3 n, µ2 n). Taking the limit as n , from Lemma (A.11) we obtain Dp(µ1, µ2) Dp(µ1, µ3) + Dp(µ3, µ2). Published as a conference paper at ICLR 2025 A.5 DISCRETE MEASURES WITH COUNTABLE SUPPORT Lemma A.13. Let µ = P m N pmδxm be a discrete probability measure with countable support {xm}m N. Let σ be an absolutely continuous probability measure with respect to the Lebesgue measure on the sphere (we write, σ Unif(Sd 1)). Let Sµ := {θ Sd 1 : θ xm = θ xm for some pair (m, m ) with m = m }. Then σ(Sµ) = 0. Proof. First, consider distinct points xm, xm on the support of µ, and let S(xm, xm ) = {θ Sd 1 : θ xm = θ xm }. It is straightforward to verify that S(xm, xm ) = Sd 1 span({xm xm }) , where span({xm xm }) is the orthogonal subspace to the line in the direction of the vector (xm xm ). Thus, S(xm, xm ) is a subset of a d 2-dimensional sub-sphere in Sd 1, and therefore σSd 1(S(xm, xm )) = 0 (25) (xm,xm ) M S(xm, xm ), where M = {(xm, xm ) : m = m } (26) we have that (xi,x i) M σ(S(xm, xm )) = 0. since M is countable (indeed, |M| |supp(µ) supp(µ)|). Remark A.14. [ γµ1,µ2 is well-defined for discrete measures with countable support] Let σ Unif(Sd 1). Given two discrete probability measures µ1 = P p(x)δx and µ2 = P q(y)δy with countable support, from Lemma A.13, we have that σ(Sµi) = 0 for i = 1, 2, and so σ(Sµ1 Sµ2) = 0. Therefore, similarly to the case of discrete measures with finite support, given any x supp(µ1), y supp(µ2) we have that the map θ 7 p(x)q(y) P ( xθ)Q( yθ) from Sd 1 to R is equal to the constant function θ 7 1 up to a set of σ-measure 0. This implies that the function θ 7 uµ1,µ2 θ is measurable. Finally, since |γµ1,µ1 θ ({(x, y)})| 1 for every (x, y), we have that γµ1,µ2 is well-defined. B EQUIVALENCE WITH WEAK CONVERGENCE Lemma B.1. Let Ω R be a compact set, µ P(Ω) and consider a sequence of probability measures (µn)n N defined in Ωsuch that µn µ as n . Then, for each θ Sd 1, we have that θ#µn θ#µ as n . Proof. Given θ Sd 1, notice that {θ x : x Ω} is a 1-dimensional compact set, which contains the supports of θ#µ and (θ#µn)n N. Thus, when dealing with the weak -topology we can use continuous functions as test functions. Let φ : R R be a continuous test function, then Z R φ(u)dθ#µn(u) = Z Rd φ(θ x)dµn(x) n Rd φ(θ x)dµ(x) = Z R φ(u)dθ#µ(u) since the composition x 7 θ x 7 φ(θ x) is a continuous function from Rd to R. Published as a conference paper at ICLR 2025 Lemma B.2. Let Ω R be a compact set, µi P(Ω), i = 1, 2, and consider sequences of probability measures (µi n)n N defined in Ωsuch that µi n µi as n , for i = 1, 2. Given θ Sd 1, consider Λµ1,µ2 θ the unique optimal transport plan between θ#µ1 and θ#µ2, and for each n N, consider Λµ1 n,µ2 n θ the unique optimal transport plan between θ#µ1 n and θ#µ2 n. Then Λµ1 n,µ2 n θ Λµ1,µ2 Proof. The proof is similar to that of Lemma A.8. From Lemma B.1, Proposition A.6, and uniqueness of optimal plans in one-dimension, there exists a subsequence (Λ µ1 nk ,µ2 nk θ )k N such that Λ µ1 nk ,µ2 nk θ Λµ1,µ2 Now, suppose that the original sequence (Λµ1 n,µ2 n θ )n N does not converge to Λµ1,µ2 θ (in the weak - topology). Thus, there exists a continuous function φ : R R such that for a given ε > 0 there exists a subsequence (Λ µ1 nj ,µ2 nj θ )j N with R φ(u) dΛ µ1 nj ,µ2 nj θ (u) Z R φ(u) dΛµ1,µ2 θ (u) > ε (27) But again, from Lemma B.1, Proposition A.6, and uniqueness of optimal plans in one-dimension, we have that there exists a sub-subsequence such that Z R φ(u) dΛ µ1 nji ,µ2 nji θ (u) i R φ(u) dΛµ1,µ2 contradicting (27). Since the contradiction is achieved from assuming that the whole sequence (Λµ1 n,µ2 n θ )n N does not converge to Λµ1,µ2 θ , we have that, in fact, it does converge to Λµ1,µ2 θ in the weak -topology. Lemma B.3. Consider probability measures supported in a compact set Ω Rd such that µ1 n µ1, µ2 n µ2. Let θ Sd 1, then the sequence of lifted plans {γµ1 n,µ2 n θ }n N satisfies that there exists a subsequence such that γ µ1 nk ,µ2 nk θ γ for γ Γ(µ1, µ2; Λµ1,µ2 θ ) := {γ Γ(µ1, µ2) : (θ θ)#γ = Λµ1,µ2 where θ θ(x, y) := (θ x, θ y) for all (x, y) Rd Rd. Proof. Since γµ1 n,µ2 n θ Γ(µ1 n, µ2 n), similar to the proof of Proposition A.6, by Banach-Alaoglu Theorem, there exists a subsequence γ µ1 nk ,µ2 nk θ , such that γ µ1 nk ,µ2 nk θ γ , for some γ P(Ω Ω). Again, as in the proof of Proposition A.6, it can be shown that γ Γ(µ1, µ2). In addition, we have (θ θ)#γ µ1 nk ,µ2 nk θ (θ θ)#γ and (θ θ)#γ µ1 nk ,µ2 nk θ = Λ µ1 nk ,µ2 nk θ Λµ1,µ2 (where the first one follows by similar arguments to those in Lemma B.1 and the second one follows by definition of the lifted plans). By the uniqueness of the limit (for the weak -convergence), we have (θ θ)#γ = Λµ1,µ2 θ . Thus, γ Γ(µ1, µ2; Λµ1,µ2 Theorem B.4. Let σ Unif(Sd 1). Consider discrete probability measures measures µ1 = P i=1 piδxi, µ2 = P j=1 qjδyj in Ωsupported on a compact set Ω Rd. Consider sequences (µ1 n)n N, (µ2 n)n N of discrete probability measures defined on Ωsuch that µ1 n µ1, µ2 n µ2. Then σ-a.s. we have that Dp(µ1 n, µ2 n; θ) Dp(µ1, µ2; θ) as n . Moreover, Dp(µ1 n, µ2 n) Dp(µ1, µ2) as n . Published as a conference paper at ICLR 2025 Proof. Let us define the set S(µ1, µ2) := n θ Rd : Γ(µ1, µ2; Λµ1,µ2 θ ) = {γµ1,µ2 Since we are considering discrete measures, notice that Sd 1 \ S(µ1, µ2) Sµ1 Sµ2, where Sµ1 = {θ Sd 1 : θ xm = θ xm for some pair m = m } and Sµ2 = {θ Sd 1 : θ yn = θ yn for some pair n = n }. By Lemma A.13, we have σ(Sd 1 \ S(µ1, µ2)) σ(Sµ1 Sµ2) = 0. Thus, σ(S(µ1, µ2)) = 1. Let θ S(µ1, µ2), and consider the lifted plans γµ1,µ2 θ and γµ1 n,µ2 n θ . By Lemma B.3, there exists a subsequence of (γµ1 n,µ2 n θ )n N such that γ µ1 nk ,µ2 nk θ γ Γ(µ1, µ2; Λµ1,µ2 Since θ S(µ1, µ2), we have that Γ(µ1, µ2; Λµ1,µ2 θ ) contains only one element, which is γµ1,µ2 θ . Hence, γ = γµ1,µ2 Moreover, by uniqueness of weak convergence, proceeding similarly as in the proof of Lemma B.2, we have that the whole sequence (γµ1 n,µ2 n θ )n N converges to γµ1,µ2 θ in the weak -topology. Therefore, by definition of weak -convergence for measures supported in a compact set (in our case, Ω Ω Rd Rd), since (x, y) 7 x y p is a continuous function, we have lim n Dp(µ1 n, µ2 n; θ)p = lim n Ω2 x y pdγµ1 n,µ2 n θ (x, y) Ω2 x y pdγµ1,µ2 = Dp(µ1, µ2; θ)p Combining this with the fact that σ(S(µ1, µ2)) = 1 and that (Dp(µ1 n, µ2 n; θ)p)n N is bounded, that is, |Dp(µ1 n, µ2 n; θ)p| max(x,y) Ω Ω x y p, by Dominated Lebesgue Theorem we obtain lim n Dp(µ1 n, µ2 n)p = lim n Sd 1 Dp(µ1 n, µ2 n; θ)pdσ(θ) Sd 1 lim n Dp(µ1 n, µ2 n; θ)pdσ(θ) Sd 1 Dp(µ1, µ2; θ)pdσ(θ) = Dp(µ1, µ2)p Corollary B.5. Let µ, µn P(Ω), where Ω Rd is compact, be of the form µ = P x Rd p(x)δx, µn = P x Rd pn(x)δx where p(x) and pn(x) are 0 at all but countably many x Rd. Assume σ Unif(Sd 1). Then, Dp(µn, µ) 0 if and only if µn µ. Proof. If Dp(µn, µ) n 0 then, by Remark 2.8, Wp(µn, µ) n 0, hence µn n µ. The converse is a Corollary of Theorem B.4 Published as a conference paper at ICLR 2025 C COMPUTATIONAL EFFICIENCY To deduce the computational complexity of the proposed method, we consider, for simplicity, two finite discrete probability measures on Rd, d > 1, concentrated at N particles, µ = PN i=1 piδxi and ν = PN j=1 qjδyj. Consider L slices or unit vectors {θl}L l=1 in Sd 1 We assume that no overlap occurs when projecting along different directions (as previously noted, this is almost surely the case as proven in Lemma A.13). The following is the analysis of the computational complexity of the proposed EST method: For each one of the L directions, {θl}L l=1, we have to project the locations {xi}N i=1, {yj}N j=1 to obtain the new projected measures concentrated at {θl xi}N i=1 and {θl yj}N j=1, respectively. This requires O(Ld N) operations. To compute the one-dimensional plan Λµ,ν θl , for each L ℓ L, we essentially need to solve a one-dimensional optimal transport problem (i.e., a sorting problem), which is of order O(N log(N)). Thus, when considering all the slices, this step is of order O(LN log(N)). The lifting process that gives rise to γµ,ν θl does not require additional operations to be taken into account (it is performed as an assignment or correspondence). The plan γµ,ν can be represented as an N N matrix. The (i, j)-entry is given by PL l=1 γµ,ν θl ({(xi, yj)}), requiring L operations. Thus, the complexity of this step is O(LN 2). Finally, once we have γµ,ν, for computing the EST-distance we require another O(N 2d) operations. As a conclusion, the complexity of computing the plan γµ,ν is O(L(Nd + N log(N) + N 2)), or simply, O(LN 2 + LNd), and that of the EST-distance is O((L + d)N 2). As it is generally the case, N is much larger than d, so the computational complexity of both the EST-plan and the EST-distance is of order O((L + d)N 2). We recall that the complexity of using the linear programming approach for solving the classical optimal transport problem between µ and ν is of order O(N 3 log(N)). Besides, for the entropic regularized version solved by using iterative algorithms like Sinkhorn s algorithm, the complexity is of order O(N 2 log(N)/λ2), where λ denotes the regularization parameter (see Dvurechensky et al. (2018); Altschuler et al. (2017)). In Figures 8 and 9, we report the wall-clock times of the Sinkhorn algorithm, used for computing the entropic regularized version of OT (for different values of the regularization parameter λ), compared to the proposed EST distance (calculated with different numbers of slices L) between two N-sized empirical distributions. The only difference between the two figures is the scale used for the horizontal axis, which has been adjusted for visualization purposes. Remark C.1. It is important to note that the computation of EST method can be parallelized over the number of slices L. Currently, our code does not implement parallelization, but incorporating it would significantly speed up the computation of the EST plan and distance. D ADDITIONAL EXPERIMENTAL RESULTS To further demonstrate the efficiency of our proposed method for the classification task, we consider the widely used benchmark dataset in 3D computer vision and geometric deep learning, Model Net40 (Wu et al., 2015). This dataset consists of objects represented as 3D point clouds, with 2048 points per object. It contains 40 different object categories. The training set comprises 9,840 samples, and the test set includes 2,468 samples. In Table 1, we compute the accuracy of the 1NN classifier using (1) the classical optimal transport (OT) approach calculated with linear programming (LP), (2) its entropic regularized version for different regularization parameters λ using the Sinkhorn algorithm, and (3) our proposed EST method with different numbers of slices (L) or unit vectors in S2. Published as a conference paper at ICLR 2025 Figure 8: Wall-clock time between Sinhkorn algorithm applied for computing entropic OT for different values of the regularization parameter λ and the proposed EST method calculated with different numbers L of slices. We compare the differences in time as the size of N the empirical measures increases. Figure 9: log2 log10 wall-clock time plot between Sinhkorn algorithm applied for computing entropic OT for different values of the regularization parameter λ and the proposed EST method calculated with different numbers L of slices. We compare the differences in time as the size of N the empirical measures increases. Of importance, we used the linearized version of these three methods, as described in Subsection 3.6. Direct computation of the 1NN classifier purely based on OT, its entropic regularized version, and the EST distance would require 2, 048 9, 840 = 20, 152, 320 ( number of testing samples number of training samples ) pairwise comparisons for each of the three approaches. Instead, we compute 2, 048 + 9, 840 = 11, 888 ( number of testing samples + number of training samples ) transport plans (with the three different methods) and then use a traditional classifier with Euclidean distance on the approximated Monge maps given by (17). We recall that the linearization technique is based on pivoting with a reference measure µ0 to reduce the number of pairwise comparisons. Specifically, it involves computing transport plans (using the different transportation approaches: OT, entropic regularization, or EST) between µ0 and the target samples (training and testing point clouds), applying the barycentric projection (16) to obtain an embedding (17) of all the training and testing samples, and using a classifier in the embedding space. Published as a conference paper at ICLR 2025 1NN Classification Sinkhorn Sinkhorn ESP ESP OT λ = 10 λ = 1 L = 128 L = 1024 (LP) Accuracy 65.96% 78.93% 77.30% 79.45% 82.09% Time per distance (Sec) 0.423 0.594 0.185 0.368 0.883 Table 1: Accuracy and time comparison of 1NN classifier implemented after the embedding. In these experiments, for all methods, the reference measure µ0 is chosen as the uniform measure on the cube [ 1, 1]3, and we use a 1NN classifier with Euclidean distance in the embedding space. E RELATIONS BETWEEN THE AVERAGING MEASURE στ AND THE SLICING DISTRIBUTION IN ENERGY-BASED SLICED WASSERSTEIN DISTANCE We draw an analogy between our proposed averaging measure στ defined in 14 and the slicing distribution in the energy-based Sliced Wasserstein distance Nguyen & Ho (2024). Under the assumption that a higher value of 1-dimensional Wasserstein distance W p p (θ#µ, θ#ν) will give a better projecting direction θ, Nguyen & Ho (2024) propose to build on the Sliced Wasserstein distance and define a slicing distribution supported on Sd 1 as σµ,ν(θ; f, p) := f(W p p (θ#µ, θ#ν)) R Sd 1 f(W p p (θ #µ, θ #ν))dθ for µ, ν Pp(Rd), where f : [0, ) Θ (0, ) is a monotonically increasing energy function. When f is the exponential function fe(x) = ex, the slicing distribution becomes dσµ,ν(θ; f, p) = e W p p (θ#µ,θ#ν) R Sd 1 e W p p (θ #µ,θ #ν)dθ dθ (28) Both our proposed στ in Equation 14 and σµ,ν(θ; f, p) in Equation 28 employ the exponential function to reflect the optimality of slices in the resulting distances, with Wasserstein distance serving as the benchmark. However, it s important to note a key distinction: contrary to the Sliced Wasserstein distance as a lower bound of Wasserstein distance, the Expected Sliced Transport distance is an upper bound, as for all θ Sd 1, Wp(θ#µ, θ#ν) Wp(µ, ν) Dp(µ, ν; θ). Thus, the analogous energy function in στ is g(x) = e τx (with τ 0) which is monotonically decreasing, based on the assumption that a better slicing direction θ will result in a lower Dp(µ, ν; θ).