# treesliced_wasserstein_distance_a_geometric_perspective__9764389c.pdf Tree-Sliced Wasserstein Distance: A Geometric Perspective Viet-Hoang Tran * 1 Trang Pham * 2 Tho Tran 1 Khoi N.M Nguyen 1 Thanh T. Chu 1 Tam Le 3 Tan M. Nguyen 1 Many variants of Optimal Transport (OT) have been developed to address its heavy computation. Among them, notably, Sliced Wasserstein (SW) is widely used for application domains by projecting the OT problem onto one-dimensional lines, and leveraging the closed-form expression of the univariate OT to reduce the computational burden. However, projecting measures onto lowdimensional spaces can lead to a loss of topological information. To mitigate this issue, in this work, we propose to replace one-dimensional lines with a more intricate structure, called tree systems. This structure is metrizable by a tree metric, which yields a closed-form expression for OT problems on tree systems. We provide an extensive theoretical analysis to formally define tree systems with their topological properties, introduce the concept of splitting maps, which operate as the projection mechanism onto these structures, then finally propose a novel variant of Radon transform for tree systems and verify its injectivity. This framework leads to an efficient metric between measures, termed Tree-Sliced Wasserstein distance on Systems of Lines (TSW-SL). By conducting a variety of experiments on gradient flows, image style transfer, and generative models, we illustrate that our proposed approach performs favorably compared to SW and its variants. 1. Introduction Optimal transport (OT) (Villani, 2008; Peyr e et al., 2019) is a naturally geometrical metric for comparing probability distributions. Intuitively, OT lifts the ground cost metric among supports of input measures into the metric between two in- Equal contribution Co-last authors 1National University of Singapore 2Movian AI 3The Institute of Statistical Mathematics. Correspondence to: Viet-Hoang Tran . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). put measures. OT has been applied in many research fields, including machine learning (Nguyen et al., 2021b; Bunne et al., 2022; Fan et al., 2022; Hua et al., 2023; Le et al., 2024a;b; Nguyen & Ho, 2024; Kessler et al., 2025; Chapel et al., 2025; Chapel & Tavenard, 2025), statistics (Mena & Niles-Weed, 2019; Weed & Berthet, 2019; Liu et al., 2022; Nguyen et al., 2022; Nietert et al., 2022; Wang et al., 2022; Pham et al., 2024), multimodal (Park et al., 2024; Luong et al., 2024), computer vision and graphics (Rabin et al., 2011; Solomon et al., 2015; Lavenant et al., 2018; Nguyen et al., 2021a; Saleh et al., 2022; Vu et al., 2025). However, OT has a supercubic computational complexity concerning the number of supports in input measures (Peyr e et al., 2019). To address this issue, Sliced-Wasserstein (SW) (Rabin et al., 2011; Bonneel et al., 2015) exploits the closed-form expression of the one-dimensional OT to reduce its computational complexity. More concretely, SW projects supports of input measures onto a random line and leverage the fast computation of the OT on one-dimensional lines. SW is widely used in various applications, such as gradient flows (Bonet et al., 2022; Liutkus et al., 2019), clustering (Kolouri et al., 2018; Ho et al., 2017), domain adaptation (Courty et al., 2017), generative models (Deshpande et al., 2018; Wu et al., 2019; Nguyen & Ho, 2022), thanks to its computational efficiency. Due to relying on one-dimensional projection, SW limits its capacity to capture the topological structures of input measures, especially in high-dimensional domains. Related work. Prior studies have aimed to enhance the Sliced Wasserstein (SW) distance (Nguyen et al., 2024a; 2020; Nguyen & Ho, 2024) or explore variants of SW (Bai et al., 2023; Kolouri et al., 2019; Quellmalz et al., 2023). These works primarily concentrate on improving existing components within the SW framework, including the sampling process (Nguyen et al., 2024a; 2020; Nadjahi et al., 2021), determining optimal lines for projection (Deshpande et al., 2019), and modifying the projection mechanism (Kolouri et al., 2019; Bonet et al., 2023). However, few studies have focused on replacing one-dimensional lines, which play the role of integration domains, with more complex domains such as one-dimensional manifolds (Kolouri et al., 2019), or low-dimensional subspaces (Alvarez-Melis Tree-Sliced Wasserstein Distance: A Geometric Perspective et al., 2018; Bonet et al., 2023; Paty & Cuturi, 2019; Niles Weed & Rigollet, 2022; Lin et al., 2021; Huang et al., 2021; Muzellec & Cuturi, 2019). In this paper, we concentrate on the latter approach, aiming to discover novel geometrical domains that meet two key criteria: (i) pushing forward of high-dimensional measures onto these domains can be processed in a meaningful manner, and (ii) OT problems on these domains can be efficiently solved, ideally with a closed-form solution. Contribution. Our contributions are three-fold: We introduce the concept of tree systems, which consist of copies of the real line equipped with additional structures, and study their topology. A key property of tree systems is that they form well-defined metric spaces, with metrics being tree metrics. This property is sufficient to guarantee that OT problems on tree systems admit closed-form solutions. We define the space of integrable functions and probability measures on a tree system, and introduce a novel transform, called Radon Transform on Systems of Lines. This transform naturally transforms measures supported in high-dimensional space onto tree systems, and is a generalization of the original Radon transform. The injectivity of this variant holds, similar to other Radon transform variants in the literature. We propose the Tree-Sliced Wasserstein distance on Systems of Lines (TSW-SL), and analyze its efficiency through the closed-form solution for the OT problem on tree systems, achieving a similar computational cost as the traditional SW. Organization. The remainder of the paper is organized as follows. Section 2 provides necessary backgrounds of SW distance and Wasserstein distance on tree metric spaces. Section 3 provides a brief and intuitive introduction of tree systems and studies its properties, and Section 4 introduces the Radon Transform on System of Lines. The novel Tree Sliced Wasserstein distance on Systems of Lines is proposed in Section 5. Finally, Section 6 contains empirical results for TSW-SL. Formal constructions, theoretical proofs of key results, and additional materials are presented in Appendix. 2. Preliminaries In this section, we review Sliced Wasserstein (SW) distance and Wasserstein distances on metric spaces with tree metrics (TW). Wasserstein Distance. Let Ωbe a measurable space with a metric d on Ω, and let µ, ν be two probability distributions on Ω. Let P(µ, ν) be the set of probability distributions π on the product space Ω Ωsuch that π(A Ω) = µ(A), π(Ω B) = ν(B) for all measurable sets A, B. For p 1, the p-Wasserstein distance Wp between µ and ν (Villani, 2008) is defined as: Wp(µ, ν) = inf π P(µ,ν) Ω Ω d(x, y)p dπ(x, y) 1 Sliced Wasserstein Distance. For µ, ν P(Rd), the Sliced p-Wasserstein distance (SW) (Rabin et al., 2011; Bonneel et al., 2015) between µ, ν is defined by: SWp(µ, ν) = Z Sd 1 Wp p(Rfµ( , θ), Rfν( , θ)) dσ(θ) 1 where σ = U(Sd 1) is the uniform distribution on Sd 1, operator R : L1(Rd) ! L1(R Sd 1) is the Radon Transform (Helgason, 2011): Rf(t, θ) = Z Rd f(x) δ(t x, θ ) dx, (3) and fµ, fν are the probability density functions of µ, ν, respectively. Max Sliced Wasserstein (Max SW) distance is discussed in Appendix C. Monte Carlo estimation for SW. The Monte Carlo method is usually employed to approximate the intractable integral in Equation (2) as follows: d SWp(µ, ν) = l=1 Wp p(Rfµ( , θl), Rfν( , θl)) where θ1, . . . , θL are drawn independently from U(Sd 1). Using the closed-form expression of one-dimensional Wasserstein distance, when µ and ν are discrete measures that have supports of at most n supports, the computational complexity of d SWp is O(Ln log n + Ldn) (Peyr e et al., 2019). Tree Wasserstein Distances. Given a rooted tree (T , r) (T is a tree as a graph, with one certain node r called root) with non-negative edge lengths, and the ground metric d T , i.e. the length of the unique path between two nodes. For two distributions µ, ν supported on nodes of T , the Wasserstein distance with ground cost d T , i.e., tree-Wasserstein (TW) (Le et al., 2019; Indyk & Thaper, 2003), admits a closed-form expression: Wd T ,1(µ, ν) = X e T we µ(Γ(ve)) ν(Γ(ve)) , (5) where ve is the farther endpoint of edge e from r, we is the length of e, and Γ(ve) is the subtree of T rooted at ve, i.e. the subtree consists of all node x that the unique path from x to r contains ve. Tree-Sliced Wasserstein Distance: A Geometric Perspective Adding a tree structure Figure 1: This illustration demonstrates the process of adding a tree structure to a system of lines. Left: An example of a system of 5 lines in R2, where the lines intersect, making the system connected. Right: Adding a tree structure to the connected system. In this example, only four pairs of lines are adjacent, shown by intersections, while the remaining pairs are disconnected, represented by gaps. This structure is derived by taking a spanning tree from a graph with five nodes (representing the five lines), with edges connecting nodes where lines intersect. 3. System of Lines with Tree Structures This section provides an intuitive and brief introduction of systems of lines and their additional tree structures. These structures form metric spaces, called tree systems, which serve as a generalization of one-dimensional lines within the framework of the Sliced-Wasserstein distance. We then explore the topological properties and the construction of tree systems. The ideas are illustrated in Figures 1, 2, 3, and a complete formal construction with theoretical proofs are presented in Appendix A. 3.1. System of Lines and Tree System A line in Rd can be fully described by specifying its direction and a point it passes through. Specifically, a line is determined by (x, θ) Rd Sd 1, and is parameterized as x + t θ for t R. Definition 3.1 (Line and System of Lines in Rd). A line in Rd is an element (x, θ) of Rd Sd 1. For k 1, a system of k lines in Rd is a set of k lines in Rd. We denote a line in Rd as l = (xl, θl). Here, xl and θl are called source and direction of l, respectively. Denote (Rd Sd 1)k by Ld k, which is the space of systems of k lines in Rd, and an element of Ld k is usually denoted by L. The ground set of a system of lines L is defined by: L := (x, l) Rd L : x = xl + tx θl for some tx R . For each element L, we sometimes write (x, l) as (tx, l), where tx R presents the parameterization of x on l as x = xl + tx θl. By a point of L, we refer to a point of the ground set L. Now consider a system of distinct lines L in Rd. L is said to be connected if its points form a connected set in Rd. In this case, L naturally has certain tree structures. Figure 1 gives an example of a system of lines with an added tree structure. A pair (L, T ) consists of a connected system of lines L and its tree structure T of L, is called a tree system. We also denote it as L for short. 3.2. Topological Properties of Tree Systems A tree system L can be intuitively understood as a system of lines that are connected in certain ways. It naturally forms a topological space by taking disjoint union copies of R and then taking the quotient at intersections of these copies. The disjoint union is straightforward, and the quotient follows the tree structure of L. The topological space resulting from these actions is called the topological space of a tree system L, and is denoted by ΩL. By its construction, ΩL naturally carries a measure induced from the standard measure on each copy of R. This measure is denoted by µL. Notice that, due to the tree structure, a unique path exists between any two points of ΩL. This leads to an important result regarding the metrizability of ΩL. Theorem 3.2 (ΩL is metrizable by a tree metric). Consider d L : ΩL ΩL ! [0, ) defined by: d L(a, b) := µL (Pa,b) , a, b ΩL, (6) where Pa,b is the unique path between a and b in ΩL. Then d L is a metric on ΩL, which makes (ΩL, d L) a metric space. Moreover, d L is a tree metric, and the topology on ΩL induced by d L is identical to the topology of ΩL. The proof is presented in Theorem A.11. Figure 2 illustrates an example of a unique path between two points on a tree system, providing an intuitive explanation of why d L is indeed a metric. Tree-Sliced Wasserstein Distance: A Geometric Perspective Figure 2: The same tree system L shown in Figure 1, naturally has a topology derived from five copies of R. Consider three points a, b, c. The red zigzag line presents the unique path from a to b. Here the distance between a, b, i.e. d L(a, b), is the sum of four red line segments. Similar for paths between b and c; a and c. This demonstrates that the triangle inequality is satisfied for d L. 3.3. Construction of Tree Systems and Sampling Process A tree system can be built inductively by sampling lines, ensuring that each new line intersects one of the previously sampled lines. We introduce a straightforward method to construct a tree system: start by sampling a line, and at each subsequent step, sample a new line that intersects the previously selected line. Specifically, the process is as follows: Step 1. Sampling x1 µ1 for an µ1 P(Rd), then θ1 ν1 for an ν1 P(Sd 1). The pair (x1, θ1) forms the first line; Step i. At step i, sampling xi = xi 1 + ti θi 1 where ti µi for an µi P(R), then θi νi for an νi P(Sd 1). The pair (xi, θi) forms the ith line. The tree system produced by this construction has a chainlike tree structure, where the ith line intersects the (i + 1)th line. A general approach for sampling tree systems is provided in Appendix A.4. In practice, we simply assume all the distributions µ s and ν s to be independent, and let: µ1 to be a distribution on a bounded subset of Rd, for instance, the uniform distribution on the d-dimensional cube [ 1, 1]d, i.e. U([ 1, 1]d); µi for i > 1 to be a distribution on a bounded subset of R, for instance, the uniform distribution on the interval [ 1, 1], i.e. U([ 1, 1]); θn to be a distribution on Sd 1, for instance, the uniform distribution U(Sd 1). Using the distributions µ s and ν s, we get a distribution on the space of all tree systems that can be sampled by this way. We obtain a distribution over the space of all tree systems that can be sampled in this manner. The algorithm for sampling tree systems is summarized in Algorithm 1, and illustrated in Figure 3. Algorithm 1 Sampling (chain-like) tree systems. Input: The number of lines in tree systems k. Sampling x1 U([ 1, 1]d) and θ1 U(Sd 1). for i = 2 to k do Sample ti U([ 1, 1]) and θi U(Sd 1). Compute xi = xi 1 + ti θi 1. end for Return: (x1, θ1), (x2, θ2), . . . , (xk, θk). Figure 3: Illustration of the sampled (chain-like) tree systems in Algorithm 1. 4. Radon Transform on Systems of Lines In this section, we introduce the notions of the space of Lebesgue integrable functions and the Radon Transform for systems of lines. Let L Ld k be a system of k lines. Denote L1(Rd) as the space of Lebesgue integrable functions on Tree-Sliced Wasserstein Distance: A Geometric Perspective Rd with norm 1, i.e. L1(Rd) = f : Rd ! R : f 1 = Z Rd |f(x)| dx < . Two functions f1, f2 L1(Rd) are considered to be identical if f1(x) = f2(x) almost everywhere on Rd. As a counterpart, a Lebesgue integrable function on L is a function f : L ! R such that: R |f(tx, l)| dtx < . (8) The space of Lebesgue integrable functions on L is denoted by L1(L). Two functions f1, f2 L1(L) are considered to be identical if f1(x) = f2(x) almost everywhere on L. The space L1(L) with norm L is a Banach space. Recall that L has k lines. Denote the (k 1)- dimensional standard simplex as k 1 = {(al)l L : al 0 and P l L al = 1} Rk. Denote C(Rd, k 1) as the space of continuous maps from Rd to k 1. A map in C(Rd, k 1) is referred to as a splitting map. Let L be a system of k lines in Ld k, α be a splitting map in C(Rd, k 1), we define an operator associated to α that transforms a Lebesgue integrable functions on Rd to a Lebesgue integrable functions on L, analogous to the original Radon Transform. For f L1(Rd), define: Rα Lf : L ! R (x, l) 7 ! Z Rdf(y) α(y)l δ (tx y xl, θl ) dy, where δ is the 1-dimensional Dirac delta function. For f L1(Rd), we can show that Rα Lf L1(L). Moreover, we have Rα Lf L f 1. In other words, the operator Rα L : L1(Rd) ! L1(L) is well-defined, and is a linear operator. The proof for these properties is presented in Theorem B.2. We now propose a novel variant of Radon Transform for systems of lines. Definition 4.1 (Radon Transform on Systems of lines). For α C(Rd, k 1), the operator Rα: Rα : L1(Rd) ! Y f 7 ! (Rα Lf)L Ld k . is called the Radon Transform on Systems of Lines. Remark. An illustration of splitting maps and the Radon Transform on Systems of Lines is presented in Figure 4. Intuitively, splitting map α indicates how the mass at a given point is distributed across all lines of a system of lines. In the case k = 1, there is only one splitting map which is the constant function 1, and the Radon Transform for Ld 1 is identical to the traditional Radon Transform. Many variants of the Radon transform require the transform to be injective. In the case of systems of lines, the injectivity also holds for Rα. Theorem 4.2. Rα is injective for all splitting maps α C(Rd, k 1). The proof of this theorem is presented in Theorem B.1. Denote P(Rd) as the space of all probability distribution on Rd, and define a probability distribution on L to be a function f L1(L) such that f : L ! [0, ) and f L = 1. The space of probability distribution on L is denoted by P(L). Then Rα L transforms a distribution in P(Rd) to a distribution in P(L). In other words, the restricted operator Rα L : P(Rd) ! P(L) is also well-defined. 5. Tree-Sliced Wasserstein Distance on Systems of Lines In this section, we present a novel Tree-Sliced Wasserstein distance on Systems of Lines (TSW-SL). Consider T as the space of tree systems consisting of k lines in Rd that be sampled by Algorithm 1. By the remark at the end of Subsection 3.3, we have a distribution σ on the space T. General cases of T, as in Appendix A.4, will be handled in a similar manner. For simplicity and convenience, we occasionally use the same notation to represent both a measure and its probability distribution function, provided the context makes the meaning clear. 5.1. Tree-Sliced Wasserstein Distance on Systems of Lines Consider a splitting function α in C(Rd, k 1). Given two probability distributions µ, ν in P(Rd) with their density function fµ, fν, respectively, and a tree system L T. By the Radon Transform Rα L in Definition 4.1, fµ and fν are transformed to Rα Lµ and Rα Lν in P(L). By Theorem 3.2, L has a tree metric d L, we compute Wasserstein distance Wd L,1(Rα Lfµ, Rα Lfν) between Rα Lfµ and Rα Lfν by Equation (5). Definition 5.1 (Tree-Sliced Wasserstein Distance on Systems of Lines). The Tree-Sliced Wasserstein distance on Systems of Lines between µ, ν in P(Rd) is defined by: TSW-SL(µ, ν) := Z T Wd L,1(Rα Lfµ, Rα Lfν) dσ(L). (10) Remark. Note that, the definition of TSW-SL depends on the space of sampled tree systems T, the distribution σ on T, and the splitting function α. For simplifying the notation, we omit them. TSW-SL is a metric on P(Rd). The proof for the below theorem is provided in Appendix D.1. Theorem 5.2. TSW-SL is a metric on P(Rd). Tree-Sliced Wasserstein Distance: A Geometric Perspective Figure 4: An illustration of Radon Transform on Systems of Lines. Given f L1(Rd) such that f(x) = 0.6, f(y) = 0.4, and L is a system of 3 lines. For a splitting map α such that α(x) = (1/6, 3/6, 2/6) and α(y) = (1/4, 2/4, 1/4), f is transformed to Rα Lf. By Equation (9), for instance, the value of Rα Lf at the projection of x onto line (2) of L is f(x) α(x)2 = 0.3. Remark. If tree systems in T consist only one line, i.e. k = 1, then in Definition 4.1, the splitting map α is the constant map 1, and the Radon Transform Rα now becomes identical to the original Radon Transform, as pushing forward measures onto lines depends only on their directions. Also, according to the sampling process described in Subsection 3.3, σ becomes the distribution of θ1, which is U(Sd 1). In this case, TSW-SL in Equation (10) is identical to SW in Equation (2). Furthermore, in Appendix C, we introduce Max Tree-Sliced Wasserstein Distance on Systems of Lines (Max TSW-SL), an analog of Max SW (Deshpande et al., 2019). 5.2. Computing TSW-SL We employ the Monte Carlo method to estimate the intractable integral in Equation (10) as follows: \ TSW-SL(µ, ν) = 1 i=l Wd Ll,1(Rα Llfµ, Rα Llfν), (11) where L1, . . . , LL i.i.d σ are referred to as projecting tree systems. We now discuss on how to compute Wd L,1(Rα Lfµ, Rα Lfν) for L T. In applications, consider fµ, fν P(Rd) given as follows: i=1 ui δ(x ai) and ν(x) = i=1 vi δ(x bi) Rα L pushes fµ, fν on L, resulting in discrete measures Rα Lfµ, Rα Lfν in P(L). In details, from definition of Rα Lfµ, the support of Rα Lfµ is the set of all projections of support of µ onto lines of L. Moreover, the value of Rα Lfµ at projections of ai onto l is equal to α(ai)l ui. Similar for Rα Lfν. From this detailed description of Rα Lfµ, Rα Lfν, together with Equation (5), we derive a closed-form expression of Wd L,1(Rα Lfµ, Rα Lfν) as follows: Wd L,1(Rα Lfµ, Rα Lfν) e T we Rα Lfµ(Γ(ve)) Rα Lfν(Γ(ve)) . (13) This expression enables an efficient and highly parallelizable implementation of TSW-SL, as it relies on fundamental operations like matrix multiplication and sorting. Appendix B.3 provides a detailed illustration of the underlying mechanism of this expression. Remark. Assume n m, the time complexity for TSWSL is O(Lkn log n + Lkdn) since it primarily involves projecting onto L k lines and sorting n projections on each line. This complexity is equivalent to that of SW when the number of projection directions is the same. Therefore, in our experiments, we ensure a fair comparison by evaluating the performance of TSW-SL against SW or its variants using the same number of projection directions. We summarize this section with Algorithm 2 of computing TSW-SL. Algorithm 2 Tree Sliced Wasserstein distance on Systems of Lines. Input: µ and ν in P(Rd), the number of lines in each tree system k, the number of tree systems L, a splitting map α: Rd ! k 1. for l = 1 to L do Sample tree system Li = (x(l) 1 , θ(l) 1 ), . . . , (x(l) k , θ(l) k ) . Project µ and ν onto Ll to get Rα Llfµ and Rα Llfν. Compute Wd Ll,1(Rα Llfµ, Rα Llfν). end for Compute \ TSW-SL = 1 L ΣL l=1Wd Ll,1(Rα Llfµ, Rα Llfν). Return: \ TSW-SL(µ, ν). Tree-Sliced Wasserstein Distance: A Geometric Perspective Table 1: Average Wasserstein distance between source and target distributions of 10 runs on Swiss Roll and 25 Gaussians datasets. All methods use 100 projecting directions. Swiss Roll 25 Gaussians Methods Iteration Time/Iter(s) Iteration Time/Iter(s) 500 1000 1500 2000 2500 500 1000 1500 2000 2500 SW 5.73e-3 2.04e-3 1.23e-3 1.11e-3 1.05e-3 0.009 1.61e-1 9.52e-2 3.44e-2 2.56e-2 2.20e-2 0.006 Max SW 2.47e-2 1.03e-2 6.10e-3 4.47e-3 3.45e-3 2.46 5.09e-1 2.36e-1 1.33e-1 9.70e-2 8.48e-2 2.38 SWGG 3.84e-2 1.53e-2 1.02e-2 4.49e-3 3.57e-5 0.011 3.10e-1 1.17e-1 3.38e-2 3.58e-3 2.54e-4 0.009 LCVSW 7.28e-3 1.40e-3 1.38e-3 1.38e-3 1.36e-3 0.010 3.38e-1 6.64e-2 3.06e-2 3.06e-2 3.02e-2 0.009 TSW-SL 9.41e-3 2.03e-7 9.63e-8 4.44e-8 3.65e-8 0.014 3.49e-1 9.06e-2 2.96e-2 1.20e-2 3.03e-7 0.010 Max TSW-SL 2.75e-6 8.24e-7 5.14e-7 5.02e-7 5.00e-7 2.53 1.12e-1 8.28e-3 1.61e-6 7.32e-7 5.19e-7 2.49 6. Experimental Results In this section, we present empirical results demonstrating the advantages of our TSW-SL distance over traditional SW distance and its variants, and how Max TSW-SL enhances the original Max SW (Deshpande et al., 2019) through optimized tree construction. The splitting maps α will be selected either as a trainable constant vector or a random vector, while the tree systems will be sampled such that the root is positioned near the mean of the target distribution, i.e. the data mean. It is worth noting that the paper presents a simple alternative by substituting lines in SW with tree systems, focusing mainly on comparing TSW-SL with the original SW, without expecting TSW-SL to outperform more recent SW variant. Further improvements to TSW-SL could be made by incorporating advanced techniques developed for SW, but we leave this for future research, choosing instead to focus on the fundamental aspects of TSW-SL. In order to provide a comprehensive assessment, our experiments focus on key tasks where SW has been widely applied in the literature: gradient flows and generative models (including GANs and denoising diffusion models). We first demonstrate TSW-SL s performance on gradient flow tasks using the 25 Gaussians and Swiss Roll datasets, highlighting its ability to capture complex topological properties, with additional higher-dimensional results in the Appendix. Next, we evaluate generative models across various datasets, ranging from simple CIFAR-10 to complex STL-10, assessing scalability and robustness. We refer the reader to Appendix E for additional experiments on color transfer and ablation studies on the effect of the number of lines k on the performance of generative adversarial networks. 6.1. Gradient Flows First of all, we conduct experiments to compare the effectiveness of our methods with baselines in the gradient flow task. In this task, we aim to minimize TSW-SL(µ, ν), where ν is the target distribution and µ represents the source distribution. The optimization process is carried out iteratively as tµt = TSW-SL(µt, ν) with µ0 = N(0, 1), tµt represents the change in the source distribution over time and TSW-SL(µt, ν) is the gradient of TSW-SL with respect to µt. We initialize with µ0 = N(0, 1) and iteratively update µt over 2500 iterations. To compare the effectiveness of various distance metrics, we employ alternative distances as loss functions (SW (Bonneel et al., 2015), Max SW (Deshpande et al., 2019), SWGG (Mahey et al., 2023) and LCVSW (Nguyen & Ho, 2023)) instead of TSW-SL. Over 2500 timesteps, we evaluate the Wasserstein distance between source and target distributions at iteration 500, 1000, 1500, 2000 and 2500. We use L = 100 in SW variants and L = 25, k = 4 in TSW-SL for a fair comparison. Detailed training settings are presented in Appendix E.1. We first utilize both the Swiss Roll (a non-linear dataset) and 25 Gaussians (a multimodal dataset) as described in (Kolouri et al., 2019). In Table 1, we present the performance and runtime of various methods on these datasets, emphasizing the reduction of the Wasserstein distance over iterations. Notably, across both datasets, our TSW-SL method demonstrates superior performance by significantly reducing the Wasserstein distance. Moreover, our Max TSW-SL method shows a significant decrease in the Wasserstein distance compared to Max SW, highlighting its improved performance and effectiveness. Furthermore, we provide additional results from experiments of 10, 50, 75, 100, 150, and 200-dimensional Gaussian distributions, where target distribution supports were sampled from these high-dimensional spaces to showcase the empirical advantages of our TSWSL in capturing topological properties. In this context, we compare the Tree Sliced Wasserstein distance on a System of Lines (TSW-SL) with Sliced Wasserstein distance (SW) to demonstrate TSW-SL s effectiveness when distribution supports lie in high-dimensional spaces. The results presented in Table 2 highlight TSW-SL s superior ability to preserve the original data s topological properties compared to SW. Tree-Sliced Wasserstein Distance: A Geometric Perspective Table 2: Average Wasserstein distance between source and target distributions of 10 runs on high-dimensional datasets. Iteration 500 Iteration 1000 Iteration 1500 Iteration 2000 Iteration 2500 Time/Iter(s) Dimension SW TSW-SL SW TSW-SL SW TSW-SL SW TSW-SL SW TSW-SL SW TSW-SL 10 4.32e-3 2.81e-3 2.94e-3 2.00e-3 2.81e-3 1.55e-3 2.23e-3 1.59e-3 2.28e-3 1.75e-3 0.010 0.015 50 50.41 39.26 45.69 21.91 42.56 11.91 38.81 4.08 35.75 1.72 0.014 0.018 75 92.39 79.71 90.79 67.99 90.07 53.92 86.58 44.91 90.31 31.61 0.015 0.018 100 130.12 117.66 128.13 103.23 128.58 93.41 129.80 80.46 128.29 75.28 0.018 0.019 150 214.09 203.30 213.71 190.62 215.05 186.77 212.90 183.52 216.32 182.63 0.020 0.022 200 302.84 289.83 301.35 283.34 303.07 276.94 302.70 279.24 301.51 279.08 0.020 0.021 Table 3: Average FID and IS score of 3 runs on Celeb A and STL-10 of SN-GAN. Celeb A (64x64) STL-10 (96x96) Celeb A (64x64) STL-10 (96x96) FID(#) FID(#) IS(") FID(#) FID(#) IS(") SW (L = 50) 9.97 1.02 69.46 0.21 9.08 0.06 SW (L = 500) 9.62 0.42 53.52 0.61 10.56 0.05 TSW-SL (L = 10, k = 5) 9.63 0.46 61.15 0.37 10.00 0.03 TSW-SL (L = 100, k = 5) 8.90 0.49 51.81 1.02 10.74 0.13 TSW-SL (L = 17, k = 3) 8.98 0.75 65.91 0.64 9.75 0.10 TSW-SL (L = 167, k = 3) 8.90 0.38 52.27 0.96 10.62 0.18 6.2. Generative Adversarial Network We then explore the capabilities of our proposed TSW-SL framework within the context of generative adversarial networks (GANs). We employ the SNGAN architecture (Miyato et al., 2018). In detail, our approach is based on the methodology of the Sliced Wasserstein generator (Deshpande et al., 2018), with details provided in the Appendix E.3. Specifically, we conduct deep generative modeling experiments on the non-cropped Celeb A dataset (Krizhevsky, 2009) with image size 64 64, and on the STL-10 dataset (Wang & Tan, 2016) with image size 96 96. To demonstrate the empirical advantage of our method in enhancing generative adversarial networks, we employ two primary metrics: the Fr echet Inception Distance (FID) score (Heusel et al., 2017) and the Inception Score (IS) (Salimans et al., 2016). We omit to report the IS for the Celeb A dataset as it does not effectively capture the perceptual quality of face images (Heusel et al., 2017). Table 3 presents the results of SW and TSW-SL methodologies on the Celeb A and STL-10 datasets, utilizing FID and IS as our metrics. We conduct experiments with two configurations of projecting directions: for 50 projecting directions, we use L = 50 in SW compared to L = 10, k = 5 and L = 17, k = 3 in TSW-SL; for 500 projecting directions, we use L = 500 in SW compared to L = 100, k = 5 and L = 167, k = 3 in TSW-SL. Our results reveal that TSW-SL significantly outperforms SW, demonstrating a considerable performance gap on both datasets in terms of IS and FID. We provide additional qualitative results and ablation study for our methods with respect to the number of lines in Appendix E.3. 6.3. Denoising Diffusion Models Finally, we concentrate on denoising diffusion models (Sohl Dickstein et al., 2015; Ho et al., 2020), which are among the most complex generative frameworks for image generation. Diffusion models consist of a forward process that gradually adds Gaussian noise to data and a reverse process that learns to denoise the data. The forward process is defined as a Markov chain of T steps, where each step adds noise according to a predefined schedule. The reverse process, parameterized by θ, aims to learn the denoising distribution. Traditionally, these models are trained using maximum likelihood by optimizing the evidence lower bound (ELBO). However, to accelerate generation, denoising diffusion GANs (Xiao et al., 2021) introduce an implicit denoising model and employ adversarial training. In our work, we build upon the framework in (Nguyen et al., 2024b) and replace the Augmented Generalized Mini-batch Energy distance with our novel TSW-SL distance as the kernel and conducting experiments on the CIFAR-10 dataset (Krizhevsky, 2009). For a detailed description of the model architecture and training loss, we refer readers to Appendix E.4. Table 4 demonstrates that our TSW-SL loss function significantly enhances FID performance compared to conventional SW, which is 22.43% over DDGAN and 2.4% over SW-DD. This improvement underscores the efficacy of our approach in generating high-quality samples with improved fidelity. 7. Conclusion This paper proposes a novel method called Tree-Sliced Wasserstein on Systems of Lines (TSW-SL), replacing the traditional one-dimensional lines in the Sliced Wasserstein (SW) framework with tree systems, providing a more geo- Tree-Sliced Wasserstein Distance: A Geometric Perspective Table 4: Results for unconditional generation on CIFAR-10 of denoising diffusion models Model FID # Time/Epoch(s)# DDGAN ((Xiao et al., 2021)) 3.64 136 SW-DD ((Nguyen et al., 2024b)) 2.90 140 TSW-SL-DD (Ours) 2.83 163 metrically meaningful space. This key innovation enables the proposed TSW-SL to capture more detailed structural information and geometric relationships within the data compared to SW while preserving computational efficiency. We rigorously develop the theoretical basis for our approach, verifying the essential properties of the Radon Transform and empirically demonstrating the benefits of TSW-SL across a range of application tasks. As this paper introduces a straightforward alternative by replacing one-dimensional lines in SW with tree systems, our primary comparison is between TSW-SL and the original SW, without anticipating that TSW-SL will surpass more recent SW variants. Future research on adapting recent advance techniques within the SW framework to TSW-SL remains an open area and is anticipated to lead to improved performance for Sliced Optimal Transport overall. Acknowledgements We thank the area chairs and anonymous reviewers for their comments. TL gratefully acknowledges the support of JSPS KAKENHI Grant number 23K11243, and Mitsui Knowledge Industry Co., Ltd. grant. Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. Alvarez-Melis, D., Jaakkola, T., and Jegelka, S. Structured optimal transport. In International Conference on Artificial Intelligence and Statistics, pp. 1771 1780. PMLR, 2018. Bai, Y., Schmitzer, B., Thorpe, M., and Kolouri, S. Sliced optimal partial transport. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13681 13690, 2023. Bonet, C., Courty, N., Septier, F., and Drumetz, L. Efficient gradient flows in sliced-Wasserstein space. Transactions on Machine Learning Research, 2022. ISSN 2835- 8856. URL https://openreview.net/forum? id=Au1LNKm Rvh. Bonet, C., Chapel, L., Drumetz, L., and Courty, N. Hyperbolic sliced-Wasserstein via geodesic and horospherical projections. In Topological, Algebraic and Geometric Learning Workshops 2023, pp. 334 370. PMLR, 2023. Bonneel, N., Rabin, J., Peyr e, G., and Pfister, H. Sliced and Radon Wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision, 51:22 45, 2015. Bunne, C., Papaxanthos, L., Krause, A., and Cuturi, M. Proximal optimal transport modeling of population dynamics. In International Conference on Artificial Intelligence and Statistics, pp. 6511 6528. PMLR, 2022. Chapel, L. and Tavenard, R. One for all and all for one: Efficient computation of partial wasserstein distances on the line. In International Conference on Learning Representations, 2025. Chapel, L., Tavenard, R., and Vaiter, S. Differentiable generalized sliced wasserstein plans. ar Xiv preprint ar Xiv:2505.22049, 2025. Courty, N., Flamary, R., Habrard, A., and Rakotomamonjy, A. Joint distribution optimal transportation for domain adaptation. Advances in neural information processing systems, 30, 2017. Deshpande, I., Zhang, Z., and Schwing, A. G. Generative modeling using the sliced Wasserstein distance. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3483 3491, 2018. Deshpande, I., Hu, Y.-T., Sun, R., Pyrros, A., Siddiqui, N., Koyejo, S., Zhao, Z., Forsyth, D., and Schwing, A. G. Max-sliced Wasserstein distance and its use for GANs. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10648 10656, 2019. Fan, J., Haasler, I., Karlsson, J., and Chen, Y. On the complexity of the optimal transport problem with graphstructured cost. In International Conference on Artificial Intelligence and Statistics, pp. 9147 9165. PMLR, 2022. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial networks. Communications of the ACM, 63(11):139 144, 2020. Hatcher, A. Algebraic topology. 2005. Helgason, S. The Radon transform on Rn. Integral Geometry and Radon Transforms, pp. 1 62, 2011. Tree-Sliced Wasserstein Distance: A Geometric Perspective Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. GANs trained by a two time-scale update rule converge to a local nash equilibrium. 12 2017. Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840 6851, 2020. Ho, N., Nguyen, X., Yurochkin, M., Bui, H. H., Huynh, V., and Phung, D. Multilevel clustering via Wasserstein means. In International conference on machine learning, pp. 1501 1509. PMLR, 2017. Hua, X., Nguyen, T., Le, T., Blanchet, J., and Nguyen, V. A. Dynamic flows on curved space generated by labeled data. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-23, pp. 3803 3811, 2023. Huang, M., Ma, S., and Lai, L. A riemannian block coordinate descent method for computing the projection robust Wasserstein distance. In International Conference on Machine Learning, pp. 4446 4455. PMLR, 2021. Indyk, P. and Thaper, N. Fast image retrieval via embeddings. In International workshop on statistical and computational theories of vision, volume 2, pp. 5, 2003. Kessler, S., Le, T., and Nguyen, V. SAVA: Scalable learningagnostic data valuation. In The Thirteenth International Conference on Learning Representations, 2025. Kingma, D. P. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Kolouri, S., Rohde, G. K., and Hoffmann, H. Sliced wasserstein distance for learning Gaussian mixture models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3427 3436, 2018. Kolouri, S., Nadjahi, K., Simsekli, U., Badeau, R., and Rohde, G. Generalized sliced Wasserstein distances. Advances in neural information processing systems, 32, 2019. Krizhevsky, A. Learning multiple layers of features from tiny images. 2009. URL https://api. semanticscholar.org/Corpus ID:18268744. Lavenant, H., Claici, S., Chien, E., and Solomon, J. Dynamical optimal transport on discrete surfaces. In SIGGRAPH Asia 2018 Technical Papers, pp. 250. ACM, 2018. Le, T., Yamada, M., Fukumizu, K., and Cuturi, M. Treesliced variants of Wasserstein distances. Advances in neural information processing systems, 32, 2019. Le, T., Nguyen, T., Phung, D., and Nguyen, V. A. Sobolev transport: A scalable metric for probability measures with graph metrics. In International Conference on Artificial Intelligence and Statistics, pp. 9844 9868. PMLR, 2022. Le, T., Nguyen, T., and Fukumizu, K. Generalized Sobolev transport for probability measures on a graph. In Fortyfirst International Conference on Machine Learning, 2024a. Le, T., Nguyen, T., and Fukumizu, K. Optimal transport for measures with noisy tree metric. In International Conference on Artificial Intelligence and Statistics. PMLR, 2024b. Lin, T., Zheng, Z., Chen, E., Cuturi, M., and Jordan, M. I. On projection robust optimal transport: Sample complexity and model misspecification. In International Conference on Artificial Intelligence and Statistics, pp. 262 270. PMLR, 2021. Lin, Y.-W. E., Coifman, R. R., Mishne, G., and Talmon, R. Tree-Wasserstein distance for high dimensional data with a latent feature hierarchy. In The Thirteenth International Conference on Learning Representations, 2025. Liu, L., Pal, S., and Harchaoui, Z. Entropy regularized optimal transport independence criterion. In International Conference on Artificial Intelligence and Statistics, pp. 11247 11279. PMLR, 2022. Liutkus, A., Simsekli, U., Majewski, S., Durmus, A., and St oter, F.-R. Sliced-Wasserstein flows: Nonparametric generative modeling via optimal transport and diffusions. In International Conference on Machine Learning, pp. 4104 4113. PMLR, 2019. Luong, M., Nguyen, K., Ho, N., Haf, R., Phung, D., and Qu, L. Revisiting deep audio-text retrieval through the lens of transportation. ar Xiv preprint ar Xiv:2405.10084, 2024. Mahey, G., Chapel, L., Gasso, G., Bonet, C., and Courty, N. Fast optimal transport through sliced generalized Wasserstein geodesics. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https: //openreview.net/forum?id=n3Xu Ydvh NW. Mena, G. and Niles-Weed, J. Statistical bounds for entropic optimal transport: sample complexity and the central limit theorem. In Advances in Neural Information Processing Systems, pp. 4541 4551, 2019. Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum? id=B1QRgzi T-. Tree-Sliced Wasserstein Distance: A Geometric Perspective Munkres, J. R. Elements of algebraic topology. CRC press, 2018. Muzellec, B. and Cuturi, M. Subspace detours: Building transport plans that are optimal on subspace projections. Advances in Neural Information Processing Systems, 32, 2019. Nadjahi, K., Durmus, A., Jacob, P. E., Badeau, R., and Simsekli, U. Fast approximation of the sliced-Wasserstein distance using concentration of random projections. Advances in Neural Information Processing Systems, 34: 12411 12424, 2021. Nguyen, K. and Ho, N. Amortized projection optimization for sliced Wasserstein generative models. Advances in Neural Information Processing Systems, 35:36985 36998, 2022. Nguyen, K. and Ho, N. Sliced Wasserstein estimation with control variates. ar Xiv preprint ar Xiv:2305.00402, 2023. Nguyen, K. and Ho, N. Energy-based sliced Wasserstein distance. Advances in Neural Information Processing Systems, 36, 2024. Nguyen, K., Ho, N., Pham, T., and Bui, H. Distributional sliced-Wasserstein and applications to generative modeling. ar Xiv preprint ar Xiv:2002.07367, 2020. Nguyen, K., Bariletto, N., and Ho, N. Quasi-Monte Carlo for 3D sliced Wasserstein. In The Twelfth International Conference on Learning Representations, 2024a. URL https://openreview.net/forum? id=Wd47f7HEXg. Nguyen, K., Zhang, S., Le, T., and Ho, N. Sliced Wasserstein with random-path projecting directions. ar Xiv preprint ar Xiv:2401.15889, 2024b. Nguyen, T., Pham, Q.-H., Le, T., Pham, T., Ho, N., and Hua, B.-S. Point-set distances for learning representations of 3d point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10478 10487, 2021a. Nguyen, T. D., Trippe, B. L., and Broderick, T. Many processors, little time: MCMC for partitions via optimal transport couplings. In International Conference on Artificial Intelligence and Statistics, pp. 3483 3514. PMLR, 2022. Nguyen, V., Le, T., Yamada, M., and Osborne, M. A. Optimal transport kernels for sequential and parallel neural architecture search. In International Conference on Machine Learning, pp. 8084 8095. PMLR, 2021b. Nietert, S., Goldfeld, Z., and Cummings, R. Outlier-robust optimal transport: Duality, structure, and statistical analysis. In International Conference on Artificial Intelligence and Statistics, pp. 11691 11719. PMLR, 2022. Niles-Weed, J. and Rigollet, P. Estimation of Wasserstein distances in the spiked transport model. Bernoulli, 28(4): 2663 2688, 2022. Park, J., Lee, J., and Sohn, K. Bridging vision and language spaces with assignment prediction. ar Xiv preprint ar Xiv:2404.09632, 2024. Paty, F.-P. and Cuturi, M. Subspace robust Wasserstein distances. In International conference on machine learning, pp. 5072 5081. PMLR, 2019. Peyr e, G., Cuturi, M., et al. Computational optimal transport: With applications to data science. Foundations and Trends in Machine Learning, 11(5-6):355 607, 2019. Pham, T., Shimizu, S., Hino, H., and Le, T. Scalable counterfactual distribution estimation in multivariate causal models. In Conference on Causal Learning and Reasoning (CLea R), 2024. Quellmalz, M., Beinert, R., and Steidl, G. Sliced optimal transport on the sphere. Inverse Problems, 39(10):105005, 2023. Rabin, J., Peyr e, G., Delon, J., and Bernot, M. Wasserstein barycenter and its application to texture mixing. In International Conference on Scale Space and Variational Methods in Computer Vision, pp. 435 446, 2011. Rotman, J. J. An introduction to algebraic topology, volume 119. Springer Science & Business Media, 2013. Saleh, M., Wu, S.-C., Cosmo, L., Navab, N., Busam, B., and Tombari, F. Bending graphs: Hierarchical shape matching using gated optimal transport. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11757 11767, 2022. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. Improved techniques for training GANs. NIPS 16, pp. 2234 2242, Red Hook, NY, USA, 2016. Curran Associates Inc. ISBN 9781510838819. Salimans, T., Zhang, H., Radford, A., and Metaxas, D. Improving GANs using optimal transport. ar Xiv preprint ar Xiv:1803.05573, 2018. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp. 2256 2265. PMLR, 2015. Tree-Sliced Wasserstein Distance: A Geometric Perspective Solomon, J., De Goes, F., Peyr e, G., Cuturi, M., Butscher, A., Nguyen, A., Du, T., and Guibas, L. Convolutional Wasserstein distances: Efficient optimal transportation on geometric domains. ACM Transactions on Graphics (TOG), 34(4):66, 2015. Tran, H., Vo, T., Huu, T., Nguyen The, A., and Nguyen, T. Monomial matrix group equivariant neural functional networks. In Advances in Neural Information Processing Systems, volume 37, pp. 48628 48665. Curran Associates, Inc., 2024a. Tran, H. V., Chu, T., Nguyen-Nhat, M.-K., Pham, H. T., Le, T., and Nguyen, T. M. Spherical tree-sliced Wasserstein distance. In The Thirteenth International Conference on Learning Representations, 2025a. URL https:// openreview.net/forum?id=FPQz XME9NK. Tran, H. V., Nguyen-Nhat, M.-K., Pham, H. T., Chu, T., Le, T., and Nguyen, T. M. Distance-based tree-sliced Wasserstein distance. In The Thirteenth International Conference on Learning Representations, 2025b. URL https: //openreview.net/forum?id=Oi Qtt MHwce. Tran, T., Tran, H. V., Chu, T., Pham, T., Ghaoui, L. E., Le, T., and Nguyen, T. M. Tree-sliced wasserstein distance with nonlinear projection. In Forty-second International Conference on Machine Learning, 2025c. URL https: //openreview.net/forum?id=Sw XYV7CIw9. Tran, V.-H., Pham, T., Tran, T., Le, T., and Nguyen, T. M. Tree-sliced Wasserstein distance on a system of lines. ar Xiv preprint ar Xiv:2406.13725, 2024b. Tran, V.-H., Vo, T. N., Huu, T. T., and Nguyen, T. M. A clifford algebraic approach to e (n)-equivariant high-order graph neural networks. ar Xiv preprint ar Xiv:2410.04692, 2024c. Tran, V.-H., Vo, T. N., The, A. N., Huu, T. T., Nguyen Nhat, M.-K., Tran, T., Pham, D.-T., and Nguyen, T. M. Equivariant neural functional networks for transformers. ar Xiv preprint ar Xiv:2410.04209, 2024d. Villani, C. Optimal Transport: Old and New, volume 338. Springer Science & Business Media, 2008. Vo, T. N., Tran, V.-H., Huu, T. T., The, A. N., Tran, T., Nguyen-Nhat, M.-K., Pham, D.-T., and Nguyen, T. M. Equivariant polynomial functional networks. ar Xiv preprint ar Xiv:2410.04213, 2024. Vu, A.-K. N., Do, T.-T., Nguyen, V.-T., Le, T., Tran, M.-T., and Nguyen, T. V. Few-shot object detection via synthetic features with optimal transport. Computer Vision and Image Understanding (CVIU), pp. 104350, 2025. Wang, D. and Tan, X. Unsupervised feature learning with c-svddnet. Pattern Recognition, 60:473 485, 2016. Wang, J., Gao, R., and Xie, Y. Two-sample test with kernel projected Wasserstein distance. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151, pp. 8022 8055. PMLR, 2022. Weed, J. and Berthet, Q. Estimation of smooth densities in Wasserstein distance. In Proceedings of the Thirty Second Conference on Learning Theory, volume 99, pp. 3118 3119, 2019. Wu, J., Huang, Z., Acharya, D., Li, W., Thoma, J., Paudel, D. P., and Gool, L. V. Sliced Wasserstein generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3713 3722, 2019. Xiao, Z., Kreis, K., and Vahdat, A. Tackling the generative learning trilemma with denoising diffusion GANs. ar Xiv preprint ar Xiv:2112.07804, 2021. Tree-Sliced Wasserstein Distance: A Geometric Perspective Rd d-dimensional Euclidean space 2 Euclidean norm , standard dot product Sd 1 (d 1)-dimensional hypersphere θ unit vector disjoint union L1(X) space of Lebesgue integrable functions on X P(X) space of probability distributions on X µ, ν measures δ( ) 1-dimensional Dirac delta function U(Sd 1) uniform distribution on Sd 1 pushforward (measure) C(X, Y ) space of continuous maps from X to Y d( , ) metric in metric space Wp p-Wasserstein distance SWp Sliced p-Wasserstein distance Γ (rooted) subtree e edge in graph we weight of edge in graph l line, index of line L system of lines, tree system L ground set of system of lines, tree system ΩL topological space of system of lines Ld k space of symtems of k lines in Rd T tree structure in system of lines L number of tree systems k number of lines in a system of lines or a tree system R original Radon Transform Rα Radon Transform on Systems of Lines k 1 (k 1)-dimensional standard simplex α splitting map T space of tree systems σ distribution on space of tree systems Tree-Sliced Wasserstein Distance: A Geometric Perspective Supplement for Tree-Sliced Wasserstein Distance: A Geometric Perspective A. Tree System In this section, we introduce the notion of a tree system, beginning with a collection of unstructured lines and progressively adding a tree structure to form a well-defined metric space with a tree metric. It is important to note that while some statements here differ slightly from those in the paper, the underlying ideas remain the same. A.1. System of Lines We have a definition of lines by parameterization. Observe that, a line in Rd is completely determined by a pair (x, θ) Rd Sd 1 via x + t θ, t R. Definition A.1 (Line and System of lines in Rd). A line in Rd is an element (x, θ) of Rd Sd 1, and the image of a line (x, θ) is defined by: Im(x, θ) := {x + t θ : t R} Rd. (14) For k 1, a system of n lines in Rd is a sequence of k lines. Remark. A line in Rd is usually denoted, or indexed, by l = (xl, θl) Rd Sd 1. Here, xl and θl are called source and direction of l, respectively. Denote (Rd Sd 1)k by Ld k, which is the collection of systems of k lines in Rd, and an element of Ld k is usually denoted by L. Definition A.2 (Ground Set). The ground set of a system of lines L is defined by: L := (x, l) Rd L : x = xl + tx θl for some tx R . For each element (x, l) L, we sometime write (x, l) as (tx, l), where tx R, which presents the parameterization of x on l by source xl and direction θl, as x = xl + tx θl. Remark. In other words, the ground set L is the disjoint union of images of lines in L: This notation seems to be redundant, but will be helpful when we define functions on L. A.2. System of Lines with Tree Structures (Tree System) Consider a finite system of lines L in Rd. Assume that these lines are geometrically distinct, i.e. their images are distinct. Define the graph GL associated with L, where L is the set of nodes in GL, and two nodes are adjacent if the two corresponding lines intersect each other. Here, saying two lines in Rd intersect means their images have exactly one point in common. Definition A.3 (Connected system of lines). L is called connected if its associated graph GL is connected. Remark. Intuitively, each edge of GL represents the intersection of its endpoints. If L is connected, for every two points that each one lies on some lines in L, one can travel to the other through lines in L. From now on, we will only consider the case L is connected. Recall the notion of a spanning tree of a graph G, which is a subgraph of G that contains all nodes of G, and also is a tree. Definition A.4 (Tree system of lines). Let L be a connected system of lines. A spanning tree T of GL is called a tree structure of L. A pair (L, T ) consists of a connected system of lines L and a tree structure T of L is called a tree system of lines. Remark. For short, we usually call a tree system of lines as a tree system. In a tree system (L, T ), images of two lines of L can intersect each other even when they are not adjacent in T . Let r be an arbitrary line of L. Denote Tr as the tree T rooted at r, and denote the (rooted) tree system as (L, Tr) if we want to specify the root. Definition A.5 (Depth of lines in a tree system). Let (L, Tr) be a tree system. For each m 0, a line l L is called a line of depth m if the (unique) path from r to l in T has length m. Denote Lm as the set of lines of depth m. Tree-Sliced Wasserstein Distance: A Geometric Perspective Remark. Note that L0 = {r}. Let T be the maximum length of paths in T start from r, which is called the depth of the line system. L has a partition as L = L0 L1 . . . LT . For l L that is not the root, denote pr(l) L as the parent of l, i.e. the (unique) node on the unique path from l to r that is adjacent to l. Note that, by definition, l and pr(l) intersect each other. We sometimes omit the root when the context is clear. Definition A.6 (Canonical tree system). A tree system (L, T ) is called a canonical tree system if for all l L that is not the root, the intersection of l and pr(l) is the source xl of l. Remark. In other words, in a canonical tree system, a line that differs from the root will have its source lies on its parent. For the rest of the paper, a tree system (L, T ) will be considered to be a canonical tree system. A.3. Topological Properties of Tree Systems We will introduce the notion of the topological space of a tree system. Let (L, T ) be a (canonical) tree system. Consider a graph where the nodes are elements of L; (x, l) and (x , l ) are adjacent if and only if one of the following conditions holds: 1. l = pr(l ), x = x , and x is the source of l . 2. l = pr(l), x = x , and x is the source of l. Let be the relation on L such that (x, l) (x , l ) if and only if (x, l) and (x , l ) are connected in the above graph. By design, is an equivalence relation on L. The set of all equivalence classes in L with respect to the equivalence relation as ΩL = L/ . Remark. In other words, we identify the source of lines to the corresponding point on its parent. We recall the notion of disjoint union topology and quotient topology in (Hatcher, 2005). For a line l in Rd, the image Im(l) is a topological space, moreover, a metric space, that is homeomorphic and isometric to R via the map t 7! xl + t θl. The metric on Im(l) is dl(x, x ) = |tx tx | for all x, x Im(l). For each l L, consider the injection map: fl : Im(l) ! G l L Im(l) = L x 7 ! (x, l). L = F l L Im(l) now becomes a topological space with the disjoint union topology, i.e. the finest topology on L such that the map fl is continuous for all l L. Also, consider the quotient map: π : L ! ΩL (x, l) 7 ! [(x, l)]. ΩL now becomes a topological space with the quotient topology, i.e. the finest topology on ΩL such that the map π is continuous. Definition A.7 (Topological space of a tree system). The topological space ΩL is called the topological space of a tree system (L, T ). Remark. In other words, ΩL is formed by gluing all images Im(l) along the relation . We show that the topological space ΩL is metrizable. Definition A.8 (Paths in ΩL). For a and b in ΩL with a = b, a path from a to b in ΩL is a continuous injective map γ : [0, 1] ! ΩL where γ(0) = a and γ(1) = b. By convention, for a in ΩL, the path from a to a in ΩL is the constant map γ : [0, 1] ! ΩL such that γ(t) = a for all t [0, 1]. For a path γ from a to b, the image of γ is defined by: Im(γ) := γ([0, 1]) ΩL. (15) Theorem A.9 (Existence and uniqueness of path in ΩL). For all a and b in ΩL, there exist a path γ from a to b in ΩL. Moreover, γ is unique up to a re-parameterization, i.e. if γ and γ are two path γ from a to b in ΩL, there exist a homeomorphism φ: [0, 1] ! [0, 1] such that γ = γ φ. Tree-Sliced Wasserstein Distance: A Geometric Perspective Proof. All previous results we state in this proof can be found in (Munkres, 2018; Rotman, 2013; Hatcher, 2005). For two point a, b on the real line R, all paths from a to b are homotopic to each other. In other words, all paths from a to b are homotopic to the canonical path: γa,b : [0, 1] ! R t 7 ! (1 t) a + t b. Now consider two point a, b on space ΩL. Observe that ΩL is path-connected by design and by the fact that R is pathconnected. Consider a curve from a to b on ΩL, i.e. a continuous map f : [0, 1] ! ΩL, and consider the set consists of sources of lines in L that lie on the curve f, i.e. all the sources that belong to f([0, 1]). We choose the curve f that has the smallest set of sources. By the tree structure added to L, all curves from a to b have the set of sources that contains the set of sources of f. We denote the sources belong to this set of f as s1, . . . , sk 1, and defined: xi = inf f 1(si) for all 1 i k 1. We reindex si such that: x1 . . . xk 1 For convention, we define s0 = a and sk = b. By design, for i = 0, . . . , k 1, we have si and si+1 line on the same line in L. So by the result of paths on R, there exist a path γi from si to si+1 on ΩL. Gluing γ0, γ1, . . . , γk 1 to get a path γ from s0 = a to sk = b on ΩL by: γ : [0, 1] ! ΩL t 7 ! γi(k t i) if t i , i = 0, . . . , k 1. It is clear to check γ is a path from a to b on Ω, and the uniqueness (up to re-parameterization) of γ comes from homotopy of paths in R. Remark. The image of a path from a to b does not depend on the chosen path γ by the uniqueness property. Indeed, for a homeomorphism φ: [0, 1] ! [0, 1], we have γ([0, 1]) = γ φ([0, 1]). Denote the image of any path from a to b by Pa,b. Let µ be the standard Borel measure on R, i.e. µ ((a, b]) = b a for every half-open interval (a, b] in R. For l L, denote µl as the pushforward of µ by the map t 7! xl + t θl, which is a Borel measure on Im(l). Denote the σ-algebra of Borel sets in L and ΩL as B( L) and B(ΩL), respectively. Definition A.10 (Borel measure on L and ΩL). The map µ L : B(ΩL) ! [0, ) that is defined by: µ L(B) := X l L µl f 1 l (B) , B B( L), is called the Borel measure on L. Define the Borel measure on ΩL, denoted by µΩL, as the pushforward of µ L by the map π: L ! ΩL. It is straightforward to show that µ L is well-defined, and indeed a Borel measure of L. As a corollary, µΩL is also a Borel measure of ΩL. Remark. By abuse of notation, we sometimes simply denote both of µ L and µΩL as µL. Theorem A.11 (ΩL is metrizable by a tree metric). Define the map dΩ: ΩL ΩL ! [0, ) by: d L(a, b) := µL (Pa,b) , a, b ΩL. (16) Then d L is a metric on ΩL, which makes (ΩL, d L) a metric space. Moreover, d L is a tree metric, and the topology on ΩL induced by d L is identical to the topology of ΩL. Tree-Sliced Wasserstein Distance: A Geometric Perspective Proof. It is straightforward to check that d L is positive definite and symmetry. We show the triangle inequality holds for d L. Let a, b, c be points of ΩL. It is enough to show that Pa,c is a subset of Pa,b Pb,c. Let γ0, γ1 be paths on Ωfrom a to b and from b to c, respectively. Consider the curve from a to c on Ωdefined by: γ : [0, 1] ! ΩL t 7 ! γi(2 t i) if t i , i = 0, 1. It is clear that γ is a curve from a to c. We have γ([(0, 1)] is exactly the union of Pa,b and Pb,c. As in the proof of Theorem A.9, the set of sources of γ contains the set of sources lying on the path from a to c. So γ([0, 1]) contains Pa,c. We have the below corollary says that: If we take finite points on ΩL, together with the sources of lines, it induces a tree (as a graph) with nodes are these points; Moreover, we have a tree metric on this tree which is d L. Corollary A.12. Let y1, y2, . . . , ym be points on ΩL. Consider the graph, where {y1, . . . , ym} {xl : l L} is the node set, and two nodes are adjacent if the (unique) path between this two nodes on ΩL does not contain any node, except them. Then this graph is a rooted tree at xr, with an induced tree metric from d L. A.4. Construction of Tree Systems We present a way to construct a tree system in Rd. First, we have a way to describe the structure of a rooted tree by a sequence of vectors. Definition A.13 (Tree representation). Let T be a nonnegative integer, and n1, . . . , n T be T positive integer. A sequence s = {xi}T i=0, where xi is a vector of ni nonnegative numbers, is called a tree representation if x0 = [1], and for all 1 i T, ni is equal to the sum of all entries in vector xi 1. Example A.14. For T = 5 and {ni}5 i=1 = {1, 3, 4, 2, 3}, the sequence: s : x0 = [1] !x2 = [2, 1, 1] !x3 = [1, 0, 2, 0] !x4 = [1, 2] !x5 = [0, 0, 1]. is a tree representation. For a tree representation s = {xi}T i=0, a tree system of type s is a tree system that is inductively constructed step-by-step as follows: Step 0. Sample a point xr Rd and a direction θr Sd 1. Define r as the line that passes through xr with direction θr. We call r as the line of depth 0. Step i. On the j-th line of depth (i 1), sample (xi)j points where (xi)j is the j-th entry of vector xi. For each of these points, denoted as xl, sample a direction θl Sd 1, and define l is the line that passes through xl with direction θl. We call the set of all lines sampled at this step as the set of lines of depth i and order them by the order that they are sampled. By this construction, we will get a system of lines L in Rd, together with a tree structure Tr. The pair (L, Tr) forms a tree system, which is canonical by design, and is said to be of type s. Denote Ts as the set of all tree systems of type s. Remark. A tree system in of type s has k = PT i=0 Pni j=1(xi)j lines. Observe that constructing a tree system of type s only depends on sampling k points and k directions, so by some assumptions on the probability distribution of these points and directions, we will have a probability distribution on Ts. Note that: 1. xr is sampled from a distribution on Rd; Tree-Sliced Wasserstein Distance: A Geometric Perspective 2. For all l = r, xl is sampled from a distribution on R; 3. For all l, θl is sampled from a distribution on Sd 1. We have some examples of tree presentations s and distribution on Ts. Example A.15 (Lines pass through origin). Consider the tree representation s: s : [1], (17) and the distributions on Ts is determined by: 1. xr = 0 Rd; 2. θr U(Sd 1). In this case, Ts is identical to the set of lines that pass through the origin 0. Example A.16 (Concurrent lines). Consider the tree representation s: s : [1] ! [k 1], (18) and the distributions on Ts is determined by: 1. xr µr for an µr P(Rd); 2. For all l = r, xl = xr; 3. For all l, θl U(Sd 1); 4. xr and all θl s are pairwise independent. In this case, Ts is identical to the set of all tuples of n concurrent lines. Example A.17 (Series of lines). Consider the tree representation s: s : [1] ! [1] ! . . . ! [1], (19) and the distributions on Ts is determined by: 1. xr µr for an µr P(Rd); 2. For all l = r, xl µl for an µl P(R); 3. For all l, θl U(Sd 1); 4. All xl s and all θl s are pairwise independent. In this case, we call Ts as the set of all series of k lines. This is the same as the sampling process in Subsection 3.3 and Algorithm 1. Example A.18. For an arbitrary tree representation s, the distributions on Ts is determined by: 1. xr is sampled from the uniform distribution on a bounded subset of Rd, for instance, µr U([0, 1]d); 2. For l = r, xl will be sampled from the uniform distribution on a bounded subset of R, for instance, µl U([0, 1]); 3. For all l, θl will be sampled from the uniform distribution on Sd 1, i.e θl U(Sd 1); 4. Together with assumptions on independence between all xr s and all θl s. Tree-Sliced Wasserstein Distance: A Geometric Perspective B. Radon Transform on Systems of Lines We introduce the notions of the space of Lebesgue integrable functions and the space of probability distributions on a system of lines. Let L be a system of k lines. B.1. Space of Lebesgue integrable functions on a system of lines Denote L1(Rd) as the space of Lebesgue integrable functions on Rd with norm 1: L1(Rd) = f : Rd ! R : f 1 = Z Rd |f(x)| dx < . (20) As usual, two functions f1, f2 L1(Rd) are considered to be identical if f1(x) = f2(x) almost everywhere on Rd. Definition B.1 (Lebesgue integrable function on a system of lines). A Lebesgue integrable function on L is a function f : L ! R such that: R |f(tx, l)| dtx < . (21) The space of Lebesgue integrable functions on L is denoted by: f : L ! R : f L = X R |f(tx, l)| dtx < Remark. As an analog of integrable functions on Rd, two functions f1, f2 L1(L) are considered to be identical if f1(x) = f2(x) almost everywhere on L. The space L1(L) with norm L is a Banach space. Recall that L has k lines, we denote the (k 1)-dimensional standard simplex as k 1 = (al)l L : al 0 and P l L al = 1 Rk. Let C(Rd, k 1) be the space of continuous function from Rd to k 1. A map in C(Rd, k 1) is called a splitting map. Let L be a system of lines in Ld k, α be a map in C(Rd, k 1), we define an operator associated to α that transforms a Lebesgue integrable functions on Rd to a Lebesgue integrable functions on L. For f L1(Rd), define: Rα Lf : L ! R (x, l) 7 ! Z Rd f(y) α(y)l δ (tx y xl, θl ) dy, where δ is the 1-dimensional Dirac delta function. Theorem B.2. For f L1(Rd), we have Rα Lf L1(L). Moreover, we have Rα Lf L f 1. In other words, the operator: Rα L : L1(Rd) ! L1(L), (23) is well-defined, and is a linear operator. Tree-Sliced Wasserstein Distance: A Geometric Perspective Proof. Let f L1(Rd). We show that Rα Lf L f 1. Indeed, Rα Lf L = X R |Rα Lf(tx, l)| dtx (24) Rd f(y) α(y)l δ (tx y xl, θl ) dy dtx (25) R |f(y)| α(y)l δ (tx y xl, θl ) dtx Rd |f(y)| α(y)l Z R δ (tx y xl, θl ) dtx Rd |f(y)| α(y)l dy (28) Rd |f(y)| X l L α(y)l dy (29) Rd |f(y)| dy (30) = f 1 < . (31) So the operator Rα L is well-defined, and is clearly a linear operator. Definition B.3 (Radon transform on system of lines). For α C(Rd, k 1), the operator Rα: Rα : L1(Rd) ! Y f 7 ! (Rα Lf)L Ld k . is called the Radon transform on a system of lines. Many variants of Radon transform require the transforms to be injective. We show that the injectivity holds in the Radon transform on a system of lines. Theorem B.4. Rα is injective. Proof. Since Rα is linear, we show that if Rαf = 0, then f = 0. Let f L1(Rd) such that Rαf = 0, which means Rα L = 0 for all L Ld n. Fix a line index l, consider the operator: Rd f(y) α(y)l δ (tx y xl, θl ) dy = 0 , tx R, (xl, θl) Rd Sd 1. (32) Note that for index l, f(y) α(y)l is a function of y. Let xl be fixed and θl varies in Rd. By the injectivity of the usual Radon transform (Helgason, 2011), we have f(x) α(x)l = 0 for all x Rd. This holds for all line index l, so f(x) = P l f(x) α(x)l = 0. So Rα is injective. Remark. By the proof, we can show a stronger result as follows: Let A be a subset of Ld k such that for every index l and θ Sd 1, there exists L A such that θl = θ, where θl is the direction of line with index l in L. Roughly speaking, the set of directions in L is (Sd 1)k. Remark. Observations in (Tran et al., 2024b; 2025b;c) suggest that imposing equivariance conditions on the splitting maps may enhance the performance of the distance. Nonetheless, since the primary focus of these works is the geometric formulation of the tree-sliced distance, this aspect is not explored in detail. More broadly, equivariance and invariance are prevalent themes in machine learning, with applications across various domains, including equivariant models (Tran et al., 2024c) and equivariant metanetworks (Vo et al., 2024; Tran et al., 2024d;a;d). Tree-Sliced Wasserstein Distance: A Geometric Perspective B.2. Probability distributions on a system of lines Denote P(Rd) as the space of all probability distribution on Rd: P(Rd) = f : Rd ! [0, ) : f 1 = 1 L1(Rd). Definition B.5 (Probability distribution on a system of lines). Let L be a system of lines. A probability distribution on L is a function f L1(L) such that f : L ! [0, ) and f L = 1. The space of probability distribution on L is defined by: P(L) := f : L ! [0, ) : f L = 1 L1(L). (33) Corollary B.6. For f P1(Rd), we have Rα Lf P(L). In other words, the restricted of Radon Transform: Rα L : P(Rd) ! P(L), (34) is well-defined. Proof. Let f P1(Rd). It is clear that Rα Lf : L ! [0, ). We show that Rα Lf L = 1. Indeed, Rα Lf L = X R Rα Lf(tx, l) dtx (35) Rd f(y) α(y)l δ (tx y xl, θl ) dy dtx (36) Rd f(y) dy = 1. (37) So Rα Lf P(L), and Rα L is well-defined. B.3. An illustration of TSW-SL In this section, we provide a visual illustration of TSW-SL through four figures: Figures 5, 6, 7, and 8. Each figure includes an explanatory caption; however, Figure 7 is accompanied by a separate, detailed explanation below. The explanation for Figure 7 is as follows: The tree consists of 9 nodes. Of these, 6 correspond to projections of the original points, while the remaining 3 nodes x1, x2, and x3 are structural nodes introduced during the construction of the sampled tree. Their roles are defined as follows: x1 is the root of the tree. x2 is the source of line 2 and lies on line 1, representing the intersection of lines 1 and 2. x3 is the source of line 3 and lies on line 2, representing the intersection of lines 2 and 3. We compute the aggregated mass Rα Lfµ(Γ(ei)) for selected edges ei, where Γ(ei) denotes the subtree rooted at the endpoint of ei that is farther from the root x1. Two examples are provided below: For edge e3: Γ(e3) = {x2, a2, b2, x3, b3, a3}, Rα Lfµ(Γ(e3)) = fµ(x2) + fµ(a2) + fµ(b2) + fµ(x3) + fµ(b3) + fµ(a3) For edge e5: Γ(e5) = {b2, x3, b3, a3}, Rα Lfµ(Γ(e5)) = fµ(b2) + fµ(x3) + fµ(b3) + fµ(a3) Similar computations apply to the remaining edges in the tree. Tree-Sliced Wasserstein Distance: A Geometric Perspective Figure 5: This figure presents the projections of two points, x and y, onto the three lines labeled 1, 2, and 3. The projections of x are denoted by ai, and those of y by bi, for i = 1, 2, 3. Figure 6: This figure presents the mass at each projection of x and y. For example, the mass at a1 is given by Rα Lfµ(a1) = fµ(x) α(x)1 = 0.6 1 C. Max Tree-Sliced Wasserstein Distance on Systems of Lines. Max Sliced Wasserstein distance. Max Sliced Wasserstein (Max SW) distance (Deshpande et al., 2019) uses only one maximal projecting direction instead of multiple projecting directions as SW. Max SWp(µ, ν) := max θ U(Sd 1) h Wp(Rfµ( , θ), Rfν( , θ)) i , (38) Max SW requires an optimization procedure to find the projecting direction. It is a metric on space of probability distributions on Rd. We define the Max Tree-Sliced Wasserstein distance on System of Lines (Max TSW-SL) as follows. Definition C.1 (Max Tree-Sliced Wasserstein Distance on Systems of Lines). The Max Tree-Sliced Wasserstein Distance on Systems of Lines between two probability distributions µ, ν in P(Rd) is defined by: Max TSW-SL(µ, ν) := max L T h Wd L,1(Rα Lµ, Rα Lν) i , (39) Max TSW-SL is a metric on P(Rd). The proof of the below theorem is in Appendix D.2. Theorem C.2. Max TSW-SL distance is a metric on P(Rd). We provide an algorithm to compute the Max TSW-SL in Algorithm 3. Tree-Sliced Wasserstein Distance: A Geometric Perspective Figure 7: Resulting tree after projection with 9 nodes, including projection nodes and sampled tree nodes. The detailed explanation for this Figure is provided is Section B.3 Figure 8: This figure presents the outcome when x and y are placed differently in space compared to Figure 5. The projection and mass computation steps remain unchanged (though α(y) may differ due to the new location of y). However, the resulting tree structure changes for example, the subtree Γ(e3) now contains only the point b1, while Γ(e5) includes b2, x3, b3, and a3. Algorithm 3 Max Tree-Sliced Wasserstein distance on Systems of lines. Input: Probability measures µ and ν, the number of lines in tree system k, a splitting function α: Rd ! k 1, learning rate η, max number of iterations T. Initialize x1 Rd, t2, . . . , tk R, θ1, . . . , θk Sd 1. Compute L corresponded to (x1, t2, . . . , tk, θ1, . . . , θk). while L not converge or reach T do x1 = x1 + η x1Wd L,1(Rα Lµ, Rα Lν). for i = 2 to k do ti = Ti + η ti Wd L,1(Rα Lµ, Rα Lν). end for for i = 1 to k do θi = θi + η θi Wd L,1(Rα Lµ, Rα Lν). Normalize θi = θi/ θi 2. end for end while Compute L corresponded to (x1, t2, . . . , tk, θ1, . . . , θk). Compute Wd L,1(Rα Lµ, Rα Lν). Return: L, Wd L,1(Rα Lµ, Rα Lν). Tree-Sliced Wasserstein Distance: A Geometric Perspective D. Theoretical Proof for injectivity of TSW-SL We will leave out almost-surely-conditions in the proofs, as they are straightforward to verify, and including them would unnecessarily complicate the proofs. D.1. Proof of Theorem 5.2 Proof. Need to show that: TSW-SL(µ, ν) := Z T Wd L,1(Rα Lµ, Rα Lν) dσ(L). (40) is a metric on P(Rd). Positive definiteness. For µ, ν P(Rn), one has TSW-SL(µ, µ) = 0 and TSW-SL(µ, ν) 0. If TSW-SL(µ, ν) = 0, then Wd L,1(Rα Lµ, Rα Lν) = 0 for all L T. Since Wd L,1 is a metric on P(L), we have Rα Lµ = Rα Lν for all L T. Since T is a subset of Ld k that satisfies the condition in the remark at the end of the proof of Theorem B.4, we conclude that µ = ν. Symmetry. For µ, ν P(Rn), we have: TSW-SL(µ, ν) = Z T Wd L,1(Rα Lµ, Rα Lν) dσ(L) (41) T Wd L,1(Rα Lν, Rα Lµ) dσ(L) (42) = TSW-SL(ν, µ). (43) So TSW-SL(µ, ν) = TSW-SL(ν, µ). Triangle inequality. For µ1, µ2, µ3 P(Rn), we have: TSW-SL(µ1, µ2) + TSW-SL(µ2, µ3) (44) T Wd L,1(Rα Lµ1, Rα Lµ2) dσ(L) + Z T Wd L,1(Rα Lµ2, Rα Lµ3) dσ(L) (45) Wd L,1(Rα Lµ1, Rα Lµ2) + Wd L,1(Rα Lµ2, Rα Lµ3) dσ(L) (46) T Wd L,1(Rα Lµ1, Rα Lµ3) dσ(L) (47) = TSW-SL(µ1, µ3). (48) The triangle inequality holds for TSW-SL. We conclude that TSW-SL is a metric on P(Rd). D.2. Proof of Theorem C.2 Proof. Need to show that: Max TSW-SL(µ, ν) = max L T h Wd L,1(Rα Lµ, Rα Lν) i (49) is a metric on P(Rd). Positive definiteness. For µ, ν P(Rn), one has Max TSW-SL(µ, µ) = 0 and Max TSW-SL(µ, ν) 0. If Max TSW-SL(µ, ν) = 0, then Wd L,1(Rα Lµ, Rα Lν) = 0 for all L T. Since Wd L,p is a metric, we have Rα Lµ = Rα Lν for all L T. Since T is a subset of Ld k that satisfies the condition in the remark at the end of the proof of Theorem B.4, we conclude that µ = ν. Tree-Sliced Wasserstein Distance: A Geometric Perspective Symmetry. For µ, ν P(Rn), we have: Max TSW-SL(µ, ν) = max L T h Wd L,1(Rα Lµ, Rα Lν) i (50) h Wd L,1(Rα Lν, Rα Lµ) i (51) = Max TSW-SL(ν, µ). (52) So Max TSW-SL(µ, ν) = Max TSW-SL(ν, µ). Triangle inequality. For µ1, µ2, µ3 P(Rn), we have: Max TSW-SL(µ1, µ2) + TSW-SL(µ2, µ3) (53) h Wd L,1(Rα Lµ1, Rα Lµ2) i + max L T h Wd L ,1(Rα L µ2, Rα L µ3) i (54) h Wd L,1(Rα Lµ1, Rα Lµ2) + Wd L,1(Rα Lµ2, Rα Lµ3) i (55) h Wd L,1(Rα Lµ1, Rα Lµ3) i (56) = Max TSW-SL(µ1, µ3). (57) The triangle inequality holds for Max TSW-SL. We conclude that Max TSW-SL is a metric on P(Rd). E. Experimental details and additional results E.1. Gradient flows Gradient flow is a concept in differential geometry and dynamical systems that describes the evolution of a point or a curve under a given vector field. In the field of Sliced Wasserstein distance, this is a synthetic task that is used to evaluate the evolution of Wasserstein distance between 2 distributions (source and target distributions) while minimizing different distances (Mahey et al., 2023; Kolouri et al., 2019) as a loss function. Datasets. We use Swiss Roll, 25-Gaussians and high-dimensional Gaussian datasets as the target distribution as in (Kolouri et al., 2019). The details of these datasets can be described as follows. The Swiss Roll dataset is a popular synthetic dataset used in machine learning, particularly for visualizing and testing dimensionality reduction techniques. It is generated using the make swiss roll function of Pytorch, which creates a non-linear, three-dimensional dataset resembling a swiss roll or spiral shape. In the original version, it is a three-dimensional dataset with each dimension representing a coordinate in the 1 axis of the points. In order to simplify it, we follow (Kolouri et al., 2019) to just consider the two-dimensional Swiss Roll dataset by reducing the second coordinates and retaining only the first and the third coordinates. We set the number of samples to equal 100. The 25-Gaussians dataset is obtained by first create a grid of 25 points spaced evenly in a 5 5 arrangement. For each grid point, we generate a cluster by sampling points from a Gaussian distribution centered at that grid point, with a small standard deviation. All the points from the 25 clusters are combined, shuffled randomly, and scaled to form the final dataset. The High-dimensional Gaussian datasets are generated by initializing a mean vector, µs, consisting of ones across all dimensions. Each element of this mean vector is scaled by a random value to introduce variability. The covariance matrix, Σ, is created as an identity matrix scaled by a constant, ensuring independence among dimensions. Points are then sampled from the multivariate normal distribution using these parameters, resulting in a dataset of N points in the specified high-dimensional space. The Swiss Roll and 25-Gaussians datasets are presented in Figure 9. Tree-Sliced Wasserstein Distance: A Geometric Perspective 0.8 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.5 1.0 0.5 0.0 0.5 1.0 1.5 Swiss Roll dataset 25-Gaussians datasets Figure 9: Swiss Roll and 25-Gaussians datasets for Gradient Flows task Hyperparameters. For TSW-SL, we use L = 25, k = 4 in all experiments, while L = 100 is set for SW and SW-variants, with 100 points generated per distribution across datasets. Following (Mahey et al., 2023), the global learning rate for all baselines is 5 10 3. For our methods, we use 5 10 3 for the 25-Gaussians and Swiss Roll datasets, and 5 10 2 for the high-dimensional Gaussian datasets. We also follow (Mahey et al., 2023) in setting 100 iterations for both Max SW and Max TSW-SL, using a learning rate of 1 10 4 for both methods. Evaluation metrics. We use the Wasserstein distance as a neutral metric to evaluate how close the model distribution µ(t) is to the target distribution ν. Over 2500 timesteps, we evaluate the Wasserstein distance between source and target distributions at iteration 500, 1000, 1500, 2000 and 2500. We utilize the source code adapted from (Mahey et al., 2023) for this task. E.2. Color Transfer In addition to the experimental results presented in the main text, we extend our results to show the emprical advantages of TSW-SL over SW for transferring color between images to produce results that closely match the color distributions of the target images. Given a source image and a target image, we represent their respective color palettes as matrices X and Y , each with dimensions n 3 (where n denotes the number of pixels). We follow Nguyen et al. (2024a) to first define the curve Z(t) = n Z(t) SW2 PZ(t), PY where PX and PY are empirical distributions over X and Y in turn. Here, the curve starts from Z(0) = X and ends at Y . We then reduce the number of colors in the images to 1000 using K-means clustering. After that, we iterate through the curve between the empirical distribution of colors in the source image PX and the empirical distribution of colors in the target image PY using the approximate Euler method. However, owing to the color palette values (RGB) lying within the set {0, . . . , 255}, an additional rounding step is necessary during the final Euler iterations. We increase the number of iterations to 2000 and utilize a step size of 1 as in (Nguyen et al., 2024a) for baselines and a step size of 16 for our experiments. We use L = 100 in SW variants and L = 25, k = 4 in TSW-SL for a fair comparison. We traverse along the curve connecting PX and PY , where PX and PY denote the empirical distribution of the source and the target images respectively. More specifically, this curve (denonted as Z(t)) starts from Z(0) = X and ends at Y . During optimization, we minimize the loss Loss(Z(t), Y ) to make the color distribution of the obtained image close to that of the target image Y . Tree-Sliced Wasserstein Distance: A Geometric Perspective We evaluate the color-transferred images obtained by various loss, including SW (Bonneel et al., 2015), Max SW (Deshpande et al., 2019), and SW variants proposed in (Nguyen et al., 2024a) to compare with our TSW-SL and Max TSW-SL approaches. For consistency, we set L = 100 for the SW variants and L = 25, k = 4 for TSW-SL in our comparisons. We report the Wasserstein distances at the final time step along with the corresponding transferred images from various baselines in Figure 10. TSW-SL produces images that most closely resemble the target, demonstrating a significant reduction compared to SW and its variants mentioned above with the same number of lines. In addition, Max TSW-SL improves upon the original Max SW, as highlighted by both qualitative and quantitative results. Figure 10: Color-transferred images from various baselines with 100 projecting directions. Evaluation metrics. We present the Wasserstein distances at the final time step alongside the corresponding transferred images to evaluate the performance of different methods. We utilize the source code adapted from (Nguyen et al., 2021a) for this task. E.3. Generative adversarial network Architectures. We denote µ as our data distribution. Subsequently, we formulate the model distribution νϕ as a resultant probability measure generated by applying a neural network Gϕ to transform a unit multivariate Gaussian (ϵ) into the data space. Additionally, we employ another neural network Tβ to map from the data space to a singular scalar value. More specifically, Tβ1 represents the subset neural network of Tβ that maps from the data space to a feature space, specifically the output of the last Res Net block, while Tβ2 maps from the feature space (the image of Tβ1) to a single scalar. Formally, Tβ = Tβ2 Tβ1. We utilize the subsequent neural network architectures for Gϕ and Tβ: - Gϕ : z R128( ϵ : N(0, 1)) ! 4 4 256 (Dense, Linear) ! Res Block up 256 ! Res Block up 256 ! Res Block up 256 ! BN, Re LU, ! 3 3 conv, 3 Tanh . Tβ1 : x [ 1, 1]32 32 3 ! Res Block down 128 ! Res Block down 128 ! Res Block down 128 ! Res Block 128 ! Res Block 128. Tβ2 : x R128 8 8 ! Re LU ! Global sum pooling (128) ! 1( Spectral normalization ). Tβ(x) = Tβ2 (Tβ1(x)). - Gϕ : z R128( ϵ : N(0, 1)) ! 4 4 256 (Dense, Linear) ! Res Block up 256 ! Res Block up 256 ! Res Block up 256 ! Res Block up 256 ! BN, Re LU, ! 3 3 conv, 3 Tanh . - Tβ1 : x [ 1, 1]32 32 3 ! Res Block down 128 ! Res Block down 128 ! Res Block down 128 ! Res Block down 128 ! Res Block 128 ! Res Block 128. Tβ2 : x R128 8 8 ! Re LU ! Global sum pooling(128) ! 1( Spectral normalization ). Tβ(x) = Tβ2 (Tβ1(x)). Tree-Sliced Wasserstein Distance: A Geometric Perspective SW L = 50 TSW-SL (50 lines) SW L = 500 TSW-SL (500 lines) Figure 11: Randomly generated images of different methods on CIFAR10 and Celeb A of SN-GAN - Gϕ : z R128( ϵ : N(0, 1)) ! 3 3 256 (Dense, Linear) ! Res Block up 256 ! Res Block up 256 ! Res Block up 256 ! Res Block up 256 ! Res Block up 256 ! BN, Re LU, ! 3 3 conv, 3 Tanh . - Tβ1 : x [ 1, 1]32 32 3 ! Res Block down 128 ! Res Block down 128 ! Res Block down 128 ! Res Block down 128 ! Res Block down 128 ! Res Block 128 ! Res Block 128. Tβ2 : x R128 8 8 ! Re LU ! Global sum pooling(128) ! 1( Spectral normalization ). Tβ(x) = Tβ2 (Tβ1(x)). We use the following bi-optimization problem to train our neural networks: min β1,β2 (Ex µ [min (0, 1 + Tβ(x))] + Ez ϵ [min (0, 1 Tβ (Gϕ(z)))]) , min ϕ EX µ m,Z ϵ m h S Tβ1,β2 PX, Tβ1,β2 Gϕ PZ i , where the function Tβ1,β2 = [Tβ1(x), Tβ2 (Tβ1(x))] which is the concatenation vector of Tβ1(x) and Tβ2 (Tβ1(x)) , S is an estimator of the sliced Wasserstein distance. Training setup. In our experiments, we configured the number of training iterations to 100000 for CIFAR10, STL-10 and 50000 for Celeb A. The generator Gϕ is updated every 5 iteration, while the feature function Tβ undergoes an update each iteration. Across all datasets, we maintain a consistent mini-batch size of 128. We leverage the Adam optimizer (Kingma, 2014) with parameters (β1, β2) = (0, 0.9) for both Gϕ and Tβ with the learning rate 0.0002. Furthermore, we use 50000 random samples generated from the generator to compute the FID and Inception scores. For FID score evaluation, the statistics of datasets are computed using all training samples. Results. For qualitative analysis, Figure 11 displays a selection of randomly generated images produced by the trained models. It is evident that utilizing our TSW-SL as the generator loss significantly enhances the quality of the generated images. Additionally, increasing the number of projections further improves the visual quality of images across all estimators. This improvement in visual quality is corroborated by the FID and IS scores presented in Table 3. We utilize the source code adapted from (Miyato et al., 2018) for this task. Tree-Sliced Wasserstein Distance: A Geometric Perspective Table 5: Performance of different methods on STL-10 dataset on SN-GAN architecture Methods Total line No. of lines per tree No. of trees FID IS SW 50 - - 69.46 9.08 Orthogonal-SW 50 - - 63.61 9.63 TSW-SL (ours) 51 3 17 65.93 9.75 TSW-SL (ours) 52 4 13 62.91 9.95 TSW-SL (ours) 50 5 10 61.15 10.00 Additional results. To fully show the empirical advantage of our methods, we conducted additional experiments on Adversarial Neural Networks on the CIFAR-10 dataset and STL-10 dataset. First of all, Table 6 presents the average FID and IS scores for different methods on the CIFAR-10 dataset. For 50 projecting directions, our TSW-SL method with 10 trees and 5 lines each (L = 10, k = 5) achieves the best performance, outperforming the standard SW method. Similarly, for 500 projecting directions, TSW-SL (L = 100, k = 5) shows superior results compared to SW. This demonstrates the consistent effectiveness of our approach across different numbers of projecting directions. Additionally, Table 5 illustrates the performance of generative models on the STL-10 (96 96) dataset with different numbers of trees and lines compared with SW and orthogonal-SW. In our experiments, we utilize the SN-GAN architecture (Miyato et al., 2018) for STL-10. For SW and Orthogonal-SW, we conduct experiments using 50 projecting directions. Our TSW-SL method is tested with three distinct configurations: 10 trees with 5 lines each, 13 trees with 4 lines each, and 17 trees with 3 lines each. All hyperparameters remain consistent with those used in our main paper. To evaluate the models, we generate 50000 random images. Table 6: Average FID and IS score of 3 runs on CIFAR-10 of SN-GAN 50 projecting directions 500 projecting directions FID(#) IS(") FID(#) IS(") SW (L = 50) 16.80 0.45 7.97 0.05 SW (L = 500) 14.23 0.84 8.25 0.05 TSW-SL (L = 10, k = 5) 15.44 0.42 8.14 0.05 TSW-SL (L = 100, k = 5) 13.27 0.23 8.30 0.01 TSW-SL (L = 17, k = 3) 15.9 0.35 8.10 0.04 TSW-SL (L = 167, k = 3) 14.18 0.38 8.28 0.07 E.4. Denoising diffusion models In this section, we provide details about denoising diffusion models, a class of generative models that have shown remarkable success in producing high-quality samples. We first describe the forward and reverse processes that form the foundation of these models. Then, we introduce the concept of denoising diffusion GANs, which aims to accelerate the generation process. Finally, we explain how our proposed TSW-SL distance can be integrated into this framework. The process in diffusion models is typically divided into two main parts: the forward process and the reverse process. The forward process is defined as: q(x1:T |x0) = t=1 q(xt|xt 1), q(xt|xt 1) = N(xt; p 1 βtxt 1, βt I) where the variance schedule β1, . . . , βT can be constant or learned hyperparameters. The reverse process is defined as: pθ(x0:T ) = p(x T ) t=1 pθ(xt 1|xt), pθ(xt 1|xt) = N(xt 1; µθ(xt, t), Σθ(xt, t)), Tree-Sliced Wasserstein Distance: A Geometric Perspective where µθ(xt, t) and Σθ(xt, t) are functions that provide the mean and covariance for the Gaussian and are defined using MLPs. The model is trained by maximizing the variational lower bound on the negative log-likelihood: Eq[ log pθ(x0)] Eq log pθ(x0:T ) q(x1:T |x0) While traditional models have successfully generated high-quality images without the need for adversarial training. However, their sampling process involves simulating a Markov chain for multiple steps, which can be time-consuming. To accelerate the generation process by reducing the number of steps T, denoising diffusion GANs (Xiao et al., 2021) propose utilizing an implicit denoising model: pθ(xt 1|xt) = Z pθ(xt 1|xt, ϵ)Gθ(xt, ϵ)dϵ, with ϵ N(0, I). Subsequently, adversarial training is employed (Xiao et al., 2021) to optimize the model parameters t=1 Eq(xt)[Dadv(q(xt 1|xt)||pϕ(xt 1|xt))], where Dadv refers to either the GAN objective or the Jensen-Shannon divergence (Goodfellow et al., 2020). We follow the proposed Augmented Generalized Mini-batch Energy distances of (Nguyen et al., 2024b) leverage our TSW-SL distance for Dadv. More specifically, as described by Nguyen et al. (2024b), the adversarial loss is replaced by the augmented generalized Mini-batch Energy (AGME) distance. For two distributions µ and ν, with a mini-batch size n 1, the AGME distance using a Sliced Wasserstein (SW) kernel is defined as: AGME2 b(µ, ν; g) = GME2 b( µ, ν), where µ = f#µ and ν = f#ν, with the mapping f(x) = (x, g(x)) for a nonlinear function g : Rd ! R. The GME is the generalized Mini-batch Energy distance (Salimans et al., 2018), given by: GME2 b(µ, ν) = 2E[D(PX, PY )] E[D(PX, P X)] E[D(PY , P Y )], where X, X i.i.d. µ m, Y, Y i.i.d. ν m, and i=1 δxi, X = (x1, . . . , xm). In the equation above, D denotes a distance function that can calculate the distance between two probability measures. To evaluate how well TSW-SL compares to other SW variants in capturing topological information, particularly when the supports lie in high-dimensional spaces, we replace D with both TSW and SW variants. We then train the generative model to assess which distance metric better quantifies the divergence between two probability distributions. A lower FID score indicates a more effective distance measure. Experimental setup. For our experiments, we adopted the architecture and hyperparameters from Nguyen et al. (2024b), training our models for 1800 epochs. For TSW, we employed the following hyperparameters: L = 2500, k = 4. For the vanilla SW and its variants, we adhered to the approach outlined in Nguyen et al. (2024b), using L = 10000. This consistent setup allowed us to effectively compare the performance of our proposed methods against existing approaches while maintaining experimental integrity. Tree-Sliced Wasserstein Distance: A Geometric Perspective FID plot. Figure 12 illustrates the FID scores of SW-DD and TSW-SL-DD across epochs. Due to the wide range of FID values, from over 400 in the initial epoch to less than 3.0 in the final epochs, we present the results on a logarithmic scale for improved visualization. The plot shows that TSW-SL-DD achieves a greater reduction in FID scores compared to SW-DD during the final 300 epochs. Figure 12: FID score over epochs between SW-DD and TSW-SL-DD E.5. Computational infrastructure The experiments on gradient flow, color transfer and generative models using generative adversarial networks are conducted on a single NVIDIA A100 GPU. Training generative adversarial networks on CIFAR10 requires 14 hours, while Celeb A training takes 22 hours. Regarding gradient flows, each dataset s experiments take approximately 3.5 hours. For color transfer, the runtime is 15 minutes. The denoising diffusion experiments were conducted parallelly on 2 NVIDIA A100 GPUs and each run takes us 81.5 hours. F. Further Discussions The proposed approach, tree-sliced-Wasserstein on tree-structured systems of lines (TSW-SL), lays a foundation for further research for TSW on tree systems, especially, for its applications with dynamic-support measures. For examples, Tran et al. (2025b) propose a novel splitting map for TSW-SL which takes into account the relative position between supports and lines in the tree systems. (Tran et al., 2025c) derive the non-linear Randon transform for the TSW-SL. Additionally, Tran et al. (2025a) extend the TSW-SL for probability measures supported on a sphere. Notice that the Quad Tree (i.e., partition-based) and clustering-based tree metric sampling for tree-Wasserstein (Le et al., 2019; Indyk & Thaper, 2003; Lin et al., 2025) may not be applicable for applications with dynamic-support measures, e.g., diffusion models and generative models, since the constructed/sampled trees are bounded, i.e., from the root to leaves, limited to a finite number of nodes. Additionally, these approaches may not include the Radon transforms and guarantee its injectivity which play the key role for applications with dynamic-support measures. For p-order Wasserstein with p > 1. The tree-Wasserstein (TW) is the 1-order Wasserstein for probability measures on a tree. For more general settings, e.g., with p > 1, or for probability measures on a graph, one may consider a scalable variant Sobolev transport (Le et al., 2022) which also yields a closed-form expression for a fast computation, and generalizes the TW (i.e., for p = 1, Sobolev transport is equal to TW). Tree-Sliced Wasserstein Distance: A Geometric Perspective G. Broader Impact The novel Tree-Sliced Wasserstein distance on a System of Lines (TSW-SL) introduced in this paper holds significant potential for societal advancement. By refining optimal transport methodologies, TSW-SL enhances their accuracy and versatility across diverse practical domains. This approach, which synthesizes elements from both Sliced Wasserstein (SW) and Tree-Sliced Wasserstein (TSW), offers enhanced resilience and adaptability, particularly in dynamic scenarios. The resulting improvements in gradient flows, color manipulation, and generative modeling yield more potent computational tools. These advancements promise to catalyze progress across multiple sectors. In healthcare, for instance, refined image processing could elevate the precision of medical diagnostics. The creative industries stand to benefit from more sophisticated generative models, potentially revolutionizing artistic expression. Moreover, TSW-SL s proficiency in handling dynamic environments opens new avenues for real-time analytics and decision-making in fields ranging from finance to environmental monitoring. By expanding the applicability of advanced computational techniques to a wider array of real-world challenges, TSW-SL contributes to technological innovation and, consequently, to the enhancement of societal welfare.