# distancebased_treesliced_wasserstein_distance__e222d59b.pdf

Published as a conference paper at ICLR 2025

DISTANCE-BASED TREE-SLICED WASSERSTEIN DISTANCE

Viet-Hoang Tran Department of Mathematics National University of Singapore hoang.tranviet@u.nus.edu

Khoi N.M. Nguyen FPT Software AI Center khoinnm1@fpt.com

Trang Pham Qualcomm AI Research tranpham@qti.qualcomm.com

Thanh T. Chu Department of Computer Science National University of Singapore thanh.chu@u.nus.edu

Tam Le The Institute of Statistical Mathematics & RIKEN AIP tam@ism.ac.jp

Tan M. Nguyen Department of Mathematics National University of Singapore tanmn@nus.edu.sg

To overcome computational challenges of Optimal Transport (OT), several variants of Sliced Wasserstein (SW) has been developed in the literature. These approaches exploit the closed-form expression of the univariate OT by projecting measures onto (one-dimensional) lines. However, projecting measures onto lowdimensional spaces can lead to a loss of topological information. Tree-Sliced Wasserstein distance on Systems of Lines (TSW-SL) has emerged as a promising alternative that replaces these lines with a more advanced structure called tree systems. The tree structures enhance the ability to capture topological information of the metric while preserving computational efficiency. However, at the core of TSW-SL, the splitting maps, which serve as the mechanism for pushing forward measures onto tree systems, focus solely on the position of the measure supports while disregarding the projecting domains. Moreover, the specific splitting map used in TSW-SL leads to a metric that is not invariant under Euclidean transformations, a typically expected property for OT on Euclidean space. In this work, we propose a novel class of splitting maps that generalizes the existing one studied in TSW-SL enabling the use of all positional information from input measures, resulting in a novel Distance-based Tree-Sliced Wasserstein (Db-TSW) distance. In addition, we introduce a simple tree sampling process better suited for Db-TSW, leading to an efficient GPU-friendly implementation for tree systems, similar to the original SW. We also provide a comprehensive theoretical analysis of proposed class of splitting maps to verify the injectivity of the corresponding Radon Transform, and demonstrate that Db-TSW is an Euclidean invariant metric. We empirically show that Db-TSW significantly improves accuracy compared to recent SW variants while maintaining low computational cost via a wide range of experiments on gradient flows, image style transfer, and generative models. The code is publicly available at https://github.com/Fsoft-AIC/Db TSW.

1 INTRODUCTION

Optimal transport (OT) (Villani, 2008; Peyr e et al., 2019) is a framework designed to compare probability distributions by extending the concept of a ground cost metric, originally defined between the

Co-first authors. Co-last authors. Qualcomm Vietnam Company Limited. Correspondence to: hoang.tranviet@u.nus.edu & tanmn@nus.edu.sg

Published as a conference paper at ICLR 2025

supports of input measures, to a metric between entire probability measures. OT has found utility across a diverse array of fields, including machine learning (Bunne et al., 2022; Hua et al., 2023; Nguyen et al., 2021b; Fan et al., 2022), data valuation (Just et al., 2023; Kessler et al., 2025), multimodal data analysis (Park et al., 2024; Luong et al., 2024), statistics (Mena & Niles-Weed, 2019; Weed & Berthet, 2019; Wang et al., 2022; Pham et al., 2024; Liu et al., 2022; Nguyen et al., 2022; Nietert et al., 2022), computer vision, and graphics (Lavenant et al., 2018; Nguyen et al., 2021a; Saleh et al., 2022; Solomon et al., 2015).

OT suffers from a computational burden due to its supercubic complexity with respect to the number of supports in the input measures (Peyr e et al., 2019). To alleviate this issue, Sliced-Wasserstein (SW) (Rabin et al., 2011; Bonneel et al., 2015) leverages the closed-form solution of OT in the onedimensional case to reduce computational demands by projecting the supports of the input measures onto random lines. The vanilla SW distance has been continuously developed and enhanced by refining existing components and introducing meaningful additions to achieve better performance. Examples include improvements in the sampling process (Nadjahi et al., 2021; Nguyen et al., 2024a), determining optimal projection lines (Deshpande et al., 2019), and modifying the projection mechanism (Kolouri et al., 2019; Bonet et al., 2023b).

Related work. Relying solely on one-dimensional projections can result in the loss of essential topological structures in high-dimensional data. To address this, an alternative approach has emerged that replaces these one-dimensional lines with different domains, applying to OT on Euclidean spaces (Alvarez-Melis et al., 2018; Paty & Cuturi, 2019; Niles-Weed & Rigollet, 2022), tree metric spaces (Indyk & Thaper, 2003; Le & Nguyen, 2021; Tran et al., 2025c;d), graph metric spaces (Le et al., 2022; 2023; 2024), spheres (Quellmalz et al., 2023; Bonet et al., 2023a; Tran et al., 2024b), and hyperbolic spaces (Bonet et al., 2023b). Specifically, Tran et al. (2025d) introduced an integration domain known as the tree system, which functions similarly to lines with a more advanced structure design. This approach is proven to capture the topological information better while maintaining the computational efficiency of the SW method. However, the proposed Tree-Sliced Wasserstein distance on Systems of Lines (TSW-SL) in (Tran et al., 2025d), derived from the tree system framework, fails to meet the Euclidean invariance property, which is typically expected for OT in Euclidean spaces (Alvarez-Melis et al., 2019). In this paper, we address these issues by generalizing the class of splitting maps introduced in (Tran et al., 2025d), simultaneously resolving the lack of invariance and the insufficient accounting for positional information. A new class of splitting maps may encounter challenges, such as verifying the injectivity of the associated Radon transform, resulting in a pseudo-metric on the space of measures.

Contribution. In summary, our contributions are three-fold:

1. We analyze the Euclidean invariance of 2-Wasserstein and Sliced p-Wasserstein distance between measures on Euclidean spaces and then discuss why Tree-Sliced Wasserstein distance on Systems of Lines fails to satisfy Euclidean invariance. To address this issue, we introduce a larger class of splitting maps that captures positional information from both points and tree systems, generalizing the previous class while incorporating an additional invariance property. 2. We introduce a novel variant of Radon Transform on Systems of Lines, developing a new class of invariant splitting maps that generalizes the previous version in (Tran et al., 2025d). By providing a comprehensive theoretical analysis with rigorous proofs, we demonstrate how our new class of invariant splitting maps ensures the injectivity of the Radon Transform. 3. We propose the novel Distance-based Tree-Sliced Wasserstein (Db-TSW) distance, which is an Euclidean invariant metric between measures. By analyzing the choice of splitting maps and tree systems in Db-TSW, we demonstrate that Db-TSW enables a highly parallelizable implementation, achieving an efficiency similar to that of the original SW.

Organization. The structure of the paper is as follows: Section 2 recalls variants of Wasserstein distance. Section 3 outlines the essential background of Tree-Sliced Wasserstein distance on Systems of Lines and discusses Euclidean invariance. Section 4 introduces a new class of splitting maps and discusses the corresponding Radon Transform. Section 5 proposes Distance-based Tree Sliced Wasserstein (Db-TSW) distance and discusses the choices of components in Db-TSW. Finally, Section 6 evaluates Db-TSW performance. A complete theoretical framework of Db-TSW and supplemental materials are provided in the Appendix.

Published as a conference paper at ICLR 2025

2 PRELIMINARIES

We review Wasserstein distance, Sliced Wasserstein (SW) distance, Wasserstein distance on metric spaces with tree metrics (TW), and Tree-Sliced Wasserstein on Systems of Lines (TSW-SL) distance.

Wasserstein Distance. Let µ, ν be two probability distributions on Rd. Let P(µ, ν) be the set of probability distributions π on the product space Rd Rd such that π(A Rd) = µ(A) and π(Rd A) = ν(A) for all measurable sets A. For p 1, the p-Wasserstein distance Wp (Villani, 2008) between µ, ν is defined as:

Wp(µ, ν) = inf π P(µ,ν)

Rd Rd x y p p dπ(x, y) 1

Sliced Wasserstein Distance. The Sliced p-Wasserstein distance (SW) (Bonneel et al., 2015) between two probability distributions µ, ν on Rd is defined by:

SWp(µ, ν) := Z

Sd 1 Wp p(Rfµ( , θ), Rfν( , θ)) dσ(θ) 1

where σ = U(Sd 1) is the uniform distribution on the unit sphere Sd 1, operator R : L1(Rd) ! L1(R Sd 1) is the Radon Transform (Helgason, 2011) defined by Rf(t, θ) = R

Rd f(x) δ(t x, θ ) dx, and fµ, fν are the probability density functions of µ, ν, respectively. The one-dimensional p-Wasserstein distance in Equation (2) has the closed-form Wp p(θ µ, θ ν) = R 1 0 |F 1 Rfµ( ,θ)(z) F 1 Rfν( ,θ)(z)|pdz, where FRfµ( ,θ) and FRfν( ,θ) are the cumulative distribution functions of Rfµ( , θ) and Rfν( , θ), respectively. To approximate the intractable integral in Equation (2), Monte Carlo method is used as follows:

d SWp(µ, ν) =

l=1 Wp p(Rfµ( , θl), Rfν( , θl))

where θ1, . . . , θL are drawn independently from the uniform distributions on Sd 1, i.e. U(Sd 1).

Tree Wasserstein Distances. Let T be a rooted tree (as a graph) with non-negative edge lengths, and the ground metric d T , i.e., the length of the unique path between two nodes. Given two probability distributions µ and ν supported on nodes of T , the Wasserstein distance with ground metric d T (TW) (Le et al., 2019) has closed-form as follows:

Wd T ,1(µ, ν) = X

e T we µ(Γ(ve)) ν(Γ(ve)) , (4)

where ve is the endpoint of edge e that is farther away from the tree root, Γ(ve) is the subtree of T rooted at ve, and we is the length of e.

Tree-Sliced Wasserstein Distance on Systems of Lines. Tree-Sliced Wasserstein distance on Systems of Lines (TSW-SL) (Tran et al., 2025d) is proposed as a combination between the projecting mechanism via Radon Transform in SW and the closed-form of Wasserstein distance with ground tree metrics in TW. In TSW-SL, tree systems, which are well-defined measure spaces and metric spaces with tree metric, are proposed to replace the role of directions in SW, and the corresponding variant of Radon Transform is also presented. Details of TSW-SL are outlined in Section 3.

3 THE ORIGINAL TSW-SL AND THE LACK OF E(d)-INVARIANCE

In this section, we outline the details of TSW-SL in (Tran et al., 2025d) and discuss the invariance under Euclidean transformations of TSW-SL and some variants of Wasserstein distance on Euclidean spaces.

3.1 REVIEW ON THE ORIGINAL TSW-SL

We recall the notion and theoretical results of the Radon Transform on Systems of Lines and the corresponding Tree-Sliced Wasserstein distance from (Tran et al., 2025d) with some modifications

Published as a conference paper at ICLR 2025

on the notations. We start with a function f L1(Rd). Usually, d is the dimension of data, and f is the distribution of data. A line in Rd is an element in Rd Sd 1, and a system of k lines in Rd is an element of (Rd Sd 1)k. We denote a system of lines by L, a line in L (also index) by l, and the space of all systems of k lines by Ld k. The ground set of L is defined by:

L := (x, l) Rd L : x = xl + tx θl for an tx R ,

where xl +t θl, t R is the parameterization of l. By abuse of notation, we sometimes index lines by i = 1, . . . , k and denote the line of index i by li. The source and direction li are denoted by xi and θi, respectively. We also denote the collection of all systems of lines in Rd with k lines by Ld k. A tree system is a system of lines L with an additional tree structure T . It is a well-defined metric measure space and denoted by (L, T ) or by L if the tree structure is not needed to be specific. A space of trees (i.e., collections of all tree systems with the same tree structure) is denoted by T with a probability distribution σ on T, which comes from the tree sampling process. For L Ld k, the space of Lebesgue integrable functions on L is

f : L ! R : f L = X

R |f(tx, l)| dtx <

Given a splitting map α C(Rd, k 1), which is a continuous map from Rd to the (k 1)- dimensional standard simplex k 1. For f L1(Rd), we define

Rα Lf : L ! R (6)

(x, l) 7 ! Z

Rd f(y) α(y)l δ (tx y xl, θl ) dy. (7)

The function Rα Lf is in L1(L). The operator

Rα : L1(Rd) ! Y

L Ldn L1(L)

f 7 ! (Rα Lf)L Ld k is called the Radon Transform on Systems of Lines. This operator is injective. The Tree-Sliced Wasserstein Distance on Systems of Lines TSW-SL between µ, ν P(Rd) is defined by

TSW-SL(µ, ν) = Z

Ld k Wd L,1(Rα Lµ, Rα Lν) dσ(L). (8)

The TSW-SL is a metric on P(Rd). Leveraging the closed-form expression of OT problems on metric spaces with tree metrics (Le et al., 2019) and the Monte Carlo method, TSW-SL in Equation (8) can be efficiently approximated by a closed-form expression. Remark 1. The Radon Transform Rα depends on the choice of α C(Rd, k 1). Intuitively, the splitting map α represents how the mass at a specific point is distributed across all lines in a system of lines. In the context of the original Radon Transform R, only one line is involved, so α is simply the constant function 1. As a result, Rα is a meaningful and nontrivial generalization of R.

3.2 E(d)-INVARIANCE IN OPTIMAL TRANSPORT ON EUCLIDEAN SPACES

Consider Rd with the Euclidean norm, i.e. 2, we consider some groups with group actions that preserve Euclidean norm and Euclidean distance between two points in Rd. We then discuss E(d)-invariance of some Wasserstein distance variants.

Euclidean group E(d) and its action on Rd. For a Rd, the translation corresponding to a is the map Rd ! Rd that x 7! x + a. The translation group T(d) is the group of every translations in Rd. Note that, T(d) is isomorphic to the additive group Rd. The orthogonal group O(d) is the group of all linear transformations of Rd that preserve the Euclidean norm

O(d) = linear transformation f : Rd ! Rd : x 2 = f(x) 2 for all x Rd . (9)

Note that O(d) is isomorphic to the group of all orthogonal matrices

O(d) = Q is an d d real matrix : Q Q = Q Q = Id . (10)

Published as a conference paper at ICLR 2025

Figure 1: An illustration highlighting the distinction between the old and new Radon Transform on Systems of Lines, specifically focusing on two different definitions of splitting maps. Left: The old splitting map relies solely on the location of points, leading to the same distribution and independent of the position of the line systems. Right: The new splitting map considers the configuration of systems of lines, leading to varied mass distributions depending on each system.

The Euclidean group E(d) is the group of all transformations of Rd that preserve the Euclidean distance between any two points. Formally, E(d) is the semidirect product between T(d) and O(d), i.e., E(d) T(d) O(d). An element g of E(d) is denoted by a pair g = (Q, a), where a Rd and Q O(d). The action of g on Rd is y 7! gy = Q y + a.

Group actions of E(d) on Ld k and T. The canonical group action of E(d) on Rd naturally induces a group action on the set of all lines in Rd, i.e., Rd Sd 1. Given a line l = (x, θ) Rd Sd 1 and g = (Q, a) E(d), we define

gl := (Q x + a, Q θ) Rd Sd 1. (11)

For L = li = (xi, θi) k i=1 Ld k, the action of E(d) on Ld k is similarly defined

g L = gli = (Q xi + a, Q θi) k i=1 Ld k. (12)

In (Tran et al., 2025d), a tree system is a system of lines with an additional tree structure, and by design, this tree structure is preserved under the action of E(d). Thus, if L T is a tree system, then g L is also a tree system. The group action of E(d) on Ld k induces a group action of E(d) on T.

E(d)-equivariance in Optimal Transport. E(d)-invariance is natural in the context of Optimal Transport on Euclidean space Rd since the distance between two measures should remain unchanged when a distance-preserving transformation is applied to the underlying space. In details, let µ P(Rn) be a measure on Rd. For g E(d), denote the pushforward of µ via g: Rd ! Rd by g µ. It canonically defines a group action of E(d) on P(Rn). We have the following result. Proposition 3.1. The 2-Wasserstein distance and the Sliced p-Wasserstein distance are E(d)- invariant. In other words, for every µ, ν P(Rd) and g E(d), we have

W2(µ, ν) = W2(g µ, g ν) and SWp(µ, ν) = SWp(g µ, g ν). (13)

Moreover, for Sliced Wasserstein distance, we have the below result for each projecting direction. Proposition 3.2. Let g = (Q, a) E(d). For every µ, ν P(Rd) and θ Sd 1, we have

Rfµ( , θ), Rfν( , θ) = Wp

Rfg µ( , Qθ), Rfg ν( , Qθ) (14)

The proofs for Proposition 3.1 and Proposition 3.2 can be found in Appendix B.1.

Remark 2. With the setting of TSW-SL, the E(d)-invariance and the similar property as Proposition 3.2 can not be derived for a general splitting map. Assumptions on invariance of splitting maps will be made to achieve invariance of TSW-SL.

4 RADON TRANSFORM WITH E(d)-INVARIANT SPLITTING MAPS

In (Tran et al., 2025d), the splitting maps are selected as continuous maps from Rd to k 1, which means that it depends solely on the position of points and ignores the tree systems. It is intuitively

Published as a conference paper at ICLR 2025

Figure 2: An illustration demonstrating E(d)-invariance of splitting maps. Starting with a point and a system of lines, two Euclidean transformations are applied, resulting in two additional pairs of points and systems of lines. An E(d)-invariant handles all three pairs identically, leading to the same mass distribution from the point to lines within each system.

better for splitting maps to account positional information from both of points and tree systems, than only from points. With this motivation, we introduce a larger class of splitting maps with the corresponding Radon Transform and then discuss the injectivity of this Radon Transform variant.

4.1 RADON TRANSFORM ON SYSTEMS OF LINES

Denote C(Rd Ld k, k 1) as the space of continuous maps from Rd Ld k to k 1 and refer to maps in C(Rd Ld k, k 1) as splitting maps. Here, we use the same name as in (Tran et al., 2025d) since the new class of splitting maps contains the class of splitting maps in (Tran et al., 2025d).

C(Rd, k 1) 1 1 !

{α C(Rd Ld k, k 1) : α(x, L) = α(x, L ) for all x Rd and L, L Ld k}.

Let L be a system of lines in Ld n and α be a splitting map in C(Rd Ld k, k 1), we define an operator associated to α that transforms a Lebesgue integrable functions on Rd to a Lebesgue integrable functions on L. For f L1(Rd), define

Rα Lf : L ! R (15)

(x, l) 7 ! Z

Rd f(y) α(y, L)l δ (tx y xl, θl ) dy, (16)

where δ is the 1-dimensional Dirac delta function. We have Rα Lf L1(L) for f L1(Rd) and moreover Rα Lf L f 1. In other words, the linear operator Rα L : L1(Rd) ! L1(L) is welldefined. The proof of these properties can be found in Appendix B.2. We introduce a novel Radon Transform on Systems of Lines, which is a generalization of the variant in (Tran et al., 2025d). Definition 4.1 (Radon Transform on Systems of Lines). For α C(Rd Ld k, k 1), the operator

Rα : L1(Rd) ! Y

f 7 ! (Rα Lf)L Ld k , (18)

is called the Radon Transform on Systems of Lines. Remark 3. We use the same name and notation for Rα as in (Tran et al., 2025d). Figure 1 emphasizes the difference between the old and new Radon Transform on Systems of Lines.

The injectivity of Rα will be discussed in the next part. Surprisingly, the E(d)-invariance of α is a sufficient condition for the injectivity of Rα.

4.2 E(d)-INVARIANCE AND INJECTIVITY OF RADON TRANSFORM

Variants of Radon Transform usually require the transform to be injective. In the case of Rα, since we introduce a larger class of splitting maps α, the proof for injectivity of the previous variant in

Published as a conference paper at ICLR 2025

(Tran et al., 2025d) is not applicable for this new variant. We found that if α is E(d)-invariant, then the injectivity of the Radon transform holds. We first have the definition of E(d)-invariance of splitting maps. Definition 4.2. A splitting map α in C(Rd Ld k, k 1) is said to be E(d)-invariant, if we have

α(gy, g L) = α(y, L) (19)

for all (y, L) Rd Ld k and g E(d).

A visualization of E(d)-invariant splitting maps is presented in Figure 2. Using this property of splitting maps, we get a result about injectivity of our Radon Transform variant. Theorem 4.3. For an E(d)-invariant splitting map α C(Rd Ld k, k 1), Rα is injective.

The proof of Theorem 4.3 is presented in Appendix B.3.

Remark 4. Since the action of E(d) on Rd is transitive, i.e. for x, y Rd, there exists g E(d) such that gx = y, so E(d)-invariant splitting maps α C(Rd, k 1) must be constant. Thus, imposing invariance on previous splitting maps in (Tran et al., 2025d) significantly limits the class of maps.

Candidates for E(d)-invariant splitting maps will be presented in the next section.

5 DISTANCE-BASED TREE-SLICED WASSERSTEIN DISTANCE

In this section, we present a novel Distance-based Tree-Sliced Wasserstein (Db-TSW) distance. Let consider the space of all tree systems of k lines T and a distribution σ on T.

5.1 TREE-SLICED WASSERSTEIN DISTANCE WITH E(d) SPLITTING MAP

For µ, ν P(Rd), a tree system L T, and an E(d)-invariant splitting map α C(Rd Ld k, k 1), by the Radon transform Rα L in Definition 4.1, probability measures µ and ν are transformed to Rα Lµ and Rα Lν in P(L). Notice that L is a metric space with tree metric d L (Tran et al., 2025d), so we can compute Wasserstein distance Wd L,1(Rα Lµ, Rα Lν) between Rα Lµ and Rα Lν by Equation (4). Definition 5.1 (Distance-Based Tree-Sliced Wasserstein Distance). The Distance-based Tree-Sliced Wasserstein distance between µ, ν P(Rd) is defined by

Db-TSW(µ, ν) := Z

T Wd L,1(Rα Lµ, Rα Lν) dσ(L). (20)

Remark 5. Note that, Db-TSW depends on the space of tree systems T, the distribution σ over T, and the E(d)-invariant splitting map α. For simplicity, these are omitted from the notation. The choice of α will be discussed in the next part and is the reason for the name Distance-based.

The Monte Carlo method is utilized to estimate the intractable integral in Equation (20) as follows:

\ Db-TSW(µ, ν) = 1

i=1 Wd Li,1(Rα Liµ, Rα Liν), (21)

where L1, . . . , LL are drawn independently from the distribution σ on T. The Db-TSW distance is, indeed, a metric on P(Rd). Moreover, Db-TSW is E(d)-invariant. Theorem 5.2. Db-TSW is an E(d)-invariant metric on P(Rd).

The proof of Theorem 5.2 is presented in Appendix B.4. We discuss the computation of Db-TSW in details in the next part.

5.2 COMPUTING DB-TSW

We construct the space of tree systems and E(d)-invariant splitting maps that are suited for Db-TSW.

Choices for the space of tree systems. In (Tran et al., 2025d), the space of tree systems T used in computing Db-TSW is the collection of all tree systems with the tree structure is a chain of k

Published as a conference paper at ICLR 2025

nodes as a graph. We discovered that this approach complicates the implementation of Db-TSW, resulting in a considerable increase in computation time. Here, we propose a method for sampling tree systems that simplifies the implementation in practice. Let T be the space of all tree systems, which consists of k lines with the same source. In details, L in T can be presented as

L = (x, θ1), (x, θ2), . . . , (x, θk) . (22)

In other words, tree system L consists of k concurrent lines with the same root. The distribution σ on T is simply the joint distribution of k + 1 independent distributions, which are µ P(Rd) and µ1, µ2, . . . , µk P(Sd 1). In details, to sample a tree system L T, we

1. Sample x µ, where µ is a distribution on a bounded subset of Rd, for instance, the uniform distribution on the d-dimensional hypercube [ 1, 1]d, i.e. U([ 1, 1]d); and, 2. For i = 1, . . . , k, sample θi µi, where µi is a distribution on Sd 1, for instance, the uniform distribution U(Sd 1).

We also concentrate on a subcollection of T, which includes tree systems that take into account the angles between pairs of lines. Assume that k d, denote T as the space of tree systems L consists of k concurrent lines with the same root, and these lines are mutually orthogonal, i.e. L = {(x, θ1), (x, θ2), . . . , (x, θk)} where θi, θj = 0 for all 1 i < j k. The intuitive motivation for this choice is to ensure that the sampled tree systems do not include lines with similar directions. Remark 6. The condition k d arises because {θ1, . . . , θk} forms a subset of an orthogonal basis of Rd. In practice, this has little impact, as the dimension d is typically large, while the number of directions k is selected to be small.

Choices for E(d)-invariant splitting maps. Since the action of E(d) preserves the distance between two points in Rd, it also preserves the distance between a point and a line in Rd. For x Rd and L Ld k, we denote the distance between x and line l of L as d(x, L)l given by:

d(x, L)l = inf y l x y 2. (23)

We have d(x, L)l is E(d)-invariant. As a result, a splitting map α: Rd ! k 1 that is in the form

α(x, L)l = β {d(x, L)l}l L , (24)

where β is an arbitrary map Rk ! k 1, is E(d)-invariant. In practice, we empirically observed that choosing β to be the softmax function together with scaling by a scalar works well in applications. In details, we choose α as follows

α(x, L)l = softmax {δ d(x, L)l}l L (25)

Here, δ R is considered as to be a tuning parameter.1 Intuition behind this choice of α is that it reflects the proximity of points to lines in tree systems. As |δ| grows, the output of α tends to become more sparse, emphasizing the importance of each line in the tree system relative to a specific point. We summarize the computation of Db-TSW by Algorithm 1. Remark 7. The above construction of α mainly bases on the distance between points and lines in tree systems. That is the reason for the name Distance-based Tree-Sliced Wasserstein.

6 EXPERIMENTAL RESULTS

In this section, we experimentally demonstrate the advantages of our Db-TSW methods over traditional SW distance and its variants across three key domains: unconditional image synthesis, gradient flows, and color transfer. Detailed experimental settings are provided in Appendix C.1. Our evaluation aims to establish that: (i) Db-TSW and Db-TSW 2 consistently enhance performance across various generative and optimization tasks; (ii) Db-TSW significantly outperforms not only baselines but also Db-TSW in all tasks, highlighting its superiority; and (iii) Db-TSW and Db-TSW is universal applicable and can be seamlessly integrated into any Optimal Transport task.

1We abuse the notation δ R for the splitting map function α. 2 stands for using orthogonal directions θ.

Published as a conference paper at ICLR 2025

Table 1: Results of different DDGAN variants for unconditional generation on CIFAR-10.

Model FID # Time/Epoch(s) #

DDGAN (Xiao et al. (2022)) 3.64 136 SW-DD (Nguyen et al. (2024b)) 2.90 140 DSW-DD (Nguyen et al. (2024b)) 2.88 1059 EBSW-DD (Nguyen et al. (2024b)) 2.87 145 RPSW-DD (Nguyen et al. (2024b)) 2.82 159 IWRPSW-DD (Nguyen et al. (2024b)) 2.70 152 TSW-SL-DD (Tran et al. (2025d)) 2.83 163 Db-TSW-DD (Ours) 2.60 160 Db-TSW-DD (Ours) 2.525 162

We maintain a consistent total number of sampled lines across all methods to ensure fair comparison. Db-TSW and Db-TSW offer two important advantages. First, these methods contain richer information compared to SW for the same number of data points, as they incorporate positional information from both points and tree systems. This allows for more effective sampling, specifically by sampling trees located around the support of target distributions (ideally near their means). Second, the root-concurrent tree system design enables an algorithm with linear runtime and high parallelizability, keeping the wall clock time of Db-TSW and Db-TSW approximately equal to vanilla SW and surpassing some SW variants.

6.1 DIFFUSION MODELS

This experiment explores the efficacy of denoising diffusion models for unconditional image synthesis. We employ a variant of the Denoising Diffusion Generative Adversarial Network (DDGAN) (Xiao et al., 2022), as introduced by Nguyen et al. (2024b), which incorporates a Wasserstein distance within the Augmented Generalized Mini-batch Energy (AGME) loss function. A detailed explanation of this Optimal Transport-based DDGAN is provided in Appendix C.2. We evaluate our proposed methods, Db-TSW-DD and Db-TSW-DD , against DDGAN and various OT-based DDGAN variants, as enumerated in Table 1. All models undergo training for 1800 epochs on the CIFAR10 dataset (Krizhevsky et al., 2009). For vanilla SW and its variants, we adopt the parameters from Nguyen et al. (2024b) and set L = 10000. For Db-TSW-DD and Db-TSW-DD models, we set L = 2500, k = 4, δ = 10. Tree sampling is conducted from a Gaussian distribution N(mt, σ2I), where mt represents the mean of all training samples and σ = 0.1. Table 1 presents the Fr echet Inception Distance (FID) scores and per-epoch training times on an Nvidia V100 GPU for our methods and the baselines. Lower FID scores indicate superior model performance. The results demonstrate that both Db-TSW-DD and Db-TSW-DD achieve notable improvements in FID compared to all baselines. Specifically, they outperform the current state-of-the-art OT-based DDGAN, IWRPSWDD (Nguyen et al., 2024b), by margins of 0.175 and 0.1, respectively. Importantly, our methods maintain computational efficiency, with only a modest 6.5% increase in training time compared to the current state-of-the-art.

6.2 GRADIENT FLOWS

The gradient flow task aims to minimize the distance between source and target distributions using gradient descent process. The objective is to iteratively reduce this distance by optimizing the equation tµt = D(µt, ν) with the initial condition µ0 = N(0, 1), where µt represents the source distribution at time t, tµt denotes the change in the source distribution over time, and D(µt, ν) is the gradient of D with respect to µt. Here, D refers to our proposed distance metric and the SW baselines. We evaluate our approach Db-TSW , Db-TSW and several established techniques, including vanilla SW Bonneel et al. (2015), Max SW (Deshpande et al., 2019), LCVSW (Nguyen & Ho, 2024), SWGG (Mahey et al., 2023), and TSW-SL (Tran et al., 2025d), using the Swiss Roll and Gaussian 20d datasets.

To assess the effectiveness of these methods, we employ the Wasserstein distance to quantify average Wasserstein distances between the source and target distributions across 10 runs at iterations 500, 1000, 1500, 2000 and 2500. The results demonstrated in Table 2 show Db-TSW and Db-TSW record the best result on almost all steps in Swiss Roll dataset, with a 3.78e-8 W2 on the last step

Published as a conference paper at ICLR 2025

Table 2: Average Wasserstein distance between source and target distributions of 10 runs on Swiss Roll and Gaussian 20d datasets. All methods use 100 projecting directions.

Swiss Roll Gaussian 20d

Methods Iteration Time/Iter(s) Iteration Time/Iter(s) 500 1000 1500 2000 2500 500 1000 1500 2000 2500

SW 5.73e-3 2.04e-3 1.23e-3 1.11e-3 1.05e-3 0.009 18.24 16.48 14.66 12.60 10.30 0.006 Max SW 2.47e-2 1.03e-2 6.10e-3 4.47e-3 3.45e-3 2.46 13.24 13.71 13.46 12.71 11.83 2.38 SWGG 3.84e-2 1.53e-2 1.02e-2 4.49e-3 3.57e-5 0.011 8.51 8.71 7.71 8.48 6.45 0.009 LCVSW 7.28e-3 1.40e-3 1.38e-3 1.38e-3 1.36e-3 0.010 17.26 14.70 12.00 9.04 6.15 0.009 TSW-SL 9.41e-3 2.03e-7 9.63e-8 4.44e-8 3.65e-8 0.014 3.13 9.67e-3 6.81e-3 6.15e-3 5.71e-3 0.010 Db-TSW 5.47e-3 8.04e-8 5.29e-8 3.92e-8 3.01e-8 0.006 3.27 1.12e-2 7.21e-3 5.63e-3 4.60e-3 0.006 Db-TSW 7.55e-3 2.65e-7 4.90e-8 4.18e-8 3.78e-8 0.009 2.46 7.96e-3 6.11e-3 5.22e-3 5.03e-3 0.009

in comparison with 1.05e-3 of vanilla SW and 3.57e-5 of SWGG. A similar trend is observed in the Gaussian 20d dataset, with Db-TSW and Db-TSW consistently achieving the second best and best Wasserstein distances respectively, outperforming the next best method SWGG by three orders of magnitude (5.03e-3 and 4.60e-3 vs. 7.34) and vanilla SW by nearly four orders of magnitude (10.30) at the final iteration. Notably, Db-TSW and Db-TSW maintain competitive computational efficiency, with runtimes equal to good LCVSW variant (0.009 seconds per iteration) while boosting the performance significantly.

6.3 COLOR TRANSFER

The color transfer task involves transferring color from a source image to a target image. Considering source and target color palettes as X and Y , we define the curve Z(t) = n Z(t)[D(PZ(t), PY )] where PX and PY are empirical distributions over X and Y and D are customizable Wasserstein distance. Here, the curve starts from Z(0) = X and ends at Y . We iterate along this curve to perform the transfer between the probability distributions PX and PY . We evaluate our method against SW, TSW-SL, Max SW, and various Quasi-Sliced Wasserstein (QSW) variants (Nguyen et al., 2024a). We follow the settings used in (Nguyen et al., 2024a) for experiment settings, detailed in Appendix C.4. We set L = 33, k = 3 for Db-TSW and Db-TSW and L = 100 for all baselines.

In Figure 5 (Appendix), we compare the Wasserstein distances and visualizations of various color transfer methods. Notably, Db-TSW and Db-TSW achieve near-best performance (0.12 and 0.21 respectively) without randomization overhead of top-performing R-methods like RRQDSW and RQDSW (both at 0.08) and significantly outperforming traditional approaches such as SW and Max SW (both at 9.58).

7 CONCLUSION

The paper presents the Distance-based Tree-Sliced Wasserstein (Db-TSW) distance for comparing measures in Euclidean spaces. Built on the Tree-Sliced Wasserstein distance on Systems of Lines (TSW-SL) framework, Db-TSW enhances the class of splitting maps that are central to TSW-SL. This new class of splitting maps in Db-TSW resolves several challenges in TSW-SL, such as the lack of Euclidean invariance, insufficient consideration of positional information, and slower computational performance. To address the theoretical gaps regarding these new maps, we provide a comprehensive theoretical framework with rigorous proofs for key properties of Db-TSW. Our experiments show that Db-TSW outperforms recent Sliced Wasserstein (SW) variants on a wide range of task, such as gradient flows, or generative models as GAN or diffusion models. In comparison to these SW variants, Db-TSW primarily focuses on refining the integration domain using tree systems and the associated measure projection mechanism through splitting maps, which are not existed in SW framework. Therefore, adapting to Db-TSW several techniques used in proving the SW variant is expected to be beneficial for future research. Additionally, the System of Lines for TSW provides advanced geometric structure to go beyond the lines for SW, which is a promising research direction for further investigation, e.g., the concurrent TSW-SL on the sphere (Tran et al., 2025b).

Published as a conference paper at ICLR 2025

ACKNOWLEDGMENTS

This research / project is supported by the National Research Foundation Singapore under the AI Singapore Programme (AISG Award No: AISG2-TC-2023-012-SGIL). This research / project is supported by the Ministry of Education, Singapore, under the Academic Research Fund Tier 1 (FY2023) (A-8002040-00-00, A-8002039-00-00). This research / project is also supported by the NUS Presidential Young Professorship Award (A-0009807-01-00) and the NUS Artificial Intelligence Institute Seed Funding (A-8003062-00-00).

We thank the area chairs, anonymous reviewers for their comments. TL acknowledges the support of JSPS KAKENHI Grant number 23K11243, and Mitsui Knowledge Industry Co., Ltd. grant.

Ethics Statement. Given the nature of the work, we do not foresee any negative societal and ethical impacts of our work.

Reproducibility Statement. Source codes for our experiments are provided in the supplementary materials of the paper. The details of our experimental settings and computational infrastructure are given in Section 6 and the Appendix. All datasets that we used in the paper are published, and they are easy to access in the Internet.

David Alvarez-Melis, Tommi Jaakkola, and Stefanie Jegelka. Structured optimal transport. In International Conference on Artificial Intelligence and Statistics, pp. 1771 1780. PMLR, 2018.

David Alvarez-Melis, Stefanie Jegelka, and Tommi S Jaakkola. Towards optimal transport with global invariances. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1870 1879. PMLR, 2019.

Cl ement Bonet, Paul Berg, Nicolas Courty, Franc ois Septier, Lucas Drumetz, and Minh Tan Pham. Spherical sliced-Wasserstein. In The Eleventh International Conference on Learning Representations, 2023a.

Cl ement Bonet, Laetitia Chapel, Lucas Drumetz, and Nicolas Courty. Hyperbolic sliced-Wasserstein via geodesic and horospherical projections. In Topological, Algebraic and Geometric Learning Workshops 2023, pp. 334 370. PMLR, 2023b.

Nicolas Bonneel, Julien Rabin, Gabriel Peyr e, and Hanspeter Pfister. Sliced and Radon Wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision, 51:22 45, 2015.

Charlotte Bunne, Laetitia Papaxanthos, Andreas Krause, and Marco Cuturi. Proximal optimal transport modeling of population dynamics. In International Conference on Artificial Intelligence and Statistics, pp. 6511 6528. PMLR, 2022.

Taco Cohen and Max Welling. Group equivariant convolutional networks. In International conference on machine learning, pp. 2990 2999. PMLR, 2016.

Ishan Deshpande, Yuan-Ting Hu, Ruoyu Sun, Ayis Pyrros, Nasir Siddiqui, Sanmi Koyejo, Zhizhen Zhao, David Forsyth, and Alexander G Schwing. Max-sliced Wasserstein distance and its use for GANs. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10648 10656, 2019.

Jiaojiao Fan, Isabel Haasler, Johan Karlsson, and Yongxin Chen. On the complexity of the optimal transport problem with graph-structured cost. In International Conference on Artificial Intelligence and Statistics, pp. 9147 9165. PMLR, 2022.

Allen Hatcher. Algebraic topology. 2005.

Sigurdur Helgason. The Radon transform on Rn. Integral Geometry and Radon Transforms, pp. 1 62, 2011.

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840 6851, 2020.

Published as a conference paper at ICLR 2025

Xinru Hua, Truyen Nguyen, Tam Le, Jose Blanchet, and Viet Anh Nguyen. Dynamic flows on curved space generated by labeled data. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-23, pp. 3803 3811, 2023.

Piotr Indyk and Nitin Thaper. Fast image retrieval via embeddings. In International workshop on statistical and computational theories of vision, volume 2, pp. 5, 2003.

Hoang Anh Just, Feiyang Kang, Tianhao Wang, Yi Zeng, Myeongseob Ko, Ming Jin, and Ruoxi Jia. LAVA: Data valuation without pre-specified learning algorithms. In The Eleventh International Conference on Learning Representations, 2023.

Samuel Kessler, Tam Le, and Vu Nguyen. SAVA: Scalable learning-agnostic data valuation. In The Thirteenth International Conference on Learning Representations, 2025.

Soheil Kolouri, Kimia Nadjahi, Umut Simsekli, Roland Badeau, and Gustavo Rohde. Generalized sliced Wasserstein distances. Advances in neural information processing systems, 32, 2019.

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.

Hugo Lavenant, Sebastian Claici, Edward Chien, and Justin Solomon. Dynamical optimal transport on discrete surfaces. In SIGGRAPH Asia 2018 Technical Papers, pp. 250. ACM, 2018.

Tam Le and Truyen Nguyen. Entropy partial transport with tree metrics: Theory and practice. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics (AISTATS), volume 130 of Proceedings of Machine Learning Research, pp. 3835 3843. PMLR, 2021.

Tam Le, Makoto Yamada, Kenji Fukumizu, and Marco Cuturi. Tree-sliced variants of Wasserstein distances. Advances in Neural Information Processing Systems, 32, 2019.

Tam Le, Truyen Nguyen, Dinh Phung, and Viet Anh Nguyen. Sobolev transport: A scalable metric for probability measures with graph metrics. In International Conference on Artificial Intelligence and Statistics, pp. 9844 9868. PMLR, 2022.

Tam Le, Truyen Nguyen, and Kenji Fukumizu. Scalable unbalanced Sobolev transport for measures on a graph. In International Conference on Artificial Intelligence and Statistics, pp. 8521 8560. PMLR, 2023.

Tam Le, Truyen Nguyen, and Kenji Fukumizu. Generalized Sobolev transport for probability measures on a graph. In Forty-first International Conference on Machine Learning, 2024.

Lang Liu, Soumik Pal, and Zaid Harchaoui. Entropy regularized optimal transport independence criterion. In International Conference on Artificial Intelligence and Statistics, pp. 11247 11279. PMLR, 2022.

Manh Luong, Khai Nguyen, Nhat Ho, Reza Haf, Dinh Phung, and Lizhen Qu. Revisiting deep audio-text retrieval through the lens of transportation. In The Twelfth International Conference on Learning Representations, 2024.

Guillaume Mahey, Laetitia Chapel, Gilles Gasso, Cl ement Bonet, and Nicolas Courty. Fast optimal transport through sliced generalized Wasserstein geodesics. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.

Gonzalo Mena and Jonathan Niles-Weed. Statistical bounds for entropic optimal transport: sample complexity and the central limit theorem. In Advances in Neural Information Processing Systems, pp. 4541 4551, 2019.

Kimia Nadjahi, Alain Durmus, Pierre E Jacob, Roland Badeau, and Umut Simsekli. Fast approximation of the sliced-Wasserstein distance using concentration of random projections. Advances in Neural Information Processing Systems, 34:12411 12424, 2021.

Khai Nguyen and Nhat Ho. Sliced Wasserstein estimation with control variates. In The Twelfth International Conference on Learning Representations, 2024.

Published as a conference paper at ICLR 2025

Khai Nguyen, Nicola Bariletto, and Nhat Ho. Quasi-Monte Carlo for 3d sliced Wasserstein. In The Twelfth International Conference on Learning Representations, 2024a.

Khai Nguyen, Shujian Zhang, Tam Le, and Nhat Ho. Sliced Wasserstein with random-path projecting directions. International Conference on Machine Learning, 2024b.

Tin D Nguyen, Brian L Trippe, and Tamara Broderick. Many processors, little time: MCMC for partitions via optimal transport couplings. In International Conference on Artificial Intelligence and Statistics, pp. 3483 3514. PMLR, 2022.

Trung Nguyen, Quang-Hieu Pham, Tam Le, Tung Pham, Nhat Ho, and Binh-Son Hua. Pointset distances for learning representations of 3d point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10478 10487, 2021a.

Vu Nguyen, Tam Le, Makoto Yamada, and Michael A Osborne. Optimal transport kernels for sequential and parallel neural architecture search. In International Conference on Machine Learning, pp. 8084 8095. PMLR, 2021b.

Sloan Nietert, Ziv Goldfeld, and Rachel Cummings. Outlier-robust optimal transport: Duality, structure, and statistical analysis. In International Conference on Artificial Intelligence and Statistics, pp. 11691 11719. PMLR, 2022.

Jonathan Niles-Weed and Philippe Rigollet. Estimation of Wasserstein distances in the spiked transport model. Bernoulli, 28(4):2663 2688, 2022.

Jungin Park, Jiyoung Lee, and Kwanghoon Sohn. Bridging vision and language spaces with assignment prediction. In The Twelfth International Conference on Learning Representations, 2024.

Franc ois-Pierre Paty and Marco Cuturi. Subspace robust Wasserstein distances. In Proceedings of the 36th International Conference on Machine Learning, pp. 5072 5081, 2019.

Gabriel Peyr e, Marco Cuturi, et al. Computational optimal transport: With applications to data science. Foundations and Trends in Machine Learning, 11(5-6):355 607, 2019.

Thong Pham, Shohei Shimizu, Hideitsu Hino, and Tam Le. Scalable counterfactual distribution estimation in multivariate causal models. In Conference on Causal Learning and Reasoning (CLea R), 2024.

Michael Quellmalz, Robert Beinert, and Gabriele Steidl. Sliced optimal transport on the sphere. Inverse Problems, 39(10):105005, 2023.

Julien Rabin, Gabriel Peyr e, Julie Delon, and Marc Bernot. Wasserstein barycenter and its application to texture mixing. In International Conference on Scale Space and Variational Methods in Computer Vision, pp. 435 446, 2011.

Mahdi Saleh, Shun-Cheng Wu, Luca Cosmo, Nassir Navab, Benjamin Busam, and Federico Tombari. Bending graphs: Hierarchical shape matching using gated optimal transport. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11757 11767, 2022.

Tim Salimans, Han Zhang, Alec Radford, and Dimitris Metaxas. Improving GANs using optimal transport. In International Conference on Learning Representations, 2018.

Vıctor Garcia Satorras, Emiel Hoogeboom, and Max Welling. E (n) equivariant graph neural networks. In International conference on machine learning, pp. 9323 9332. PMLR, 2021.

Anthony Simeonov, Yilun Du, Andrea Tagliasacchi, Joshua B Tenenbaum, Alberto Rodriguez, Pulkit Agrawal, and Vincent Sitzmann. Neural descriptor fields: Se (3)-equivariant object representations for manipulation. In 2022 International Conference on Robotics and Automation (ICRA), pp. 6394 6400. IEEE, 2022.

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp. 2256 2265. PMLR, 2015.

Published as a conference paper at ICLR 2025

Justin Solomon, Fernando De Goes, Gabriel Peyr e, Marco Cuturi, Adrian Butscher, Andy Nguyen, Tao Du, and Leonidas Guibas. Convolutional Wasserstein distances: Efficient optimal transportation on geometric domains. ACM Transactions on Graphics (TOG), 34(4):66, 2015.

Hoang Tran, Thieu Vo, Tho Huu, Tan Nguyen, et al. Monomial matrix group equivariant neural functional networks. Advances in Neural Information Processing Systems, 37:48628 48665, 2025a.

Hoang V Tran, Thanh Chu, Minh-Khoi Nguyen-Nhat, Huyen Trang Pham, Tam Le, and Tan Minh Nguyen. Spherical tree-sliced Wasserstein distance. In The Thirteenth International Conference on Learning Representations, 2025b.

Hoang-Viet Tran, Thieu N Vo, Tho Tran Huu, and Tan Minh Nguyen. A clifford algebraic approach to e (n)-equivariant high-order graph neural networks. ar Xiv preprint ar Xiv:2410.04692, 2024a.

Huy Tran, Yikun Bai, Abihith Kothapalli, Ashkan Shahbazi, Xinran Liu, Rocio P Diaz Martin, and Soheil Kolouri. Stereographic spherical sliced Wasserstein distances. In Forty-first International Conference on Machine Learning, 2024b.

Thanh Tran, Viet-Hoang Tran, Thanh Chu, Trang Pham, Laurent El Ghaoui, Tam Le, and Tan M Nguyen. Tree-sliced Wasserstein distance with nonlinear projection. ar Xiv preprint ar Xiv:2505.00968, 2025c.

Viet-Hoang Tran, Thieu N Vo, An Nguyen The, Tho Tran Huu, Minh-Khoi Nguyen-Nhat, Thanh Tran, Duy-Tung Pham, and Tan Minh Nguyen. Equivariant neural functional networks for transformers. ar Xiv preprint ar Xiv:2410.04209, 2024c.

Viet-Hoang Tran, Trang Pham, Tho Tran, Minh Khoi Nguyen Nhat, Thanh Chu, Tam Le, and Tan M. Nguyen. Tree-sliced wasserstein distance: A geometric perspective, 2025d. URL https:// arxiv.org/abs/2406.13725.

C. Villani. Optimal Transport: Old and New, volume 338. Springer Science & Business Media, 2008.

Thieu N Vo, Viet-Hoang Tran, Tho Tran Huu, An Nguyen The, Thanh Tran, Minh-Khoi Nguyen Nhat, Duy-Tung Pham, and Tan Minh Nguyen. Equivariant polynomial functional networks. ar Xiv preprint ar Xiv:2410.04213, 2024.

Robin Walters, Jinxi Li, and Rose Yu. Trajectory prediction using equivariant continuous convolution. ar Xiv preprint ar Xiv:2010.11344, 2020.

Jie Wang, Rui Gao, and Yao Xie. Two-sample test with kernel projected Wasserstein distance. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151, pp. 8022 8055. PMLR, 2022.

Jonathan Weed and Quentin Berthet. Estimation of smooth densities in Wasserstein distance. In Proceedings of the Thirty-Second Conference on Learning Theory, volume 99, pp. 3118 3119, 2019.

Zhisheng Xiao, Karsten Kreis, and Arash Vahdat. Tackling the generative learning trilemma with denoising diffusion GANs. In International Conference on Learning Representations, 2022.

Published as a conference paper at ICLR 2025

Rd d-dimensional Euclidean space 2 Euclidean norm , standard dot product Sd 1 (d 1)-dimensional hypersphere θ unit vector disjoint union L1(X) space of Lebesgue integrable functions on X P(X) space of probability distributions on X µ, ν measures δ( ) 1-dimensional Dirac delta function U(Sd 1) uniform distribution on Sd 1

pushforward (measure) C(X, Y ) space of continuous maps from X to Y d( , ) metric in metric space T(d) translation group of order d O(d) orthogonal group of order d E(d) Euclidean group of order d g element of group Wp p-Wasserstein distance SWp Sliced p-Wasserstein distance Γ (rooted) subtree e edge in graph we weight of edge in graph l line, index of line L system of lines, tree system L ground set of system of lines, tree system ΩL topological space of system of lines Ld k space of symtems of k lines in Rd

T tree structure in system of lines L number of tree systems k number of lines in a system of lines or a tree system R original Radon Transform Rα Radon Transform on Systems of Lines k 1 (k 1)-dimensional standard simplex α splitting map δ tuning parameter in splitting maps T space of tree systems T space of orthogonal tree systems σ distribution on space of tree systems

Published as a conference paper at ICLR 2025

Supplement for Distance-Based Tree-Sliced Wasserstein Distance

Table of Contents

A Background for Tree-Sliced Wasserstein distance on Systems of Lines 16

A.1 Tree System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

A.2 Sampling Process of Tree Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 17

A.3 Radon Transform on Systems of Lines . . . . . . . . . . . . . . . . . . . . . . . . 17

A.4 Tree-Sliced Wasserstein Distance on Systems of Lines (TSW-SL) . . . . . . . . . 17

B Theoretical Proofs 18

B.1 Proof of Proposition 3.1 and Proposition 3.2 . . . . . . . . . . . . . . . . . . . . . 18

B.2 Rα Lf is integrable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

B.3 Proof of Theorem 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

B.4 Proof of Theorem 5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

C Experimental details 24

C.1 Algorithm of Db-TSW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

C.2 Denoising Diffusion Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

C.3 Gradient Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

C.4 Color Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

C.5 Computational infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

C.6 Computation and memory complexity of Db-TSW . . . . . . . . . . . . . . . . . 28

C.7 Runtime and memory analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

C.8 Ablation study for L, k and δ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

A BACKGROUND FOR TREE-SLICED WASSERSTEIN DISTANCE ON SYSTEMS OF LINES

This section provides background for Tree-Sliced Wasserstein distance on Systems of Lines. For completeness, we recall essential definitions and theoretical results. Proofs and further details can be found in (Tran et al., 2025d).

A.1 TREE SYSTEM

A line in Rd is determined by a pair (x, θ) Rd Sd 1 and is parameterized as x + t θ, where t R. A line in Rd is denoted or indexed by l = (xl, θl) Rd Sd 1. Here, xl and θl are referred to as the source and direction of l, respectively. For k 1, a system of k lines in Rd is a set of k lines. We denote (Rd Sd 1)k by Ld k, which represents the space of systems of k lines in Rd, and an element of Ld k is typically denoted by L. The system L is said to be connected if the points on the lines form a connected subset of Rd. By removing some intersections between lines, we obtain a tree system L, where there is a unique path between any two points of L. For a quick visualization of tree systems, kindly refer to Figure 3. Remark 8. The term tree system is used because there is a unique path between any two points, analogous to the definition of trees in graph theory.

Published as a conference paper at ICLR 2025

By identifying all remaining intersections, together with the notions of disjoint union topology and quotient topology (Hatcher, 2005), we obtain a topology on a tree system as the gluing of copies of R. Analyzing this topology, we find that tree systems are metric spaces with a tree metric.

A.2 SAMPLING PROCESS OF TREE SYSTEMS

The space of tree systems exhibits significant diversity due to the various possible choices for tree structures (as graphs). (Tran et al., 2025d) outlines a comprehensive approach applicable to general tree structures, and focuses on the implementation of chain-like tree structures. The process of sampling tree systems based on this chain-like structure is as follows:

Step 1. Sample x1 µ1 and θ1 ν1 for ν1 P(Sd 1) and µ1 P(Rd).

Step i. Sample xi = xi 1 +ti θi 1 where ti µi and θi νi for µi P(R) and νi P(Sd 1).

We assume all µ s and ν s are independent, and let:

1. µ1 to be a distribution on a bounded subset of Rd, for instance, the uniform distribution on the d-dimensional cube [ 1, 1]d, i.e. U([ 1, 1]d); 2. µi for i > 1 to be a distribution on a bounded subset of R, for example, the uniform distribution on the interval [ 1, 1], i.e. U([ 1, 1]); 3. θn to be a distribution on Sd 1, for example, the uniform distribution U(Sd 1).

We derive a distribution, denoted by σ, over the space of all tree systems that can be sampled in this way, denoted by T. The tree system shown in Figure 3 is an example of a tree system with a chain-like structure.

A.3 RADON TRANSFORM ON SYSTEMS OF LINES

Denote L1(Rd) as the space of Lebesgue integrable functions on Rd with norm 1. Let L Ld k be a system of k lines. A Lebesgue integrable function on L is a function f : L ! R such that:

R |f(tx, l)| dtx < . (26)

Denote L1(L) as the space of Lebesgue integrable functions on L.

Denote C(Rd, k 1) as the space of continuous maps from Rd to the (k 1)-dimensional standard simplex k 1 = {(al)l L : al 0 and P

l L al = 1} Rk. A map in C(Rd, k 1) is called a splitting map. Let L Ld k, α C(Rd, k 1), we define a linear operator Rα L that transforms a function in L1(Rd) to a function in L1(L). For f L1(Rd), define:

Rα Lf : L ! R

(x, l) 7 ! Z

Rd f(y) α(y)l δ (tx y xl, θl ) dy (27)

The (old) Radon Transform for Systems of Lines Rα in (Tran et al., 2025d) is defined as follows:

Rα : L1(Rd) ! Y

f 7 ! (Rα Lf)L Ld k .

As shown in (Tran et al., 2025d), Rα is injective for all splitting maps α C(Rd, k 1).

A.4 TREE-SLICED WASSERSTEIN DISTANCE ON SYSTEMS OF LINES (TSW-SL)

Given the space of tree systems T, distribution σ on T, α C(Rd, k 1), the Tree-Sliced Wasserstein distance on Systems of Lines (Tran et al., 2025d) between µ, ν in P(Rd) is defined by

TSW-SL(µ, ν) = Z

T Wd L,1(Rα Lµ, Rα Lν) dσ(L). (28)

Published as a conference paper at ICLR 2025

Figure 3: An illustration of constructing a tree system: Starting with a bunch of lines with no structure (left), we consider the intersections of all pairs of lines (middle), then removing some of the intersections to obtain a tree system (Right). There exists a unique path between any two points, since we only allows the pass through the remained intersections.

TSW-SL is indeed a metric on P(Rd). The proof in (Tran et al., 2025d) primarily relies on the injectivity of the (old) Radon Transform on Systems of Lines Rα.

B THEORETICAL PROOFS

B.1 PROOF OF PROPOSITION 3.1 AND PROPOSITION 3.2

Invariance and Equivariance Properties in Machine Learning. Equivariant networks (Cohen & Welling, 2016) enhance generalization and improve sample efficiency by embedding task symmetries into the model architecture. They have shown considerable success in a range of domains such as trajectory prediction (Walters et al., 2020), robotics (Simeonov et al., 2022), graph-based models (Satorras et al., 2021; Tran et al., 2024a), functional networks (Tran et al., 2025a; 2024c; Vo et al., 2024), etc. Utilizing equivariance has been found to boost performance, increase data efficiency, and enhance robustness against out-of-domain generalization.

Proof. For 2-Wasserstein distance, recall that, for µ, ν P(Rd), we have

W2(µ, ν) = inf π P(µ,ν)

Rd Rd x y 2 2 dπ(x, y) 1

We need to show that: for g E(d),

W2(µ, ν) = W2(g µ, g ν). (30)

For g = (Q, a), the map g : Rd ! Rd is an affine map that preserves Euclidean norm 2 and det(Q) = 1. By applying change of variable, we have

W2(g µ, g ν) = inf π P(g µ,g ν)

Rd Rd x y 2 2 dπ(x, y) 1

= inf π P(µ,ν)

Rd Rd x y 2 2 d(g π)(x, y) 1

= inf π P(µ,ν)

Rd Rd x y 2 2 dπ(g 1x, g 1y) 1

= inf π P(µ,ν)

Rd Rd gx gy 2 2 dπ(x, y) 1

= inf π P(µ,ν)

Rd Rd x y 2 2 dπ(x, y) 1

= W2(µ, ν). (36)

We finish the proof for 2-Wasserstein distance. For Sliced p-Wasserstein distance, we first show that, for g = (Q, a) E(d), µ, ν P(Rd) and θ Sd 1,

Rfµ( , θ), Rfν( , θ) = Wp

Rfg µ( , Qθ), Rfg ν( , Qθ) (37)

Published as a conference paper at ICLR 2025

Indeed, we have

Rfg µ(t, Qθ) = Z

Rd fg µ(x) δ(t x, Qθ ) dx (38)

Rd fµ(g 1x) δ(t x, Qθ ) dx (39)

Rd fµ(x) δ(t gx, Qθ ) dx (40)

Rd fµ(x) δ(t Qx + a, Qθ ) dx (41)

Rd fµ(x) δ(t x + Q 1a, θ ) dx (42)

Rd fµ(x) δ(t Q 1a, θ x, θ ) dx (43)

= Rfµ(t Q 1a, θ , θ). (44)

Using the closed-form of one dimensional Wasserstein distance, we have

Rfg µ( , Qθ), Rfg ν( , Qθ) = Wp

Rfµ( Q 1a, θ , θ), Rfν( Q 1a, θ , θ) (45)

Rfµ( , θ), Rfν( , θ) , (46)

where we leverage the translation invariant properties of the ground cost Lp-norm of Wp for the above second row.

So Equation (37) holds. For the rest of the proof, we show that

SWp(µ, ν) = SWp(g µ, g ν). (47)

Indeed, we have

SWp(g µ, g ν) = Z

Sd 1 Wp p(Rfg µ( , θ), Rfg ν( , θ)) dσ(θ) 1

Sd 1 Wp p(Rfg µ( , Qθ), Rfg ν( , Qθ)) dσ(Qθ) 1

Sd 1 Wp p(Rfg µ( , Qθ), Rfg ν( , Qθ)) dσ(θ) 1

Sd 1 Wp p(Rfµ( , θ), Rfν( , θ)) dσ(θ) 1

= SWp(µ, ν). (52)

We finish the proof.

B.2 Rα Lf IS INTEGRABLE

Proof. Let f L1(Rd). We show that Rα Lf L f 1. Indeed,

Rα Lf L = X

R |Rα Lf(tx, l)| dtx (53)

Rd f(y) α(y, L)l δ (tx y xl, θl ) dy dtx (54)

Published as a conference paper at ICLR 2025

Rd |f(y)| α(y, L)l δ (tx y xl, θl ) dy dtx (55)

R |f(y)| α(y, L)l δ (tx y xl, θl ) dtx

Rd |f(y)| α(y, L)l Z

R δ (tx y xl, θl ) dtx

Rd |f(y)| α(y, L)l dy (58)

Rd |f(y)| X

l L α(y, L)l dy (59)

Rd |f(y)| dy (60)

= f 1 < . (61)

So, we have Rα Lf L1(L). It implies the operator Rα L : L1(Rd) ! L1(L) is well-defined.

Clearly, Rα L is a linear operator.

B.3 PROOF OF THEOREM 4.3

As we nontrivially expand the class of splitting maps from C(Rd, k 1) to C(Rd Ld k, k 1), the proof presented in (Tran et al., 2025d) is not applicable to this larger class. By introducing an invariance condition on the splitting maps, we provide a new proof for the injectivity of Rα as follows.

Proof. Recall the notion of the original Radon Transform. Let R : L1(Rd) ! L1(R Sd 1) be the operator defined by: For f L1(Rd), we have

Rf(t, θ) = Z

Rd f(y) δ(t y, θ ) dy (62)

It is well-known that the Radon Transform R is a linear bijection (Helgason, 2011) and its inverse R 1 is defined as

f(x) = R 1(Rf(t, θ)) (63)

Sd 1 (Rf( x, θ , θ) η) ( x, θ ) dθ, (64)

where the convolution kernel η satisfies that its Fourier transform ˆη(ω) = |ω|.

Back to the problem. Recall that Ld k is the collection of all systems of k lines in Rd Sd 1,

Ld k = n L = {(xj, θj)}k j=1 : xj Rd, θj Sd 1o = Rd Sd 1 k . (65)

For an index i such that 1 i k and a direction θ Sd 1, define

Ld k(i, θ) := n L = {(xj, θj)}k j=1 Rd Sd 1 k : θi = θ o . (66)

In words, Ld k(i, θ) is a subcollection of Ld k consists of all systems of k lines with the direction of the ith line is equal to θ. It is clear that Ld k is the disjoint union of all Ld k(i, θ) for θ Sd 1,

θ Sd 1 Ld k(i, θ). (67)

We have some observations on these subcollections.

Published as a conference paper at ICLR 2025

Result 1. Each g = (Q, a) E(d), define a bijection between Ld k(i, θ) and Ld k(i, Q θ). More precisely, for the map ϕg, defined by

ϕg : Ld k(i, θ) ! Ld k(i, Q θ) (68)

L = {(xj, θj)}k j=1 7 ! g L = {(Q xj + a, Q θj)}k j=1, (69)

it is well-defined and is a bijection. This can be directly verified by definitions.

Result 2. For all 1 i k, y, y Rd and θ, θ Sd 1 , we have Z

Ld k(i,θ) α(y, L)i d L = Z

Ld k(i,θ ) α(y , L)i d L. (70)

Note that, the integrations are taken over Ld k(i, θ) and Ld k(i, θ ) with measures induced from the measure of Ld k. We show that Equation (70) holds. Note that, for θ, θ Sd 1, there exists an orthogonal transformation Q O(d) such that Q θ = θ . Let a = y Q y, and g = (Q, a) E(d). By this definition, we have

gy = Q y + a = Q y + y Q y = y . (71)

From Result 1, we have a corresponding bijection ϕg from Ld k(i, θ) to Ld k(i, θ ). We have Z

Ld k(i,θ ) α(y , L)i d L = Z

Ld k(i,θ) α(y , g L)i d(g L) (change of variables) (72)

Ld k(i,θ) α(gy, g L)i d(g L) (since y = gy) (73)

Ld k(i,θ) α(y, L)i d(g L) (since α is E(d)-invariant) (74)

Ld k(i,θ) α(y, L)i d L (since | det(Q)| = 1) (75)

We finish the proof for Result 2.

Result 3. From Result 2, for all 1 i k, we can define a constant ci such that

Ld k(i,θ) α(y, L)i d L. (76)

for all y Rd and θ Sd 1. Then

c1 + c2 + . . . + ck = 1. (77)

In particular, there exists 1 i k such that ci is non-zero. To show this, first, recall that Ld k is the disjoint union of all Ld k(i, θ) for θ Sd 1,

θ Sd 1 Ld k(i, θ), (78)

so we have Z

Ld k α(y, L)i d L = Z

Ld k(i,θ) α(y, L)i d L

Sd 1 ci dθ (80)

c1 + c2 + . . . ck =

Ld k α(y, L)j d L (82)

Published as a conference paper at ICLR 2025

j=1 α(y, L)j

Ld k 1 d L (84)

We finish the proof for Result 3.

Consider a splitting map α in C(Rd Ld k, k 1) that is E(d)-invariant. By Result 3, let 1 i k is the index that

Ld k(i,θ) α(y, L)i d L = 0. (86)

Now, for a system of lines L in Ld k, we denote the ith line by l L:i and its source by x L:i. For a function f L1(Rd), define a function g L1(R Sd 1) as follows

g : R Sd 1 ! R (87)

(t, θ) 7 ! Z

Ld k(i,θ) Rα Lf t x L:i, θ , l L:i d L (88)

From the definition of Rα Lf,

Rα Lf : L ! R (89)

(x, l) 7 ! Z

Rd f(y) α(y, L)l δ (tx y xl, θl ) dy, (90)

g(t, θ) = Z

Ld k(i,θ) Rα Lf t x L:i, θ , l L:i d L (91)

Rd f(y) α(y, L)i δ (t x L:i, θ y x L:i, θ ) dy d L (92)

Rd f(y) α(y, L)i δ (t x L:i + y x L:i, θ ) dy d L (93)

Rd f(y) α(y, L)i δ (t y, θ ) dy d L (94)

Ld k(i,θ) f(y) α(y, L)i δ (t y, θ ) d L

Rd f(y) δ (t y, θ )

Ld k(i,θ) α(y, L)i d L

Rd f(y) δ (t y, θ ) dy (97)

= ci Rf(t, θ). (98)

Let f Ker Rα, which means Rα Lf = 0 for all L Ld k. So g = 0 L1(R Sd 1), and since ci = 0, it implies Rf = 0 L1(R Sd 1). Additionally, recall that the Radon Transform R is a bijection, we conclude that f = 0 L1(Rd).

Hence, Rα is injective. The proof is completed.

Remark 9. A formal proof would require Haar measure theory for compact groups, but we simplify this for brevity and relevance to the scope of the paper.

Published as a conference paper at ICLR 2025

Remark 10. The injectivity still hold if we restrict Ld k to a non-empty subset of Ld k that is closed under action of E(d). In concrete, let A be a non-empty subset of Ld k satisfies that g L A for all g E(d) and L A. Let f L1(Rd) such that Rα Lf = 0 for all L A. Using the same argument, we can demonstrate that f = 0. In particular, for T and T are introduced in Subsection 5.2, since both T and T are closed under action of E(d), we see that a function f L1(Rd) is equal to 0, if Rα Lf = 0 for all L T, or for all L T .

B.4 PROOF OF THEOREM 5.2

Proof. We show that

Db-TSW(µ, ν) = Z

T Wd L,1(Rα Lµ, Rα Lν) dσ(L)., (99)

is a metric on P(Rd).

Positive definiteness. For µ, ν P(Rn), it is clear that Db-TSW(µ, µ) = 0 and Db-TSW(µ, ν) 0. If Db-TSW(µ, ν) = 0, then Wd L,1(Rα Lµ, Rα Lν) = 0 for all L T. Since Wd L,1 is a metric on P(L), we have Rα Lµ = Rα Lν for all L T. By the remark at the end of Appendix B.3, it implies that µ = ν.

Symmetry. For µ, ν P(Rn), we have:

Db-TSW(µ, ν) = Z

T Wd L,1(Rα Lµ, Rα Lν) dσ(L) (100)

T Wd L,1(Rα Lν, Rα Lµ) dσ(L) = Db-TSW(ν, µ). (101)

So Db-TSW(µ, ν) = TSW-SL(ν, µ).

Triangle inequality. For µ1, µ2, µ3 P(Rn), we have:

Db-TSW(µ1, µ2) + Db-TSW(µ2, µ3) (102)

T Wd L,1(Rα Lµ1, Rα Lµ2) dσ(L) + Z

T Wd L,1(Rα Lµ2, Rα Lµ3) dσ(L) (103)

Wd L,1(Rα Lµ1, Rα Lµ2) dσ(L) + Wd L,1(Rα Lµ2, Rα Lµ3) dσ(L) (104)

T Wd L,1(Rα Lµ1, Rα Lµ3) dσ(L) (105)

= Db-TSW(µ1, µ3). (106)

So the triangle inequality holds for Db-TSW.

In conclusion, Db-TSW is a metric on the space P(Rd).

Db-TSW is E(d)-invariant. We need to show that Db-TSW is E(d)-invariant, which means for all g E(d) such that

Db-TSW(µ, ν) = Db-TSW(g µ, g ν), (107)

where g µ, g ν as the pushforward of µ, ν via Euclidean transformation g: Rd ! Rd, respectively. For a tree system L T such that L = li = (xi, θi) k i=1, we have

g L = gli = (Q xi + a, Q θi) k i=1. (108)

For g = (Q, a), note that | det(Q)| = 1, we have

Rα g L(g µ)(gx, gl) = Z

Rd(g µ)(y) α(y, g L)l δ (tgx y xgl, θgl ) dy (109)

Published as a conference paper at ICLR 2025

Rd µ(g 1y) α(y, g L)l δ (tx y xgl, θgl ) dy (110)

Rd µ(g 1gy) α(gy, g L)l δ (tx gy xgl, θgl ) d(gy) (111)

Rd µ(y) α(y, L)l δ (tx gy xgl, θgl ) dy (112)

Rd µ(y) α(y, L)l δ (tx Q y + a Q xl a, Q θl ) dy (113)

Rd µ(y) α(y, L)l δ (tx Q y Q xl, Q θl ) dy (114)

Rd µ(y) α(y, L)l δ (tx y xl, θl ) dy (115)

= Rα Lµ(x, l) (116)

Similarly, we have

Rα g L(g ν)(gx, gl) = Rα Lν(x, l). (117)

Moreover, g induces an isometric transformation L ! g L, so

Wd L,1(Rα Lµ, Rα Lν) = Wdg L,1(Rα g Lg µ, Rα g Lg ν). (118)

Finally, we have

Db-TSW(g µ, g ν) = Z

Ld k Wd L,1(Rα Lg µ, Rα Lg ν) dσ(L) (119)

T Wdg L,1(Rα g Lg µ, Rα g Lg ν) dσ(g L) (120)

T Wd L,1(Rα Lµ, Rα Lν) dσ(L) (121)

= Db-TSW(µ, ν) (122)

We conclude that Db-TSW is E(d)-invariant.

Remark 11. We will omit the almost-surely conditions in the proofs, as they are simple to verify and their inclusion would only introduce unnecessary complexity.

C EXPERIMENTAL DETAILS

C.1 ALGORITHM OF DB-TSW

Algorithm 1 Distance-based Tree-Sliced Wasserstein distance.

Input: Probability measures µ and ν in P(Rd), number of tree systems L, number of lines in tree system k, space of tree systems T or T , splitting maps α with tuning parameter δ R. for i = 1 to L do

Sampling x Rd and θ1, . . . , θk i.i.d U(Sd 1). if space of tree system is T then

Orthonormalize θ1, . . . , θk. end if Contruct tree system Li = {(x, θ1), . . . , (x, θk)}. Projecting µ and ν onto Li to get Rα Liµ and Rα Liν.

Compute \ Db-TSW(µ, ν) = (1/L) Wd Li,1(Rα Liµ, Rα Liν). end for Return: \ Db-TSW(µ, ν).

Published as a conference paper at ICLR 2025

C.2 DENOISING DIFFUSION MODELS

Diffusion Models. Diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020) have gained significant attention for their ability to generate high-quality data. In this experiment, we explore how these models work and the improvements introduced by our method. The diffusion process starts with a sample from distribution q(x0) and gradually adding Gaussian noise to data x0 over T steps, described by q(x1:T |x0) = QT t=1 q(xt|xt 1), where q(xt|xt 1) = N(xt; 1 βtxt 1, βt I) with a predefined variance schedule βt.

The denoising diffusion model aim to learns the reverse diffusion process to effectively reconstruct the original data from noisy observations. The training process of the denoising diffusion model aim to estimate the parameters θ of the reverse process, which is defined by pθ(x0:T ) = p(x T ) QT t=1 pθ(xt 1|xt), where pθ(xt 1|xt) = N(xt 1; µθ(xt, t), σ2 t I). The standard training approach uses maximum likelihood by optimizing the evidence lower bound (ELBO), L pθ(x0), which aims to minimize the Kullback-Leibler divergence between the true posterior and the model s approximation of the reverse diffusion process across all time steps:

t=1 Eq(xt) [KL(q(xt 1|xt)||pθ(xt 1|xt))] + C,

where KL refers to the Kullback-Leibler divergence, and C represents a constant term.

Denoising Diffusion GANs. Original diffusion models, while producing high-quality and diverse samples, are limited by their slow sampling process, which hinders their applications in real-world scenarios. Denoising diffusion GANs (Xiao et al., 2022) address this limitation by modeling each denoising step with a multimodal conditional GAN, enabling larger denoising steps and thus significantly reducing the total number of steps required to 4, resulting in sampling speeds up to 2000 times faster than traditional diffusion models while maintaining competitive sample quality and diversity. Denoising diffusion GANs introduce an implicit denoising model:

pθ(xt 1|xt) = Z pθ(xt 1|xt, ϵ)Gθ(xt, ϵ)dϵ, ϵ N(0, I).

Xiao et al. (2022) employ adversarial training to optimize the model parameters θ. Its loss is defined by:

t=1 Eq(xt)[Dadv(q(xt 1|xt)||pϕ(xt 1|xt))],

where Dadv is the adversarial loss. Nguyen et al. (2024b) replace the adversarial loss by the augmented generalized Mini-batch Energy distance. For two distributions µ and ν, with a mini-batch size n 1, the augmented generalized mini-batch Energy distance (AGME) using a Sliced Wasserstein (SW) kernel is expressed as:

AGME2 b(µ, ν; g) = GME2 b( µ, ν),

where µ = f#µ and ν = f#ν with f(x) = (x, g(x)) for a nonlinear function g : Rd ! R. GME is the generalized Mini-batch Energy distance (Salimans et al., 2018), defined as

GME2 b(µ, ν) = 2E[D(PX, PY )] E[D(PX, P X)] E[D(PY , P Y )],

where X, X i.i.d. µ m and Y, Y i.i.d. ν m, with PX = 1 m Pm i=1 δxi, X = (x1, . . . , xm). and D are any valid distance. Here we replace D by TSW variants (including our proposed Db-TSW) and SW variants.

Setting We use the same architecture and hyperparameters as in Nguyen et al. (2024b) and train our models over 1800 epochs. For all our methods, including Db-TSW-DD and Db-TSW-DD, we use L = 2500, k = 4, δ = 10. For vanilla SW and SW variants, we follow Nguyen et al. (2024b) to use L = 10000.

Published as a conference paper at ICLR 2025

0 500 1000 1500 2000 2500 Iterations

W2 Distance (log scale)

Log Wasserstein Distance over 3 Runs (gaussian_20d_small_v)

SW TSW-SL Db-TSW

Db-TSW LCVSW SWGG

Figure 4: Logarithm of Wasserstein Distance over 3 runs on Gaussian 20d dataset.

Table 3: Average Wasserstein distance between source and target distributions of 10 runs on 25 Gaussians dataset. All methods use 100 projecting directions.

Methods Iteration Time/Iter(s) 500 1000 1500 2000 2500

SW 1.61e-1 9.52e-2 3.44e-2 2.56e-2 2.20e-2 0.002 Max SW 5.09e-1 2.36e-1 1.33e-1 9.70e-2 8.48e-2 0.144 SWGG 3.10e-1 1.17e-1 3.38e-2 3.58e-3 2.54e-4 0.002 LCVSW 3.38e-1 6.64e-2 3.06e-2 3.06e-2 3.02e-2 0.001 TSW-SL 3.49e-1 9.06e-2 2.96e-2 1.20e-2 3.03e-7 0.002 Db-TSW-DD 3.84e-1 1.13e-1 2.48e-2 2.96e-3 1.00e-7 0.002 Db-TSW-DD 3.82e-1 1.11e-1 2.73e-2 1.45e-3 9.97e-8 0.003

C.3 GRADIENT FLOW

Additional Results on Multi-modal Synthetic Dataset. Table 3 showcases the performance of our methods on the Gradient Flow task using the 25 Gaussians (multi-modal distribution) dataset. Our approach consistently achieves the smallest Wasserstein distance, demonstrating robust performance across various dataset types, including non-linear, multi-modal, and high-dimensional cases. These results indicate that our methods outperform baseline models, further highlighting their effectiveness across different challenging scenarios.

Hyperparameters. For Db-TSW and Db-TSW, we use L = 25, k = 4, δ = 10 for all our experiments. For the baselines of SW and SW-variants, we use L = 100. The number of supports generated for each distribution in all datasets is 100.

We follow Mahey et al. (2023) to set the global learning rate for all baselines to be 5e-3 for all datasets. For our methods, we use the global learning rate of 5e-3 for 25 Gaussians and Swiss Roll datasets and 5e-2 for Gaussian 20d.

Gaussian 20d detail results. We show the detail result of Db-TSW and Db-TSW in Table 4 and visualize the result in Figure 4. Figure 4 reveals distinct performance patterns among different methods, with TSW variants showing faster convergence rate. Notably, Db-TSW achieves the fastest convergence, followed by Db-TSW and TSW-SL, all exhibiting a sharp performance improvement around iteration 1500. While these methods show some fluctuation mid-process as indicated by the shaded variance regions, they maintain remarkable stability in the final iterations (iteration 2000 2500). These results are further supported by the quantitative mean and standard deviation values provided in Table 4.

Published as a conference paper at ICLR 2025

Table 4: Wasserstein distance between source and target distributions of 3 runs on Gaussian 20d dataset. All methods use 100 projecting directions.

Methods Iteration

500 1000 1500 2000 2500

SW 17.57 2.4e-1 15.86 3.1e-1 13.92 3.7e-1 11.70 4.2e-1 9.22 3.8e-1 SWGG 16.53 1.2e-1 16.64 1.4e-1 16.65 1.7e-1 16.63 1.6e-1 16.64 1.5e-1 LCVSW 16.86 4e-1 14.36 3.4e-1 11.68 3.3e-1 8.80 4e-1 5.91 2.3e-1 TSW-SL 12.61 3e-1 5.68 3.8e-1 6.90e-1 8e-2 6.83e-4 2.6e-5 4.22e-4 1.2e-5 Db-TSW 12.46 4.6e-1 5.12 3.9e-1 4.8e-1 3.3e-1 6.08e-4 6.7e-5 3.84e-4 2.2e-5 Db-TSW 12.15 3.5e-1 4.67 3.6e-1 1.22e-1 1.6e-1 5.74e-4 2e-5 3.78e-4 1.4e-5

Figure 5: Comparison of color transferred image.

C.4 COLOR TRANSFER

Color transfer. Given a source image and a target image, we represent their respective color palettes as matrices X and Y , where each matrix is characterized by dimensions n 3 (with n denoting the number of pixels). The source and target images employed in this analysis are derived from the work of Nguyen et al. (2024a). This curve is initialized at Z(0) = X and is terminated at Z(T) = Y . Subsequently, we reduce the number of distinct colors in both images to 1000 utilizing K-means clustering. We then trace the trajectory between the empirical distributions of colors in the source image (PX) and the target image (PY ) through an approximate Euler integration scheme. Given that the RGB color values are constrained within the range {0, . . . , 255}, a rounding procedure is implemented at each Euler step during the final iterations.

Hyperparameters. The total number of iterations for all methodologies is standardized at 2000. A step size of 1 is utilized for all baseline methods, while a step size of 17 is adopted for TSW-SL, Db-TSW-SL , and Db-TSW-SL. For the sliced Wasserstein variants, we set L = 100, whereas in TSW-SL and our proposed methods, we utilize L = 33 and k = 3 to ensure a consistent and fair comparison.

Qualitative results. Figure 6 shows the qualitative results of our methods, including Db-TSW-DD and Db-TSW-DD .

Published as a conference paper at ICLR 2025

(a) Generated images of Db-TSW-DD

(b) Generated images of Db-TSW-DD

Figure 6: Comparison of generated images from Db-TSW-DD and Db-TSW-DD methods.

C.5 COMPUTATIONAL INFRASTRUCTURE

The experiments of gradient flow and color transfer were carried out on a single NVIDIA A100 GPU. Each experiments for gradient flow take roughly 0.5 hours. The runtime for color transfer experiments is 5 minutes.

In contrast, the denoising diffusion experiments were executed in parallel on two NVIDIA A100 GPUs, with each run lasting around 81.5 hours.

C.6 COMPUTATION AND MEMORY COMPLEXITY OF DB-TSW

Since GPUs are now standard in machine learning workflows due to their parallel processing capabilities, and Db-TSW is primarily intended use for deep learning tasks like generative modeling and optimal transport problems which are typically GPU-accelerated, we provide both complexity analysis and empirical runtime/memory of Db-TSW with GPU settings.

Computation and memory complexity of Db-TSW. Assume n m, the computational complexity and memory complexity of Db-TSW are O(Lknd + Lkn log n) and O(Lkn + Lkd + nd), respectively. In details, the Db-TSW algorithm will have there most costly operations as described in Table 5:

Table 5: Complexity Analysis of Db-TSW

Operation Description Computation Memory Projection Matrix multiplication of points and lines O(Lknd) O(Lkd + nd)

Distance-based weight splitting Distance calculation and softmax O(Lknd) O(Lkn + Lkd + nd)

Sorting Sorting projected coordinates O(Lkn log n) O(Lkn)

Total O(Lknd + Lkn log n) O(Lkn + Lkd + nd)

The kernel fusion trick: In the table, the distance-based weight splitting operation involves: (1) finding the distance vector from each point to each line, (2) calculating the distance vectors norms, and (3) applying softmax over all lines in each tree. The first step costs O(Lknd) computation and O(Lknd) memory. Similarly, the second step costs O(Lkdn) computation and O(Lknd) memory.

Published as a conference paper at ICLR 2025

Lastly, the final step costs O(Lkn) computation and O(Lkn) memory. Therefore, this operation theoretically costs O(Lknd) in terms of both computation and memory. However, we leverage the automatic kernel fusion technique provided by torch.compile (Torch document) to fuse all these three steps into one single big step, thus contextualizing the distance vectors of size Lkn d in a shared GPU memory instead of global memory. In the end, we only need to store two input matrices of size Lk d (a line matrix) and n d (a support matrix), along with one output matrix of size Ln k (a split weight), which helps reduce the memory stored in GPU global memory to only O(Lkn + Lkd + nd).

C.7 RUNTIME AND MEMORY ANALYSIS

In this section, we have conducted a runtime comparison of Db-TSW with respect to the number of supports and the support s dimension in a single Nvidia A100 GPU. We fix L = 100 and k = 10 for all settings and varies N {500, 1000, 5000, 10000, 50000} and d {10, 50, 100, 500, 1000}.

Runtime evolution. In Figure 7a, the relationship between runtime and number of supports demonstrates predominantly linear scaling, particularly beyond 10000 supports, though there s subtle nonlinear behavior in the early portions of the curves for higher dimensions. The performance gap between high and low-dimensional cases widens as the number of supports increases, with d = 1000 taking approximately twice the runtime of d = 500, suggesting a linear relationship between dimension and computational time.

Memory evolution. In Figure 7b the memory consumption analysis of Db-TWD reveals several key patterns which align with the theoretical complexity analysis and suggest predictable scaling behavior. Firstly, there is a clear linear relationship between memory usage and the number of samples across all dimensions. Secondly, in higher-dimensional cases (d = 100, 500, 1000), the memory requirements exhibit a proportional relationship with the dimension, as evidenced by the approximately equal spacing between these curves. Finally, for lower dimensions (d = 10, 50, 100), the memory curves nearly overlap due to the dominance of the Lkn term in the memory complexity formula O(Lkn+Lkd+nd), where the number of supports components (Lkn) has a greater impact than the dimensional components (Lkd and nd) when L = 100 and k = 10.

0 10000 20000 30000 40000 50000 Number of supports

Runtime (ms)

Runtime of Db-TSW with different dimensions

d = 10 d = 50 d = 100 d = 500 d = 1000

(a) Runtime evolution of Db-TSW with respect to different dimensions

0 10000 20000 30000 40000 50000 Number of supports

Memory usage (MB)

Memory usage of Db-TWD with different dimensions

d = 10 d = 50 d = 100 d = 500 d = 1000

(b) Memory evolution of Db-TSW with respect to different dimensions

Figure 7: Runtime and memory evolution of Db-TSW with respect to different dimension.

C.8 ABLATION STUDY FOR L, k AND δ

In this section, we include the ablation study for L, k, and δ on the Gradient flow task. We chose the Gaussian 100d dataset. When not chosen to vary, we used the default values: number of projections n = 100, learning rate lr = 0.005, number of iterations niter = 5000, number of trees L = 500, number of lines per tree k = 2, and coefficient δ = 10. We fixed all other values while varying k {2, 4, 16, 64, 128}, L {100, 500, 1000, 1500, 2000, 3000}, and δ {1, 2, 5, 10, 20, 50, 100}.

Ablation study for k. From Figure 8a, the ablation analysis of the number of lines k reveals an interesting relationship between the number of lines per tree (k) and the algorithm s performance. When fixing the number of trees (L), configurations with fewer lines per tree demonstrate faster

Published as a conference paper at ICLR 2025

convergence to lower Wasserstein distances. This behavior is reasonable because increasing k expands the space of tree systems, which in turn requires more samples to attain similar performance levels. For instance, k = 2 and k = 4 configurations reach lower Wasserstein distances compared to k = 64 and k = 128. Additionally, as demonstrated in the runtime and memory analysis section C.7, configurations with higher number of lines per tree incur greater computational and memory costs while yielding suboptimal performance.

Ablation study for L. From Figure 8b, the convergence analysis demonstrates a clear relationship between the number of trees (L) and the algorithm s performance. When fixing the number of lines per tree (k), configurations with more trees achieve significantly better convergence, reaching much lower Wasserstein distances. This behavior is expected because increased sampling in the Monte Carlo method used in Db-TSW leads to a more accurate approximation of the distance. Specifically, comparing L = 3000 and L = 100, we observe that L = 3000 achieves a final Wasserstein distance of 10 5, while L = 100 only reaches 102. However, there is a computation-accuracy trade-off since more trees involves more computation and memory.

Ablation study for δ. From Figure 8c, the analysis of different δ values indicates variations in the algorithm s performance. Moderate values of delta (δ = 1 10) enhance the convergence speed compared to δ = 1, as they effectively accelerate the optimization process. However, when delta becomes too large (δ = 20 100), the converged speed decreased.

0 1000 2000 3000 4000 5000 Iterations

W2 Distance (log scale)

k = 2 k = 4 k = 16 k = 64 k = 128

(a) Ablation study for the number of lines k.

0 1000 2000 3000 4000 5000 Iterations

W2 Distance (log scale)

L = 100 L = 500 L = 1000 L = 1500 L = 2000 L = 3000

(b) Ablation study for the number of trees L.

0 1000 2000 3000 4000 5000 Iterations

W2 Distance (log scale)

= 1 = 2 = 5 = 10 = 20 = 50 = 100

(c) Ablation study for δ.

Figure 8: Ablation study for k, L, and δ on Gaussian 100d dataset.