# spherical_treesliced_wasserstein_distance__2861ab15.pdf Published as a conference paper at ICLR 2025 SPHERICAL TREE-SLICED WASSERSTEIN DISTANCE Viet-Hoang Tran Department of Mathematics National University of Singapore hoang.tranviet@u.nus.edu Thanh T. Chu Department of Computer Science National University of Singapore thanh.chu@u.nus.edu Khoi N.M. Nguyen FPT Software AI Center khoinnm1@fpt.com Trang Pham Qualcomm AI Research tranpham@qti.qualcomm.com Tam Le The Institute of Statistical Mathematics & RIKEN AIP tam@ism.ac.jp Tan M. Nguyen Department of Mathematics National University of Singapore tanmn@nus.edu.sg Sliced Optimal Transport (OT) simplifies the OT problem in high-dimensional spaces by projecting supports of input measures onto one-dimensional lines and then exploiting the closed-form expression of the univariate OT to reduce the computational burden of OT. Recently, the Tree-Sliced method has been introduced to replace these lines with more intricate structures, known as tree systems. This approach enhances the ability to capture topological information of integration domains in Sliced OT while maintaining low computational cost. Inspired by this approach, in this paper, we present an adaptation of tree systems on OT problems for measures supported on a sphere. As a counterpart to the Radon transform variant on tree systems, we propose a novel spherical Radon transform with a new integration domain called spherical trees. By leveraging this transform and exploiting the spherical tree structures, we derive closed-form expressions for OT problems on the sphere. Consequently, we obtain an efficient metric for measures on the sphere, named Spherical Tree-Sliced Wasserstein (STSW) distance. We provide an extensive theoretical analysis to demonstrate the topology of spherical trees and the well-definedness and injectivity of our Radon transform variant, which leads to an orthogonally invariant distance between spherical measures. Finally, we conduct a wide range of numerical experiments, including gradient flows and self-supervised learning, to assess the performance of our proposed metric, comparing it to recent benchmarks. The code is publicly available at https://github.com/lilythchu/STSW.git. 1 INTRODUCTION Despite being embedded in high dimensional Euclidean spaces, in practice, data often reside on low dimensional manifolds (Fefferman et al., 2016). The hypersphere is one such manifold with various practical applications. The range of applications involving distributions on a hypersphere is remarkably broad, underscoring the significance of spherical geometries across multiple fields. These applications encompass spherical statistics (Jammalamadaka, 2001; Mardia & Jupp, 2009; Ley & Verdebout, 2017; Pewsey & Garc ıa-Portugu es, 2021), geophysical data (Di Marzio et al., 2014), cosmology (Jupp, 1995; Cabella & Marinucci, 2009; Perraudin et al., 2019), texture mapping (Elad et al., 2005; Dominitz & Tannenbaum, 2009), magnetoencephalography imaging (Vrba & Robinson, 2001), spherical image representations (Coors et al., 2018; Jiang et al., 2024), omnidi- Co-first authors. Co-last authors. Qualcomm Vietnam Company Limited. Correspondence to: hoang.tranviet@u.nus.edu & tanmn@nus.edu.sg Published as a conference paper at ICLR 2025 rectional images (Khasanova & Frossard, 2017), and deep latent representation learning (Wu et al., 2018; Chen et al., 2020; Wang & Isola, 2020; Grill et al., 2020; Caron et al., 2020; Davidson et al., 2018; Liu et al., 2017; Yi & Liu, 2023). Optimal Transport (OT) (Villani, 2008; Peyr e et al., 2019) is a geometrically natural metric for comparing probability distributions, and it has received significant attention in machine learning in recent years. However, OT faces a significant computational challenge due to its supercubic complexity in relation to the number of supports in input measures (Peyr e et al., 2019). To alleviate this issue, several variants have been developed to reduce the computational burden, including entropic regularization (Cuturi, 2013), minibatch OT (Fatras et al., 2019), low-rank approaches (Forrow et al., 2019; Altschuler et al., 2019; Scetbon et al., 2021), the Sliced-Wasserstein distance (Rabin et al., 2011; Bonneel et al., 2015), tree-sliced-Wasserstein distance (Indyk & Thaper, 2003; Le et al., 2019; Le & Nguyen, 2021; Tran et al., 2025b;c;d), and Sobolev transport (Le et al., 2022; 2023; 2024). Related work. There has been growing interest in utilizing OT to compare spherical probability measures (Cui et al., 2019; Hamfeldt & Turnquist, 2022). To mitigate the computational burden, recent studies have focused on sliced spherical OT (Quellmalz et al., 2023; Bonet et al., 2022; Tran et al., 2024b). Quellmalz et al. (2023) introduced the vertical slice transform and a normalized version of the semicircle transform to define sliced OT on the sphere. The semicircle transform was also employed in (Bonet et al., 2022) to define a spherical sliced Wasserstein. Meanwhile, Tran et al. (2024b) utilized stereographic projection to create a spherical distance between measures via univariate OT problems. However, projecting spherical measures onto a line or circle poses challenges due to the loss of topological information. Furthermore, comparing one-dimensional measures on circles is computationally more expensive, as it requires an additional binary search. Notably, Tran et al. (2025b;d) offer an alternative method by substituting one-dimensional lines in the Sliced Wasserstein framework with more complex domains, referred to as tree systems. These systems operate similarly to lines but with a more advanced and intricate structure. This approach is expected to enhance the capture of topological information while preserving the computational efficiency of one-dimensional OT problems. Inspired by this observation, we propose an adaptation of tree systems to the hypersphere, called spherical trees, to develop a new metric for measures on the hypersphere. Spherical trees satisfy two important criteria: (i) spherical measures can be projected onto spherical trees in a meaningful manner, and (ii) OT problems on spherical trees admit a closed-form expression for a fast computation. Contribution. Our contributions are three-fold: 1. We provide a comprehensive theoretical construction of spherical trees on the sphere, analogous to the notion of tree systems. We demonstrate that spherical trees, as topological spaces, are metric spaces defined by tree metrics, which ensures that OT problems on these spaces can be analytically solved with closed-form solutions. 2. We propose the Spherical Radon Transform on Spherical Trees, which transforms functions on the sphere to functions on spherical trees. We also present the concept of splitting maps for the sphere, a key component of this new Spherical Radon Transform, which describes how mass at a point is distributed across the spherical tree. In addition, we examine the orthogonal invariance of splitting maps, which later proves to be a sufficient condition for the injectivity of the Spherical Radon Transform. 3. We propose the novel Spherical Tree-Sliced Wasserstein (STSW) distance for probability distributions on the sphere. By selecting orthogonal invariant splitting maps, we demonstrate that STSW is a invariant metric under orthogonal transformations. Finally, we derive a closed-form approximation for STSW, enabling an efficient and highly parallelizable implementation. Organization. The rest of the paper is organized as follows: we review Wasserstein distance variants in 2. We propose the notion of Spherical Trees on the Sphere with a formal construction in 3. We introduce the Spherical Radon Transform on Spherical Trees, and discusses its injectivity in 4. In 5, we propose Spherical Tree-Sliced Wasserstein (STSW) distance and derive a closed-form approximation for STSW. Finally, we evaluate STSW on various tasks in 6. Theoretical proofs and experimental details are provided in Appendix. Published as a conference paper at ICLR 2025 2 PRELIMINARIES In this section, we review Wasserstein distance, Sliced Wasserstein distance, Wasserstein distance on tree metric spaces and Tree-Sliced Wasserstein distance on Systems of Lines. Wasserstein Distance. Let Ωbe a measurable space, endowed with a metric d, and let µ, ν be two probability distributions on Ω. Denote P(µ, ν) as the set of probability distributions π on the product space Ω Ωsuch that π(A Ω) = µ(A) and π(Ω B) = ν(B) for all measurable sets A, B. For p 1, the p-Wasserstein distance Wp (Villani, 2008) between µ, ν is defined as: Wp(µ, ν) = inf π P(µ,ν) Ω Ω d(x, y)p dπ(x, y) 1 Sliced Wasserstein Distance. The Radon Transform (Helgason, 2011) is the operator R : L1(Rd) ! L1(R Sd 1) defined by: for f L1(Rd), we have Rf L1(R Sd 1) such that Rf(t, θ) = R Rd f(x) δ(t x, θ ) dx. Note that R is a bijection. The Sliced p-Wasserstein (SW) distance (Bonneel et al., 2015) between µ, ν P(Rd) is defined by: SWp(µ, ν) := Z Sd 1 Wp p(Rfµ( , θ), Rfν( , θ)) dσ(θ) 1 where σ = U(Sd 1) is the uniform distribution on Sd 1; and fµ, fν are the probability density functions of µ, ν, respectively. Tree Wasserstein Distances. Let T be a rooted tree (as a graph) with non-negative edge lengths, and the ground metric d T , i.e. the length of the unique path between two nodes. Given two probability distributions µ and ν supported on nodes of T , the Wasserstein distance with ground metric d T , i.e., tree-Wasserstein (TW) (Le et al., 2019), yields a closed-form expression as follows: Wd T ,1(µ, ν) = X e T we µ(Γ(ve)) ν(Γ(ve)) , (3) where ve is the endpoint of edge e that is farther away from the tree root, Γ(ve) is a subtree of T rooted at ve, and we is the length of e. Tree-Sliced Wasserstein Distances on Systems of Lines. Tree systems (Tran et al., 2025d) are proposed as replacements of directions in SW. As a topological space, they are constructed by joining (gluing) multiple copies of R based on a tree (graph) framework, forming a measure metric space in which optimal transport problems admit closed-form solutions. By developing a variant of the Radon Transform that transforms functions on Rd to functions on tree systems, Tree-Sliced Wasserstein Distances on Systems of Lines (TSW-SL) is are introduced in a similar manner as SW. The mentioned closed-form expressions lead to a highly parallelizable implementation for TSW-SL. We next extend the tree systems for measures on a sphere. 3 SPHERICAL TREES ON THE SPHERE Let d be a positive integer. Recall the notion of the d-dimensional sphere in Rd+1, Sd := x = (x0, x1, . . . , xd) Rd+1 : x 2 = 1 Rd+1. The sphere Sd is a complete metric space with metric d Sd defined as d Sd(a, b) = arccos a, b Rd+1 for a, b Sd, where , Rd+1 is the standard dot product in Rd+1. For x Sd, denote Hx be the hyperplane passes through 0 Rd+1 and orthogonal to x, i.e. Hx = {y Rd+1 : x, y = 0}. We consider the stereographic projection corresponding to x, denoted by φx, which is a map from Sd \ {x} to Hx defined by: for y Sd \ {x}, φx(y) is the unique intersection between the line passes through x, y and the hyperplane Hx. In concrete, the formula for φx is as follows φx : Sd \ {x} ! Hx y 7 ! x, y 1 x, y x + 1 1 x, y y. (4) Published as a conference paper at ICLR 2025 Figure 1: Illustrations of stereographic projection, rays in R3, and spherical rays in S2. It is well-known that φx is a smooth bijection between Sd \ {x} and Hx. Moreover, it is convenient to extend φx to a map that we also denote by φx, from Sd to Hx { }, with φx(x) = . Remark. As a topological space, Hx is homeomorphic to Rd, and Hx { } is the one-point compactification of Hx, which is homeomorphic to Sd. Also, Hx Sd is homeomorphic to Sd 1. Definition 3.1 (Spherical rays in Rd+1). For y Sd, ray in Rd+1 with direction y is defined as the set {t y : t > 0} { }. For x Sd, and y Sd Hx, the spherical ray with root x and direction y, denoted by rx y, is defined as the preimage of the ray with direction y through φx, i.e., rx y := φ 1 x {t y : t > 0} { } . An illustration of stereographic projection, rays and spherical rays are presented in Figure 1. In words, a spherical ray with root x and direction y is the great semicircle on surface of the hypersphere passes through y with one endpoint x. We have rx y is isometric to the closed interval [0, π] via z 7! arccos x, z , and we also have a parameterization of rx y as (t, rx y) for 0 t π. In particular, φ 1 x (0) = x 7! π and φ 1 x ( ) = x 7! 0. Let k be a positive integer, x Sd and y1, . . . , yk Sd Hx be k distinct points. We have k distinct spherical rays rx yi with root x and direction yi. Consider an equivalence relation on the disjoint union F i=1,...,k rx yi as follows: For (t, rx yi) rx yi and (t , rx yj) rx yj, we have (t, rx yi) (t , rx yj) if and only if (t, rx yi) = (t , rx yj) in Sd. In other words, we identify k points with coordinate 0 on k spherical rays rx yi, 1 i k. Denote T x y1,...,yk as the set of all equivalence classes in F i=1,...,k rx yi with respect to the equivalence relation , i.e., T x y1,...,yk := F i=1,...,k rx yi / . Recall the notion of disjoint union topology and quotient topology in (Hatcher, 2005). For i = 1, . . . , k, consider the injection fi : rx yi , ! G i=1,...,k rx yi (t, rx yi) 7 ! (t, rx yi). The disjoint union F i=1,...,k rx yi now becomes a topological space with the disjoint union topology, i.e. the finest topology on F i=1,...,k rx yi such that the map fi is continuous for all i = 1, . . . , k. Consider the quotient map by the equivalent relation , i=1,...,k rx yi ! T x y1,...,yk = i=1,...,k rx yi (t, rx yi) 7 ! [(t, rx yi)]. T x y1,...,yk now becomes a topological space with the quotient topology, i.e. the finest topology on T x y1,...,yk such that the map π is continuous. In other words, T x y1,...,yk is formed by gluing k spherical rays rx yi at the points with coordinate 0 on each spherical rays. Definition 3.2 (Spherical Trees in Sd). The topological space T x y1,...,yk is called a spherical tree on Sd. We said that x is the root and y1, . . . , yk are the edges of T x y1,...,yk. A visualization for construction of spherical trees is presented in Figure 2a. The number of edges of a spherical tree is usually denoted by k. For simplicity, we sometimes omit the root x and edges y1, . . . , yk and simply denote a spherical tree as T . The collection of all spherical trees with k edges Published as a conference paper at ICLR 2025 (a) Spherical Tree (b) Radon Transform Figure 2: (a) An illustration of 5 spherical rays with the same root x, along with the corresponding spherical tree rooted at x. Note that, even when endpoints differ from x of these spherical rays are all identical to x on the sphere, the spherical tree treats these as five distinct points, and only identifies the root x. (b) An illustration of Radon Transform on Spherical Trees. Consider a point z. The hyperplane passing through z and orthogonal to x cuts edges of the spherical tree at 5 points. The mass at z under operator Rα is distributed across these 5 intersections, depending on α. on Sd is denoted by Td k. Since Sd Hx is homeomorphic to the sphere Sd 1, we have the one-to-one correspondence between Td k and the product Sd (Sd Hx)k as follows: T x y1,...,yk 1 1 ! x Sd and (y1, . . . , yk) (Sd Hx)k (Sd 1)k. (5) From this observation, we can define a distribution σ on the space of spherical trees Td k as the joint distribution of distributions on Sd and Sd 1. For the rest of the paper, let σ be the joint distribution of (k + 1) independent distributions, consists of one uniform distributions on Sd, i.e. U(Sd), and k uniform distributions on Sd 1, i.e. U(Sd 1). The topological space T is metrizable by the metric d T defined as: For a = (t, rx yi) and b = (t , rx yj) in T , d T (a, b) = |t t |, if i = j, and t + t , if i = j. (6) Moreover, this metric is a tree metric on T . We verify this by showing for every pair of two points a, b in T , all paths from a be b in T are homotopic to each other. Then d T (a, b) is the length of the shortest path from a to b in T . Moreover, we can define a measure on T that induced from the Borel measure on the closed interval [0, π]. The proof of these properties is similar as the proofs in (Tran et al., 2025d). We summarize our results by a theorem. Theorem 3.3 (Spherical trees are metric spaces with tree metric). T is a metric space with tree metric d T . The topology on T induced by d T is identical to the topology of T . With this design, in the next section, we will define Lebesgue integrable functions on spherical trees. 4 SPHERICAL RADON TRANSFORM ON SPHERICAL TREES In this section, we introduce the spherical Radon Transform on Spherical Trees, and discuss the injectivity of our spherical Radon transform variant. 4.1 A SPHERICAL RADON TRANSFORM VARIANT We introduce the notions of the space of Lebesgue integrable functions on spherical trees. First, denote L1(Sd) as the space of Lebesgue integrable functions on Sd with norm 1: L1(Sd) = f : Sd ! R : f 1 = Z Sd |f(x)| dx < . (7) Two functions f1, f2 L1(Sd) are considered to be identical if f1(x) = f2(x) for almost everywhere on Sd. Consider a spherical tree T with root x and k edges y1, . . . , yk, a Lebesgue integrable function on T is a function f : T ! R such that f T := Pk i=1 R π 0 |f(t, rx yi)| dt < . Published as a conference paper at ICLR 2025 The space of Lebesgue integrable functions on T is denoted by L1(T ). Two functions f1, f2 L1(T ) are considered to be identical if f1(x) = f2(x) for almost everywhere on T . The space L1(L) with norm T is a Banach space. Let k 1 := n (ai)k i=1 : 0 ai 1 and Pk i=1 ai = 1 o Rk be the (k 1)-dimensional standard simplex. Denote C Sd Td k, k 1 as the space of continuous maps from Sd Td k to k 1, and called a map in C Sd Td k, k 1 by a splitting map. Let T be a spherical tree with root x and k edges y1, . . . , yk, α be a splitting map in C Sd Td k, k 1 , we define an operator associated to α that transforms a Lebesgue integrable functions on Sd to a Lebesgue integrable functions on T . For f L1(Sd), define Rα T f : T ! R (8) (t, rx yi) 7 ! Z Sd f(y) α(y, T )i δ(t arccos x, y ) dy, (9) where δ is the Dirac delta function. We have Rα T f L1(T ) for f L1(Sd), and moreover, Rα T f T f 1. The operator Rα T : L1(Sd) ! L1(T ) is a well-defined linear operator. The proof of these properties can be found in Appendix A.1. An illustration of Rα T is presented in Figure 2b. We next present a novel spherical Radon Transform variant on spherical trees. Definition 4.1 (Spherical Radon Transform on Spherical Trees). For α C Sd Td k, k 1 , the operator Rα that is defined as follows: Rα : L1(Sd) ! Y f 7 ! (Rα T f)T Td k . is called the Spherical Radon Transform on Spherical Trees. 4.2 INJECTIVITY OF RADON TRANSFORM ON SPHERICAL TREES We discuss on the injectivity of our spherical Radon Transform variant. Consider the Euclidean norm on Rd, i.e. 2. Orthogonal group O(d) and its actions. The orthogonal group O(d) is the group of linear transformations of Rd that preserves the Euclidean norm 2, O(d) = linear transformation f : Rd ! Rd : x 2 = f(x) 2 for all x Rd . (10) It is well-known that O(d) is isomorphic to the group of orthogonal matrices under multiplication, O(d) = Q Md d(R) : Q Q = Q Q = Id . (11) The canonical group action of O(d) on Rd is defined by: For g = Q O(d) and y Rd, we have y 7! gy = Q y. By the norm preserving, the action of O(d + 1) on Rd+1 canonically induces an action of O(d + 1) on the sphere Sd. Moreover, the action of O(d) on Rd preserves the standard dot product, so the action of O(d + 1) on Sd preserves the metric d Sd. Group actions of O(d + 1) on space of spherical trees Td k. Under g O(d + 1), the spherical ray rx y transforms to rgx gy. It implies that the action of O(d + 1) on Sd canonically induces an action of O(d + 1) on Td k as T = T x y1,...,yk 7 ! g T := T gx gy1,...,gyk. (12) Moreover, each g O(d + 1) presents a morphism T ! g T that is isometric. O(d + 1)-invariant splitting maps. Given a map f : X ! Y and a group G acts on X. The map f is called G-invariant if f(gx) = f(x) for all g G and x X. We have the definition of O(d + 1)-invariance in splitting maps. Definition 4.2. A splitting map α in C(Sd Td k, k 1) is said to be O(d + 1)-invariant, if we have α(gy, g T ) = α(y, T ) (13) for all (y, T ) Sd Td k and g O(d + 1). Published as a conference paper at ICLR 2025 With an O(d + 1)-invariant splitting maps, our spherical Radon Transform variant is injective. Theorem 4.3. Rα is injective for an O(d + 1) invariant splitting map α. The proof of Theorem 4.3 is presented in Appendix A.3. Finally, we present a candidate for O(d+1)- invariant splitting maps. Define the map β : Sd Td k ! Rk as follows: β(y, T x y1,...,yk)i = 0, if y = x or y = x, 1 x, y 2, if y = x. (14) Remark. The construction of β will be explained in Appendix A.2. The map β is continuous and O(d + 1)-invariant. The derivation of β and the proof for these properties are presented in Appendix A.2. We choose α: Sd Td k ! k 1 as follows: α(y, T ) = softmax {ζ β(y, T )i}i=1,...,k (15) Here, ζ R is treated as a tuning parameter. The intuition behind this choice of α is that it reflects the proximity of points to the rays of the spherical trees. As |ζ| increases, the resulting value of α tends to become more sparse, emphasizing the importance of each ray relative to a specific point. 5 SPHERICAL TREE-SLICED WASSERSTEIN DISTANCE In this section, we propose our novel Spherical Tree-Sliced Wasserstein Distance (STSW). We also derive a closed-form approximation of STSW that allows an efficient implementation. 5.1 SPHERICAL TREE-SLICED WASSERSTEIN DISTANCE Given two probability distributions µ, ν in P(Sd), a tree T Td k and an O(d + 1)-invariant splitting map α C(Sd Td k, k 1). By the Radon Transform Rα T in Definition 4.1, µ and ν tranform to two probability distributions Rα Lµ and Rα Lν in P(T ). T is a metric space with tree metric d T (Tran et al., 2025d), so we can compute Wasserstein distance Wd T ,1(Rα T µ, Rα T ν) between Rα T µ and Rα T ν by Equation (3). Definition 5.1 (Spherical Tree-Sliced Wasserstein Distance). The Spherical Tree-Sliced Wasserstein Distance between µ, ν in P(Sd) is defined by: STSW(µ, ν) := Z Td k Wd T ,1(Rα T µ, Rα T ν) dσ(T ). (16) Remark. Note that, the definition of STSW depends on the space Td k, the distribution σ on Td k, and the splitting map α as in Equation (15). We omit them to simplify the notation. The STSW distance is, indeed, a metric on P(Sd). Theorem 5.2. STSW is a metric on P(Sd). Moreover, STSW is invariant under orthogonal transformations: For g O(d + 1), we have STSW(µ, ν) = STSW(g µ, g ν), (17) where g µ, g ν as the push-forward of µ, ν via orthogonal transformation g: Sd ! Sd, respectively. The proofs of Theorem 5.2 is presented in Appendix A.4. 5.2 COMPUTATION OF STSW To approximate the intractable integral in Equation (16), we use the Monte Carlo method as \ STSW(µ, ν) = (1/L) ΣL l=1Wd Tl,1(Rα Tlµ, Rα Tlν), where T1, . . . , TL are drawn independently from the distribution σ on T, and are referred to as projecting tree systems. We present the way to sample Ti and compute Wd Tl,1(Rα Tlµ, Rα Tlν). Published as a conference paper at ICLR 2025 Sampling spherical trees. Recall that σ is the joint distribution of k + 1 independent distributions, consists of one uniform distributions on Sd, and k uniform distributions on Sd 1. This comes from the one-to-one correspondence between Td k and Sd (Sd 1)k as in Equation (5). In applications, to perform a sampling process for T = T x y1,...,yk Td k from σ, we sample by two steps as follows: 1. Sample k + 1 points x, y1, . . . , yk in Rd+1. Normalize them to get x, y1, . . . , yk lie on Sd. 2. For each i, take the intersection of the line passes through x, yi with Hx, i.e. φx, then normalize Φx to get new yi lies on Hx Sd. This results in a sampling process based on distribution σ. Computing Wd T ,1(Rα T µ, Rα T ν). In applications, given discrete distributions µ and ν as µ(x) = Pn j=1 uj δ(x aj) and ν(x) = Pn j=1 vj δ(x aj). We can present µ and ν with the same supports by combining their supports and allow some uj or vj to be 0. For spherical tree T = T x y1,...,yk, we want to compute Wd T ,1(Rα T µ, Rα T ν). For 1 j n, let cj = d Sd(x, aj). Also let c0 = 0. By re-indexing, we assume that 0 = c0 c1 . . . cn. By Radon Transform in Definition 4.1, µ, ν transform to Rα T µ, Rα T ν supported on {(cj, rx yi)}1 i k,1 j n of T , with Rα T µ(cj, rx yi) = α(aj, T )i uj and Rα T ν(cj, rx yi) = α(aj, T )i vj (18) By Equation (3), Wd T ,1(Rα T µ, Rα T ν) has a closed-form approximation as follows Wd T ,1(Rα T µ, Rα T ν) = j=1 (cj cj 1) p=j α(ap, T )i (up vp) The detailed derivation of Equation (19) is presented in Appendix A.5. The closed-form expression in Equation (19) leads to a highly parallelizable implementation of STSW distance.1 6 EXPERIMENTAL RESULTS In this section, we present the results of our four main tasks: Gradient Flow, Self-Supervised Learning, Earth Density Estimation, and Sliced-Wasserstein Auto-Encoder. We provide a detailed evaluation for each task, including quantitative metrics, visualizations, and a comparison with relevant baseline methods. Experimental details can be found in Appendix B. 6.1 GRADIENT FLOW Our first experiment focuses on learning a target distribution ν from a source distribution µ by minimizing STSW(ν, µ). We solve this optimization using projected gradient descent, as discussed in Bonet et al. (2022). We compare the performance of our method against baselines: SSW (Bonet et al., 2022), and S3W variants (Tran et al., 2024b). Following Tran et al. (2024b), we use a mixture of 12 von Mises-Fisher distributions (v MFs) as our target ν. The training is conducted over 500 epochs with a full batch size, and each experiment is repeated 10 times. We adopt the evaluation metrics from Tran et al. (2024b), which include log 2Wasserstein distance, negative log-likelihood (NLL), and training time. As shown in Table 1, STSW outperforms the baselines in all metrics and achieves faster convergence, as illustrated in Figure 10. 6.2 SELF-SUPERVISED LEARNING (SSL) Normalizing feature vectors to the hypersphere has been shown to improve the quality of learned representations and prevent feature collapse (Chen et al., 2020; Wang & Isola, 2020). In previous work, Wang & Isola (2020) identified two properties of contrastive learning: alignment (bringing positive pairs closer) and uniformity (distributing features evenly on the hypersphere). Adopting 1See Algorithm 1 in Appendix B.1 for pseudo-code of STSW distance. Published as a conference paper at ICLR 2025 Table 1: Learning target distribution 12 v FMs. We use NR = 30 rotations for ARI-S3W and an additional learning rate LR = 0.05 for SSW. Method log W2 # NLL # Runtime(s) SSW (LR=0.01) -3.21 0.16 -4980.01 1.89 55.20 0.15 SSW (LR=0.05) -3.36 0.12 -4976.58 2.23 55.31 0.33 S3W -2.37 0.21 -4749.67 84.34 1.93 0.06 RI-S3W (1) -3.12 0.18 -4964.50 27.98 2.03 0.12 RI-S3W (5) -3.47 0.06 -4984.80 7.32 5.68 0.51 ARI-S3W (30) -4.39 0.19 -5020.37 6.35 20.25 0.15 STSW -4.69 0.01 -5041.13 0.84 1.89 0.05 Table 2: CIFAR-10 linear evaluation accuracy for encoded (E) features and projected (P) features on S9, along with pretraining time per epoch. ARI-S3W and RI-S3W use 5 rotations. Method Acc. E(%) " Acc. P(%) " Time (s/ep.) Hypersphere 79.81 74.64 10.18 Sim CLR 79.97 72.80 9.34 SW 74.39 67.80 9.65 SSW 70.23 64.33 10.59 S3W 78.59 73.83 10.14 RI-S3W (5) 79.93 73.95 10.22 ARI-S3W (5) 80.08 75.12 10.19 STSW 80.53 76.78 9.54 the approach in Bonet et al. (2022), we propose replacing the Gaussian kernel uniformity loss with STSW, resulting in the following contrastive objective: z A i z B i 2 2 | {z } Alignment loss 2 STSW(z A, ν) + STSW(z B, ν) | {z } Uniformity loss where ν = U(Sd), λ > 0 is regularization factor, z A, z B Rn (d+1) are embeddings of two image augmentations mapped onto Sd. Similar to Bonet et al. (2022) and Tran et al. (2024b), we train Res Net18 (He et al., 2016) based encoder on the CIFAR-10 (Krizhevsky et al., 2009) w.r.t L. After this, we train a linear classifier on the features extracted from the pre-trained encoder. Table 2 demonstrates the improvement of STSW in comparison to baselines: Hypersphere (Wang & Isola, 2020), Sim CLR (Chen et al., 2020), SW, SSW, and S3W variants. We also conduct experiments with d = 2 to visualize learned representations. Figure 12 illustrates that STSW can effectively distribute encoded features around the sphere while keeping similar ones close together. 6.3 EARTH DENSITY ESTIMATION We now demonstrate the application of STSW in density estimation on S2. Data used in this task is collected by (Mathieu & Nickel, 2020) which consists of Fire (Brakenridge, 2017), Earthquake (EOSDIS, 2020) and Flood (EOSDIS, 2020). As in (Bonet et al., 2022), we employ an exponential map normalizing flow model (Rezende et al., 2020) which are invertible transformations T and aim to minimize min T STSW(T#µ, p Z), where µ is the empirical distribution, and p Z is a prior distribution on S2 which we use uniform distribution. The density for any point x S2 is then estimated by fµ(x) = p Z(T(x))|det JT (x)| where JT (x) is the Jacobian of T at x. Our baselines are exponential map normalizing flows with SW, SSW, and S3W variants, and stereographic projection-based (Dinh et al., 2016) normalizing flows. As seen in Table 3, STSW even with fewer epochs and shorter training time (10K epochs over 2h10m for STSW versus 20K epochs Published as a conference paper at ICLR 2025 Table 3: Negative log-likelihood on test data, averaged over 5 runs with different data split. Method Quake # Flood # Fire # Stereo 1.94 0.21 1.92 0.04 1.31 0.12 SW 0.99 0.05 1.47 0.03 0.55 0.21 SSW 0.84 0.06 1.26 0.04 0.23 0.20 S3W 0.89 0.08 1.35 0.04 0.34 0.05 RI-S3W (1) 0.80 0.07 1.25 0.03 0.14 0.06 ARI-S3W (50) 0.77 0.06 1.24 0.03 0.08 0.05 STSW 0.68 0.04 1.23 0.03 -0.07 0.05 Table 4: SWAE results evaluated on latent regularization of CIFAR-10 test data. Method log W2 # NLL # BCE # Time (s/ep.) SW -3.2943 -0.0014 0.6314 3.4060 SSW -2.2234 0.0005 0.6309 8.2386 S3W -3.3421 0.0013 0.6329 4.5138 RI-S3W (5) -3.1950 -0.0039 0.6354 4.9682 ARI-S3W (5) -3.3935 0.0012 0.6330 4.7347 STSW -3.4191 -0.0051 0.6341 3.5460 over 4h30m for ARI-S3W, both on Fire dataset) still outperforms or is competitive with SSW and S3W variants. 6.4 SLICED-WASSERSTEIN AUTO-ENCODER (SWAE) We apply STSW to generative modeling using the Sliced-Wasserstein Auto-Encoder (SWAE) (Kolouri et al., 2018) framework, which regularizes the latent space distribution to match a prior distribution q. Let φ : X ! Sd and ψ : Sd ! X be the parametric encoder and decoder. The objective of the SWAE is minφ,ψ Ex p [c(x, ψ(φ(x)))] + λ STSW (φ#p, q), where λ controls regularization, p is data distribution. We use SW, SSW (Bonet et al., 2022) and S3W variants (Tran et al., 2024b) as baselines, Binary Cross Entropy (BCE) for reconstruction loss and a mixture of 10 v MFs as the prior, similar to Tran et al. (2024b). We provide results in Table 4. We note that STSW has the best results in log 2-Wasserstein and NLL with a competitive training time, though its BCE slightly underperforms the others. 7 CONCLUSION This paper introduces the Spherical Tree-Sliced Wasserstein (STSW) distance, a novel approach leveraging a new integration domain called spherical trees. In contrast to the traditional onedimensional lines or great semicircles often used in the spherical Sliced Wasserstein variant, STSW ultilizes spherical trees to better capture the topology of spherical data and provides closed-form solutions for optimal transport problems on spherical trees, leading to expected improvements in both performance and efficiency. We rigorously develop the theoretical basis for our approach by introducing spherical Radon Transform on Spherical Tree then verifying the core properties of the transform such as its injectivity. We thoroughly develop the theoretical foundation for this method by introducing the spherical Radon Transform on Spherical Trees and validating its key properties, such as injectivity. STSW is derived from the Radon Transform framework, and through careful construction of the splitting maps, we obtain a closed-form approximation for the distance. Through empirical tasks on spherical data, we demonstrate that STSW significantly outperforms recent spherical Wasserstein variants. Future research could explore spherical trees further, such as developing sampling processes for spherical trees or adapting Generalized Radon Transforms to enhance STSW. Published as a conference paper at ICLR 2025 ACKNOWLEDGMENTS This research / project is supported by the National Research Foundation Singapore under the AI Singapore Programme (AISG Award No: AISG2-TC-2023-012-SGIL). This research / project is supported by the Ministry of Education, Singapore, under the Academic Research Fund Tier 1 (FY2023) (A-8002040-00-00, A-8002039-00-00). This research / project is also supported by the NUS Presidential Young Professorship Award (A-0009807-01-00) and the NUS Artificial Intelligence Institute Seed Funding (A-8003062-00-00). We thank the area chairs, anonymous reviewers for their comments. TL acknowledges the support of JSPS KAKENHI Grant number 23K11243, and Mitsui Knowledge Industry Co., Ltd. grant. Ethics Statement. Given the nature of the work, we do not foresee any negative societal and ethical impacts of our work. Reproducibility Statement. Source codes for our experiments are provided in the supplementary materials of the paper. The details of our experimental settings and computational infrastructure are given in Section 6 and the Appendix. All datasets that we used in the paper are published, and they are easy to access in the Internet. Jason Altschuler, Francis Bach, Alessandro Rudi, and Jonathan Niles-Weed. Massively scalable Sinkhorn distances via the Nystr om method. In Advances in Neural Information Processing Systems, pp. 4429 4439, 2019. Cl ement Bonet, Paul Berg, Nicolas Courty, Franc ois Septier, Lucas Drumetz, and Minh-Tan Pham. Spherical sliced-wasserstein. ar Xiv preprint ar Xiv:2206.08780, 2022. Nicolas Bonneel, Julien Rabin, Gabriel Peyr e, and Hanspeter Pfister. Sliced and radon wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision, 51:22 45, 2015. G. Brakenridge. Global active archive of large flood events. http://floodobservatory. colorado.edu/Archives/index.html, 2017. Paolo Cabella and Domenico Marinucci. Statistical challenges in the analysis of cosmic microwave background radiation. 2009. Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. Advances in neural information processing systems, 33:9912 9924, 2020. Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597 1607. PMLR, 2020. Taco Cohen and Max Welling. Group equivariant convolutional networks. In International conference on machine learning, pp. 2990 2999. PMLR, 2016. Benjamin Coors, Alexandru Paul Condurache, and Andreas Geiger. Spherenet: Learning spherical representations for detection and classification in omnidirectional images. In Proceedings of the European conference on computer vision (ECCV), pp. 518 533, 2018. Li Cui, Xin Qi, Chengfeng Wen, Na Lei, Xinyuan Li, Min Zhang, and Xianfeng Gu. Spherical optimal transportation. Computer-Aided Design, 115:181 193, 2019. Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information processing systems, 26, 2013. Tim R Davidson, Luca Falorsi, Nicola De Cao, Thomas Kipf, and Jakub M Tomczak. Hyperspherical variational auto-encoders. ar Xiv preprint ar Xiv:1804.00891, 2018. Marco Di Marzio, Agnese Panzera, and Charles C Taylor. Nonparametric regression for spherical data. Journal of the American Statistical Association, 109(506):748 763, 2014. Published as a conference paper at ICLR 2025 Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. ar Xiv preprint ar Xiv:1605.08803, 2016. Ayelet Dominitz and Allen Tannenbaum. Texture mapping via optimal mass transport. IEEE transactions on visualization and computer graphics, 16(3):419 433, 2009. Asi Elad, Yosi Keller, and Ron Kimmel. Texture mapping via spherical multi-dimensional scaling. In Scale Space and PDE Methods in Computer Vision: 5th International Conference, Scale-Space 2005, Hofgeismar, Germany, April 7-9, 2005. Proceedings 5, pp. 443 455. Springer, 2005. EOSDIS. Active fire data. https://earthdata.nasa.gov/ earth-observation-data/near-real-time/firms/active-fire-data, 2020. Land, Atmosphere Near real-time Capability for EOS (LANCE) system operated by NASA s Earth Science Data and Information System (ESDIS). Kilian Fatras, Younes Zine, R emi Flamary, R emi Gribonval, and Nicolas Courty. Learning with minibatch wasserstein: asymptotic and gradient properties. ar Xiv preprint ar Xiv:1910.04091, 2019. Charles Fefferman, Sanjoy Mitter, and Hariharan Narayanan. Testing the manifold hypothesis. Journal of the American Mathematical Society, 29(4):983 1049, 2016. Aden Forrow, Jan-Christian H utter, Mor Nitzan, Philippe Rigollet, Geoffrey Schiebinger, and Jonathan Weed. Statistical optimal transport via factored couplings. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2454 2465, 2019. Jean-Bastien Grill, Florian Strub, Florent Altch e, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271 21284, 2020. Brittany Froese Hamfeldt and Axel GR Turnquist. A convergence framework for optimal transport on the sphere. Numerische Mathematik, 151(3):627 657, 2022. Allen Hatcher. Algebraic topology. 2005. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. Sigurdur Helgason. The Radon transform on Rn. Integral Geometry and Radon Transforms, pp. 1 62, 2011. Piotr Indyk and Nitin Thaper. Fast image retrieval via embeddings. In International workshop on statistical and computational theories of vision, volume 2, pp. 5, 2003. SR Jammalamadaka. Topics in Circular Statistics, volume 336. World Scientific, 2001. San Jiang, Kan You, Yaxin Li, Duojie Weng, and Wu Chen. 3d reconstruction of spherical images: a review of techniques, applications, and prospects. Geo-spatial Information Science, pp. 1 30, 2024. PE Jupp. Some applications of directional statistics to astronomy. New trends in probability and statistics, 3:123 133, 1995. Renata Khasanova and Pascal Frossard. Graph-based classification of omnidirectional images. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 869 878, 2017. D Kinga, Jimmy Ba Adam, et al. A method for stochastic optimization. In International conference on learning representations (ICLR), volume 5, pp. 6. San Diego, California;, 2015. Soheil Kolouri, Phillip E Pope, Charles E Martin, and Gustavo K Rohde. Sliced wasserstein autoencoders. In International Conference on Learning Representations, 2018. Published as a conference paper at ICLR 2025 Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009. URL https: //api.semanticscholar.org/Corpus ID:18268744. Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. Tam Le and Truyen Nguyen. Entropy partial transport with tree metrics: Theory and practice. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics (AISTATS), volume 130 of Proceedings of Machine Learning Research, pp. 3835 3843. PMLR, 2021. Tam Le, Makoto Yamada, Kenji Fukumizu, and Marco Cuturi. Tree-sliced variants of Wasserstein distances. Advances in neural information processing systems, 32, 2019. Tam Le, Truyen Nguyen, Dinh Phung, and Viet Anh Nguyen. Sobolev transport: A scalable metric for probability measures with graph metrics. In International Conference on Artificial Intelligence and Statistics, pp. 9844 9868. PMLR, 2022. Tam Le, Truyen Nguyen, and Kenji Fukumizu. Scalable unbalanced Sobolev transport for measures on a graph. In International Conference on Artificial Intelligence and Statistics, pp. 8521 8560. PMLR, 2023. Tam Le, Truyen Nguyen, and Kenji Fukumizu. Generalized Sobolev transport for probability measures on a graph. In Forty-first International Conference on Machine Learning, 2024. Christophe Ley and Thomas Verdebout. Modern directional statistics. Chapman and Hall/CRC, 2017. Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. Sphereface: Deep hypersphere embedding for face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 212 220, 2017. Kanti V Mardia and Peter E Jupp. Directional statistics. John Wiley & Sons, 2009. Emile Mathieu and Maximilian Nickel. Riemannian continuous normalizing flows. Advances in Neural Information Processing Systems, 33:2503 2515, 2020. Nathana el Perraudin, Micha el Defferrard, Tomasz Kacprzak, and Raphael Sgier. Deepsphere: Efficient spherical convolutional neural network with healpix sampling for cosmological applications. Astronomy and Computing, 27:130 146, 2019. Arthur Pewsey and Eduardo Garc ıa-Portugu es. Recent advances in directional statistics. Test, 30 (1):1 58, 2021. Gabriel Peyr e, Marco Cuturi, et al. Computational optimal transport: With applications to data science. Foundations and Trends in Machine Learning, 11(5-6):355 607, 2019. Michael Quellmalz, Robert Beinert, and Gabriele Steidl. Sliced optimal transport on the sphere. Inverse Problems, 39(10):105005, 2023. Julien Rabin, Gabriel Peyr e, Julie Delon, and Marc Bernot. Wasserstein barycenter and its application to texture mixing. In International Conference on Scale Space and Variational Methods in Computer Vision, pp. 435 446, 2011. Danilo Jimenez Rezende, George Papamakarios, S ebastien Racaniere, Michael Albergo, Gurtej Kanwar, Phiala Shanahan, and Kyle Cranmer. Normalizing flows on tori and spheres. In International Conference on Machine Learning, pp. 8083 8092. PMLR, 2020. Vıctor Garcia Satorras, Emiel Hoogeboom, and Max Welling. E (n) equivariant graph neural networks. In International conference on machine learning, pp. 9323 9332. PMLR, 2021. Meyer Scetbon, Marco Cuturi, and Gabriel Peyr e. Low-rank Sinkhorn factorization. International Conference on Machine Learning (ICML), 2021. Published as a conference paper at ICLR 2025 Anthony Simeonov, Yilun Du, Andrea Tagliasacchi, Joshua B Tenenbaum, Alberto Rodriguez, Pulkit Agrawal, and Vincent Sitzmann. Neural descriptor fields: Se (3)-equivariant object representations for manipulation. In 2022 International Conference on Robotics and Automation (ICRA), pp. 6394 6400. IEEE, 2022. Hoang Tran, Thieu Vo, Tho Huu, Tan Nguyen, et al. Monomial matrix group equivariant neural functional networks. Advances in Neural Information Processing Systems, 37:48628 48665, 2025a. Hoang V Tran, Minh-Khoi Nguyen-Nhat, Huyen Trang Pham, Thanh Chu, Tam Le, and Tan Minh Nguyen. Distance-based tree-sliced Wasserstein distance. In The Thirteenth International Conference on Learning Representations, 2025b. Hoang-Viet Tran, Thieu N Vo, Tho Tran Huu, and Tan Minh Nguyen. A clifford algebraic approach to e (n)-equivariant high-order graph neural networks. ar Xiv preprint ar Xiv:2410.04692, 2024a. Huy Tran, Yikun Bai, Abihith Kothapalli, Ashkan Shahbazi, Xinran Liu, Rocio P Diaz Martin, and Soheil Kolouri. Stereographic spherical sliced wasserstein distances. In Forty-first International Conference on Machine Learning, 2024b. Thanh Tran, Viet-Hoang Tran, Thanh Chu, Trang Pham, Laurent El Ghaoui, Tam Le, and Tan M Nguyen. Tree-sliced Wasserstein distance with nonlinear projection. ar Xiv preprint ar Xiv:2505.00968, 2025c. Viet-Hoang Tran, Thieu N Vo, An Nguyen The, Tho Tran Huu, Minh-Khoi Nguyen-Nhat, Thanh Tran, Duy-Tung Pham, and Tan Minh Nguyen. Equivariant neural functional networks for transformers. ar Xiv preprint ar Xiv:2410.04209, 2024c. Viet-Hoang Tran, Trang Pham, Tho Tran, Minh Khoi Nguyen Nhat, Thanh Chu, Tam Le, and Tan M. Nguyen. Tree-sliced wasserstein distance: A geometric perspective, 2025d. URL https:// arxiv.org/abs/2406.13725. C. Villani. Optimal Transport: Old and New, volume 338. Springer Science & Business Media, 2008. Thieu N Vo, Viet-Hoang Tran, Tho Tran Huu, An Nguyen The, Thanh Tran, Minh-Khoi Nguyen Nhat, Duy-Tung Pham, and Tan Minh Nguyen. Equivariant polynomial functional networks. ar Xiv preprint ar Xiv:2410.04213, 2024. Jiri Vrba and Stephen E Robinson. Signal processing in magnetoencephalography. Methods, 25(2): 249 271, 2001. Robin Walters, Jinxi Li, and Rose Yu. Trajectory prediction using equivariant continuous convolution. ar Xiv preprint ar Xiv:2010.11344, 2020. Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International conference on machine learning, pp. 9929 9939. PMLR, 2020. Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via nonparametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3733 3742, 2018. Jiacheng Xu and Greg Durrett. Spherical latent spaces for stable variational autoencoders. ar Xiv preprint ar Xiv:1808.10805, 2018. Mingxuan Yi and Song Liu. Sliced wasserstein variational inference. In Asian Conference on Machine Learning, pp. 1213 1228. PMLR, 2023. Published as a conference paper at ICLR 2025 Rd d-dimensional Euclidean space 2 Euclidean norm , standard dot product Sd d-dimensional hypersphere θ unit vector disjoint union arccos inverse of cosine function L1(X) space of Lebesgue integrable functions on X P(X) space of probability distributions on X µ, ν measures δ( ) 1-dimensional Dirac delta function U(Sd) uniform distribution on Sd pushforward (measure) C(X, Y ) space of continuous maps from X to Y d( , ) metric in metric space O(d) orthogonal group of order d g element of group Wp p-Wasserstein distance SWp Sliced p-Wasserstein distance Γ (rooted) subtree e edge in graph we weight of edge in graph φx stereographic projection at x Hx hyperplane passes through x and orthogonal to x rx y spherical ray T , T x y1,...,yk spherical tree Td k space of spherical trees of k edges on Sd σ distribution on space of tree systems L number of spherical tree k number of edges in spherical tree R original Radon Transform Rα spherical Radon Transform on Spherical Trees k 1 (k 1)-dimensional standard simplex α splitting map ζ tuning parameter in splitting maps Published as a conference paper at ICLR 2025 Supplement to Spherical Tree-Sliced Wasserstein Distance Table of Contents A Theoretical Proofs 16 A.1 Properties of Rα T f . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 A.2 Derivation and properties of splitting maps . . . . . . . . . . . . . . . . . . . . . . 17 A.3 Proof of Theorem 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 A.4 Proof of Theorem 5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 A.5 Derivation for the closed-form approximation of STSW . . . . . . . . . . . . . . . 23 B Experimental details 25 B.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 B.2 Evolution of STSW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 B.3 Runtime Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 B.4 Gradient Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 B.5 Self-Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 B.6 Earth Data Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 B.7 Generative Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 A THEORETICAL PROOFS A.1 PROPERTIES OF Rα T f Proof. Let f L1(Sd). We show that Rα T f T f 1. Note that, arccos x, y [0, π], so we have Rα T f(t, rx yi) dt Sd f(y) α(y, T )i δ(t arccos x, y ) dy dt Sd |f(y)| α(y, T )i δ(t arccos x, y ) dy dt 0 |f(y)| α(y, T )i δ(t arccos x, y ) dt dy Sd |f(y)| α(y, T )i Z π 0 δ(t arccos x, y ) dt dy Sd |f(y)| α(y, T )i dy i=1 α(y, T )i Sd |f(y)| dy Published as a conference paper at ICLR 2025 It implies that Rα T f L1(T ), which means the operator Rα T : L1(Sd) ! L1(T ) is well-defined. Clearly, Rα T is a linear operator. A.2 DERIVATION AND PROPERTIES OF SPLITTING MAPS Invariance and Equivariance Properties in Machine Learning. Equivariant networks (Cohen & Welling, 2016) improve generalization and boost sample efficiency by incorporating task symmetries directly into their architecture. These networks have been notably successful in several fields, including trajectory prediction (Walters et al., 2020), robotics (Simeonov et al., 2022), graph-based models (Satorras et al., 2021; Tran et al., 2024a), and functional networks (Tran et al., 2025a; 2024c; Vo et al., 2024), among others. The use of equivariance has been demonstrated to enhance performance, increase data efficiency, and strengthen robustness against out-of-domain generalization. We recall the construction for a splitting map α presented in Subsection 4.2. We have a map β : Sd Td k ! Rk defined as follows: β(y, T x y1,...,yk)i = 0, if y = x or y = x, and 1 x, y 2, if y = x, x. (21) Then α: Sd Td k ! k 1 is defined as follows: α(y, T ) = softmax {δ β(y, T )i}i=1,...,k (22) We will show that 1. Where does β come from? 2. α is continuous. 3. α is O(d + 1)-invariant. Proof. We prove each part. 1. For (y, T x y1,...,yk) Sd Td k, let Ny be the hyperplane passes through Y and orthogonal to x. Then Ny intersects the spherical ray rx yi at a single point a, and intersects the vector x at a single point b. The β(y, T )i is the length of the small arc from y to a on the circle centered at b passes through y and a. Indeed, if y = x and y = x, this length is equal to 0, the same as the definition of β. If y = x, x, let c is the intersection of the line passes through x, y, and the hyperplane Hx; moreover, let d be the unique intersection of the segment with endpoints 0, c, and the hyperplane Hx. In details, we have c = φx(y) and d = c c 2 . (23) Note that, the condition y = x, x is to guarantee that c = 0, . We compute c in details as follows: 1 x, y x + 1 1 x, y y, (24) d = c c 2 = x, y 1 x, y x + 1 1 x, y y x, y 1 x, y x + 1 1 x, y y 2 x, y 1 x, y x + 1 1 x, y y s x, y 1 x, y x + 1 1 x, y y (26) Published as a conference paper at ICLR 2025 x, y 1 x, y x + 1 1 x, y y s x 2 2 x, y 2 (1 x, y )2 + y 2 2 1 (1 x, y )2 2 x, y x, y (1 x, y )2 x, y 1 x, y x + 1 1 x, y y s = x, y x + y q 1 x, y 2 . (29) Note that, since b is the projection of y on vector x, so we have b = x, y x. (30) By similarity, we have length of arc from y to a on the circle centered at b passes through y and a length of arc from d to yi on the circle centered at 0 passes through d and yi = y b 2 d 0 2 . (31) Note that, the length of arc from d to yi on the circle centered at 0 passes through d and yi is d Sd(d, yi) = arccos d, yi , (32) length of arc from y to a on the circle centered at b passes through y and a (33) = arccos d, yi y b 2 = arccos d, yi y b 2 (35) * x, y x + y q 1 x, y 2 , yi y x, y x 2 (36) y 2 2 + x, y 2 x 2 2 2 x, y x, y (37) 1 x, y 2 (38) = β(y, T x y1,...,yk)i. (39) We finish the derivation of β. In context of splitting maps, this is a reasonable choice, since it relates to evaluate distances from a point to a spherical ray. 2. The derivation of β clearly implies that β is continuous. We can also check the continuous of β directly from the formula of β. Since β is continuous, we have α is continuous. 3. We have β is O(d + 1)-invariant since orthogonal transformations preserve the standard dot product. Since β is O(d + 1)-invariant, we have α is O(d + 1)-invariant. A.3 PROOF OF THEOREM 4.3 Proof. Recall the notion of (vertical) Radon Transform (Quellmalz et al., 2023). Let Φd be the collection of all spherical rays on Sd, i.e. Φd := {rx y : x Sd, y Hx}. (40) Published as a conference paper at ICLR 2025 Note that, this is the same as the collection of all spherical trees with one edge, i.e. Td 1. For f L1(Sd), consider the map Rrx y f : rx y [0, π] ! R defined by Rrx y f(t) = Z Sd f(z) δ(t arccos z, x )dz. (41) Similar to Appendix A.1, we can show that Rrx y f L1(rx y). We have an operator R : L1(Sd) ! G rx y Φd L1(rx y) (42) f 7 ! Rrx y f rx y Φd (43) This is exactly the (vertical) Radon Transform for Lebesgue integrable functions on Sd, as in (Quellmalz et al., 2023). This is proved to be an injective linear operator, so if Rrx y f = 0 for all rx y Φd, then f = 0. Back to the problem. Recall that Td k is the space of all spherical trees of k edges on Sd, Td k = {T x y1,...,yj = (rx y1, . . . rx yk) : x Sd and y1, . . . , yk Hx}. (44) For an 1 i k and rx y Φd, define Td k(i, rx y) := n T x y1,...,yj : y = yi o . (45) In words, Td k(i, rx y) is a subcollection of Td k consists of all spherical trees with root x and the ith spherical ray is rx yi. It is clear that Td k is the disjoint union of all Td k(i, rx y) for rx y Φd, rx y Φd Td k(i, rx y). (46) We have some observations on subcollections Td k(i, rx y). Result 1. Each orthogonal transformation g O(d + 1) define a bijection between Td k(i, rx y) and Td k(i, rgx gy). In details, the map ϕg defined by ϕg : Td k(i, rx y) ! Td k(i, rgx gy) (47) T x y1,...,yk 7 ! T gx gy1,...,gyk. (48) is a well-defined and is a bijection. This can be verified directly by definitions. Result 2. For 1 i k and rx y, rx y Φd, we have Z Td k(i,rx y) α(z, T )i d T = Z Td k(i,rx y ) α(z , T )i d T (49) for all z, z Sd such that d Sd(x, z) = d Sd(x , z ). Note that, the intergrations are taken over a Td k(i, rx y) and Td k(i, rx y ) with measures induced from the measure of Ld k. To prove Equation (49), we first show it in two specific cases as follows: Case 1. Assume x = x and y = y. Case 2. Assume z lies on rx y and z lies on rx y . Published as a conference paper at ICLR 2025 If we can show that Equation (49) holds for assumptions in case 1 and 2, then Equation (49) holds for all x, y, z, x , y , z . Indeed, assume that Equation (49) holds for assumptions in case 1 and 2. Then for all x, y, z, x , y , z , we consider t rx y and t rx y such that d Sd(x, t) = d Sd(x, z) = d Sd(x , z ) = d Sd(x , t ). (50) Then from the results in case 1 and 2, we have Td k(i,rx y) α(z, T )i d T by case 1 = Z Td k(i,rx y) α(t, T )i d T (51) by case 2 = Z Td k(i,rx y ) α(t , T )i d T by case 1 = Z Td k(i,rx y ) α(z , T )i d T . (52) So Equation (49) holds for all x, y, z, x , y , z . Now we prove it holds for case 1 and 2. For case 1, from the transitivity of orthogonal transformations on Sd, there exists g O(d+1) such that gx = x, gy = y, gz = z . (53) From Result 1, there is a corresponding bijection ϕg from Td k(i, rx y) to Td k(i, rx y). We have Z Td k(i,rx y) α(z , T )i d T = Z Td k(i,rx y) α(z , g T )i d(g T ) (change of variables) (54) Td k(i,rx y) α(gz, g T )i d(g T ) (since z = gz) (55) Td k(i,rx y) α(z, T )i d(g T ) (since α is O(d + 1)-invariant) (56) Td k(i,rx y) α(z, T )i d(T ) (since | det(g)| = 1) (57) So Equation (49) holds for case 1. A similar proof can be processed for case 2. From the transitivity of orthogonal transformations on Sd, there exists h O(d + 1) such that hx = x , hy = y , hz = z . (59) From Result 1, there is a corresponding bijection ϕh from Td k(i, rx y) to Td k(i, rx y ). We have Z Td k(i,rx y ) α(z , T )i d T = Z Td k(i,rx y) α(z , h T )i d(h T ) (change of variables) (60) Td k(i,rx y) α(hz, h T )i d(h T ) (since z = hz) (61) Td k(i,rx y) α(z, T )i d(h T ) (since α is O(d + 1)-invariant) (62) Td k(i,rx y) α(z, T )i d(T ) (since | det(h)| = 1) (63) We finish the proof for Result 2. Result 3. From Result 2, for all 1 i k and t [0, π], we can define a constant ci(t) such that Td k(i,rx y) α(z, T )i d T (65) Published as a conference paper at ICLR 2025 for all rx y Φd and z Sd such that t = d Sd(x, z) = arccos x, z . Then for all t [0, π], we have c1(t) + c2(t) + . . . + ck(t) = 1. (66) To show this, first, denote Td k(x) as the collection of all spherical trees with root x on Sd. We have Td k(x) = G y Hx Sd Td k(i, rx y), (67) so we have Z Td k(x) α(z, T )i d T = Z Td k(i,rx y) α(z, T )i d T Hx Sd ci(arccos x, z ) dy (69) = ci(arccos x, z ). (70) c1(arccos x, z ) + . . . + ck(arccos x, z ) = Td k(x) α(z, T )i d T (71) i=1 α(z, T )i Td k(x) 1 d T (73) We finish the proof for Result 3. Consider a splitting map α in C(Sd Td k, k 1) that is O(d + 1)-invariant. For a function f L1(Sd), for each 1 i k, define a function gi L1([0, π] Φd) as follows gi : [0, π] Φd ! R (75) (t, rx y) 7 ! Z Td k(i,rx y) Rα T f(t, rx y) d T . (76) From the definition of Rα T f, Rα T f : T ! R (77) (t, rx yi) 7 ! Z Sd f(y) α(y, T )i δ(t arccos x, y ) dy, (78) gi(t, rx y) = Z Td k(i,rx y) Rα T f(t, rx y) d T (79) Td k(i,rx y) Sd f(z) α(z, T )i δ(t arccos x, z ) dz d T (80) Td k(i,rx y) f(z) α(z, T )i δ(t arccos x, z ) d T Sd f(z) δ(t arccos x, z ) Td k(i,rx y) α(z, T )i d T Published as a conference paper at ICLR 2025 Sd f(z) δ(t arccos x, z ) ci(arccos x, z ) dz (83) Sd f(z) δ(t arccos x, z ) dz (84) = ci(t) Rrx y f(t). (85) i=1 gi(t, rx y) = i=1 ci(t) Rrx y f(t) (86) Rrx y f(t) = 1 Rrx y f(t) = Rrx y f(t) (87) Let f Ker Rα, which means Rα T f = 0 for all T Td k. So g = 0 L1([0, π] Φd) for all 1 i k. It implies Rrx y f = 0 L1(rx y) for all rx y Φd. So, from the (vertical) Radon Transform is injective, we conclude that f = 0 L1(Sd). so Rα is injective. Remark. To formalize the proof above, the notion of Haar measure for compact groups is required. However, we simplify the explanation as it goes beyond the scope of this paper. A.4 PROOF OF THEOREM 5.2 Proof. We want to show that STSW(µ, ν) = Z Td k Wd T ,1(Rα T µ, Rα T ν) dσ(T ). (88) is a metric on P(Sd). Positive definiteness. For µ, ν P(Sd), it is clear that STSW(µ, µ) = 0 and STSW(µ, ν) 0. If STSW(µ, ν) = 0, then Wd T ,1(Rα T µ, Rα T ν) = 0 for all T Td k. Since Wd T ,1 is a metric on P(T ), we have Rα T µ = Rα T ν for all T Td k. By the injectivity of our Radon transform variant, we have µ = ν. Symmetry. For µ, ν P(Sd), we have: STSW(µ, ν) = Z Td k Wd T ,1(Rα T µ, Rα T ν) dσ(T ) (89) Td k Wd T ,1(Rα T ν, Rα T µ) dσ(T ) = STSW(ν, µ). (90) So STSW(µ, ν) = STSW(ν, µ). Triangle inequality. For µ1, µ2, µ3 P(Sd), we have: STSW(µ1, µ2) + STSW(µ2, µ3) (91) Td k Wd T ,1(Rα T µ1, Rα T µ2) dσ(T ) + Z Td k Wd T ,1(Rα T µ2, Rα T µ3) dσ(T ) (92) Wd T ,1(Rα T µ1, Rα T µ2) + Wd T ,1(Rα T µ2, Rα T µ3) dσ(T ) (93) Td k Wd T ,1(Rα T µ1, Rα T µ3) dσ(T ) (94) = STSW(µ1, µ3). (95) So the triangle inequality holds for STSW. We conclude that STSW is a metric on the space P(Sd). Published as a conference paper at ICLR 2025 O(d + 1)-invariance of STSW. For g O(d + 1), we show that STSW(µ, ν) = STSW(g µ, g ν), (96) where g µ, g ν as the pushforward of µ, ν via orthogonal transformation g: Sd ! Sd, respectively. For T = T x y1,...,yk Td k, we have g T = T gx gy1,...,gyk. Note that | det(g)| = 1, so Rα g T (g µ)(t, rgx gyi) = Z Sd g µ(y) α(y, g T )i δ(t arccos gx, y ) dy (97) Sd µ(g 1y) α(y, g T )i δ(t arccos gx, y ) dy (98) Sd µ(g 1gy) α(gy, g T )i δ(t arccos gx, gy ) d(gy) (99) Sd µ(y) α(y, T )i δ(t arccos x, y ) d(y) (100) = Rα T µ(t, rx yi). (101) Similarly, we have Rα g T (g ν)(t, rgx gyi) = Rα T ν(t, rx yi). (102) Since g induces an isometric transformation T ! g T , so Wd T ,1(Rα T µ, Rα T ν) = Wdg T ,1(Rα g T g µ, Rα g T g ν). (103) STSW(g µ, g ν) = Z Td k Wd T ,1(Rα T g µ, Rα T g ν) dσ(T ) (104) Td k Wdg T ,1(Rα g T g µ, Rα g T g ν) dσ(g T ) (105) Td k Wd T ,1(Rα T µ, Rα T ν) dσ(T ) (106) = STSW(µ, ν) (107) So STSW is O(d + 1)-invariant. A.5 DERIVATION FOR THE CLOSED-FORM APPROXIMATION OF STSW We derive the closed-form approximation of STSW for two discrete probability distributions µ and ν given as follows j=1 uj δ(x aj) and ν(x) = i=j vj δ(x aj). (108) We can write µ and ν in these forms by combining their supports and allow some uj and vj to be 0. Consider spherical tree T = T x y1,...,yk. For 1 j n, let cj = d Sd(x, aj), and also let c0 = 0. By re-indexing, we assume that the sequence c0, . . . , cn is increasing, 0 = c0 c1 c2 . . . cn. (109) For 0 j n and 1 i k, consider all points x(i) j = (cj, rx yi) on the spherical tree T . Since c0 = 0, we have x(1) 0 = x(2) 0 = . . . = x(k) 0 = x, (110) and for 1 j n, x(i) j is exactly the unique intersection between the hyperplane passes through aj and orthogonal to x, and the spherical ray rx yi. We compute Rα T µ: For t [0, π] and 1 i k, Rα T µ(t, rx yi) = Z Sd µ(y) α(y, T )i δ(t arccos x, y ) dy (111) Published as a conference paper at ICLR 2025 j=1 uj δ(y aj) α(y, T )i δ(t arccos x, y ) dy (112) Sd α(y, T )i δ(y aj) δ(t arccos x, y ) dy. (113) 1. If t / {c1, . . . , cn}, then Rα T µ(t, rx yi) = 0; and, 2. If t = cj for some j, then Rα T µ(t, rx yi) = Rα T µ(cj, rx yi) = Rα T µ(x(i) j ) = α(aj, T )i uj. Similarly, we have 1. If t / {c1, . . . , cn}, then Rα T ν(t, rx yi) = 0; and, 2. If t = cj for some j, then Rα T ν(t, rx yi) = Rα T ν(cj, rx yi) = Rα T ν(x(i) j ) = α(aj, T )i vj. For 1 j n and 1 i k, let u(i) j = α(aj, T )i uj and v(i) j = α(aj, T )i vj. (114) Consider T as a graph with nodes x(i) j for 1 i k, 0 j n. Note that x(i) 0 = x for all i, and we assign this is the root of T . Two nodes is adjacent is the shortest path on T does not contain any other nodes. In other words, the set of edges in T are e(i) j = (x(i) j , x(i) j 1) for 1 i k, 1 j n, and e(i) j = (x(i) j , x(i) j 1) has length cj cj 1. For an edge e(i) j , its further endpoint from the root is x(i) j . Also, for a node x(i) j with j > 0, the corresponding subtree Γ(x(i) j ) contains all nodes x(i) p with j p n. From these above observations, we can see µ and ν transform to Rα T µ and Rα T ν supported on nodes of T , where the mass at node x(i) j is u(i) j and v(i) j , respectively. So, from Equation (3), we have Wd T ,1(Rα T µ, Rα T ν) = X e T we µ(Γ(ve)) ν(Γ(ve)) (115) j=1 (cj cj 1) µ(Γ(x(i) j )) ν(Γ(x(i) j )) (116) j=1 (cj cj 1) µ(Γ(x(i) j )) ν(Γ(x(i) j )) ! j=1 (cj cj 1) p=j µ(x(i) p ) p=j ν(x(i) p ) j=1 (cj cj 1) j=1 (cj cj 1) u(i) p v(i) p j=1 (cj cj 1) p=j α(ap, T )i (up vp) This is identical to Equation (19). We finish the derivation. Published as a conference paper at ICLR 2025 Figure 3: Runtime Comparison, averaged over 15 runs. B EXPERIMENTAL DETAILS All our experiments were conducted on a single NVIDIA H100 80G GPU. For all tasks, if not specified, hyperparameter ζ in STSW is set to its default value of 2. B.1 IMPLEMENTATION We summarize a pseudo-code for STSW distance computation in Algorithm 1. Algorithm 1 Spherical Tree-Sliced Wasserstein distance. Input: µ, ν P(Sd) as µ(x) = Pn j=1 uj δ(x aj), ν(x) = Pn j=1 vj δ(x aj), number of spherical trees L, number of rays in spherical trees k, splitting maps α with weight δ R. for l = 1 to L do Sample x(l), y(l) 1 , . . . , y(l) k i.i.d N(0, Idd+1). Compute x(l) x(l)/ x(l) 2 and y(l) j φx(l)(y(l) j )/ φx(l)(y(l) j ) 2. Contruct spherical tree Tl = T x(l) y(l) 1 ,...,y(l) k . Compute Wd Tl,1(Rα Tlµ, Rα Tlν) by Equation (19). end for Compute \ STSW(µ, ν) = (1/L) ΣL l=1Wd Tl,1(Rα Tlµ, Rα Tlν) Return: \ STSW(µ, ν). B.2 EVOLUTION OF STSW In this section, we examine the evolution of STSW as well as different distances when measuring two distributions. In line with (Bonet et al., 2022; Tran et al., 2024b), we select source distribution v MF( , 0) a.k.a uniform distribution and target distribution v MF(µ, κ). We initialize 500 samples in each distribution. We use kappa κ = 10, L = 200 trees, k = 10 lines for STSW, L = 200 projections for other sliced metrics, NR = 100 rotations for RI-S3W, ARI-S3W, and a pool size of 1000 for ARI-S3W in all experiments unless specified otherwise. Results are averaged over 20 runs. Evolution w.r.t κ. Figure 4 shows the evolution of various methods w.r.t to κ. As expected, STSW aligns with the trends in S3W and SSW, decreasing with higher dimensions, unlike KL divergence. Here, we use a derived form for KL divergence (Davidson et al., 2018; Xu & Durrett, 2018) as follows: KL(v MF(µ, κ) v MF( , 0)) = κ I(d+1)/2(κ) I(d+1)/2 1(κ) + d + 1 2 1 log κ d + 1 Published as a conference paper at ICLR 2025 Figure 4: Evolution between v MF(µ, κ) and v MF( , 0)) w.r.t κ on Sd 1 across various methods. We use κ {1, 5, 10, 20, 30, 40, 50, 75, 100, 150, 200, 250} log I(d+1)/2 1(κ) + d + 1 2 log π + log 2 log Γ d + 1 Figure 5: Evolution between rotated v MFs distributions, averaged over 100 runs. d denotes data dimension. Evolution w.r.t rotated v MFs. Next, we evaluate a fixed v MF distribution and its rotation along a great circle. Specifically, we compute metric between v MF((1, 0, 0, . . . ), κ) and v MF((cos θ, sin θ, 0, . . . ), κ) for θ {(kπ)/6}12 k=0. We plot results in Figure 5 Evolution of STSW w.r.t Number of Trees, Number of Lines and ζ. Next, we study the effect of the number of trees and lines on STSW. If not specified, we fix κ = 10 and d = 3. We present the results in Figure 6, Figure 7, and Figure 8 B.3 RUNTIME ANALYSIS Runtime Comparison. We now perform a runtime comparison with other commonly used distance metrics, including the traditional Wasserstein, Sinkhorn (Cuturi, 2013), Sliced-Wasserstain (SW), Spherical Sliced-Wasserstein (SSW) (Bonet et al., 2022) as well as Stereographic Spherical Sliced Wasserstein (S3W) (Tran et al., 2024b) and its variants (RI-S3W, ARI-S3W). For a fair comparison, we also include SSW2 with binary search (BS) and Unif when a closed form is available for uniform distribution. We set L = 200 projections for all methods. For our STSW, we use L = 100 Published as a conference paper at ICLR 2025 Figure 6: Evolution of STSW between the source and target distributions when varying number of trees L {1, 10, 50, 100, 200, 400, 500, 750, 900, 1000}. Figure 7: Evolution of STSW between two distributions w.r.t number of lines k {1, 10, 50, 100, 200, 400, 500, 600, 700, 750}. trees and k = 10 lines. The runtime of applying each of these methods on two distribution on S2 is illustrated in Figure 3. Runtime Evolution. To further assess STSW performance, we conduct a runtime analysis to understand the computational cost associated with different configurations. We again choose uniform distribution and v MF(µ, κ) where κ = 10 as our source and target distribution and use STSW to measure distance between these two probabilities. All experiments are repeated 20 times with default parameters set to L = 200 trees, k = 10 lines and N = 500 samples, unless otherwise stated. We vary the number of trees L {200, 400, 500, 750, 900, 1000, 1250, 1500, 1750, 2000} in Figure 9a, adjust the number of lines k across {5, 10, 25, 50, 100, 150, 200, 300, 500, 750, 1000} in Figure 9b and change the number of samples N within {500, 1000, 3000, 5000, 7000, 8000, 10000} in Figure 9c. We note that the runtime of STSW scales linearly with these parameters. B.4 GRADIENT FLOW The probability density function of the von Mises-Fisher distribution with mean direction µ Sd is given by: f(x; µ, κ) = Cd(κ) exp(κµT x) where κ > 0 is concentration parameter and the normalization constant Cd(κ) = κd/2 1 (2π)p/2Ip/2 1(κ) Our target distribution, 12 v MFs with 2400 samples (200 per v FM), have κ = 50 and µ1 = ( 1, ϕ, 0), µ2 = (1, ϕ, 0), µ3 = ( 1, ϕ, 0), µ4 = (1, ϕ, 0) µ5 = (0, 1, ϕ), µ6 = (0, 1, ϕ), µ6 = (0, 1, ϕ), µ8 = (0, 1, ϕ) µ9 = (ϕ, 0, 1), µ10 = (ϕ, 0, 1), µ11 = ( ϕ, 0, 1), µ12 = ( ϕ, 0, 1) Published as a conference paper at ICLR 2025 Figure 8: Evolution of STSW between two distributions w.r.t ζ {1, 2, 5, 10, 20, 25, 50, 75, 100, 150, 200} Figure 9: Runtime of STSW w.r.t number of trees, lines and samples where ϕ = 1 + 5 2 . The projected gradient descent as described in (Bonet et al., 2022): ( x(k+1) = x(k) γ x(k)STSW(ˆµk, ν), x(k+1) = x(k+1) Setup. We fix L = 200 trees and k = 5 lines. For the rest, we use L = 1000 projections. As in the original setup, ARI-S3W (30) has 30 rotations with a pool size of 1000 while RI-S3W (1) and RI-S3W (5) have 1 and 5 rotations respectively. We train with Adam (Kinga et al., 2015) optimizer lr = 0.01 over 500 epochs and an additional lr = 0.05 for SSW. Results. As seen from Table 1 and Figure 10, STSW provides better results in log 2-Wasserstein distance and NLL, while also being efficient in terms of both runtime and convergence speed. We perform additional experiments on the most informative sliced methods including MAX-STSW, MAX-SSW, and MAX-SW. We present in Table 5 the results after training for 1000 epochs with a learning rate LR = 0.01. Each experiment is repeated 10 times. Figure 11 illustrates the log 2-Wasserstein distance between the source and target distribution during training. We observe that MAX-STSW performs better than others. B.5 SELF-SUPERVISED LEARNING Encoder. Consistent with the setup in (Bonet et al., 2022; Tran et al., 2024b), we train a Res Net18 (He et al., 2016) on CIFAR-10 (Krizhevsky, 2009) data for 200 epochs using a batch size of 512. We use SGD as our optimizer with initial lr = 0.05 a momentum 0.9, and a weight decay 10 3. The standard data augmentations used to generate positive pairs are similar to prior works (Wang & Isola, 2020; Bonet et al., 2022; Tran et al., 2024b) which include resizing, cropping, horizontal flipping, color jittering, and random grayscale conversion. Published as a conference paper at ICLR 2025 Figure 10: Log 2-Wasserstein distance between source and target distributions Table 5: Learning target distribution 12 v FMs, trained for 1000 epochs and averaged over 10 runs. Method log W2 # NLL # MAX-SW -3.10 0.06 -4959.14 12.22 MAX-SSW -2.76 0.02 -4868.78 60.51 MAX-STSW -3.19 0.03 -5007.72 16.34 We set L = 200 trees and k = 20 lines for STSW and fix L = 200 projections for all other sliced distances. NR = 5 and a pool size of 100 are used for RI-S3W and ARI-S3W as in Tran et al. (2024b). For settings of the regularization coefficient, please refer to Table 6. Linear Classifier. A linear classifier is then trained on feature representations from the pre-trained encoder. Similar to Bonet et al. (2022), we train it for 100 epochs using the Adam (Kinga et al., 2015) optimizer with a learning rate of 10 3, a weight decay of 0.2 at epoch 60 and 80. Results. We report in Table 2 the best accuracy of the linear evaluation on features taken before and after projection on Sd where d = 9. The visualizations of learned representations when d = 2 can be found in Figure 12. B.6 EARTH DATA ESTIMATION Similar to Bonet et al. (2022) and Tran et al. (2024b), we use an exponential mapping normalizing flows model consisting of 48 radial blocks with 100 components each, totaling 24000 parameters. The model is then trained with full batch gradient descent via Adam optimizer. Dataset details are provided in Table 7. Setup. Our settings for STSW in this task are L = 1000 trees, k = 100 lines, and ζ = 100. We use lr = 0.05 for STSW, S3W, RI-S3W and ARI-S3W and lr = 0.1 for SW and SSW. We train other sliced distances for 20,000 epochs as in the original setup while our STSW is only trained for 10,000 epochs. Results. Table 3 highlights the competitive performance of STSW compared to the baseline methods. To further evaluate the efficiency of our approach, we compare the training time of STSW with that of the second-best performer, ARI-S3W, using the Fire dataset. Our findings show that STSW (2 hours 10 minutes) is twice as fast as ARI-S3W (4 hours 30 minutes). We also present in Figure 13 the normalized density maps of test data learned. Published as a conference paper at ICLR 2025 Figure 11: Log 2-Wasserstein distance between source and target distributions. Table 6: Regularization coefficient λ across various methods w.r.t projection on Sd in Self Supervised Learning task. STSW SSW SW S3W variants d = 9 λ = 10.0 λ = 20.0 λ = 1.0 λ = 0.5 d = 2 λ = 10.0 λ = 20.0 λ = 1.0 λ = 0.1 B.7 GENERATIVE MODELS Setup. We use Adam (Kinga et al., 2015) optimizer with learning rate lr = 10 3. We train with a batch size of 500 over 100 epochs using BCE loss as our reconstruction loss. We choose L = 200 trees and k = 10 lines for STSW. Following the same settings in Tran et al. (2024b), we fix L = 100 projections for others, NR = 5 rotations for RI-S3W and ARI-S3W, and a pool size of 100 random rotations ARI-S3W. We use prior 10 v MFs, λ = 1 for STSW, λ = 10 for SSW, and λ = 10 3 for SW and S3W variants. Additional Results on MNIST. For quantitative analysis, we train the SWAE framework on MNIST and report the FID score in Table 8, along with the generated images in Figure 14. We follow the same settings as in Tran et al. (2024b), which use the latent prior U(S2) and train the model with a batch size of 500 over 100 epochs. For STSW, we fix L = 200 trees and k = 10 lines with a learning rate LR = 0.01 and λ = 1. For other sliced methods, we use L = 100 projections and a learning rate LR = 10 3, as described in Tran et al. (2024b). The FID scores are computed using 10,000 samples from the test set. We use the same model architecture as specified in Tran et al. (2024b). CIFAR-10 Model Architecture. x R3 32 32 ! Conv2d32 ! Re LU ! Conv2d32 ! Re LU ! Conv2d64 ! Re LU ! Conv2d64 ! Re LU ! Conv2d128 ! Re LU ! Conv2d128 ! Flatten ! FC512 ! Re LU ! FC3 ! ℓ2 normalization ! z S2 z S2 ! FC512 ! FC2048 ! Re LU Published as a conference paper at ICLR 2025 (b) ARI-S3W (f) Sim CLR (g) Hypersphere Figure 12: Distributions of CIFAR-10 validation set on S2 after pre-training. ! Reshape(128 4 4) ! Conv2d T128 ! Re LU ! Conv2d T64 ! Re LU ! Conv2d T64 ! Re LU ! Conv2d T32 ! Re LU ! Conv2d T32 ! Re LU ! Conv2d T3 ! Sigmoid MNIST Model Architecture. Published as a conference paper at ICLR 2025 (a) Ground Truth Fire. (b) STSW Fire. (c) Ground Truth Flood. (d) STSW Flood. (e) Ground Truth Earthquake. (f) STSW Earthquake. Figure 13: Density estimation on earth data. The left figures (ground truth) represent training data estimated with KDE. The right ones depict the normalized log likelihood of the trained models on test data. Table 7: Earth datasets. Earthquake Flood Fire Train 4284 3412 8966 Test 1836 1463 3843 Data size 6120 4875 12809 x R28 28 ! Conv2d32 ! Re LU ! Conv2d32 ! Re LU ! Conv2d64 ! Re LU ! Conv2d64 ! Re LU ! Conv2d128 ! Re LU ! Conv2d128 ! Flatten ! FC512 ! Re LU ! FC3 ! ℓ2 normalization ! z S2 z S2 ! FC512 ! FC512 ! Re LU ! Reshape(128 2 2) ! Conv2d T128 ! Re LU ! Conv2d T64 ! Re LU ! Conv2d T64 ! Re LU ! Conv2d T32 ! Re LU ! Conv2d T32 ! Re LU ! Conv2d T1 ! Sigmoid Published as a conference paper at ICLR 2025 Table 8: Average FID of 5 runs on MNIST. Method FID # SW 73.35 2.01 SSW 76.14 2.73 S3W 75.55 2.80 RI-S3W (10) 72.80 3.39 ARI-S3W (30) 70.37 2.58 STSW 69.16 2.74 (d) RI-S3W (10). (e) ARI-S3W (30). Figure 14: Generated images of different methods on MNIST of SWAE.