# stereographic_spherical_sliced_wasserstein_distances__32e80323.pdf Stereographic Spherical Sliced Wasserstein Distances Huy Tran * 1 Yikun Bai * 1 Abihith Kothapalli * 1 Ashkan Shahbazi 1 Xinran Liu 1 Rocio Diaz Martin 2 Soheil Kolouri 1 Comparing spherical probability distributions is of great interest in various fields, including geology, medical domains, computer vision, and deep representation learning. The utility of optimal transport-based distances, such as the Wasserstein distance, for comparing probability measures has spurred active research in developing computationally efficient variations of these distances for spherical probability measures. This paper introduces a high-speed and highly parallelizable distance for comparing spherical measures using the stereographic projection and the generalized Radon transform, which we refer to as the Stereographic Spherical Sliced Wasserstein (S3W) distance. We carefully address the distance distortion caused by the stereographic projection and provide an extensive theoretical analysis of our proposed metric and its rotationally invariant variation. Finally, we evaluate the performance of the proposed metrics and compare them with recent baselines in terms of both speed and accuracy through a wide range of numerical studies, including gradient flows and selfsupervised learning. Our code is available at https://github.com/mint-vu/s3wd. 1. Introduction Applications involving distributions defined on a hypersphere are remarkably diverse, highlighting the importance of spherical geometries across various disciplines. These applications include: 1) mapping the distribution of geographic or geological features on celestial bodies, such as stars and planets (Jupp, 2020; Cabella & Marinucci, 2009; Perraudin et al., 2019), 2) magnetoencephalography (MEG) *Equal contribution 1Department of Computer Science, Vanderbilt University, Nashville, TN 2Department of Mathematics, Vanderbilt University, Nashville, TN. Correspondence to: Soheil Kolouri . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). imaging (Vrba & Robinson, 2001) in medical domains, 3) spherical image representations and 360 images (Coors et al., 2018; Jiang et al., 2023), such as omnidirectional images in computer vision (Khasanova & Frossard, 2017), 4) texture mapping in computer graphics (Elad et al., 2005; Dominitz & Tannenbaum, 2010), and more recently, 5) deep representation learning, where the latent representation is often mapped to a bounded space, commonly a sphere, where cosine similarity is utilized for effective representation learning (Chen et al., 2020; Wang & Isola, 2020). The analysis of distributions on hyperspheres is traditionally approached through directional statistics, also referred to as circular/spherical statistics (Jammalamadaka et al., 2001; Mardia et al., 2000; Ley & Verdebout, 2017; Pewsey & Garc ıa-Portugu es, 2021). This specialized field is dedicated to the statistical analysis of directions, orientations, and rotations. More recently, with the growing application of optimal transport theory (Villani, 2008; Peyr e & Cuturi, 2018) in machine learning, due in part to its favorable statistical, geometrical, and topological properties, there has been an increasing interest in using optimal transport to compare spherical probability measures (Cui et al., 2019; Hamfeldt & Turnquist, 2022). One of the main bottlenecks in optimal transport theory is its high computational cost, generally of cubic complexity. This has sparked extensive research to develop faster solvers (Cuturi, 2013; Scetbon & Cuturi, 2022; Charikar et al., 2023) or computationally superior equivalent distances (Rabin et al., 2012; Bonneel et al., 2015; Kolouri et al., 2019a). Notably, sliced variations of optimal transport distances, such as sliced Wasserstein distances (Rabin et al., 2012) and their various extensions (Kolouri et al., 2019a; Le et al., 2019; Nguyen et al., 2022), have emerged as effective solutions. These methods use integral geometry and the Radon transform (Helgason et al., 2011) to represent highdimensional distributions via a set of their one-dimensional marginals. By doing so, they take advantage of more efficient optimal transport solvers designed for one-dimensional probability measures, offering a pragmatic approach to comparing probability measures. Owing to their computational efficiency and implementation simplicity, these methods have been recently adapted for spherical measures, leading to the development of spherical sliced optimal transport Stereographic Spherical Sliced Wasserstein Distances methods (Bonet et al., 2023a; Quellmalz et al., 2023). This comes as part of a broader effort to extend sliced Wasserstein distances to arbitrary manifolds; in this regard, we also highlight extensions of these methods to measures supported on compact manifolds (Rustamov & Majumdar, 2023), hyperbolic spaces (Bonet et al., 2023b), and symmetric positive definite matrices (Bonet et al., 2023c). The main challenge in developing spherical sliced optimal transport methods lies in extending the classical Radon transform to its spherical counterparts. In the context of sliced optimal transport, such an extension must: 1) map probability measures on the hypersphere to a family of probability measures on one-dimensional domains (e.g., R or S1), and 2) be injective so that a sliced distance can be defined (otherwise, one will obtain only a pseudo-metric). These requirements rule out many of the existing extensions of the Radon transform to the sphere, e.g., the classic Funk-Radon transform, which takes integrals along all great circles (Helgason et al., 2011; Quellmalz, 2020). Recently, (Quellmalz et al., 2023) proposed using two such spherical extensions for the Radon transform, namely, the vertical slice transform (Gindikin et al., 1994b) and a normalized version of the semicircle transform (Groemer, 1998), to define sliced optimal transport on the sphere. Notably, the semicircle transform was also used in (Bonet et al., 2023a) to define a spherical sliced Wasserstein discrepancy for empirical probability measures. The recent works on sliced spherical optimal transport (Bonet et al., 2023a; Quellmalz et al., 2023) map a distribution defined on a hypersphere into its marginal distributions on a unit circle, thus requiring circular optimal transport to compare these marginals. Importantly, the calculation of optimal transport between two one-dimensional measures defined on a circle is more expensive (requiring an additional binary search) than when the measures are defined on the real line (Martin et al., 2023; Hundrieser et al., 2022). Motivated by this observation, in this paper, we explore a spherical Radon transform aimed at converting a spherical distribution into one-dimensional marginals along the real line. We utilize the stereographic projection, which is a smooth bijection, in composition with an injective (nonlinear) map to transform the hypersphere into a hyperplane, where we then apply the classic Radon transform. The key advantage of the injective map is its ability to manage the distortion of distances introduced by the stereographic projection. We support our proposed method with detailed theoretical analysis and comprehensive numerical studies. Contributions. Our specific contributions are as follows: Introducing a computationally efficient transport distance, Stereographic Spherical Sliced Wasserstein distance (S3W), for spherical probability measures. Providing a rotationally invariant variation of the proposed distance, Rotationally Invariant Stereographic Spherical Sliced Wasserstein distance (RI-S3W). Offering theoretical analysis of the proposed distances. Demonstrating the performance, both in terms of speed and accuracy, of the proposed distances in diverse applications, including gradient flows on the sphere, representation learning with Sliced-Wasserstein Auto Encoders (SWAEs), spherical density estimation via normalizing flows, sliced-Wasserstein variational inference on the sphere, and self-supervised learning. 2. Background 2.1. Stereographic Projection Stereographic projection is a mathematical technique to map points on a sphere onto a plane. Originating in Greek astronomy, it projects points from a sphere, typically from one of its poles, onto a plane tangential to the opposite pole or onto the equator plane. This projection is conformal, preserving angles and shapes locally, making it significant in fields like cartography, complex analysis, and computer graphics. It elegantly translates spherical geometry into planar terms, maintaining the intricate relationships between points and angles found on the sphere s surface. Let Sd denote the d-dimensional sphere in Rd+1 defined as Sd := {s Rd+1 : s 2 = 1}. The stereographic projection ϕ : Sd \ {sn} Rd maps a point s Sd (excluding the North Pole, sn = [0, . . . , 0, 1]) to a point x Rd by the formula: [x]i = 2[s]i 1 [s]d+1 , for i = 1, 2, . . . , d. (1) This projection is a bijective and smooth mapping between Sd \ {sn} and Rd, i.e., the hyperplane tangent to the sphere at the south pole, sd+1 = 1, which is commonly used as a way to visualize spherical geometries in an Euclidean space. 2.2. Radon Transform The Radon transform is a fundamental tool in integral geometry, widely used for image reconstruction, particularly in computed tomography (CT). It transforms a function defined in the d-dimensional Euclidean space, Rd, into an infinite set of its one-dimensional slices, i.e., its integrals over hyperplanes parametrized with unit vectors θ Sd 1. Radon transform is instrumental in medical imaging and has recently attracted ample attention from the machine learning community for its favorable characteristics for measuring distances between probability measures (Bonneel et al., 2015; Kolouri et al., 2016a; 2019a) and, more generally, positive measures (Bai et al., 2023; S ejourn e et al., 2023), and more recently on theoretical analysis of deep neural networks (Parhi & Nowak, 2023; Unser, 2023). Stereographic Spherical Sliced Wasserstein Distances Figure 1. Depiction of stereographic projection from S2\{sn} to R2 (a), the stereographic Radon transform integration surfaces on the sphere, i.e., the level sets of ϕ(x), θ for a fixed θ Rd (b), and the generalized stereographic Radon transform integration surfaces on the sphere, i.e. the level sets of h ϕ(x), θ for a fixed θ Rd . More formally, the classic Radon transform, denoted by R, maps a function f in the Lebesgue space L1(Rd) the space of integrable functions on Rd to its integrals over the hyperplanes of Rd. Formally, it is defined as: Rf(t, θ) := Z Rd f(x)δ(t x, θ )dx, (2) for (t, θ) R Sd 1, where δ( ) is the one-dimensional Dirac delta function, and , denotes the Euclidean inner product. Here, R maps from L1(Rd) to L1(R Sd 1). Each hyperplane, H(t, θ) = {x Rd | x, θ = t}, corresponds to a level set of a function g : Rd Sd 1 R, defined as g(x, θ) = x, θ . For a fixed θ, the set of all integrals over hyperplanes orthogonal to θ yields the continuous function Rf( , θ) : R R, representing a projection or slice of f. Notably, the Radon transform R : L1(Rd) L1(R Sd 1) is a linear bijection with a closed-form inversion formula. See the Appendix for details. 2.3. Generalized Radon Transform The Generalized Radon Transform (GRT) extends the foundational concept of the classic Radon transform, as introduced by Radon in (Radon, 1917), from integration over hyperplanes in Rd to integration over more complex structures, namely hypersurfaces or (d 1)-dimensional manifolds. This broader scope of the GRT has been developed and explored in various works (Beylkin, 1984; Denisyuk, 1994; Ehrenpreis, 2003; Gel fand et al., 1969; Kuchment, 2006; Homan & Zhou, 2017), in many applications from impedance and acoustic tomography (Kuchment, 2006) to machine learning (Kolouri et al., 2019a). The GRT of a function f L1(Rd) involves the integration of f over hypersurfaces in Rd. These hypersurfaces are defined as the level sets of a defining function (Kolouri et al., 2019a), g : Rd (Rd \ {0}) R, characterized by Ht,θ = {x Rd| g(x, θ) = t}. The GRT of f, denoted as Gf, is formally given by: Gf(t, θ) := Z Rd f(x)δ(t g(x, θ)) dx, (3) where δ represents the Dirac delta function, enabling the integration over the specific level sets defined by g. Note that g(x, θ) = x, θ recovers the classic Radon transform. The injectivity of the GRT is essential for defining distances between measures through their generalized slices. The choice of g identifies whether GRT is injective or not. Kolouri et al. (2019a) enumerate a set of necessary conditions on g to construct a bijective GRT. However, these conditions are not sufficient and are limited in practical utility, as they do not provide guidance on crafting specific defining functions that would ensure injectivity. To achieve a practical injective GRT, Chen et al. (2022) recently introduced a variation of the GRT. They posited that by setting g(x, θ) = h(x), θ for an injective function h : Rd Rd , one effectively applies the classic Radon transform, which is bijective, to the image of h. This approach leads to an injective GRT, thus resolving the issue of invertibility. For consistency of notation with Chen et al. (2022), we denote this variation of the GRT as: Hf(t, θ) := Z Rd f(x)δ(t h(x), θ ) dx, (4) where H : L1(Rd) L1(R Sd 1). Importantly, a neural network can effectively parametrize the injective function h. For instance, h could be a normalizing flow (Kobyzev et al., 2020). Alternatively, to avoid the complexities associated with normalizing flows, one can define h(x) as [x T , ρ(x)T ]T , where x T and ρ(x)T are transposed vectors that are concatenated. Here, ρ : Rd Rd 1 is any nonlinear function parametrized as a neural network of choice (Chen et al., 2022). In our experiments, we will use H as the default GRT due to its injectivity. Stereographic Spherical Sliced Wasserstein Distances 2.4. GRT of Radon Measures Skipping much of the theoretical details (refer to the appendix), for a Radon measure µ M(Rd), its Generalized Radon Transform G(µ) = ν with respect to the defining function g, is defined as the measure ν M(R Sd 1) such that for each ψ C0(R Sd 1), Z R Sd 1 ψ(t, θ) dν(t, θ) = Z Rd(G (ψ))(x) dµ(x), (5) where G is the dual operator (aka adjoint operator), which for any ψ L (R Sd 1), is defined as G (ψ)(x) = Z Sd 1 ψ(g(x, θ), θ) dσd (θ) x Rd, (6) where σd is the uniform probability measure defined in Sd 1. Importantly, the dual GRT operator satisfies, G(µ)(ψ) = µ(G (ψ)). Lastly, with a slight abuse of notation, we denote the corresponding slice for θ Sd 1 as G(µ)θ = g( , θ)#µ M(R). Note that f#µ denotes the pushforward of measure µ with respect to f, defined as f#µ(A) = µ(f 1(A)). When µ is a positive or a probability measure in Rd, then G(µ)θ is a positive/probability measure in R. 2.5. Wasserstein and Sliced Wasserstein Distances Let M denote a Riemannian manifold endowed with the distance d( , ) : M M R+. For 1 p < , let µ, ν Pp(M) := {µ P(M)| R M dp(x, x0) dµ(x) < for some x0 M} be two probability measures defined on manifold M with a finite p-th moment. Then, the optimal transport (OT) problem (Villani, 2008) seeks to transport the mass in µ into ν such that the expected traversed distance is minimized. This leads to the p-Wasserstein distance: W p p (µ, ν) := inf γ Γ(µ,ν) M M dp(x, y)dγ(x, y), (7) where Γ(µ, ν) denotes the joint probability measures γ P(M M) with marginals µ and ν. Unfortunately, for discrete probability measures with N particles, solving (7) generally has a O(N 3 log N) complexity. However, for µ, ν Pp(R) the problem can be solved in O(N log N): W p p (µ, ν) = Z 1 0 F 1 µ (t) F 1 ν (t) pdt (8) where F 1 µ and F 1 ν are the quantile functions of µ and ν. Notably, similar efficient solvers are developed for when µ, ν Pp(S1) (Delon et al., 2010; Hundrieser et al., 2022; Bonet et al., 2023a). For an injective generalized Radon transform G, such that for µ Pp(M) we have G(µ)θ Pp(R), the generalized Sliced-Wasserstein distance (Rabin et al., 2012; Kolouri et al., 2019a) can be defined as: SW p G,p(µ, ν) := Z Sd 1 W p p (G(µ)θ, G(ν)θ)dσd (θ), (9) where σd P(Sd 1) is a probability measure possessing a non-zero density on the sphere Sd 1, often simply the uniform measure. Recently, Bonet et al. (2023a) introduced the concept of Spherical Sliced Wasserstein distance, which involves projecting spherical measures onto great circles, resulting in G(µ)θ Pp(S1), and utilizing circular OT to measure distances between these slices. Notably, circular OT still necessitates solving an optimization problem to register cumulative distribution functions on a circle (i.e., finding an optimal cut). This results in a slower solver compared to OT on R. Based on the extended definition of GRT for probability measures in 2.4, here we formally introduce the Stereographic Spherical Radon Transform and the corresponding sliced Wasserstein distance, Stereographic Spherical Sliced Wasserstein Distance, for spherical probability measures. 3.1. Stereographic Spherical Radon Transform Let µ M(Sd) denote a Radon measure defined on Sd that does not assign mass to the North Pole, i.e., µ({sn}) = 0. We denote the stereographic projection as ϕ : Sd \ {sn} Rd, which is a bijection, and we have that ϕ#µ is a Radon measure defined in Rd. The Stereographic Spherical Radon transform, of µ is defined as SR(µ) := R(ϕ#µ) M(R Sd 1), (10) and we also define its generalized version as SG(µ) := G(ϕ#µ) M(R Sd 1). (11) Similarly we define SH(µ) := H(ϕ#µ), which provides an invertible transformation from M(Sd) to M(R Sd 1). Proposition 3.1 For µ M(Sd) that does not give mass to the North Pole {sn} the Stereographic Spherical Radon transforms SG and SH satisfy the following properties: (1) SG(µ), SH(µ) M(R Sd 1). In addition SG and SH preserves mass, and if µ is a positive measure, then SG(µ), SH(µ) are positive measures too. Finally, if µ P(Sd \ {sn}), then SG(µ), SH(µ) are probability measures defined on R Sd 1. (2) The disintegration theorem gives a unique SG(µ) a.s. set of measures (SG(µ)θ)θ Sd 1 M(R), Stereographic Spherical Sliced Wasserstein Distances and a unique SH(µ) a.s. set of measures (SH(µ)θ)θ Sd 1 M(R) such that for any ψ C0(R Sd 1), Z R Sd 1 ψ(t, θ) d SG(µ)(t, θ) R ψ(t, θ) d SG(µ)θ(t) dσd (θ), and similarly for SH. Then, it holds that SG(µ)θ = (g( , θ) ϕ)#µ, (13) and the same holds for SH. (3) SH is invertible. The proof of this proposition is included in the appendix Section D. Being equipped with the proposed stereographic spherical Radon transform, we are now ready to define our proposed distance. 3.2. Stereographic Spherical Sliced Wasserstein Let µ, ν Pp(Sd) denote two probability measures on the unit sphere in Rd+1. We introduce the formal definition of the Stereographic Spherical Sliced Wasserstein (S3W) distances as follows: S3W p G,p(µ, ν) := Z Sd 1 W p p (SG(µ)θ, SG(ν)θ) dσd (θ). In this context, σd P(Sd 1) typically represents a probability measure possessing a non-zero density on the sphere Sd 1 Rd . However, for the sake of simplicity and in line with common practice, we opt to consider the uniform measure on Sd 1. Similarly, we can define S3W p H,p by integrating over W p p (SH(µ)θ, SH(ν)θ). Theorem 3.2 The proposed S3WG,p( , ) and S3WH,p( , ) are well-defined. Furthermore, S3WG,p( , ) is generally a pseudo-metric in Pp(Sd \ {sn}), i.e., it is non-negative, symmetric and satisfies triangular inequality. In addition, S3WH,p( , ) defines a metric in Pp(Sd \ {sn}). The proof is in the appendix Section D. 3.3. Distance Distortion The Stereographic Projection, while being conformal (i.e., preserving angles), severely distorts distances. To demonstrate the extent of this distortion, consider points s = (ϵ, 0, . . . , 0, 1 ϵ2) and s = ( ϵ, 0, . . . , 0, 1 ϵ2). Then, as ϵ 0, we have arccos( s, s ) 0, while ϕ(s) ϕ(s ) ! This distortion implies that the transportation cost after the stereographic projection would be significantly different from the transportation cost on the sphere, making the naive application of this projection with optimal transport unsuitable. In this section, we aim to construct an injective function h that closely approximates the arclength on the sphere Sd with the Euclidean distance in the embedded space. Specifically, we seek to satisfy: h(ϕ(s)) h(ϕ(s )) arccos( s, s ), s, s Sd. We introduce two variants of h: an analytical function h1( ) and a neural network-based learnable function h NN( ), both mapping to a nearly-isometric embedding from ϕ(s). We start by defining the analytic function: h1(x) := arccos x 2 1 x , x Rd (14) and provide the following proposition. Proposition 3.3 For s0, s, s Sd, where s0 denotes the South Pole, the stereographic projection ϕ, and h1 as defined in Eq. (14) we have: h1(ϕ(s)) = (s, s0) s[1:d] s[1:d] . Thus, we have h1(ϕ(s)) h1(ϕ(s )) 2π The following inequality holds, arccos( s,s ) h1(ϕ(s)) h1(ϕ(s )) + ϵ(s, s ) where the equality holds when s, s , and s0 are in the same great circle, and ϵ(s, s ) 0 as d Sd(s, s ) 0. The proof is in the appendix Section H. Next, we consider training a neural network to obtain a nearly-isometric Euclidean embedding. To do so, we define h NN(x) := [h T 1 (x)/C, ρT (x)]T (15) where ρ : Rd Rd d is a neural network, and C 2π a constant. We first note that such h NN : Rd Rd is injective, then train ρ by minimizing: L(ρ) = Es,s h (arccos( s, s ) h NN(ϕ(s)) h NN(ϕ(s )) )2i , (16) where s and s are sampled according to the uniform distribution in the sphere Sd Rd+1, i.e., (s, s ) σd+1 σd+1. Figure 2 illustrates the arclength versus the distance in the embedding for random pairs of samples s, s Sd with respect to the various scenarios proposed in this section. It is Stereographic Spherical Sliced Wasserstein Distances 𝑑𝕊! 𝑠, 𝑠 = 𝑎𝑟𝑐𝑐𝑜𝑠(𝑠 𝑠") 𝑑ℝ! ℎ(𝜙𝑠, ℎ(𝜙(𝑠 ) 𝐶𝐶= 0.224 𝐶𝐶= 0.632 𝐶𝐶= 0.824 𝐶𝐶= 0.976 Figure 2. Spherical distance (i.e., the arclength) versus the distance after stereographic projection, where CC denotes Pearson s correlation coefficient. From left to right, when the injective function h = id, and the distance is ϕ(s) ϕ(s ) (a), when h(x) = h1(x) (see Eq. (14)) and the distance is h(ϕ(s)) h(ϕ(s )) (b), when h(x) = h1(x) and the distance is min( h(ϕ(s)) h(ϕ(s )) , 2π h(ϕ(s)) h(ϕ(s )) ) (c), and finally when h(x) = h NN(x) (see Eq. (15)) where ρ(x) is a trained neural network minimizing Eq. (16) and C 2π (d). evident that an injective function h NN parameterized with a neural network can yield a nearly-isometric embedding. We discuss the use of this neural-network based h NN further in the appendix Section H.1. To avoid any potential confusion, we emphasize that irrespective of the specific choice of h, as long as it maintains injectivity, S3WH,p remains a valid metric in Pp(Sd). The discussion in this section is, however, significant, particularly when we aim to ensure that the transportation cost in the embedding closely resembles the spherical distance. 3.4. Rotationally Invariant Extension of S3W Rotational symmetry is a significant property when dealing with probability metrics on the sphere. This symmetry implies that the probability distribution remains unchanged under rotations, making it a key consideration in spherical statistics and related applications. It simplifies calculations and enhances our understanding of the underlying processes on a spherical surface. Notably, the spherical OT leads to a rotationally symmetric metric on Pp(Sd). However, our proposed Stereographic Spherical Sliced Wasserstein (S3W) metric is not rotationally invariant. Here, we propose a rotationally invariant variation of S3W, which leads to a robust and easy-to-implement rotationally invariant metric. Let SO(d+1) denote the special orthogonal group in Rd+1, and let R SO(d + 1) denote a rotation matrix. For µ Pp(Sd) we denote the rotated measure as R#µ. Given probability measures µ, ν Pp(Sd), we define the rotationally invariant extension of S3W as: RI-S3WG,p(µ, ν) := ER ω[S3WG,p(R#µ, R#ν)], (17) where ω denotes the normalized Haar measure on SO(d+1) (i.e., the uniform probability measure which is left-rotationinvariant1). We similarly define RI-S3WH,p using S3WH,p. Theorem 3.4 RI-S3WG,p( , ) and RI-S3WH,p( , ) are well-defined. Furthermore, RI-S3WG,p( , ) is generally a pseudo-metric in Pp(Sd), i.e., it is non-negative, symmetric and satisfies triangular inequality. In addition, RI-S3WH,p( , ) defines a metric in Pp(Sd). The proof is included in the appendix Section E. 4. Numerical Implementation Details Stereographic Projection (SP) A key issue in implementing the SP for S3W concerns the numerical handling of points near the North Pole. To ensure numerical stability, we introduce an ϵ-cap around the North Pole, which serves to effectively establish an upper bound for the norm of the projected points. Specifically, any point where xd+1 > 1 ϵ is initially mapped to the circle xd+1 = 1 ϵ, and then projected using SP. We refer to this modified SP as ϕϵ. We discuss the stability of this ϵ-cap in the appendix Section I.3. The S3W Distances Let ϕϵ denote the Stereographic Projection operator that excludes the ϵ-cap around sn Sd, and h : Rd Rd an injective defining function, L the number of slices, and θl Sd 1 a slicing direction. For simplicity, we initially consider two data distributions with an equal number of samples and uniform mass distribution (the general case involving different numbers of samples and non-uniform mass distribution is discussed in the appendix). Let ˆµ = 1 M ΣM m=1δxm and ˆν = 1 M ΣM m=1δym denote the empirical distributions, where δxm denotes a Dirac measure centered at xm. Then, S3W p H,p(ˆµ, ˆν) can be approximated via the following Monte Carlo estimator: h(ϕϵ(xπl[m])), θl h(ϕϵ(yπ l[m])), θl p where πl[m] and π l[m] are the sorted indices of the projected samples on θl (see Algorithm 1 for the procedure). 1The Haar measure on a locally compact group G is the unique, up to positive constants, left-translation-invariant regular Borel measure on G. If G = SO(d + 1), the group translations are rotations. Here, we normalize it to obtain a probability measure. Stereographic Spherical Sliced Wasserstein Distances Algorithm 1 S3W Input: {xi}M i=1 µ, {yj}M j=1 ν, L projections, p-th order, ϵ for excluding the ϵ-cap around sn. Initialize: h (injective map), {θl}L l=1 (projections) Compute {ui = h(ϕϵ(xi))} and {vj = h(ϕϵ(yj))} Initialize distance d = 0 for l = 1 to L do Compute ul i = ui, θl , vl j = vj, θl Sort {ul i}, {vl j}, s.t ul πl[i] ul πl[i+1], vl π l[j] vl π l[j+1] d = d + 1 L PM i=1 |ul πl[i] vl π l[i]|p end for Return d 1 p The RI-S3W Distances The Monte Carlo approximation of RI-S3W distances can be written as: RI-S3Wp(ˆµ, ˆν) 1 NR n=1 S3Wp((Rn)#ˆµ, (Rn)#ˆν), where {Rn}NR n=1 SO(d + 1) are random rotation matrices. To generate Rn, we adopt the Geo Torch library (Lezcano-Casado, 2019), which provides a direct method to sample from SO(d + 1). Our approach is efficient with vectorization and parallel processing on GPU. We note that generating Rn is generally O(NR d3), which could become expensive for high-dimensional data or a large number of rotations. Instead, we could amortize this cost by presampling a rotation pool which could then be subsampled for every distance calculation. We denote this implementation as ARI-S3W, which involves a trade-off in memory and potential increased bias. In practice, we observe highly favorable performance in a variety of settings. We discuss this further in the appendix Section I. Computational Complexity Stereographic projection requires O(Nd) where N = n+m is the total number of data points from the source and target, and (d + 1) is the data dimensionality. If n m, then we let N = n, and the same analysis holds. Applying h( ) = h1( ) to the projected data also requires O(Nd). Thereafter, slicing the data is done in O(LNd), where L is the number of projections; sorting is done in O(LN log N); and finally, the distance calculation requires O(LN). The overall time complexity for S3W is therefore O(LN(d + log N)). For RI-S3W, the cost of calculating S3W for all rotations is O(NRLN(d + log N)). Generating the random rotations takes an overhead of O(NRd3) (which can be amortized), and applying these rotations to the data takes O(NRNd2). The total complexity is O(NR(d3+Nd2+LN(d+log N)). If we amortize generating the rotation matrices, then the 102 103 104 105 Number of samples in each distribution Wasserstein Sinkhorn SW2 SSW1 SSW2, BS SSW2, Unif S3W2 (ours) RI-S3W2 (ours) ARI-S3W2 (ours) Figure 3. Runtime comparison for Wasserstein distance, Sinkhorn distance (Cuturi, 2013) with geodesic distance as cost function, SW2 (sliced Wasserstein) distance, SSW1 distance (using level median formula) (Bonet et al., 2023a), SSW2 distance with binary search (BS) and antipodal closed form (for uniform distribution) (Bonet et al., 2023a), S3W2 distance (ours), RI-S3W2 distance (ours), and ARI-S3W2 distance (ours). per-operation cost becomes O(NRN(d2 +Ld+L log N)). Runtime Comparison We compare the runtime to compute different distances between the uniform distribution and a von Mises-Fisher distribution on S100. The results in Figure 3 are averaged over 50 iterations for varying sample sizes of each distribution. We use L = 200 projections for all sliced methods, NR = 10 rotations for ARI-S3W2 and RI-S3W2, and a pool size of 100 for ARI-S3W2. We do not include the time required to generate the rotation pool in our ARI-S3W2 measurements. Results for Wasserstein and Sinkhorn are based on the Python OT library (Flamary et al., 2021). 5. Experiments Here, we present key results from our numerical experiments. We defer much of the visualization, discussions, and further results to the appendix Section I. All our experiments were executed on a Linux server with an AMD EPYC 7713 64-Core Processor, 8 32GB DIMM DDR4, 3200 MHz, and a NVIDIA RTX A6000 GPU. 5.1. Gradient Flow On The Sphere Similar to the work of Bonet et al. (2023a), we apply our proposed distances as a loss function for the gradient flow problem. We consider a challenging target probability measure ν with 12 von Mises-Fisher distributions (v MFs), and aim to solve argminµS3W(µ, ν). Suppose we have access to the target measure only via i.i.d. samples {yj}M j=1, i.e., ˆν = 1 M PM j=1 δyj where M = 2400. We initialize 2400 particles uniformly sampled on S2, and directly optimize these particles with full-batch projected gradient descent on the surface of the sphere. Figure 4 shows the converged loss curves after 500 iterations and reports the runtime, negative log-likelihood (NLL), and the logarithm of the Stereographic Spherical Sliced Wasserstein Distances 0 100 200 300 400 500 Iteration 0 20 40 60 80 Runtime (s) SSW (LR=0.01) SSW (LR=0.05) S3W RI-S3W (1) RI-S3W (5) ARI-S3W (30) Method Runtime(s) NLL log W2 SSW (LR=0.01) 94.97 0.96 -4974.50 2.27 -3.27 0.17 SSW (LR=0.05) 94.98 0.27 -4974.46 3.50 -3.32 0.13 S3W 3.32 0.62 -4753.81 83.70 -2.57 0.19 RI-S3W (1) 4.00 0.83 -4957.22 34.70 -3.15 0.26 RI-S3W (5) 10.58 0.27 -4983.56 6.62 -3.50 0.17 ARI-S3W (30) 42.88 0.05 -5025.37 5.57 -4.37 0.21 Figure 4. Learning a mixture of 12 v MFs. ARI-S3W (30) has 30 rotations, pool size of 1000. S3W variants use LR = 0.01. SSW has an additional LR = 0.05 for better comparison. The plots show convergence of different distances w.r.t. iterations and runtime. The table summarizes numerical results for 10 independent runs. We provide more details of the plots in the appendix Section I.4.1. 2-Wasserstein distance between the distributions for SSW, S3W, RI-S3W with NR {1, 5}, and ARI-S3W with NR = 30 and pool size of 1000. We observe that the proposed distances provide on-par or better performance while being significantly faster than SSW (up to 20X). Additionally, we provide mini-batch results in the appendix Section I.2. When the target is only known up to a constant, we use sliced-Wasserstein variational inference (Yi & Liu, 2022) (see the appendix, Section I.8). 5.2. Self-Supervised Learning (SSL) We now show that our method can be an effective loss for contrastive SSL on the sphere. We adopt the contrastive objective proposed in (Wang & Isola, 2020), composed of alignment and uniformity loss terms, and replace the Gaussian kernel uniformity loss with an S3W-based loss: i=1 z A i z B i 2 2+λ 2 S3W2(z A, ν) + S3W2(z B, ν) , where z A, z B Rn (d+1) are two encoded views of the same images, ν = Unif(Sd) is the uniform distribution on Sd and λ > 0 is the regularization coefficient. We run our experiments on CIFAR-10 using a Res Net18 encoder. Figure 5 visualizes the learned embeddings when d = 2, and Table 1 assesses the quality of the learned embeddings for d = 9 using the standard linear classifier evaluation. The details of these experiments are included in the appendix Section I.7. We can see that the proposed metrics consis- Figure 5. Projected features on S2 for CIFAR-10. Table 1. Standard linear evaluation on CIFAR-10. Accuracy is for encoded (E) features and projected (P) features on S9 (encoder feature dimension is 10). Time reported is per epoch of pretraining. Method Acc.(%) E/P Time(s/ep.) Supervised 92.38 / 91.77 (Wang & Isola, 2020) 79.76 / 74.57 24.28 Sim CLR (Chen et al., 2020) 79.69 / 72.78 20.94 SSW (Bonet et al., 2023a) 70.46 / 64.52 33.14 SW 74.45 / 68.35 21.09 S3W 78.54 / 73.84 21.36 RI-S3W (5) 79.97 / 74.27 21.59 ARI-S3W (5) 79.92 / 75.07 21.51 tently lead to embeddings that are competitive both in terms of performance and runtime. 5.3. Sliced-Wasserstein Autoencoder (SWAE) We now adopt the SWAE framework proposed by (Kolouri et al., 2019b) and demonstrate the application of S3W distances in generative modeling. Let φ : X Sd denote an encoder network and ψ : Sd X denote a decoder network. The SWAE framework aims to enforce that the encoded data follow a prior distribution in the latent space. In our experiments, we use a mixture of v MF distributions with 10 components on S2 as our prior distribution, which we denote as q. Then, the training objective for the modified SWAE is: min φ,ψ Ex p[c(x, ψ(φ(x))] + λ S3W(φ#p, q) where λ is the regularization coefficient, c( , ) the reconstruction loss, for which we use the standard Binary Cross Entropy (BCE) loss, and p denotes the data distribution. Details about the network architectures and results on the MNIST benchmark can be found in the appendix Section I.9. Stereographic Spherical Sliced Wasserstein Distances Table 2. CIFAR-10 results for SWAE. We evaluate the latent regularization loss (log(W2) and NLL), along with the BCE reconstruction loss on the test data. Method log W2 NLL BCE Time (s/ep.) Supervised -0.5132 0.0060 0.6319 5.3243 SSW -2.1949 0.0052 0.6323 15.4651 SW -3.3229 -0.0007 0.6348 5.4661 S3W -3.3381 0.0025 0.6318 5.7511 RI-S3W (5) -3.1424 -0.0043 0.6376 7.5443 ARI-S3W (5) -3.3853 0.0028 0.6332 5.8316 5.4. Earth Density Estimation We extend our S3W distances to the task of density estimation with normalizing flows on S2. Our focus is on the three datasets introduced by (Mathieu & Nickel, 2020), representing the Earth s surface as a perfect spherical manifold: Earthquake (NOAA, 2022), Flood (Brakenridge, 2017), and Fire (EOSDIS, 2020). Similar to (Bonet et al., 2023a), we use an exponential map normalizing flow model (Rezende et al., 2020) (see I.6) optimizing min T S3W(T#ν, q), where T is the transformation induced by the model, ν is the data distribution known via samples {yj}M j=1, and q is the prior distribution on S2. The learned density at y S2 can then be approximated as fν(y) = q(T(y))| det JT (y)| where JT (y) denotes the Jacobian of T at y. Table 3. Earth datasets results. We use a pool size of 100 for ARI-S3W. Stereo denotes the approach introduced by (Gemici et al., 2016) which involves stereographically projecting the sphere Sd onto Rd and then performing Real NVP (Dinh et al., 2016). Method Quake Flood Fire Stereo 2.04 0.19 1.85 0.03 1.34 0.11 SW 1.12 0.07 1.58 0.02 0.55 0.18 SSW 0.84 0.05 1.26 0.03 0.24 0.18 S3W 0.88 0.09 1.33 0.05 0.36 0.04 RI-S3W(1) 0.79 0.07 1.25 0.02 0.15 0.06 ARI-S3W(50) 0.78 0.06 1.24 0.04 0.10 0.04 6. Conclusion We introduced a new class of sliced Wasserstein (SW) distances for spherical data using the stereographic projection (SP). We rigorously addressed the distortion issue caused by SP and presented several high-speed, high-performing variants of our approach. S3W maps data to a generalized hypersurface with minimal distortion (via composing SP with a novel injective function) and efficiently computes SW over an order of magnitude faster than existing baselines in many settings. RI-S3W encodes rotation invariance into S3W, further boosting its performance. Although RI-S3W is highly parallelizable, we achieved additional efficiency with our implementation of amortization (denoted as ARI-S3W). Given that our approach uses stereographic projection to map the hypersphere into Euclidean space, there may exist novel extensions of our slicing framework to unbalanced settings by leveraging recent advancements in unbalanced and partial OT on R (Bai et al., 2023; S ejourn e et al., 2023). Moreover, our approach opens up interesting directions to bridging the spherical manifold and classical SW literature via pushforwards to appropriate hypersurfaces. Lastly, we highlight recent advancements in improving the projection complexity of sliced Wasserstein distances (Deshpande et al., 2019; Nguyen et al., 2020; Nguyen & Ho, 2022; Nguyen et al., 2022; Nguyen & Ho, 2024; Nguyen et al., 2024). These strategies are compatible with our proposed S3W distances and can be integrated with our method to further enhance the projection efficiency of our metric in comparing high-dimensional spherical measures. Acknowledgements This research was supported by the NSF CAREER Award No. 2339898. We are grateful to Cl ement Bonet for providing insights into the SSW implementation. Impact Statement The work presented in this paper aims to advance the field of machine learning, particularly the supplementary theoretical developments and explorations of computational optimal transport. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here. Abouelaz, A. and Daher, R. Sur la transformation de radon de la sph ere sd. Bulletin de la Soci et e Math ematique de France, 121(3):353 382, 1993. URL http://eudml. org/doc/87670. Absil, P.-A., Mahony, R., and Sepulchre, R. Optimization algorithms on matrix manifolds. Princeton University Press, 2008. Bai, Y., Schmitzer, B., Thorpe, M., and Kolouri, S. Sliced optimal partial transport. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13681 13690, 2023. Beylkin, G. The inversion problem and applications of the generalized Radon transform. Communications on pure and applied mathematics, 37(5):579 599, 1984. Bonet, C., Berg, P., Courty, N., Septier, F., Drumetz, L., and Pham, M. T. Spherical sliced-wasserstein. In The Eleventh International Conference on Learning Representations, 2023a. Stereographic Spherical Sliced Wasserstein Distances Bonet, C., Chapel, L., Drumetz, L., and Courty, N. Hyperbolic sliced-wasserstein via geodesic and horospherical projections. In Topological, Algebraic and Geometric Learning Workshops 2023, pp. 334 370. PMLR, 2023b. Bonet, C., Mal ezieux, B., Rakotomamonjy, A., Drumetz, L., Moreau, T., Kowalski, M., and Courty, N. Slicedwasserstein on symmetric positive definite matrices for m/eeg signals. In International Conference on Machine Learning, pp. 2777 2805. PMLR, 2023c. Bonneel, N., Rabin, J., Peyr e, G., and Pfister, H. Sliced and Radon Wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision, 51(1):22 45, 2015. Bradawl, G. Statistics of earth science data, 2003. Brakenridge, G. Global active archive of large flood events. Dartmouth Flood Observatory, University of Colorado, 2017. URL http://floodobservatory. colorado.edu/Archives/index.html. Cabella, P. and Marinucci, D. Statistical challenges in the analysis of cosmic microwave background radiation. Ann. Appl. Stat., 3(1):61 95, 2009. Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., and Joulin, A. Unsupervised learning of visual features by contrasting cluster assignments. Advances in neural information processing systems, 33:9912 9924, 2020. Charikar, M., Chen, B., R e, C., and Waingarten, E. Fast algorithms for a new relaxation of optimal transport. In The Thirty Sixth Annual Conference on Learning Theory, pp. 4831 4862. PMLR, 2023. Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597 1607. PMLR, 2020. Chen, X., Yang, Y., and Li, Y. Augmented sliced wasserstein distances. In International Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=i Mq TLyfwn OO. Coors, B., Condurache, A. P., and Geiger, A. Spherenet: Learning spherical representations for detection and classification in omnidirectional images. In Proceedings of the European conference on computer vision (ECCV), pp. 518 533, 2018. Cui, L., Qi, X., Wen, C., Lei, N., Li, X., Zhang, M., and Gu, X. Spherical optimal transportation. Computer-Aided Design, 115:181 193, 2019. Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, pp. 2292 2300, 2013. Davidson, T. R., Falorsi, L., De Cao, N., Kipf, T., and Tomczak, J. M. Hyperspherical variational auto-encoders. ar Xiv preprint ar Xiv:1804.00891, 2018. De Cao, N. and Aziz, W. The power spherical distribution. ar Xiv preprint ar Xiv:2006.04437, 2020. Delon, J., Salomon, J., and Sobolevski, A. Fast transport optimization for monge costs on the circle. SIAM Journal on Applied Mathematics, 70(7):2239 2258, 2010. Denisyuk, A. Inversion of the generalized Radon transform. Translations of the American Mathematical Society Series 2, 162:19 32, 1994. Deshpande, I., Hu, Y.-T., Sun, R., Pyrros, A., Siddiqui, N., Koyejo, S., Zhao, Z., Forsyth, D., and Schwing, A. G. Max-sliced wasserstein distance and its use for gans. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10648 10656, 2019. Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using real nvp. ar Xiv preprint ar Xiv:1605.08803, 2016. Dominitz, A. and Tannenbaum, A. Texture mapping via optimal mass transport. IEEE Transactions on Visualization and Computer Graphics, 16(3):419 433, 2010. doi: 10.1109/TVCG.2009.64. Doucet, A., De Freitas, N., Gordon, N. J., et al. Sequential Monte Carlo methods in practice. Springer, 2001. Ehrenpreis, L. The universality of the Radon transform. Oxford University Press on Demand, 2003. Elad, A., Keller, Y., and Kimmel, R. Texture mapping via spherical multi-dimensional scaling. In Scale Space and PDE Methods in Computer Vision: 5th International Conference, Scale-Space 2005, Hofgeismar, Germany, April 7-9, 2005. Proceedings 5, pp. 443 455. Springer, 2005. EOSDIS. Land, atmosphere near real-time capability for eos (lance) system operated by nasa s earth science data and information system (esdis). https://earthdata.nasa.gov/earth-observation-data/nearreal-time/firms/active-fire-data, 2020. Flamary, R., Courty, N., Gramfort, A., Alaya, M. Z., Boisbunon, A., Chambon, S., Chapel, L., Corenflos, A., Fatras, K., Fournier, N., Gautheron, L., Gayraud, N. T., Janati, H., Rakotomamonjy, A., Redko, I., Rolet, A., Schutz, A., Seguy, V., Sutherland, D. J., Tavenard, R., Tong, A., and Vayer, T. Pot: Python optimal transport. Journal of Machine Learning Research, 22(78):1 8, 2021. URL http: //jmlr.org/papers/v22/20-451.html. Stereographic Spherical Sliced Wasserstein Distances Gel fand, I. M., Graev, M. I., and Shapiro, Z. Y. Differential forms and integral geometry. Functional Analysis and its Applications, 3(2):101 114, 1969. Gemici, M. C., Rezende, D., and Mohamed, S. Normalizing flows on riemannian manifolds. ar Xiv preprint ar Xiv:1611.02304, 2016. Gindikin, S., Reeds, J., and Shepp, L. Spherical tomography and spherical integral geometry. Tomography, Impedance Imaging, and Integral Geometry, 30:83 92, 1994a. Gindikin, S., Reeds, J., and Shepp, L. Spherical tomography and spherical integral geometry. In Tomography, Impedance Imaging, and Integral Geometry: 1993 AMSSIAM Summer Seminar on the Mathematics of Tomography, Impedance Imaging, and Integral Geometry, June 7-18, 1993, Mount Holyoke College, Massachusetts, volume 30, pp. 83. American Mathematical Soc., 1994b. Groemer, H. On a spherical integral transformation and sections of star bodies. Monatshefte f ur Mathematik, 126 (2):117 124, 1998. Hamfeldt, B. F. and Turnquist, A. G. A convergence framework for optimal transport on the sphere. Numerische Mathematik, 151(3):627 657, 2022. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770 778, 2016. Helgason, S. et al. Integral geometry and Radon transforms. Springer, 2011. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Neural Information Processing Systems, 2017. Hielscher, R., Potts, D., and Quellmalz, M. An svd in spherical surface wave tomography. New Trends in Parameter Identification for Mathematical Models, pp. 121 144, 2018. Homan, A. and Zhou, H. Injectivity and stability for a generic class of generalized Radon transforms. The Journal of Geometric Analysis, 27(2):1515 1529, 2017. Hundrieser, S., Klatt, M., and Munk, A. The statistics of circular optimal transport. In Directional Statistics for Innovative Applications: A Bicentennial Tribute to Florence Nightingale, pp. 57 82. Springer, 2022. Jammalamadaka, S. R., Sengupta, A., and Sengupta, A. Topics in circular statistics, volume 5. world scientific, 2001. Jiang, S., Li, Y., Weng, D., You, K., and Chen, W. 3d reconstruction of spherical images: A review of techniques, applications, and prospects. ar Xiv preprint ar Xiv:2302.04495, 2023. Jupp, P. Some applications of directional statistics to astronomy. In Multivariate Statistics and Matrices in Statistics: Proceedings of the 5th Tartu Conference, Tartu P uhaj arve, Estonia, 23 28 May, 1994, pp. 123. Walter de Gruyter Gmb H & Co KG, 2020. Khasanova, R. and Frossard, P. Graph-based classification of omnidirectional images. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 869 878, 2017. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Kobyzev, I., Prince, S. J., and Brubaker, M. A. Normalizing flows: An introduction and review of current methods. IEEE transactions on pattern analysis and machine intelligence, 43(11):3964 3979, 2020. Kolouri, S., Park, S. R., and Rohde, G. K. The Radon cumulative distribution transform and its application to image classification. Image Processing, IEEE Transactions on, 25(2):920 934, 2016a. Kolouri, S., Zou, Y., and Rohde, G. K. Sliced-Wasserstein kernels for probability distributions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4876 4884, 2016b. Kolouri, S., Nadjahi, K., Simsekli, U., Badeau, R., and Rohde, G. Generalized sliced wasserstein distances. Advances in neural information processing systems, 32, 2019a. Kolouri, S., Pope, P. E., Martin, C. E., and Rohde, G. K. Sliced Wasserstein auto-encoders. In International Conference on Learning Representations, 2019b. URL https://openreview.net/forum? id=H1xa Jn05FQ. Krizhevsky, A. Learning multiple layers of features from tiny images. 2009. Kuchment, P. Generalized transforms of Radon type and their applications. In Proceedings of Symposia in Applied Mathematics, volume 63, pp. 67, 2006. Le, T., Yamada, M., Fukumizu, K., and Cuturi, M. Treesliced variants of wasserstein distances. Advances in neural information processing systems, 32, 2019. Ley, C. and Verdebout, T. Modern directional statistics. CRC Press, 2017. Stereographic Spherical Sliced Wasserstein Distances Lezcano-Casado, M. Trivializations for gradient-based optimization on manifolds. In Advances in Neural Information Processing Systems, Neur IPS, pp. 9154 9164, 2019. Mardia, K. V., Jupp, P. E., and Mardia, K. Directional statistics, volume 2. Wiley Online Library, 2000. Martin, R. D., Medri, I., Bai, Y., Liu, X., Yan, K., Rohde, G. K., and Kolouri, S. Lcot: Linear circular optimal transport. ar Xiv preprint ar Xiv:2310.06002, 2023. Mathieu, E. and Nickel, M. Riemannian continuous normalizing flows. Advances in Neural Information Processing Systems, 33:2503 2515, 2020. Nguyen, K. and Ho, N. Amortized projection optimization for sliced wasserstein generative models. Advances in Neural Information Processing Systems, 35:36985 36998, 2022. Nguyen, K. and Ho, N. Energy-based sliced wasserstein distance. Advances in Neural Information Processing Systems, 36, 2024. Nguyen, K., Ho, N., Pham, T., and Bui, H. Distributional sliced-wasserstein and applications to generative modeling. In International Conference on Learning Representations, 2020. Nguyen, K., Ren, T., Nguyen, H., Rout, L., Nguyen, T. M., and Ho, N. Hierarchical sliced wasserstein distance. In The Eleventh International Conference on Learning Representations, 2022. Nguyen, K., Ren, T., and Ho, N. Markovian sliced wasserstein distances: Beyond independent projections. Advances in Neural Information Processing Systems, 36, 2024. NOAA. Global Significant Earthquake Database, 2020. URL https://data.nodc.noaa.gov/ cgi-bin/iso?id=gov.noaa.ngdc.mgg. hazards:G012153. National Geophysical Data Center / World Data Service (NGDC/WDS): NCEI/WDS Global Significant Earthquake Database, NOAA National Centers for Environmental Information. NOAA. Ncei/wds global significant earthquake database. https://www.ncei.noaa.gov/access/metadata/landingpage/bin/iso?id=gov.noaa.ngdc.mgg.hazards:G012153, 2022. Oord, A. v. d., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. ar Xiv preprint ar Xiv:1807.03748, 2018. Parhi, R. and Nowak, R. D. Deep learning meets sparse regularization: A signal processing perspective. ar Xiv preprint ar Xiv:2301.09554, 2023. Perraudin, N., Defferrard, M., Kacprzak, T., and Sgier, R. Deepsphere: Efficient spherical convolutional neural network with healpix sampling for cosmological applications. Astronomy and Computing, 27:130 146, 2019. Pewsey, A. and Garc ıa-Portugu es, E. Recent advances in directional statistics. Test, 30(1):1 58, 2021. Peyr e, G. and Cuturi, M. Computational optimal transport. ar Xiv preprint ar Xiv:1803.00567, 2018. Quellmalz, M. A generalization of the funk radon transform. Inverse Problems, 33(3):035016, 2017. Quellmalz, M. The funk radon transform for hyperplane sections through a common point. Analysis and Mathematical Physics, 10(3):38, 2020. Quellmalz, M., Beinert, R., and Steidl, G. Sliced optimal transport on the sphere. ar Xiv preprint ar Xiv:2304.09092, 2023. Quinto, E. T. Null spaces and ranges for the classical and spherical radon transforms. Journal of Mathematical Analysis and Applications, 90(2):408 420, 1982. Rabin, J., Peyr e, G., Delon, J., and Bernot, M. Wasserstein barycenter and its application to texture mixing. In Scale Space and Variational Methods in Computer Vision, pp. 435 446. Springer, 2012. Radon, J. Uber die bestimmug von funktionen durch ihre integralwerte laengs geweisser mannigfaltigkeiten. Berichte Saechsishe Acad. Wissenschaft. Math. Phys., Klass, 69: 262, 1917. Rezende, D. J., Papamakarios, G., Racaniere, S., Albergo, M., Kanwar, G., Shanahan, P., and Cranmer, K. Normalizing flows on tori and spheres. In International Conference on Machine Learning, pp. 8083 8092. PMLR, 2020. Rubin, B. Introduction to Radon transforms, volume 160. Cambridge University Press, 2015. Rubin, B. On the spherical slice transform. Analysis and Applications, 20(03):483 497, 2022. Rustamov, R. M. and Majumdar, S. Intrinsic sliced wasserstein distances for comparing collections of probability distributions on manifolds and graphs. In International Conference on Machine Learning, 2023. Scetbon, M. and Cuturi, M. Low-rank optimal transport: Approximation, statistics and debiasing. Advances in Neural Information Processing Systems, 35:6802 6814, 2022. S ejourn e, T., Bonet, C., Fatras, K., Nadjahi, K., and Courty, N. Unbalanced optimal transport meets slicedwasserstein. ar Xiv preprint ar Xiv:2306.07176, 2023. Stereographic Spherical Sliced Wasserstein Distances Snyder, J. P. Flattening the earth: two thousand years of map projections. University of Chicago Press, 1997. Unser, M. Ridges, neural networks, and the radon transform. Journal of Machine Learning Research, 24(37): 1 33, 2023. Villani, C. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008. Vrba, J. and Robinson, S. E. Signal processing in magnetoencephalography. Methods, 25(2):249 271, 2001. Wang, T. and Isola, P. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning, pp. 9929 9939. PMLR, 2020. Wang, X., Lei, Q., and Panageas, I. Fast convergence of langevin dynamics on manifold: Geodesics meet logsobolev. Advances in Neural Information Processing Systems, 33:18894 18904, 2020. Wu, Z., Xiong, Y., Yu, S. X., and Lin, D. Unsupervised feature learning via non-parametric instance discrimination. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3733 3742, 2018. doi: 10.1109/CVPR.2018.00393. Xu, J. and Durrett, G. Spherical latent spaces for stable variational autoencoders. ar Xiv preprint ar Xiv:1808.10805, 2018. Yi, M. and Liu, S. Sliced wasserstein variational inference. In Asian Conference on Machine Learning, 2022. Stereographic Spherical Sliced Wasserstein Distances A. Notation (Rd, ): d-dimensional Euclidean space, where is the Euclidean norm (or 2 norm): given x = [x1, . . . , xd] Rd, x = p x2 1 + + x2 d. Sometimes we will write 2 to stress that we are considering this 2 norm . , : canonical inner product in Euclidean spaces. Lp(Ω), where p 1 and Ω Rd: functional space defined by Lp(Ω) := {f : Ω R| Z endowed with the norm In the extreme case p = , we have L (Ω) := {f : Ω R| sup |f| < }. (18) In our case, Ωwill be Rd, R Sd 1, R Rd \ {0} or Sd. C(X): space of real-valued continuous functions defined on the space X. C0(Rd): set of continuous functions that vanish at infinity . C : class of infinite differentiable functions. Sd 1: the unit sphere in Rd, defined as Sd 1 := {x Rd : x 2 = 1}. σd: the uniform probability measure defined in the sphere Sd 1 Rd. sn = [0, . . . , 0, 1] Rd+1 is the North Pole and s0 = [0, . . . , 0, 1] Rd+1 is the South Pole in Sd Rd+1. ϕ: stereographic projection (SP). ϕϵ: stereographic projection operator whose domain excludes the ϵ-cap around sn in the sphere Sd. v T : transpose vector. H(t, θ) = {x Rd | x, θ = t} hyperplane. Ht,θ = {x Rd| g(x, θ) = t} level set of g : Rd Sd 1 R at level t with fixed spherical variable θ. R( ): Radon transform with R : L1(Rd) L1(R Sd 1). G( ): Generalized Radon Transform (GRT) with G(f)(t, θ) = Z Rd f(x)δ(t g(x, θ)) dx. In this formulation, d d in general, and g : Rd (Rd \ {0}) R is a function which satisfies the following: (H.1) g(x, θ) is C function on Rd (Rd \ {0}). (H.2) g(x, θ) is homogeneous of degree one in θ, i.e., g(x, λθ) = λg(x, θ) for all λ R. (H.3) g(x, θ) is non-degenerate with respect to x in the sense that (x, θ) Rd (Rd \ {0}), g x(x, θ) = 0. (H.4) The mixed Hessian of g is strictly positive, i.e., det 2g xi θj > 0. By property (H.2), G : L1(Rd) L(R Sd 1). We refer (Beylkin, 1984; Denisyuk, 1994; Gel fand et al., 1969; Ehrenpreis, 2003; Kuchment, 2006; Homan & Zhou, 2017; Kolouri et al., 2019a) for more details. Stereographic Spherical Sliced Wasserstein Distances H( ): a variant and simplified version of the Generalized Radon Transform defined by H : L1(Rd) L1(R Sd 1), H(f)(t, θ) = Z Rd f(x)δ(t h(x), θ )dx, where h : Rd Rd is injective. h1: function defined by (14). (t, θ): the first and second inputs for (generalized) Radon transform, where t R and variable θ lies in a pre-defined sphere: θ Sd 1 for the classic Radon transform (R), and θ Sd 1 for the generalized cases G and H. R , G , H : the Hermitian adjoint operators for R, G, H. In particular, R : L (R Sd 1) L (Rd), and G , H : L (R Sd 1) L (Rd) with R (ψ)(x) = Z Sd 1 ψ( x, θ , θ)dσd(θ) G (ψ)(x) = Z Sd 1 ψ(g(x, θ), θ)dσd (θ) H (ψ)(x) = Z Sd 1 ψ( h(x), θ , θ)dσd (θ) M(Ω): Set of all real Radon measures (finite regular Borel measures not necessarily positive, that is, it includes signed measures) defined on Ω. In this article Ωcan be Rd, R Sd 1, or R Sd 1. M(Ω) is endowed with the total variation norm µ T V = µ+(Ω) + µ (Ω) where µ are the positive and negative parts of µ. M+(Ω): Set of all positive Radon measures defined on Ω. It is endowed with the total variation norm µ T V = µ(Ω). P(Ω): Set of all probability measures defined on Ω(i.e., positive measures with µ T V = µ(Ω) = 1). Pp(Ω): Set of all probability measures defined on Ωwith p th finite moment, that is, Pp(Ω) = {µ P(Ω) : Z Ω x pdµ < } We have the following relation: Pp(Ω) P(Ω) M+(Ω) M(Ω). T#µ: push-forward of the measure µ M(X) by the measurable map T : X Y , which defines a measure on M(Y ) such that T#µ(B) = µ ({x X : T(x) B}) for every measurable set B Y . Wp: p-Wasserstein distance. ˆµ = 1 M ΣM m=1δxm: empirical distribution. Unif(X), σ(X): Uniform distribution on a measure space X. Fµ, F 1 µ : cumulative distribution function (CDF) and quantile function of the measure µ, respectively. SR, SG, and SH: Stereographic Spherical Radon Transforms (SSRT), SR(µ) = R(ϕ#µ), SG(µ) = G(ϕ#µ), SH(µ) = H(ϕ#µ). Stereographic Spherical Sliced Wasserstein Distances S3WR,p, S3WG,p, and S3WH,p: Stereographic Spherical Sliced Wasserstein (S3W) distances S3W p R,p(µ, ν) = Z Sd 1 W p p (SR(µ)θ, SR(ν)θ) dσd(θ), S3W p G,p(µ, ν) = Z Sd 1 W p p (SG(µ)θ, SG(ν)θ) dσd (θ), S3W p H,p(µ, ν) = Z Sd 1 W p p (SH(µ)θ, SH(ν)θ) dσd (θ). E: expected value. O(n), SO(n): orthogonal group (rotations and reflections) and special orthogonal group (rotations) of n n real matrices, respectively. ω: Haar probability measure on the compact group SO(d + 1). : approximately. x µ: element x sampled from the distribution µ. x: gradient with respect to the variable x. RI S3WG,p: Rotationally Invariant Extension of S3W defined by (17). ARI S3W: Amortized Rotationally Invariant Extension of S3W used in our experiments. d Sd( , ): The great circle distance on the sphere Sd defined by d Sd(s1, s2) = arccos( s1, s2 ) for all s1, s2 Sd. (s1, s2): the angle between the points s1 and s2 defined by (s1, s2) = arccos( s1, s2 ) [0, π]. B. Radon Transform for Radon Measures This section will review the classic Radon transform in the general Radon measure setting. In the next section, we introduce the Generalized Radon Transform (GRT) for general Radon measures. As we discussed in the main section, given a function f L1(Rd), the Radon transform, R(f), is a function in L1(R Sd 1). In particular, for each (t, θ) R Sd 1, we have: R(f)(t, θ) = Z Rd f(x)δ(t x, θ ) dx Rd 1 f(tθ + Uθξ) dξ where Uθ Rd (d 1) is any matrix such that its columns are formed by an orthonormal basis of θ (the subspace which is perpendicular to the vector θ), see equation (22) in (Bonneel et al., 2015). It is straightforward to verify the above definition is well-defined (i.e. R(f) is independent of the choice of Uθ). Notably, the Radon transform R : L1(Rd) L1(R Sd 1) is a linear bijection with a closed-form inversion formula. For a function f L1(Rd), the Radon transform Rf can be inverted using the inverse Radon transform, denoted by R 1. The inversion formula is given by: f(x) = R 1(Rf)(x) = Z Rf( , θ) η( ) ( x, θ ) dσd(θ), (19) where η( ) is a one-dimensional high-pass filter with corresponding Fourier transform Fη(ω) = c|ω|d 1, which appears due to the Fourier slice theorem (Helgason et al., 2011), and is the convolution operator. This integral formula effectively reconstructs f from its Radon transform and, in the medical imaging community, is referred to as filtered back projection. Stereographic Spherical Sliced Wasserstein Distances Besides, the fact that the Radon transform R is a bounded linear operator from L1(Rd) to L1(R Sd 1), yields to the definition of the dual operator 2, denoted as R , which is generally called back projection (Helgason et al., 2011; Rubin, 2015). By definition of the dual operator and Riesz representation theorem that identifies the dual space of L1 with L , for each ψ L (R Sd 1), we have Z R Sd 1 ψ(t, θ) R(f)(t, θ) dt dσd(θ) = Z Rd f(x) R (ψ)(x) dx. (20) Moreover, R : L (R Sd 1) L (Rd) has closed form: For each x Rd and ψ L (R Sd 1), R (ψ)(x) = Z Sd 1 ψ( x, θ , θ) dσd(θ), (21) where σd is the uniform probability measure defined in the sphere Sd 1 Rd . As discussed in the main section, the classic Radon transform can describe the distribution in projected 1D space. Indeed, suppose f is the density of a probability measure µ, and θ Sd 1, then function R(f)( , θ) : R R 0 is exactly the density of the projected probability measure of µ into the space spanned by θ. We denote this projection θ#µ by using pushforward notation where we identify θ Sd 1 with the function x 7 x, θ θ, (22) which range lies of the line generated by the direction θ. Thus, we write R(f)( , θ) = θ#µ M(R). (23) However, when µ is not a continuous measure, continuous (e.g. empirical distribution), then θ#µ does not admit a density function. To address this limitation and extend R to any king of finite measures, we follow a measure-theoretic approach of the Radon transform as in (Bonneel et al., 2015) by using equation (20). We refer the reader also to (Helgason et al., 2011)[Ch.1, Sec.5]). For this extension, we first recall the classical Riesz-Markov representation theorem and the lemma for the identity between Radon measures. Theorem B.1 (Riesz-Markov Representation theorem) Suppose Ωis a separable and locally compact space, then the dual space of the Banach space (C0(Ω), ), denoted as (C0(Ω) , ), is isomorphic to (M(Ω), T V ). In particular, for each bounded linear functional ξ C0(Ω) , there exists a unique ν M(Ω) such that Ω ψ(x) dν(x), and ξ = ν T V . Lemma B.2 (Identity of Radon measures) Given µ1, µ2 M(Ω) where Ω Rd, the following are equivalent: (1) ν1 = ν2 Ωψ(x) dν1(x) = R Ωψ(x) dν2(x), ψ C0(Ω) Ωψ(x) dν1(x) = R Ωψ(x) dν2(x), ψ L (Ω) We will refer to the spaces C0(Ω) and L (Ω) as the spaces of test functions for M(Ω). Proof: If ν1 = ν2, we directly have (2), (3). Since C0(Ω) L (Ω), then (3) implies (2). All we need to prove is (2) (1). By the Riesz-Markov Representation theorem B.1, M(Ω) is isometric to (C0(Ω) , ). By the identity in the dual space, we have ν1 = ν2 given (2). 2Some references call the back projection as adjoint (aka Hermitian conjugate) operator e.g. (Bonneel et al., 2015; Unser, 2023). Stereographic Spherical Sliced Wasserstein Distances Based on the above theorem, for µ M(Rd), its Radon transform R(µ) = ν is defined as the measure ν M(R Sd 1) such that for each ψ C0(R Sd 1), Z R Sd 1 ψ(t, θ) dν(t, θ) = Z Rd(R (ψ))(x) dµ(x), (24) where the dual operator R is defined in (21). Synthetically, one usually writes R(µ)(ψ) = µ(R (ψ)). Note that, by the above lemma, we can choose the L (R Sd 1) space of test functions. In particular, by setting ψ = 1A for any Borel set A R Sd 1, we can verify that ν is Radon measure if µ is Radon measure. Similarly, ν is a positive Radon measure if µ is positive. In addition, one can obtain ν T V = µ T V . Thus, if µ is a probability measure on Rd, then ν = R(µ) is a probability measure defined on R Sd 1. By the disintegration theorem in classic measure theory, there exists a ν a.s. unique set of measures (νθ)θ Sd 1 M(R) such that for any ψ C0(R Sd 1), we have Z R Sd 1 ψ(t, θ) dν(t, θ) = Z R ψ(t, θ) dνθ(t) dσd(θ). (25) Similar to (23) for the classic Radon measure, given a fixed θ Sd 1, R(µ)θ describes the projected measure of µ into the 1D space spanned by θ (Bonneel et al., 2015)[Proposition 6], that is: R(µ)θ = νθ = θ#µ M(R). C. Generalized Radon Transform for Probability Measures Extending the Radon transform as in (3) and (4) is natural. In this section, we review those definitions. Given g : Rd (Rd \ {0}) R, such that g satisfies the technical assumptions [H.1 H.4] listed in the Notation section and given in (Beylkin, 1984) and (Kolouri et al., 2019a), the Generalized Radon Transform (GRT) is defined as G(f)(t, θ) = Z Rd f(x)δ(t g(x, θ)) dx, (t, θ) R (Rd \ {0}). Note, for each (t, θ) R (Rd \ {0}), we set the corresponded hyper-surface as Ht,θ = {x Rd | g(x, θ) = t}. By the homogeneity property (H.2), Hλt,λθ = Ht,θ, λ R. Then, Gf(λt, λθ) = Gf(t, θ), (t, θ, λ) R (Rd \ {0}) R. Thus, the hypersurface can be parameterized by (t, θ) R Sd 1, and we can redefine G as a mapping from space L1(Rd) to L1(R Sd 1). Furthermore, as G is an linear operator from L1(Rd) to L1(R Sd 1), we can define the dual operator G of G as in (Homan & Zhou, 2017) (also called the adjoint operator). In analogy with the Radon transform, this is done by the duality between L1 and L : For each ψ L (R Sd 1), G (ψ)(x) = Z Sd 1 ψ(g(x, θ), θ) dσd (θ), x Rd, (26) where σd is the uniform probability measure defined in Sd 1. In the specific case, g(x, θ) = h(x), θ , where h : Rd Rd is a injective function, the induced GRT becomes H(f)(t, θ) = Z Sd 1 f(x)δ(t h(x), θ ) dx, (t, θ) R Sd 1, Stereographic Spherical Sliced Wasserstein Distances with corresponding dual operator H (ψ)(x) = Z Sd 1 ψ( h(x), θ , θ) dσd (θ), ψ L (R Sd 1), x Rd. (27) In this case, that is g(x, θ) = h(x), θ for h = (h1, . . . , hd ) : Rd Rd , properties [H.1-H.4] read as follows: First, notice that the real inner product in Rd defines, in each variable, a C function which is homogeneous of degree one. In particular, (H.2) is trivially satisfied. Then, h must satisfy: (a) In place of (H.1): h C . (b) In place of (H.3): For all (x, θ) Rd Sd 1, h x1 (x), θ , . . . , h xd (x), θ = 0. (c) In place of (H.4): det hi 1 i 1,1 j d > 0. In particular, if d = d , let Jh denote the Jacobian of h = (h1, . . . , hd) : Rd Rd, that is, the d d matrix Jh = hi ij. Then, condition (b) can be written as the matrix-vector multiplication Jh(x) θ = 0, and condition (c) is equivalent det(Jh(x)) > 0 for all x Rd, and so in this case (b) is redundant. Therefore, when d = d , h : Rd Rd must satisfy: h C and det (Jh) > 0. Inspired by the framework of classic Radon transform for Radon measure, we can extend the idea into the GRT setting. In particular, given µ M(Rd), we define the measures G(µ) and H(µ) as follows: For each test function ψ C0(R Sd 1), Z R Sd 1 ψ(t, θ) d G(µ)(t, θ) = Z Rd G (ψ)(x) dµ(x) (28) Z R Sd 1 ψ(t, θ) d H(µ)(t, θ) = Z Rd H (ψ)(x) dµ(x) (29) Proposition C.1 Given µ M(Rd), the measures G(µ), H(µ) is defined in (28) and (29) have the following properties: (1) G(µ) is well-defined, i.e., G(µ) M(R Sd 1). (2) If µ M+(Rd), then G(µ) M+(R Sd 1). Similarly, if µ P(Rd), then G(µ) P(R Sd 1). (3) G preserves the the total variation norm, i.e., µ T V = G(µ) T V . In particular, G preserves mass. (4) If µ P(Rd), then G(µ) P(R Sd 1). (5) Given µ M(Rd), the disintegration theorem gives a unique G(µ) a.s. set of measures (G(µ)θ)θ Sd 1 M(R) such that for any ψ C0(R Sd 1), Z R Sd 1 ψ(t, θ) d G(µ)(t, θ) = Z R ψ(t, θ) d G(µ)θ(t) dσd (θ) Then, it holds that G(µ)θ = g( , θ)#µ. (6) Properties (1) (5) hold true for H in place of G. (7) Given µ M(Rd), the disintegration theorem gives a unique H(µ) a.s. set of measures (H(µ)θ)θ Sd 1 M(R) such that for any ψ C0(R Sd 1), Z R Sd 1 ψ(t, θ) d H(µ)(t, θ) = Z R ψ(t, θ) d H(µ)θ(t) dσd (θ) Then, it holds that H(µ)θ = θ#h#µ, or equivalently, H(µ)θ = (θ h)#µ, where the function θ is defined as in (22). Stereographic Spherical Sliced Wasserstein Distances (8) H(µ) = R(h#µ) for every µ M(Rd). (9) When h is injective, H is invertible. (1) Let µ M(Rd). By (Beylkin, 1984; Homan & Zhou, 2017), we have that (6) is the dual operator of G. In fact, by functional analysis theory, L is the dual space of Banach space L1, and thus, G : L (R Sd 1) L (Rd) is well-defined. In particular, R Rd G (ψ)(x) dµ(x) is finite for all ψ C0(R Sd 1) L (R Sd 1). Then, expression (28) is finite for all ψ C0(R Sd 1), and moreover defines G(µ) as an element in the dual space of C0(R Sd 1). By Theorem B.1, we have G(µ) M(R Sd 1). (2) Let µ M+(Rd) and ψ C0(R Sd 1) with ψ(t, θ) 0, t, θ. Thus, by definition (6), G ψ(x) 0, x Rd. Therefore, R Sd 1 ψ(t, θ) d Gµ(t, θ) = Z Rd G (ψ)(x) dµ(x) 0 which follows from the facts G (ψ) 0 and µ M+(Rd). Since the above property holds for all non-negative test function ψ C0(R Sd 1), we obtain that G(µ) M+(R Sd 1). (3) Set ψ = 1R Sd 1 L (R Sd 1). By definition of G , we have G (ψ)(x) = Z Sd 1 1 dσd (θ) = 1, x Rd. That is, G (1R Sd 1) = 1Rd. Now, let µ M(Rd). We have that Rd 1Rd dµ = Z Rd G (1R Sd 1) dµ = Z R Sd 1 1R Sd 1 d G(µ) = G(µ)(R Sd 1). (30) Let µ+ and µ are the positive and negative parts of µ, that is, µ = µ+ µ , where µ M+(Rd). By property (2) we have that G(µ ) M+(R Sd 1). By linearity of G, we have G(µ) = G(µ+ µ ) = G(µ+) | {z } M+ G(µ ) | {z } M+ which by the uniqueness of the Hahn-Jordan decomposition theorem implies that the positive and negative parts of the measure G(µ) are G(µ+) and G(µ ), respectively. That is, G(µ) = G(µ ). G(µ) T V = G(µ)+(R Sd 1) + G(µ) (R Sd 1) = G(µ+)(R Sd 1) + G(µ )(R Sd 1) = µ+(R Sd 1) + µ (R Sd 1) = µ T V , that is, G(µ) T V = µ T V . (4) If µ P(Rd), combining the property G(µ) M+(R Sd 1) with the property G(µ) T V = µ T V = 1, we obtain that G(µ) P(R Sd 1). Stereographic Spherical Sliced Wasserstein Distances (5) We will follow the ideas in (Bonneel et al., 2015)[Proposition 6]. For any test function ψ C0(R Sd 1) we have R ψ(t, θ) d G(µ)θ(t) dσd (θ) = Z R Sd 1 ψ(t, θ) d G(µ)(t, θ) (disintegration theorem) Rd G (ψ)(x) dµ(x) (definition of G(µ)) Sd 1 ψ(g(x, θ), θ) dσd (θ) dµ(x) (definition of G ) Rd ψ(g(x, θ), θ) dµ(x) dσd (θ) R ψ(t, θ) d(g( , θ)#µ)(t) dσd (θ), where the fourth equality follows from Fubini theorem and the fact ψ is bounded and µ, σd are finite measures; in the fifth equality, t = g(x, θ) and we use the definition of pushforward. By uniqueness of the disintegration of measures we have G(µ)θ = g( , θ)#µ. This also can be shown by choosing an arbitrary ψ0 C0(R), defining ψ(t, θ) := ψ0(t), (t, θ) R Sd 1, so ψ C0(R Sd 1), and therefore we have R ψ0(t) d G(µ)θ(t) = Z R ψ(t, θ) d G(µ)θ(t) dσd (θ) R ψ(t, θ) d(g( , θ)#µ)(t) dσd (θ) R ψ0(t) d(g( , θ)#µ)(t). (6) For H, properties (1) (5) can be derived similarly. (7) As in (5), we will follow the ideas in (Bonneel et al., 2015)[Proposition 6]. For any test function ψ C0(R Sd 1) we have R ψ(t, θ) d H(µ)θ(t) dσd (θ) = Z R Sd 1 ψ(t, θ) d H(µ)(t, θ) (disintegration theorem) Rd H (ψ)(x) dµ(x) (definition of H(µ)) Sd 1 ψ(g(x, θ), θ) dσd (θ) dµ(x) (definition of H ) Rd ψ( h(x), θ , θ) dµ(x) dσd (θ) (Fubini s Theorem) R ψ( y, θ , θ) d(h#µ)(y) dσd (θ) (definition of pushforward) R ψ(t, θ) d(θ#h#µ)(t) dσd (θ) (definition of pushforward). Thus, by the uniqueness of the disintegration of measures, we have H(µ)θ = θ#h#µ = (θ h)#µ. Stereographic Spherical Sliced Wasserstein Distances (8) Let µ M(Rd), and let ψ C0(R Sd 1) be any test function. Following the ideas from (7), we have Z R Sd 1 ψ(t, θ) d H(µ)(t, θ) = Z Rd H (ψ)(x) dµ(x) Sd 1 ψ( h(x), θ , θ) dσd (θ) dµ(x) Sd 1 ψ( y, θ , θ) dσd (θ) dh#µ(y) Rd(R (ψ))(y) dh#µ(y) R Sd 1 ψ(t, θ) d R(h#µ)(t, θ), where in the third equality y = h(x) and we use the definition of pushforward; the first and fifth equation follows from the equations (28) and (5), respectively. Therefore, H(µ) = R(h#µ). (9) By Proposition 7 in (Bonneel et al., 2015), we have the mapping h#(µ) 7 R(h#(µ)) = H(µ) (31) is invertible. If h is injective, we have µ 7 h#(µ) is invertible. Thus, µ 7 H(µ) is invertible. D. Stereographic Spherical Radon Transform and Stereographic Spherical Sliced Wasserstein Distance for Probability Measures Based on the extended definition of GRT for probability measures, in this section, we formally introduce the Stereographic Spherical Radon Transform and the corresponding sliced Wasserstein distance, Stereographic Spherical Sliced Wasserstein distance for general probability measures. Given µ M(Sd) and the stereographic projection ϕ : Sd \ {sn} Rd, which is a bijection, we have that ϕ#µ is a measure defined in Rd. The generalized Stereopgraphic Spherical Radon transform, of µ has two versions SG(µ) and SH(µ) defined as follows: SG(µ) := G(ϕ#µ) and SH(µ) := H(ϕ#µ). (32) That is, from (28) and (29), we have that for each test function ψ C0(R Sd 1) Z R Sd 1 ψ(t, θ) d SG(µ)(t, θ) = Z Rd G (ψ)(x) dϕ#µ(x) = Z Sd\{sn} G (ψ)(ϕ(s))dµ(s), (33) Z R Sd 1 ψ(t, θ) d SH(µ)(t, θ) = Z Rd H (ψ)(x) dϕ#µ(x) = Z Sd\{sn} H (ψ)(ϕ(s)) dµ(s). (34) Proposition D.1 Let µ M(Sd) that does not give mass to the North Pole {sn}. The Stereographic Spherical Radon transforms defined in (32) have the following properties: (1) SG(µ), SH(µ) M(R Sd 1). In addition SG and SH preserves mass, and if µ is a positive measure, then SG(µ), SH(µ) are positive measures too. Finally, if µ P(Sd \ {sn}), then SG(µ), SH(µ) are probability measures defined on R Sd 1. (2) The disintegration theorem gives a unique SG(µ) a.s. set of measures (SG(µ)θ)θ Sd 1 M(R), and a unique SH(µ) a.s. set of measures (SH(µ)θ)θ Sd 1 M(R) such that for any ψ C0(R Sd 1), Z R Sd 1 ψ(t, θ) d SG(µ)(t, θ) = Z R ψ(t, θ) d SG(µ)θ(t) dσd (θ), and Z R Sd 1 ψ(t, θ) d SH(µ)(t, θ) = Z R ψ(t, θ) d SH(µ)θ(t) dσd (θ). (35) Stereographic Spherical Sliced Wasserstein Distances Then, it holds that SG(µ)θ = (g( , θ) ϕ)#µ, and SH(µ)θ = (θ h ϕ)#µ, (36) where we recall that the function θ is defined as in (22) (3) SH is invertible. (1) If µ M(Sd) is such that does not give mass to {sn}, then ϕ#µ is a measure defined in Rd with the same total mass. Thus, since SG(µ) = G(ϕ#µ), the conclusion follows from by (1) (4) in Proposition (C.1). Similar to SH(µ). (2) By definition of SG and (5) in Proposition (C.1) we have SG(µ)θ = g( , θ)#(ϕ#µ) = g(ϕ( ), θ)#µ. Similarly, by definition of SH and (7) in Proposition (C.1) we have SH(µ)θ = (θ h)#ϕ#µ = (θ h ϕ)#µ. (3) Finally, since the stenographic projection ϕ : Sd \ {sn} Rd is invertible, then following composition M(Sd \ {sn}) M(Rd) M(R Sd 1) µ 7 ϕ#µ 7 H(ϕ#µ) = SH(µ) is invertible as each mapping is invertible. Similarly, P(Sd \ {sn}) P(Rd) M(R Sd 1) µ 7 ϕ#µ 7 H(ϕ#µ) = SH(µ) is invertible. Thus µ 7 SH(µ) is an invertible mapping either general measures or for probability measures. Remark D.2 If g(x, θ) = x, θ , then the generalized Radon transform G coincides with the classical Radon transform R, and the corresponding Stereographic Spherical Radon transform coincides with the one defined in (10): SR(µ) = R(ϕ#µ) for any Radon measure µ M(Sd) that does not assign mass to {sn} We introduce the notation ϕθ g, ϕθ h : Sd\{sn} Rd ϕθ g(z) := g(ϕ(z), θ)θ, ϕθ h(z) := h(ϕ(z), θ θ, having range in the line generated by θ Sd 1. The expressions (13) in (2) in the above Proposition can be written as SG(µ)θ = ϕθ g#µ, SH(µ)θ = ϕθ h#µ and we call them slices of the SSRT of µ. Given two measures µ1, µ2 P(Sd), we introduce the formal definition of Stereographic Spherical Sliced Wasserstein (S3W) distance as follows: S3W p G,p(µ1, µ2) := Z Sd 1 W p p (SG(µ1)θ, SG(µ2)θ) dσd (θ) Sd 1 W p p (ϕθ g#µ1, ϕθ g#µ2) dσd (θ) (37) S3W p H,p(µ1, µ2) := Z Sd 1 W p p (SH(µ1)θ, SH(µ2)θ) dσd (θ) Sd W p p (ϕθ h#µ1, ϕθ h#µ2) dσd (θ), (38) where σd P(Sd 1) is the uniform measure on Sd 1. Stereographic Spherical Sliced Wasserstein Distances Theorem D.3 The above definitions S3WG,p( , ) and S3WH,p( , ) are well-defined. Furthermore, S3WG,p( , ) is a pseudo-metric in P(Sd\{sn}), i.e., it is non-negative, symmetric and satisfies triangular inequality. In addition, S3WH,p( , ) defines a metric in P(Sd \ {sn}). Proof: Given probability measures µ1, µ2 P(Sd \ {sn}), by (1) in Proposition D.1 we have that SG(µ1), SG(µ2) are probability measures defined on R Sd 1. Thus, by (2) in Proposition D.1, SG(µ1)θ, SG(µ2)θ are probability measure on R. Then, W p p (SG(µ1)θ, SG(µ2)θ) is well-defined and thus S3W p G,p( , ) is well defined. Analogously, S3W p H,p( , ) is well-defined. Since the Wasserstein distance Wp( , ) in P(R) is non-negative and symmetric and σd is a positive measure we have that S3WG,p( , ) and S3WH,p( , ) are non-negative and symmetric. For triangular inequality, let µ1, µ2, µ3 P(Sd \ {sn}), S3WG,p(µ1, µ3) = Z Sd 1 W p p (SG(µ1)θ, SG(µ3)θ)dσd (θ) 1/p Sd 1 (Wp(SG(µ1)θ, SG(µ2)θ) + Wp(SG(µ2)θ, SG(µ3)θ))p dσd (θ) 1/p Sd 1 W p p (SG(µ1)θ, SG(µ2)θ) 1/p + Z Sd 1 W p p (SG(µ2)θ, SG(µ3)θ) 1/p = S3WG,p(µ1, µ2) + S3WG,p(µ2, µ3) where the first inequality holds from the fact Wp( , ) in P(R) is a metric; the second inequality follows from Minkowski inequality in Lp(Sd 1). Then, S3WG,p( , ) is a Pseudo-metric. Analogously for S3WH,p( , ). In addition, suppose S3WH,p(µ1, µ2) = 0, we have SH(µ1)θ = SH(µ2)θ where the equality holds σd (θ) a.s. Then, from (35) we can deduce that SH(µ1) = SH(µ2). Since SH is invertible by Proposition (D.1), we have µ1 = µ2. Thus, S3WH,p( , ) is a metric. Lastly, and similar to the max-sliced Wasserstein distance, here we define the max-S3W distance as: Max-S3W p G,p(µ1, µ2) := sup θ Sd 1 W p p (SG(µ1)θ, SG(µ2)θ), (39) and similarly for H we define Max-S3W p H,p. E. Rotationally Invariant Stereographic Spherical Sliced Wasserstein Distances (RI-S3W) In this section, we discuss the Rotationally Invariant S3W distance (17): RI-S3WG,p(µ, ν) := ER ω[S3WG,p(R#µ, R#ν)]. First, we describe the uniform (Haar) distribution in O(d+1) denoted as ω. Precisely, we will describe how a (d+1) (d+1) orthonormal matrix R with rows v1, . . . , vd+1 can be randomly generated as a sample of ω. The procedure relies on the Gram-Schmidt algorithm: First, we randomly select v1 Sd, in particular, we denote it as v1 σd+1, that is, we sample v1 according to σd+1 which is the uniform distribution in Sd. Next, v2 should selected from span(v1) = H(v1, 0) = {v : v1, v = 0} and satisfy v2 = 1, where span(v1) is the 1D sub-space in Rd spanned by v1. Denote Sd 1(v1) := Sd H(v1, 0). Note, it is essentially a d 1 sphere. We randomly select v2 Sd 1(v1), denoted as v2 σd given v1, where σd denotes the uniform distribution in sphere Sd 1(v1). In the third step, similarly, let Sd 2(v1, v2) := Sd H(v1, 0) H(v2, 0), and sample v3 σd 1 given v1, v2, where σd 1 is the uniform distribution in the d 2 sphere Sd 2(v1, v2). Stereographic Spherical Sliced Wasserstein Distances Continue this process recursively until step d + 1. Note that at the last iteration, we have an orthonormal set of vectors {v1, . . . , vd}, and S0(v1, . . . vd) is a 0-dimensional sphere, that is, it is a set of two points {u1, u2}. We randomly select vd+1 as one of those two points in S0(v1, . . . vd), denoting vd+1 σ1, where σ1 is the uniform distribution in S0(v1, . . . , vd) (that is, σ1 = 0.5δu1 + 0.5δu2). Thus, the uniform probability measure ω in O(d + 1) can be described in the following way: Consider an arbitrary Borel set A O(d + 1), and let Ai := {v Sd : v is the i th row of R, R A} for i = 1, 2, . . . d + 1, then v1 A1,...,vd+1 Ad+1 dσ1(vd+1) dσ2(vd) . . . dσd(v2) dσd+1(v1). (40) where we are using the above notation, that is, σd+1 is the uniform distribution on Sd, and σd+1 k is the uniform distribution on Sd(v1, . . . , vk) for | k d (i.e., σd is the uniform distribution on Sd(v1), . . . , σ2 is the uniform distribution on Sd(v1, . . . , vd 1), σ1 is the uniform distribution on Sd(v1, . . . , vd)). In particular, ω is constant, in the sense that its density function is constant: Indeed, let fσd+1, . . . , fσ2 denote the density functions of σd+1, . . . , σ2 (which are all constant equal to the reciprocal of the surface area of the corresponding sphere). Then, let fω denote the density function of probability measure ω, and so, for each R = [v T 1 , v T 2 , . . . , v T d+1] O(d + 1) fω(R) = fσd+1(v1) fσ2(vd) σ1(vd+1) = 1 1 Si i (41) Thus we call ω uniform distribution. We mention that it is also called a Haar measure for the special orthogonal group SO(d + 1). Finally, we can normalize it so that we have a probability distribution. Next, we discuss the proof of Theorem 3.4. We first introduce the following lemma. Lemma E.1 Consider sn = [0, . . . , 0, 1] Rd+1 the North Pole of Sd. Let s1 Sd and A = {R O(d+1) : Rs1 = sn}. Then A = {R O(d + 1) : the (d + 1) th row of R is s1}. Furthermore, ω(A) = 0. Proof: Let B = {R O(d + 1) : the (d + 1) th row of R is s1}. It is straightforward to verify B A. For the other direction, for any R = [v T 1 , . . . , v T d+1] A we have v T i s1 = 0 if i = d + 1 v T d+1s1 = 1 if i = d + 1 Thus, Rs1 = sn, and we have A B. Next, we will show ω(A) = 0. Let A = {R O(d + 1) : 1st row of R is s1}. We have ω(A ) = ω(A). Then A 1 = {s1}. and (40), we have v2,...vd Sd dσ1(vd+1) . . . dσd(v2) dσd+1(v1) = 0, which completes the proof. Proof: [Proof of Theorem 3.4]. We first claim that (17) is well-defined. Let S(µ, ν) = {s Sd : min(µ({s}), ν({s})) > 0}. Since µ, ν are probability measures, we have that S(µ, ν) is a finite set. For each s S, let A(s) = {R O(d + 1) : Rs = sn}. Note that S3WG,p(R#µ, R#ν) is well-defined for each R O(d + 1) \ A(µ, ν), where A(µ, ν) = S s S(µ,ν) A(s). By lemma E.1, ω(A(µ, ν)) = ω s S(µ,ν) A(s) s S(µ,ν) ω(A(s)) = 0. Thus, we have that RI S3WG,p(µ, ν) := ER ω[S3WG,p(R#µ, R#ν)] = Z R O(d+1)\A(µ,ν) S3WG,p(R#µ, R#ν) dω(R) Stereographic Spherical Sliced Wasserstein Distances is well-defined. Now, we will verify its properties. By Theorem D.3, S3WG,p is non-negative, and symmetric. Thus RI-S3WG,p( , ) is also non-negative and symmetric. For the triangle inequality, let µ1, µ2, µ3 P(Sd). Similarly to before, let A(µ1, µ2, µ3) = S s S(µ1,µ2,µ3) A(s). Since ω(A(µ1, µ2, µ3)) = 0, we have RI S3WG,p(µ1, µ3) = Z O(d+1) S3WG,p(R#µ, R#ν) dω(R) = Z SO(d+1)\A(µ1,µ2,µ3) S3WG,p(R#µ, R#ν) dω(R) SO(d+1)\A(µ1,µ2,µ3) (S3WG,p(R#µ1, R#µ2) + S3WG,p(R#µ2, R#µ3)) dω(R) (Triangular inequality) SO(d+1)\A(µ1,µ2,µ3) S3WG,p(R#µ1, R#µ2) dω(R) + Z SO(d+1)\A(µ1,µ2,µ3) S3WG,p(R#µ2, R#µ3) dω(R) SO(d+1) S3WG,p(R#µ1, R#µ2) dω(R) + Z SO(d+1) S3WG,p(R#µ2, R#µ3) dω(R) = RI S3WG,p(µ1, µ2) + RI S3WG,p(µ2, µ3) where the second and previous to last equalities hold from the fact ω(A(µ1, µ2, µ3)) = 0. Thus, RI-S3WG,p( , ) is a well-defined pseudo-metric. Similarly, we can prove RI-S3WH,p( , ) is a well-defined pseudo-metric. Finally, it remains to show that µ = ν if RI-S3WH,p(µ, ν) = 0. If that holds, then for all R SO(d + 1) \ A(µ, ν), we have S3WH,p(R#µ, R#ν) = 0. Pick one of such rotations R. From Theorem D.3, and the fact R#µ, R#ν P(Sd \{sn}), we have R#µ = R#ν. Since Rd+1 v 7 Rv Rd+1 is a one-by-one mapping, we have µ = ν. F. Funk Radon Transform and Stereographic Spherical Radon Transform The Funk-Radon Transform (Quinto, 1982; Helgason et al., 2011; Quellmalz, 2017; 2020), which is also known as the Minkowski-Funk transform or the spherical Radon transform is a classical and significant mathematical tool used in integral geometry, with profound applications in image reconstruction and analysis. It plays a crucial role in fields like tomography, allowing for the reconstruction of images from projection data. In this section, we discuss the relation between the Funk Transform and the Stereographic Spherical Radon Transform. First, we review some basic concepts about the Funk transform. In the sphere Sd Rd+1 centered at the origin with radius 1, every (d 1) dimensional sub-sphere ( circle ) is the intersection of Sd with a d dimensional hyperplane, that is, S(s, t) = {s Sd | s, s = t}, where s Sd is normal to the hyperplane and t [ 1, 1] is the signed distance of the hyperplane to origin. For example, if t = 1, S(s, t) consists of only the singleton { s}. As another example, when t = 0, for any s Sd, the sub-sphere S(s, 0) is a great circle (a sub-sphere of dimension d 1 in Rd+1 centered at the origin and with radius 1). The Spherical Transform or Spherical Mean Operator (Quellmalz, 2017) of a function defines on Sd is formally defined as a function on Sd [ 1, 1] by the surface integral Uf(s, t) := 1 S(s, t) d 1 S(s,t) f d S(s, t) = 1 S(s, t) d 1 Rd+1 f(s )δ(t s, s ) ds , (s, t) Sd [ 1, 1], (42) where S(s, t) d 1 denotes the surface area of the sub-sphere S(s, t), and d S(s, t) is the induced volume form on the surface S(s, t). In the special case d = 2, given f defined on S2 R3, the integral in (42) is carried out with respect to the Stereographic Spherical Sliced Wasserstein Distances arclength d S of the circle S(s, t), that is, Uf(s, t) = 1 2π radius(S(s, t)) S(s,t) f d S(s, t) := 1 2π radius(S(s, t)) a f(r(u)) r (u) du, where radius(S(s, t)) denotes the radius of the circle S(s, t), r(u) for a u b is a parametrization of S(s, t), and r (u) is the magnitude of the tangent vector r (u) to S(s, t). Besides, in the special case d = 3, given f defined on S3 R4, we have that S(s, t) is 2-dimensional sphere and we get Uf(s, t) = 1 4 3π radius(S(s, t))3 S(s,t) f d S(s, t) := 1 4 3π radius(S(s, t))3 Z Z f(r(u, v)) ru rv du dv, where radius(S(s, t)) denotes the radius of the sphere S(s, t), r(u, v) is a parametrization of S(s, t), and ru rv is the magnitude of the cross-product between the partial derivatives ru and rv, which is a normal vector to S(s, t). The inversion of this Spherical Mean Operator is an over-determined problem (for instance, Uf(s, t) = f(s) for all s Sd). In application, one has access restrictions of Uf(s, t) to specific sub-spheres S(s, t), and this yields different transforms. One important example arises when we restrict Uf(s, t) to great circles S(s, 0) getting the classical Funk-Radon transform F1, that is, F1f(s) := Uf(s, 0), s Sd. (43) Another example, studied in (Abouelaz & Daher, 1993; Gindikin et al., 1994a; Helgason et al., 2011) arises when considering the restriction to the family of sub-spheres containing the North Pole sn = [0, . . . , 0, 1] Rd+1. Similarly as in (Quellmalz, 2017; 2020), we will interested in a generalized Funck-Radon transform given by the following restriction of U. Let s = [s1, . . . , sd+1] Sd Rd+1, and ξ [ 1, 1]. Consider the point [0, . . . , 0, ξsd+1] Rd+1 located in the same positive axis as the North Pole sn = [0, . . . , 0, 1] and inside the sphere Sd3. Then, we restrict Uf to the integrals over the intersections of the sphere Sd with hyperplanes containing the point [0, . . . , 0, ξsd+1], that is, ζs ξ := S(s, ξsd+1) = {s Sd| s, s = ξsd+1}, ξ [ 1, 1], s Sd. Then, the generalized Funk-Radon transform is defined by Uξf(s) := Uf(s, ξsd+1) = 1 ζs ξ d 1 ζs ξ f dζs ξ = 1 ζs ξ d 1 Rd+1 f(s )δ(ξsd+1 s, s ) ds , (44) where ζs ξ d 1 is the volume ( surface area ) of sub-sphere (circle, if d = 2) ζs ξ and dζs ξ is the integration over the surface ζs ξ (path, if d = 2). Remark F.1 Note, the generalized Funk-Radon transform has been further generalized by (Rubin, 2022) under the name spherical slice transform. In particular, choose α Rd+1 and let T (α, k) denote the set of all k-dimensional affine planes that pass through α and intersect Sn: T (α, k) = {H(a) = α + H : H is hyperplane, dim(H) = k, H(a) Sd = }. Pick H T (α, k), then H Sd is a k 1 dimensional sub-circle (or a point). The spherical slice transform of function f L1(Sd) is defined as: SSαf(H) = Z H Sd f(s)d H Sd, H T (α, k). (45) When k = d, pick s Sd and ξ [ 1, 1], we define affine plane H(s, ξsn) = ξsn + s , 3In (Quellmalz, 2017; 2020), ξ [0, 1) and so [0, . . . , 0, ξsd+1] lies strictly inside the sphere. Stereographic Spherical Sliced Wasserstein Distances we have H T (ξsn, d) and ζxis = H(s) Sd. Thus SSf(H(s, ξsn)) = Uξ(f)(s). That is, the generalized Funk transform (43) is a special case of the spherical slice transform (45). As said before, the goal of this section is to discuss the relationship between a generalized Funk-Radon transform (44) and the Stereographic Radon transform. For doing that let us first consider the stereographic projection ϕ : Sd \ {sn} Rd given by the formula ϕ(s) = s[1 : d] 1 sd+1 , s = [s1, . . . , sd, sd+1] Sd, (46) where s[1 : d] := [s1, . . . , sd] Rd, which corresponds to the intersection with the equator plane xd+1 = 0, rather than (1) which corresponds to projection onto the plane xd+1 = 1 (which has a factor of 2 in (46)). It is straightforward to verify that the inverse of ϕ, denoted as ϕ 1, is as follows: ϕ 1(x) = [(1 sd+1)x, sd+1], where sd+1 = x 2 1 x 2 + 1, x Rd. (47) See Figures 6, 7 and 8. Figure 6. Visualization for d = 1 of the stereographic projection ϕ defined by the formula (1). The ambient space Rd+1 is depicted in green, while the unit sphere (circle) Sd is in blue. The North and South Poles are labeled as sn and s0, respectively. The projected space Rd = ϕ(Sd \ sn) is highlighted in red, corresponding to the plane (line) in Rd+1 defined by all points with the last coordinate equal to 1. Two points, θ1 and θ2 (blue dots), from Sd are projected through ϕ obtaining the points ϕ(θ1) and ϕ(θ2) (red dots). Stereographic Spherical Sliced Wasserstein Distances Figure 7. Visualization for d = 1 of the stereographic projection ϕ defined by the formula (46). The ambient space Rd+1 is depicted in green, while the unit sphere (circle) Sd is in blue. The North and South Poles are labeled as sn and s0, respectively. The projected space Rd = ϕ(Sd \ sn) coincides with the hyperplane xd+1 = 0, which in the figure is nothing but the horizontal axis. Two points, θ1 and θ2 (blue dots), from Sd are projected through ϕ obtaining the points ϕ(θ1) and ϕ(θ2) (red dots). The South Pole is projected to the origin in Rd and the points in the Equator are fixed points for this stereographic projection. Figure 8. Visualization for d = 2 of the stereographic projection ϕ defined by the formula (46). The ambient space is the 3D space R3. The sphere S2 is depicted in blue. The projected space R2 = ϕ(S2 \ sn) is depicted in grey. The points in the Equator are fixed points for this stereographic projection, and circles parallel to the Equator are mapped to circles in the plane as shown in the plot: If the height of the circle is positive, its projection is a circle with radius greater than 1 as the case depicted in the figure; if the height is negative, its projection is a circle with radius smaller than 1. When we set ξ = 1 we have the following relations between the sub-spheres ζs 1 and the hyperplanes H(t, θ) through the stereographic projection ϕ. Proposition F.2 Consider the sub-sphere (circle) ζs 1 = {s Sd| s , s = sd+1}, where s = [s1, . . . , sd, sd+1] Sd is such that s = sn = [0, . . . , 0, 1]. Then, ϕ(ζs 1) is a hyperplane in Rd, precisely, ϕ(ζs 1) = x Rd | x, θ = t , where θ = s[1:d] s[1:d] and t = sd+1 Conversely, given hyperplane H(t, θ) = {x Rd | x, θ = t} for θ Sd 1 and t R, we have ϕ 1(H(t, θ)) = ζs 1, where s = [ q 1 s2 d+1θ, sd+1] with sd+1 = sign(t) t2 t2+1. Proof: Let s Sd, and consider s ζs 1. Let x = ϕ(s ) = s [1:d] 1 s d+1 . Then, we can write s = [(1 s d+1)x , s d+1]. Besides, since s2 1 + + s2 d + s2 d+1 = 1, Stereographic Spherical Sliced Wasserstein Distances by defining θ = s[1:d] s[1:d] , we can write s = hq 1 s2 d+1θ, sd+1 i . Then, we have that sd+1 = s , s = s d+1sd+1 + (1 s d+1)x , q = s d+1sd+1 + (1 s d+1) q 1 s2 d+1 x , θ . (48) That is, x , θ = sd+1 q x Rd | x, θ = sd+1 q For the converse, given θ Sd 1 and t R, consider x H(t, θ). By using (47), let s = ϕ 1(x) = [(1 s d+1)x, s d+1], x Rd where s d+1 = x 2 1 We set s = [ q 1 s2 d+1θ, sd+1] with sd+1 = sign(t) q t2 t2+1. So, we have that s Sd because (1 s2 d+1) θ2 |{z} =1 +s2 d+1 = 1, and also we have that t = sd+1 1 s2 d+1 . Thus, we have that x , θ = t = sd+1 q and by reversing the above proof (see (48)), we have that it implies that sd+1 = s , s , and so s ζs 1. Combining with the fact ϕ is invertible, we have ϕ 1(H(t, θ)) = ζs 1. Lemma F.3 Let s Sd different from the North Pole sn = [0, . . . , 0, 1], the South Pole s0 = [0, . . . , 0, 1], and consider S0 the unique great circle (meridian) that passes through the North Pole sn, the South Pole s0, and the point s. For each ˆs S0, we define the functions θ(ˆs) := ˆs[1 : d] ˆs[1 : d] , t(ˆs) := sd+1 q For all ˆs S0, θ(ˆs) = θ(s). In addition t : S0 \ {sn} R is a bijection. Proof: Without loss of generality, we can suppose s = [cos α, 0, . . . , 0, sin α] for some α ( 3 2π). Thus, θ(s) = [1, 0, . . . 0], and S0 = ˆsα = [cos ˆα, 0, . . . , 0, sin ˆα] : ˆα [ 3 Stereographic Spherical Sliced Wasserstein Distances Pick ˆsα S0 \ {sn}, we have θ(ˆsα) = [1, 0, . . . , 0]. In addition, t(ˆsα) = sin α | cos α| is a one by one mapping. Notice that for S2 R3, if we visualize S2 as the Earth and we pick the point s S2 to be the Royal Observatory in the town of Greenwich, London, England, then S0 is the Greenwich meridian (or prime meridian ). Finally, we discuss the relation between the variant Funk transform defined in (44) with ξ = 1 (i.e., U1) and the Stereographic Spherical Radon transform defined in (10) as SR(µ) = R(ϕ#µ) for any Radon measure µ M(Sd) that does not assign mass to {sn}. In particular, we can define the Stereographic Spherical Radon transform for L1 functions f : Sd \ {sn} R, as they can be viewed as densities of continuous measures. Indeed, let µ be the corresponding measure for f, that is, Z Sd\{sn} ψ0(s)f(s) dσd+1(s) = Z Sd\{sn} ψ0(s) dµ(s), ψ0 C(Sd), where σd+1 is the uniform measure on Sd Rd+1. Let ˆf denote the density of ϕ#µ. By definition of push forward measure, we have: Z Sd\{sn} ψ(ϕ(s))f(s)dσd+1(s) = Z Sd\{sn} ψ(ϕ(s))dµ(s) = Z Rd ψ(x)dϕ#µ(x) = Z Rd ψ(x) ˆf(x)dx, (49) for all test functions ψ C0(Rd) that decay to zero. In short, we usually write dµ = fdσd+1, and dϕ#µ = ˆfdx, where dx represents the Lebesgue measure in Rd. Thus, SR(f) is defined as SR(f)(t, θ) = R( ˆf)(t, θ) = Z Rd ˆf(x)δ(t x, θ ) dx = Z H(t,θ) ˆf d H(t, θ), (50) where d H(t, θ) is the surface/line integral over the hyperplane/plane/line H(t, θ). Proposition F.4 Let f L1(Sd \ {sn}), and let s Sd. Then, ζs 1 d 1 U1f(s) = SRf(t, θ), θ = s[1 : d] s[1 : d] , t = sd+1 q Proof: For convenience, we can extend f to Sd by setting f(sn) = 0. Without loss of generality, we suppose s = [cos θ, 0, . . . , 0, sin θ] where θ ( 3 As before, let µ be the measure having density f and let ˆf denote the density of ϕ#µ. Since ϕ is invertible, it is straightforward to verify that (ϕ 1)#(ϕ#)(µ) = µ, that is, Z Rd ψ0(ϕ 1(x)) ˆf(x)dx = Z Sd ψ0(s)f(s)dσd+1(s), ψ0 C(Sd) (51) In short, dµ = fdσd+1 and dϕ#µ = ˆfdx. Let S0 denote the unique great circle (meridian) that passes through the North Pole sn = [0, . . . , 0, 1], the South Pole s0 = [0, . . . , 0, 1] and the point s. It is straightforward to verify ˆs S0 ζ ˆs 1 Stereographic Spherical Sliced Wasserstein Distances and that for any distinct ˆs1, ˆs2 S0, ζ ˆs1 1 ζ ˆs2 1 = {sn}. Similarly, Rd = [ ˆt R H(ˆt, θ) and for any distinct ˆt1, ˆt2 R, H(ˆt1, θ) H(ˆt2, θ) = . Thus, we can slice and integrate in the following way: Z Sd ψ0 dσd+1 = Z ζ ˆs 1 ψ0 dζ ˆs 1 d S0(ˆs), (52) where dζ ˆs 1 denotes the surface/path integral over ζ ˆs 1, and d S0 denotes the surface/path integral over S0. Also, by using Proposition F.2, we can slice and integrate in the following way: Z Rd ψ(x) dx = Z H(t(ˆs),θ(ˆs)) ψ d H(t(ˆs), θ(ˆs)) dˆs, (53) where for each ˆs S0 we define θ(ˆs) := ˆs[1:d] ˆs[1:d] , t(ˆs) := ˆsd+1 1 ˆs2 d+1 , and where d H(t(ˆs), θ(ˆs)) denotes the surface/line integral over H(t(ˆs), θ(ˆs)), and dˆs denotes the integral over R = ϕ(S0 \ {sn}). (Notice that if d = 2, (53) is nothing but the Fubini-Tonelli expression R R ϕ(z) dz = R R ψ(x, y) dx dy, where dz is the Lebesgue measure on R2 and dx, dy the Lebesgue measure on R.) Refer to Figure 9 for an illustrative visualization. Figure 9. Representation of a unit sphere S2 in R3, featuring the North and South Poles (sn and s0 in blue). The sketch includes five great circles passing through a point s: four intersecting sn and one meridian labeled as S0. The stereographic projection of these five great circles is depicted as straight lines, which lie on the same plane. For the stereographic projection ϕ we used formula (1). The four circles passing through both sn and s project onto parallel lines. One such circle is highlighted in red and labeled as ζs 1, with its corresponding projection also in red, labeled as H(t(s), θ(s)). These four parallel lines are perpendicular to the line labeled as ϕ(S0), representing the stereographic projection of the meridian S0. By using Lemma F.3 combined with identity (51), together with with (52) and (53), we have Z Rd ψ0(ϕ 1(x)) ˆf(x)dx H(t(ˆs),θ(ˆs)) ψ0 ϕ 1 ˆ f d H(t(ˆs),θ(ˆs)) dˆs Sd ψ0(s)f(s)dσd+1(s) ζˆs 1 ψ0 f dζ ˆs 1 d S0(ˆs) Stereographic Spherical Sliced Wasserstein Distances Since the above equality holds for all ψ0 C(Sd), by setting ψ0 1 we obtain SRf(t, θ) = Z H(t,θ) ˆf d H(t, θ) = Z ζs 1 f dζs 1 = ζs 1 d 1 U1f(s) and we complete the proof. G. Related Work: Vertical Sliced Transform, Semi-Circle Transform, and Related Wasserstein Distances This section introduces the vertical sliced transform (Gindikin et al., 1994b), semi-circle transform (Groemer, 1998), and the related optimal transport distances (Quellmalz et al., 2023). G.1. Vertical Slice Transform and Vertical Sliced Wasserstein Distance We first introduce the vertical slice transform technique introduced in (Quellmalz et al., 2023), which can only be defined in S2 space. Next, we extend it into Sd for d 2. Finally, we discuss the vertical slice of Wasserstein distance and its relation to the classical slice of Wasserstein distance. Parametrization in S2. For each point s S2, we can parameterize it as s = [cos α sin β, sin α sin β, cos β], (54) where α [0, 2π), β [0, π]. When s is the north (or south) pole, i.e., s = [0, 0, 1] (or s = [0, 0, 1]), the above parameterization in particular is not uniquely determined. In this case, we set α = 0, β = 0 (or α = 0, β = π). Otherwise, α, β are uniquely determined. In particular, α is called the azimuth angle, denoted as α = azi(s), and β is called the zenith angle, denoted as β = zen(s). In summary, the following mapping is bijective: Φ : ((0, 2π) (0, π)) ({0} {0, π}) S2 (α, β) 7 Φ(α, β) = s = [cos α sin β, sin α sin β, cos β]. (55) Vertical Slice Transform and Vertical Sliced Wasserstein Distance. The equator in S2 can be defined by E := {Φ(α, 0) = [cos α, sin α, 0] : α [0, 2π)}. (56) Each θ = Φ(α, 0) E and t [ 1, 1] can determine a vertical circle VS(α, t) := VS(θ, t) := {s S2 : θ, s = t}, (57) characterizing a circle centered at tθ with radius 1 t2. Thus VS(θ, t) 1 = 2π Given f L1(S2), the vertical slice transform V of f, denoted as Vf, is a L1 function defined on E [ 1, 1]: Vf(θ, t) := 1 VS(θ,t) 1 R VS(θ,t) f(s)d VS(θ, t)(s) = 1 2π S2 f(s)δ(t θ, s )ds if t ( 1, 1) f(θ) if t = 1 f( θ) if t = 1 Its adjoint operator V : L1(E [ 1, 1]) L1(S2) has closed form: E g(θ, θ, s )d E. (59) Stereographic Spherical Sliced Wasserstein Distances Thus, we can extend the vertical transform to any (positive Radon) measure as follows. For each µ M+(S2), V(µ) is defined by: E [ 1,1] ψ(θ, t)d Vµ(θ, t) = Z S2 V ψ(s)(s)dµ(s), ψ C0(E [ 1, 1]). (60) Note, V(µ) can be equivalently defined as Z ψ(θ, t)d V(µ)(θ, t) = Z 1 ψ(θ, t)d V(µ)θ(t)dσE(θ) (61) where σE is the uniformly probability measure defined on E and V(µ)θ = θ, #µ, θ E. is the integration of Vµ with respect to σE. In the discrete case, µ = Pn i=1 piδxi, the above definition becomes V(µ)θ = Pn i=1 piδ θ,xi . Given µ, ν P(S2), the vertical sliced Wasserstein distance is defined by V SW p p (µ, ν) := Z E W p p (V(µ)θ, V(ν)θ)dσE(θ) (62) t=1 W p p ( θ, #µ, θ, #ν) (63) where (63) is the Monte Carlo approximation and θ1, . . . θN are uniformly selected from E. By [Theorem 3.7, (Quellmalz et al., 2023)], the vertical transform (60) (or (61)) is injective. Thus, the above VSW problem defines a metric. Relation with Classical Radon Transform. We refer to (5) and (24) for the classical Radon transform for a measure µ P(R3). In addition, we have R(µ)θ = V(µ)θ = θ, #µ, θ E S2. That is, the restricted transformed measure of µ under R and V coincide when θ E. Recall that the Sliced Wasserstein (SW) distance is defined by SW p p (µ, ν) := Z S2 W p p (R(µ)θ, R(ν)θ)dσS2(θ) (64) i=1 W p p ( θ, #µ, θ, #ν), (65) where (65) is the Monte Carlo approximation of the SW problem and θ1, . . . θN are uniformly sampled from S2. Comparing (63) and (65), we observe that when θ1, . . . θN are sampled from E, (63) and (65) coincide. Generalization of Vertical Sliced Wasserstein in Sd. It is natural to extend the vertical sliced transform given by (58), (60), (61) to Sd for d 2. Let E = {s Sd : sd+1 = 0}, Stereographic Spherical Sliced Wasserstein Distances denote the equator. Choose (θ, t) E [ 1, 1], the d 1 dimensional vertical sphere VS(θ, t) is defined as VS(θ, t) := {s Sd : θ, s = t}. Choose f L1(Sd), Vf is defined by Vf(θ, t) := 1 E d 1 V S(θ,t) f(s)d V S(θ, t) = 1 E d 1 Sd f(s)δ( θ, s t)ds. (66) It is straightforward to verify Vf L1(E [ 1, 1]). In addition, its adjoint operator V becomes V g(s) = 1 E d 1 E g(θ, θ, s )dθ, g L1(E [ 1, 1]). (67) And similarly, for each µ M(Sd), vertical transformed measure Vµ can be defined by (60), or equivalently, Z E [ 1,1] g(θ, t)d Vµ(θ, t) = Z [ 1,1] g(θ, t)d V(µ)θ(t)dσE(θ), (68) where V(µ)θ = θ, #µ is the integration of d Vµ with respect to σE. Thus, vertical sliced Wasserstein (62), (63) can be extended to the space Sd. We provide additional results in Appendix I.2. G.2. Semi-Circle Transform and Spherical Sliced Wasserstein Distance We first introduce the semi-circle transform and then discuss its relation to spherical sliced Wasserstein (SSW). The semi-circle can be defined by the following three equivalent formulations. Formulation 1 (Quellmalz et al., 2023): In S2, we define Ψ(α, β, γ) := cos α sin α 0 sin α cos α 0 0 0 1 cos β 0 sin β 0 1 0 sin β 0 cos β cos γ sin γ 0 sin γ cos γ 0 0 0 1 Choose s = Φ(α , β ) (defined in (55)), ξ [0, 2π), the semi-circle is defined by SC(s , ξ) = {s S2 : azi(Ψ(α , β , 0) s) = ξ} { Φ(α , β )}. (70) Formulation 2 (Groemer, 1998): In Sd, where d 2, given u, v Sd with u v, the semi-circle is defined by SC(u, v) := {s Sd : s u, s, v 0}. (71) Formulation 3 (Bonet et al., 2023c). In Sd, choose U V2(d) := {U Rd 2 : (U ) U = I2}, and let U denote the perpendicular space of Range(U). In addition, U can determine a great circle: SU := {s Rd+1 : U T s = 0} Sd. The projection from Sd to SU admits the following closed form: PU(s) = arg min s SU d S(s, s ) = UU T s U T s , s Sd \ U . (72) Stereographic Spherical Sliced Wasserstein Distances The great circle SU can be regarded as a S1 circle, and we can use t [0, 2π) to represent each point in SU. Thus, the above projection mapping can be rewritten as s 7 U T s U T s . The new formulation induces the semi-circle: SC(U, t) = s Sd : U T s U T s = t U (73) When d = 2, the above three formulations of semi-circle are equivalent. When d 2, the second and third formulations are equivalent. For convenience, we select formulation 3. Semi-Circle Transform. Given f L1(Sd), we will introduce the semi-circle transform of f. Note, in (Quellmalz et al., 2023; Hielscher et al., 2018), this transform is called the unnormalized semi-circle transform; in (Bonet et al., 2023c), this transform is called the spherical Radon transform; and in (Groemer, 1998), it is called the hemispherical transform: SC(f)(U, t) := Z SC(U,t) f(s)d SC(U, t). (74) Similar to the Radon transform (5) and the vertical slice transform (60), the above definition can be extended into M(Sd). The corresponding Wasserstein problem is the spherical sliced Wasserstein problem (Bonet et al., 2023a): SSW(µ, ν) := Z V2(d) W p p (SC(µ)U, SC(ν)U)dσV2(d)(U), (75) where SC(µ)U is the integration of SC(µ) with respect to the uniformly measure σV2(d). In addition, with σV2(d) a.s., we have SC(µ)U = U T U T H. Distance Distortion In Rd we consider the Euclidean distance and in Sd the great circle distance d Sd( , ). While in Euclidean spaces the distance between two points is the length of a straight segment between them, in the sphere, we measure the distance between two points as the length of the shortest path that lies in the sphere and connects them. In particular, for two angles in the unit circle θ1, θ2 [0, 2π) d S1(θ1, θ2) = min{|θ1 θ2|, 2π |θ1 θ2|}. In general, the distance between s1, s2 Sd given by the arclength of the shortest path can be expressed as d Sd(s1, s2) := arccos( s1, s2 ). We point out that in the stereographic projection, significant distortion can occur. For instance, as d Sd(s1, s2) 0, we might have ϕ(s1) ϕ(s2) . See Figure 10 for a visualization. Figure 10. Illustration of the distortion phenomenon in one-dimensional space (d = 1) as a result of applying the stereographic projection with formula (1). As the points s1 (blue) and s2 (red) are brought closer within the sphere (circle), their respective stereographic projections, depicted as crosses, diverge from each other. Stereographic Spherical Sliced Wasserstein Distances In this section, we aim to construct an injective function h : Rd Rd such that the following holds: h(ϕ(s1)) h(ϕ(s2)) d Sd(s1, s2). We define h1 : Rd Rd by h1(x) := arccos( sd+1) x x , where sd+1 = x 2 1 x 2 + 1. (76) Proposition H.1 Let s, s1, s2 Sd \ {sn}, and let s0 = [0, . . . , 0, 1]T Sd denote the South Pole. Consider ϕ : Sd \ {sn} Rd the stereographic projection as in (46) with inverse (47), and also consider the function h1 : Rd Rd defined by (76). (1) h1(ϕ(s)) = (s, s0) s[1:d] s[1:d] , and thus h1(ϕ(s1)) h1(ϕ(s2)) 2π, where (s, s0) = arccos( s, s0 ) [0, π] denotes the angle between s and s0. (2) If s1, s2, s0 are in the same great circle, denoted as S, let d S denote the circle distance in S, then d Sd(s1, s2) = d S(s1, s2) = min{ h1(ϕ(s1)) h1(ϕ(s2) , 2π h1(ϕ(s1)) h1(ϕ(s2) } (77) (1) Note that h1(ϕ(s)) = arccos( sd+1) ϕ(s) ϕ(s) = arccos( s, s0 ) s[1 : d] s[1 : d] = (s, s0) s[1 : d] s[1 : d] | {z } 1 , s Sd \ {sn} Rd+1. Thus, we have that h1(ϕ(s1)) h1(ϕ(s2)) (s1, s0) + (s2, s0) π + π = 2π. (2) Without loss of generality, we suppose that the great circle S (meridian) is of the form S = s = [cos θ, 0, . . . , 0, sin θ] : θ [ 3 (where we go through the circle clockwise having π/2 in the middle). Thus, we assume that S passes through s0 = [0, . . . , 0, 1] = [cos( π 2 ), 0, . . . , 0, sin( π 2 )], s1 = [cos θ1, 0, . . . 0, sin θ1], and s2 = [cos θ2, 0, . . . , 0, sin θ2] in S \ {sn} with θ1, θ2 ( 3 2π). It follows that h1(ϕ(s1)) = (s1, s0) s1[1 : d] s1[1 : d] = (θ1 + π 2 )[1, 0, . . . , 0] and similarly, h1(ϕ(s2)) = (θ2 + π 2 )[1, 0, . . . , 0]. Thus, the third term in (77) becomes min{ h1(ϕ(s1)) h1(ϕ(s2)) , 2π h1(ϕ(s1)) h1(ϕ(s2)) } = min{|θ1 θ2|, 2π |θ1 θ2|}. For the first term in (77), we have d Sd(s1, s2) = arccos(cos θ1 cos θ2 + sin θ1 sin θ2) = arccos(cos(θ1 θ2)) = min{|θ1 θ2|, 2π |θ1 θ2|}. For the second term, we have d S(s1, s2) = min(|θ1 θ2|, 2π |θ1 θ2|). And we are done. Stereographic Spherical Sliced Wasserstein Distances Remark H.2 As a corollary from part 1 in Proposition H.1, we notice that h1 preserves angles with the vertex at the origin in the projected space. (See angle β in Figure 13.) In addition, when d = 2, we observe that h1(ϕ) is similar to the Lambert azimuthal equal-area projection (Bradawl, 2003): 2 1 s[3]s[1 : 2]. (78) The function LP projects points in S2 into R2, while preserving the area in S2, i.e. A 2 = LP(A) 2. Meanwhile, h1 ϕ aims to preserve distance. In addition, if we write s = [cos α cos β, sin α cos β, sin β], then h1(ϕ(s)) = (s, s0) s[1 : 2] 2 )[cos α, sin α] 2 )[sin(α π/2), cos(α π/2)] | {z } A where A is exactly the azimuthal equidistant projection (Snyder, 1997) of s centered at south pole [0, 0, 1]. Thus, h1 ϕ and the azimuthal equidistance projection are equivalent. Conjecture H.3 Let s, s1, s2 Sd \ {sn}, and let s0 = [0, 0, . . . , 1]T Sd denote the South Pole. Consider ϕ : Sd \ {sn} Rd the stereographic projection as in (46) with inverse (47), and also consider the function h1 : Rd Rd defined by (76). Let O(d + 1) denote the set of all (d + 1) (d + 1) orthonormal matrices, then d Sd(s1, s2) = min R O(d+1) min{ h1(ϕ(Rs1)) h1(ϕ(Rs2)) , 2π h1(ϕ(Rs1)) h1(ϕ(Rs2)) } = min R O(d+1) h1(ϕ(Rs1)) h1(ϕ(Rs2)) (80) Numerically, we have shown that statement holds (see Figure 12). We will leave the theoretical proof for a future study. Here we provide the intermediate results: Proposition H.4 and Proposition H.10. Proposition H.4 Let s, s1, s2 Sd \ {sn}, and let s0 = [0, 0, . . . , 1]T Sd denote the South Pole. Consider ϕ : Sd \ {sn} Rd the stereographic projection as in (46) with inverse (47), and also consider the function h1 : Rd Rd defined by (76). There exists R SO(d + 1) such that d Sd(s1, s2) = min{ h1(ϕ(R s1)) h1(ϕ(R s2)) , 2π h1(ϕ(R s1)) h1(ϕ(R s2)) } min R O(d+1) min{ h1(ϕ(Rs1)) h1(ϕ(Rs2)) , 2π h1(ϕ(Rs1)) h1(ϕ(Rs2)) } Choose s1, s2 Sd two different points. Let ζ denote the shortest path from s1 to s2. Then ζ can be parametrized as a function. In particular, there exists a mapping ζ : [0, 1] Sd, ζ(0) = s1, ζ(1) = s2, ζ is differentiable, ζ (t) is constant, and the range of ζ is exactly the shortest path. By definition of the great circle distance d Sd, it is exactly the length of ζ. That is, d Sd(s1, s2) = |ζ| = Z 1 (ζ (t))2dt. Stereographic Spherical Sliced Wasserstein Distances Furthermore, ζ lies in a great circle, denoted as S. Let s 0 = ζ(t0) denote the middle point between s1, s2 (for an appropriate t0 (0, 1)), By the Gram-Schmidt algorithm, there exists R0 SO(d + 1) such that R0s 0 = s0 (for example, first set r1 = s 0, then construct orthonormal vectors r2, . . . , rd+1, and finally define the matrix R0 having rows r1, r2, . . . , rd+1). It holds that d Sd(s1, s2) = d Sd(R0s1, R0s2), i.e., d Sd is invariant under rotations and reflections. Besides, note that R0s1, R0s2, R0s 0 are in the same great circle. Thus, by using that d S2 in invariant under orthogonal transformations and by using part (2) in Proposition H.1, we have d Sd(s1, s2) = d Sd(R0s1, R0s2) = min{ h1(ϕ(R0s1)) h1(ϕ(R0s2)) , 2π h1(ϕ(R0s1)) h1(ϕ(R0s2)) } = h1(ϕ(R0s1)) h1(ϕ(R0s2)) where the last equality holds because s0 = R0s 0 is a point between R0s1, R0s2. In particular, min R O(d+1) min{ h1(ϕ(Rs1)) h1(ϕ(Rs2) , 2π h1(ϕ(Rs1)) h1(ϕ(Rs2) } d Sd(s1, s2) For the other direction, we first recall that d Sd is invariant under orthogonal transformations: d Sd(s1, s2) = d Sd(Rs1, Rs2) R O(d + 1). In particular, every R O(d + 1) is a minimizer of: d Sd(s1, s1) = d Sd(R s1, R s1) = inf R O(d+1) d Sd(Rs1, Rs1). By using part (2) in Proposition H.1 we have that for all R such that R s1, R s2, and R s0 are in the same great circle it holds d Sd(s1, s2) = d Sd(R s1, R s2) = min{ h1(ϕ(R s1)) h1(ϕ(R s2)) , 2π h!(ϕ(R s1)) h1(ϕ(R s2)) } (81) However, we want to minimize over all the possible orthogonal matrices. At this point, the proof is more involved, and to prove this part we will need a series of auxiliary lemmas: Lemma H.5 Given s1, s2 Sd, with (s1)d+1 = (s2)d+1 (i.e., s1 and s2 are in the same latitude), then d Sd(s1, s2) min(2π h(ϕ(s1)) h1(ϕ(s2)) , h1(ϕ(s1)) h1(ϕ(s2)) ). Proof: Suppose (s1)d+1 = (s2)d+1 = sin α for some α [ π 2 ). Let β = (s1[1 : d], s2[1 : d]) [0, π]. We have d Sd(s1, s2) = arccos( s1[1 : d], s2[1 : d] + (s1)d+1(s2)d+1) = arccos cos2(α) cos(β) + sin2(α) h1(ϕ(s1)) h1(ϕ(s2)) = p h1(ϕ(s1) 2 + h1(ϕ(s2) 2 2 h1(ϕ(s1)) h1(ϕ(s2)) cos( (s1[1 : d], s2[1 : d])) L(α, β) = h1(ϕ(s1)) h1(ϕ(s2)) d Sd(s1, s2) arccos cos2(α) cos(β) + sin2(α) (82) Stereographic Spherical Sliced Wasserstein Distances In hat follows we will show that L(α, β) 0 if α [ π 2 ), β [0, π]. First, if α = π 2 or β = 0, we have L = 0. It remains to consider the case α ( π 2 ), β (0, π]. The procedure will be the following: We fix α ( π 2 ) and consider the function L(α, ) on the variable β. d dβ L = α + π + cos2(α) sin(β) q 1 (cos2(α) cos(β) + sin2(α))2 (83) We set d dβ L = 0, and obtain = cos4(α) sin2(β) 1 (cos2(α) cos(β) + sin2(α))2 (84) Let A = cos(β) = 2 cos2 β 2 1 = 1 2 sin2( β 2 ), B = cos2(α) = sin2(α ) = 1 sin2(α), C = (α + π 2 )2 = (α )2, where α = α + π 2 (0, π). The above equation (84) becomes: 1 2C(A + 1) = B2(1 A2) 1 (AB + 1 B)2 = B2(1 A)(1 + A) 1 (A2B2 + 2AB 2AB2 + 1 2B + B2) = B2(1 A)(1 + A) A2B2 2AB + 2AB2 + 2B B2 = B(1 A)(1 + A) A2B 2A + 2AB + 2 B thus, C( A2B 2A + 2AB + 2 B) = 2B(1 A), or equivalently, ( CB)A2 + 2(BC C + B)A + (2C 2B BC) = 0 (85) Denote by a = CB, b = 2(BC C + B), c = (2C 2B BC) the coefficients of the above quadratic expression (85) as function of A: We have 2 = b2 4ac = 4(B C)2 0 and thus, we obtained real roots. Consider the following cases: Case 1: a = 0. Thus, B = 0 or C = 0, and so α = π/2, which was already analyzed. Case 2: a = 0. Since B, C 0, we have that a < 0. Besides, the two roots A1, A2 of (85) are given by A1 = BC + 2(B C) BC and A2 = 1 Since A = cos(β) ( 1, 1) as β (0, π), once we fix α, there exists at most one β such that d dβ (α, β )L = 0. In fact, such β exists if and only if 1 < A1 < 1. In addition, we have that d dβ L(α, 0) = lim β 0 d dβ L = (α + π L(α, 0) = 0 ( 0 if α ( π 2 , 0] 4α if α [0, π Stereographic Spherical Sliced Wasserstein Distances We have the following: If cos(β ) = A1 ( 1, 1), for some β (0, π), we have d dβ L(α, β) 0, β [0, β ], thus L(α, β) is an increasing function on [0, β ]. Since L(α, 0) > 0, we have L(α, β) 0 for all β [0, β ]. On (β , π], d L dβ = 0. By the Intermediate Value Theorem, we have d L dβ < 0. Thus, L is a decreasing function here. Since L(α, π) 0, we have L(α, β) 0 on (β , π]. If A1 / ( 1, 1), there is no β (0, π) such that d dβ L(α, β) = 0. Combining this with the fact d dβ L(α, 0) > 0, we have d dβ L(α, β) > 0 on (0, π). Thus, L is an increasing function here, and thus L(α, β) 0. Therefore, L 0. Through a similar process, one can show that 2π h1(ϕ(s1)) h1(ϕ(s2)) d Sd(s1, s2) 0 and thus we complete the proof. Lemma H.6 In 3D, the function h1 ϕ : S2 \ {sn} R2, where ϕ defined by (46) and h1 by (14) with d = 2 has range contained in the circle of radius π centered at the origin. Moreover, h1 ϕ maps circles parallel to the Equator to circles in the plane centered at the origin. See Figure 11 for a visualization. This can be generalized to the dimensions d > 2, where we consider the function h1 ϕ : Sd \ {sn} Rd, and in the preceding statement we have to change the word circle by sphere . The Equator in Sd is the sub-sphere intersecting the hyperplane in Rd+1 with equation xd+1 = 0. Proof: This is a consequence of part 1 in Proposition H.1, that is, h1(ϕ(s)) = (s, s0) s[1:d] s[1:d] for every s Sd \ {sn}, and the fact that the stereographic projection ϕ with formula given by (46) maps circles (resp., sub-spheres of dimension d 1 in Rd+1) parallel to the Equator to circles (resp., spheres of dimension d 1) in R2 (resp., Rd) -see Figure 8-. Given a circle Sα (sub-sphere) which is the intersection between the sphere Sd and the (hyper)plane xd+1 = sin(α) for some α [ π/2, π/2) (and so, 1 sin(α) < 1), i.e., Sα = Sd x Rd+1 : xd+1 = sin(α) , (86) we have that (s, s0) = (s , s0) = α + π Thus, if for example we consider a representative s = [0, . . . , 0, cos(α), sin(α)] in Sα, we have that h1(ϕ(s)) = (s, s0) = α + π 2 . In general, h1(ϕ(s )) = (α + π 2 ) s [1 : d] s [1 : d] s Sα. Since the points of the form s [1:d] s [1:d] belong to the unit circle in Rd, the function h1 ϕ maps the circle Sα (which is parallel to the Equator at height sin(α)) to the circle centered at the origin with radius α + π h1(ϕ(Sα)) = n x Rd : x = α + π In particular, we have the following: The Equator corresponds to α = 0 and so h1(ϕ(Equator)) = x Rd : q x2 1 + + x2 d = π The South Pole corresponds to α = π/2, i.e., s0 = [0, . . . , 0, 1] = [0, . . . , cos( π/2), sin( π/2)], and so h1(ϕ(s0)) = [0, . . . , 0] Rd. Stereographic Spherical Sliced Wasserstein Distances Figure 11. Depiction of the map h1 ϕ : S2 \ {sn} R2. The domain is depicted on the left. Its image, on the right, is inside the open disk of radius π centered at the origin. The Equator (black circle on the left) goes to the circle of radius π/2 (black circle on the right). The South Pole (black big dot on the left) goes to the origin of coordinates of the plane on the right (black dot). Dotted blue Parallel on the left at negative height (in the South Hemisphere), goes to the small blue dotted circle on the right. Dotted red Parallel on the left at positive height (in the North Hemisphere), goes to the big red dotted circle on the right. We recall that parallels are circles on the sphere that are parallel to the Equator. For small numbers ϵ > 0 we have that the circle Sϵ parallel to the Equator at height π/2 ϵ defined by Sϵ = Sd n x Rd+1 : xd+1 = π h1(ϕ(Sϵ)) = x Rd : q x2 1 + + x2 d = π ϵ . Thus, as ϵ 0 we obtain that the range of h1 ϕ is contained on the open sphere/circle in Rd of radius π centered at the origin. Circles at positive height, i.e., Sα as in (86) with 0 α < π/2, are sent through h1 ϕ to spheres/circles in Rd with radius grater than π/2 and smaller that π centered at the origin. Circles at at negative height, i.e., Sα as in (86) with π/2 α < 0, are sent through h1 ϕ to spheres/circles in Rd with radius between 0 and π/2 centered at the origin. Corollary H.7 Given two arbitrary points s1, s2 S2 \ {sn}, we have the following bounds d Sd(s1, s2) π and h1(ϕ(s1)) h1(ϕ(s2)) 2π. Proof: In general d Sd(s1, s2) is the minimum arc length between s1 and s2 taken among all possible paths on the sphere connecting those points, and so it is at most π. We recall that d Sd(s1, s2) = arccos( s1, s2 ) [0, π]. The other inequality is a consequence of Lemma H.6: The higher Ecudidean distance between points in a circle (sphere) with diameter 2π (radius π) is, in fact, 2π. The above results (identity (81), Lemma H.5 and Corollary H.7) suggest that given two points s1, s2 on the sphere we would have that d Sd(s1, s2) h1(ϕ(s1)) h1(ϕ(s2)) . This is shown experimentally in Figure 12. We will need some more lemmas to then sketch the proof of this fact theoretically. Stereographic Spherical Sliced Wasserstein Distances Figure 12. Experimental results for the relation d S2(s1, s2) dh(s1, s2) := min{ h1(ϕ(s1)) h1(ϕ(s2)) , 2π h1(ϕ(s1)) h1(ϕ(s2)) }. There were taken 1000 pairs (s1, s2) uniformly at random on the unit sphere in R3. Additionally, we calculate the correlation between d S2(s1, s2) and dh(s1, s2), resulting in an average correlation coefficient of 0.83. Lemma H.8 Consider s1, s2 Sd, with (s1)d+1 (s2)d+1, let s 2 be the unique point such that (s 2)d+1 = (s1)d+1 and (s 2[1 : d], s2[1 : d]) = 0. (87) Let β = (s1[1 : d], s2[1 : d]) [0, π], and let CH denote the angle between line segments [h1(ϕ(s1)), h1(ϕ(s 2))], [h1(ϕ(s 2)), h1(ϕ(s2))]. Then, Proof: We suggest the reader to look at Figure 13 for visualization purposes. Consider CH the angle between line segments [h1(ϕ(s1)), h1(ϕ(s 2))] and [h1(ϕ(s2)), h1(ϕ(s 2))] in the projected space. The way we have chosen s 2 guarantees that the segment [0, h1(ϕ(s 2))] is contained in the segment [0, h1(ϕ(s2))]. As an application of Lemma (H.6), we have that the segments [0, h1(ϕ(s 2))] and [0, h1(ϕ(s1))] have the same length as s 2 and s1 are at the same latitude. Then, the triangle in the projected space with vertices the origin or coordinates O = [0, . . . , 0] Rd, s1 and s 2 is isosceles. The angle of that triangle having vertex at O is β = (s1[1 : d], s2[1 : d]) = (s1[1 : d], s2[1 : d]) = (s1[1 : d], s 2[1 : d]) = (ϕ(s1), ϕ(s 2)) = (h1(ϕ(s1)), h1(ϕ(s 2))) (89) and the other two angles are equal. Thus, since CH is adjacent to one of those two equal angles, and since the sum of the internal angles of a triangle is 180, we have π = β + 2(π CH) or, equivalently, CH = β Lemma H.9 Consider s1, s2 Sd, with (s1)d+1 (s2)d+1, let s 2 be the unique point satisfying (87). Let C [0, π] denote the spherical angle between arc(s1, s 2), arc(s 2, s2), and CH denote the angle between line segments [h(ϕ(s1)), h(ϕ(s 2)], [h1(ϕ(s 2)), h1(ϕ(s2))]. Then, C CH. (90) Proof: For better visualization and an easier way of parametrizing the points, we will work on the 3-dimensional space (see Figure 14). Thus, assuming d = 2, by using spherical coordinates let α [ π/2, π/2], β = (s1[1 : d], s2[1 : d]) [0, π], and α [α, π 2 ) and, without loss of generality, let us parametrize s1 = [cos(α), 0, sin(α)] s2 = [cos(α ) cos 2β, cos(α ) sin(β), sin(α)] s 2 = [cos α cos 2β, cos(α) sin(β), sin(α)] (91) Stereographic Spherical Sliced Wasserstein Distances If β = 0, s 2 = s1, C = CH = 0 and there is nothing else to prove. If β > 0, the angle C is characterized by cos(C) = n12 , n22 n12 n22 , where n12 is a vector normal (perpendicular) to the plane containing the vectors # Os1 and # Os 2, and n22 is a vector normal to the plane containing the vectors # Os2 and # Os 2, where O denotes the origin of coordinates O = [0, . . . , 0] Rd+1. Since arccos( ) is a decreasing function, want to show that cos(C) cos(CH). (92) By using the parametrizations in (91), we can compute n22 = [sin(β), cos(β), 0] n22 = 1 n12 = [sin(α) sin(β), sin(α)(1 cos(β)), cos(α) sin(β)] n12 = q sin2(β) + sin2(α)(1 cos(β))2 n12 , n22 = sin(α)(1 cos(β)) cos(C) = sin(α)(1 cos(β)) q sin2(β) + sin2(α)(1 cos(β))2 (93) Notice that in the special case where α [0, π/2], that is, (s1)d+1 0, and so both s1, s2 are in the same positive hemisphere (as in Figure 14), we have that sin(α) 0, so, cos(C) 0, and therefore 0 C π/2. Thus, in this particular case, C π/2 CH if 0 α < π (See Figure 14 for a visualization of this case.) In general, by using (93) and (88), the inequality (92) reads as cos(C) = sin(α)(1 cos(β)) q sin2(β) + sin2(α)(1 cos(β))2 cos β By algebraic manipulations and by using trigonometric identities (92) holds if and only if 1 + sin(α) sin β 2 cos2 β 2 + sin2(α) sin2 β 2 0 if 0 β π, π Let us rewrite the inequality in (95) (by using that sin(β/2) 0 in our range) as + sin2(α) sin2 β The parameter α is constraint to the interval [ π/2, π/2). We have already proved that for α 0 everything works (see (94)). Thus, let us suppose that α < 0: In this case, sin(α) < 0 and so, both sides in (96) are positive. Therefore, it is equivalent to show that sin2(α) cos2 β 2 + sin2(α) sin2 β 2 or, equivalently, that sin2(α) 1, which is always true. Stereographic Spherical Sliced Wasserstein Distances Figure 13. On top we have the sphere in the 3-dimensional space. Three points s1, s2, s 2 are considered on the sphere. The point s 2 satisfies (87), that is, s 2, s1 are at the same latitude, and s 2, s2 are in the same meridian (and among the points that satisfy these two conditions always choose the s 2 which is closer to s2). The corresponding projections ϕ(s1), ϕ(s2), ϕ(s 2) are plotted on the 2-dimensional plane. The circle on the sphere parallel to the Equator that passes through s1 and s 2 and its projected circle by ϕ are depicted in blue. The circle on the sphere that passes through s2, s 2, and the poles (meridian) and its projection by ϕ (which is a line) are depicted in red. Similarly, the circle on the sphere that passes through s1, and the poles (meridian) and its projection by ϕ (which is a line) are depicted in green. On the bottom, we sketch the image by h1 of the projected space. The angles β given by (89), C (spherical angle between the arcs arc(s1, s 2), arc(s 2, s2)), and CH (defined by the line segments [ϕ(s1), ϕ(s 2)] and [ϕ(s2), ϕ(s 2)] or, equivalently, by the segments [h1(ϕ(s1)), h1(ϕ(s 2))] and [h1(ϕ(s2)), h1(ϕ(s 2))]) are depicted. When applying the function h1 after ϕ, it preserves angles with vertex at the origin in the projected space, preserves the lines that pass through the origin (red and green lines), maps circles centered at the origin to circles centered at the origin (blue arc). Roughly speaking, the effect of h1 is that it compresses the projected plane. Figure 14. Given two arbitrary points s1, s2 S2 R3 in the positive hemisphere. The point s 1 is plotted such that it has the same latitude as s1 and it is in the same meridian as s2. The angle C is depicted as the angle between the two planes containing the arcs arc(s1, s 2) and arc(s2, s 2). A spherical triangle connecting the points s1, s 2, s2 is depicted in black. Proposition H.10 Let s, s1, s2 Sd \ {sn}, and let s0 = [0, 0, . . . , 1]T Sd denote the South Pole. Consider ϕ : Sd \ {sn} Rd the stereographic projection as in (46) with inverse (47), and also consider the function h1 : Rd Rd Stereographic Spherical Sliced Wasserstein Distances defined by (76). Then d Sd(s1, s2) min R O(d+1) h1(ϕ(Rs1)) h1(ϕ(Rs2)) + ϵ(s1, s2) where ϵ(s1, s2) 0 as d Sd(s1, s2) 0. Let s1, s2 in Sd. Assume, without loss of generality, that with (s2)d+1 (s1)d+1. Define s 2 as in Lemma H.9: That is, such that (s 2)d+1 = (s1)d+1 (same latitude) and s 2, s2, s0 are in the same great circle (same meridian). Since there are two possible choices for s 2, choose it not only such that (s 2)d+1 = (s1)d+1 but also such that (s 2[1 : d], s2[1 : d]) = 0. (This means we always choose the s 2 which is closer to s2.) Let C be the spherical angle angle between the arcs arc(s1, s 2) and arc(s2, s 2) given by the smallest portions of the great circles connecting the corresponding points s1 with s 2 and s2 with s 2, with vertex the origin of coordinates [0, . . . , 0] Rd+1. We recall that the angle C is measured by the angle between the planes containing those two arcs. See Figure 14 for a visualization in R3. A spherical triangle on the surface of the unit sphere is defined by the great circles connecting three points on the sphere, in our case consider s1, s2, s 2. The spherical law of cosines reads as follows: cos(d Sd(s1, s2)) = cos(d Sd(s1, s 2)) cos(d Sd(s2, s 2)) + sin(d Sd(s1, s 2)) sin(d Sd(s2, s 2)) cos(C). As a consequence, by using Taylor expansions, d Sd(s1, s2)2 = d Sd(s1, s 2)2 + d Sd(s2, s 2)2 2d Sd(s1, s 2)d Sd(s2, s 2) cos(C) + O(d4 Sd(s1, s2)) + O(d4 Sd(s1, s 2)) + O(d4 Sd(s2, s 2)) Thus, for small spherical triangles, i.e., when the lengths d Sd(s1, s2), d Sd(s1, s 2), d Sd(s2, s 2) are small, we can use the planar approximation d Sd(s1, s2)2 d Sd(s1, s 2)2 + d Sd(s2, s 2)2 2d Sd(s1, s 2)d Sd(s2, s 2) cos(C). Now, as in the above lemmas, consider CH the angle between line segments [h1(ϕ(s1)), h1(ϕ(s 2))] and [h(ϕ(s2)), h(ϕ(s 2))] in the projected space. By Lemma H.5 and by using C CH together with fact that cos is a decreasing function on [0, π], we have: d2 Sd(s1, s 2) + d2 Sd(s2, s 2) 2d Sd(s1, s 2)d Sd(s2, s 2) cos(C) h1(ϕ(s1)) h1(ϕ(s 2)) 2 + h1(ϕ(s2)) h1(ϕ(s 2)) 2 2 h1(ϕ(s1)) h1(ϕ(s 2)) h1(ϕ(s2)) h1(ϕ(s 2)) cos(CH) = h1(ϕ(s1)) h1(ϕ(s2)) 2 where in the last step we used the classical law of cosines. Therefore, by using the approximation given by the spherical law of cosines (assuming d Sd(s1, s2) small), we obtain d Sd(s1, s2) h1(ϕ(s1)) h1(ϕ(s2)) . In particular, d Sd(s1, s2) = d Sd(Rs1, Rs2) h1(ϕ(Rs1)) h1(ϕ(Rs2)) R O(d + 1) and so d Sd(s1, s2) min R O(d+1) h1(ϕ(Rs1)) h1(ϕ(Rs2)) . Stereographic Spherical Sliced Wasserstein Distances H.1. Neural Network-Based Embedding While any choice for h which maintains injectivity will ensure S3WH,p remains a valid metric in Pp(Sd), the particular choice for h is significant in ensuring that the transportation cost in the embedding space resembles the spherical distance. As discussed in Section 3.3, as an alternative to the proposed analytic function h1( ) (as given in Eq. (14)), we may consider training a neural network to obtain a nearly-isometric Euclidean embedding and correct for the distance distortion caused by stereographic projection. Here, we consider the injective function defined by h NN(x) := [h T 1 (x)/C, ρT (x)]T where ρ : R2 R3 is a neural network, and C 2π a constant. We parameterize ρ using a multi-layer perceptron (MLP) with two hidden layers, each consisting of 128 neurons, and train ρ by minimizing L(ρ) = Es,s (arccos( s, s ) h(ϕ(s)) h(ϕ(s )) )2 , (97) where s and s are sampled according to the uniform distribution in the sphere Sd Rd+1, i.e., (s, s ) σd+1 σd+1. As shown in Figure 2, this choice for h can yield a nearly-isometric embedding. We then conduct two different gradient flow experiments to study the distinct behaviors of h1( ) and h NN( ). The visualizations demonstrating these gradient flows can be found at the following Git Hub URL. We observe that the use of h NN( ) provides more geodesic-like paths for the particles as compared to h1( ), though the final outcomes are comparable. We note that in our experiments, we primarily use only the analytic h1 for its computational efficiency as opposed to the neural network-based learnable function h NN. Specifically, in our experiments, the proposed distances are used in gradient-based optimization frameworks where the distance is computed in each iteration of (stochastic) gradient descent. Utilizing h NN in these optimization applications increases the computational cost of calculating the distance. More precisely, the computational cost would have the following components: 1) stereographic projection, 2) evaluation of h NN, and 3) slicing, with a neural network becoming the computational bottleneck in the evaluation of h NN. I. Numerical Experiments In this section, we aim to demonstrate our proposed S3W as an efficient metric suitable for various Machine Learning tasks. In I.2, we show that S3W and its variants provide a high-speed alternative for SSW as an effective loss for gradient-based optimization on the spherical manifold. In I.4, we study the evolution of our proposed distances w.r.t. varying parameters. In I.5, we discuss the computational efficiency of our method via runtime study. I.6, I.7, I.8, I.9 provide experiments in practical ML settings. Throughout our experiments, we use h(x) = h1(x) := arccos x 2 1 x 2+1 x x . I.1. Additional Algorithmic and Implementation Details Let {αi, xi}M i=1 ˆµ and {βj, yj}N j=1 ˆν be iid samples from the empirical measures ˆµ and ˆν, respectively, where PM i=1 αi = 1 and PN j=1 βj = 1. We provide below a general formulation for computing S3Wp between ˆµ and ˆν. We then define the procedure to compute RI-S3Wp and ARI-S3Wp. By abuse of notation, we denote these Algorithms as S3W, RI-S3W, and ARI-S3W, respectively. Stereographic Spherical Sliced Wasserstein Distances I.1.1. ALTERNATIVE IMPLEMENTATION TO COMPUTE S3Wp Algorithm 2 Stereographic Spherical Sliced-Wasserstein (S3W) Input: {αi, xi}M i=1 ˆµ, {βj, yj}N j=1 ˆν, L projections, p-th order, ϵ Initialize: h (injective map), {θl}L l=1 (projection directions) Initialize d = 0 Calculate {ui = h(ϕϵ(xi))}M i=1 and {vj = h(ϕϵ(yj))}N j=1 for ℓ= 1 to L do Project onto θl: ul i = ui, θl and vl j = vj, θl Sort {ul i}, {vl j}, s.t ul πl[i] ul πl[i+1], vl π l[j] vl π l[j+1] Compute the (weighted) CDFs { ul i = Ful(ul πl[i])}M i=1, { vl j = Fvl(vl πl[j])}N j=1 Merge and sort the { ul i}i and { vl j}j into {zl k}M+N k=1 where zl k zl k+1 For each zl k, compute F 1 ul (zl k) and F 1 vl (zl k) by setting F 1 ul (zl k) to the value of ul i such that Ful(ul i) is nearest to zl k. Proceed similarly for F 1 vl (zl k). dℓ= PM+N k=1 F 1 ul (zl k) F 1 vl (zl k) p (zl k zl k 1) Update total distance: d = d + 1 Ldℓ end for Return S3Wp(ˆµ, ˆν) d 1 p I.1.2. IMPLEMENTATION TO COMPUTE THE RI-S3Wp Algorithm 3 Rotationally Invariant S3W (RI-S3W) Input: {αi, xi}M i=1 ˆµ, {βj, yj}N j=1 ˆν, L projections, p-th order, ϵ, NR rotations Sample {Rr}NR r=1 SO(d + 1) Initialize d = 0 for r = 1 to NR do Apply rotation Rr to obtain Xr = {αi Rr(xi)}M i=1 and Yr = {βj Rr(yj)}N j=1 Use Algorithm 2 for inputs Xr, Yr to compute dr Update total distance: d = d + 1 NR dr end for Return RI-S3Wp(ˆµ, ˆν) d I.1.3. IMPLEMENTATION TO COMPUTE THE ARI-S3Wp Although RI-S3W is highly parallelizable via batch matrix multiplication, generating random rotation matrices can still be computationally expensive. That is especially true for high-dimensional data and training setups with small batch size where RI-S3W gets called repeatedly. One way to mitigate this issue is by pregenerating a batch of N rotation matrices, {Ri}N i=1 SO(d + 1), and subsample from this batch to amortize the generation cost. We introduce the Amortized Rotationally Invariant S3W (ARI-S3W) as algorithm 4. This process can be made seamless by managing the rotation matrices in a stateful manner. That is, we maintain a Rotation manager class with static variables to ensure that once a set of matrices is generated, they can be efficiently reused. We describe the abstract procedure to compute ARI-S3W in Algorithm 4. Stereographic Spherical Sliced Wasserstein Distances Algorithm 4 Amortized Rotationally Invariant S3W (ARI-S3W) Input: {αi, xi}M i=1 ˆµ, {βj, yj}N j=1 ˆν, L projections, p-th order, pool size Ntotal, NR rotations, pregenerated rotations R = {Rk}Ntotal k=1 SO(d + 1) where Ntotal NR Initialize d = 0 Randomly subsample NR rotations R = { Rr R}NR r=1 for each r in NR do Apply Rr to obtain Xr = {αi Rr(xi)}M i=1 and Yr = {βj Rr(yj)}N j=1 Use Algorithm 2 where inputs are { Rr(xr i )}N i=1 and { Rr(yr j)}N j=1 to compute dr Update total distance: d = d + 1 NR dr end for Return ARI-S3Wp(ˆµ, ˆν) d I.2. Study: Gradient Flow on the Sphere I.2.1. BACKGROUND OVERVIEW The von Mises-Fisher Distribution is a generalization of a Gaussian from Rd to Sd. Its parametric form can be characterized by the mean direction µ Sd and the concentration parameter κ > 0. That is f(x; µ, κ) = Cd(κ) exp(κµT x) (98) where Cd(κ) = κd/2 1 (2π)d/2Id/2 1(κ) is the normalization constant. In our experiment setting, we mostly work with d = 2. The pdf is f(x; µ, κ) = κ 4π sinh(κ) exp(κµT x). (99) Gradient descent on the sphere The sphere Sd is a Riemannian manifold where Euclidean geometry holds only locally. This local linearization is sufficient to adapt gradient descent to follow geodesic paths, which replace straight lines in Euclidean space. For a function f : Rm R, the classical Euclidean gradient update rule is given by xt+1 = xt γ Eucf(xt), (100) where Eucf(xt) denotes the Euclidean gradient at xt, and from now on we simply write f(xt). On manifolds, where straight lines are generalized to geodesic curves, we replace f(xt) with the Riemannian gradient Riemannianf(x) = f(x) f(x), x x where , denotes the inner product. While this gives local linearized direction, it does not specify how to move along the manifold itself. To address this, one could use a retraction4. When that retraction is normalization, the update rule becomes xt+1 = Retrxt ( γ Riemannianf(xt)) = xt γ ( f(xt) f(xt), xt xt) xt γ ( f(xt) f(xt), xt xt) . (101) If we use f(xt) instead of Riemannianf(xt), then we have the familiar Projected Gradient Descent. Another option for retraction is using the exponential map (see Eq. (111)) to project Riemannianf(xt) from the tangent space Tx Rd back onto the manifold along the geodesic. The gradient update is 4a mapping that approximates geodesics and projects the updated point in the tangent space back onto the manifold. Stereographic Spherical Sliced Wasserstein Distances xt+1 = expxt ( γ Riemannianf(xt)) = cos (| γ( f(xt) f(xt), xt xt)|) xt + sin (| γ( f(xt) f(xt), xt xt)|) γ( f(xt) f(xt), xt xt) | γ( f(xt) f(xt), xt xt)|. This variant is Riemannian Gradient Descent (Absil et al., 2008). For simplicity, we use Projected Gradient Descent in our gradient-based optimization, similar to (Bonet et al., 2023a). I.2.2. EXPERIMENT: LEARNING A MIXTURE OF 12 VON MISES-FISHER DISTRIBUTIONS Assuming we only have access to the target measure via its samples {yj ˆ νm}M j=1 Our objective is to iteratively minimize argminµd(ˆµi, ˆνmi). The Projected Gradient Descent algorithm for S3W gives the following update rule x i,k+1 = xi,k γ xi,k S3Wp p(ˆµk, ˆνmi), (103) xi,k+1 = x i,k+1 x i,k+1 2 , (104) where γ denotes the learning rate, i indexes the mini-batches, and k the gradient step. Implementation We consider a mixture of 12 v MFs as our target on S2, with each v MF centered on the vertices of an icosahedron determined by the golden ratio ϕ = (1 + 5)/2. We assume access to 2400 samples from the mixture (200 per v MF), each with κ = 50, and set batch sizes to |mi| = {200, 2400}. We fix the number of projecions to 1000 for all distances. We optimize over 500 gradient steps using the Adam Optimizer (Kingma & Ba, 2014) (learning rate of γ = 0.01 for full-batch and 0.001 for mini-batch) for 10 in dependent runs. We report both qualitative and quantitative results (NLL, W2); the latter comes with mean and standard deviation. Full-batch Results In addition to the numerical results provided in the main paper (refer to Figure 4), we provide the Molleweide projections in Figure 16 to show that all distances work comparatively well to learn the target distribution. In this setting, we also include the evaluation of the vertical sliced Wasserstein (VSW) distance (Quellmalz et al., 2023) in Figure 16 and Table 4 (see Section G.1). Mini-batch results We provide below our mini-batch results for all distances. In addition to the numerical results reported in Figure 4, we also show the Mollweide projections for the learned distributions as well as particle scatter-plots at epochs {0, 100, 300, 500}. In Table 4, S3W shows the fastest runtime for both full-batch and mini-batch optimization. It is almost an order of magnitude faster than SSW. RI-S3W(1) adds negligible computational overhead to S3W with the single added rotation but performs almost on par with SSW. RI-S3W (5) and RI-S3W (10) are both highperforming candidates, exceeding SSW albeit with some additional runtime compared to S3W. ARI-S3W(30/1000) shows significant performance gain while being almost as fast as S3W. We remark that the smaller the batch size, the more frequently the distance function gets called, the bigger advantage ARI-S3W will have over RI-S3W since the rotation generation cost gets amortized over a large number of function calls and becomes negligible. Both RI-S3W and ARI-S3W can achieve additional efficiency with rotation scheduler. In particular, one could start from few rotations and gradually (i.e. linearly) increase the number of rotations over time for fine-tuning. We provide additional Stereographic Spherical Sliced Wasserstein Distances visualization and numerical results for RI-S3W and ARI-S3W with a linear schedule from 1 to 30 rotations over 500 epochs. We observe that ARI-S3W(1-30) outperforms ARI-S3W(30) while being twice as fast. (c) RI-S3W(1) (d) RI-S3W(5) (e) RI-S3W(10) (f) ARI-S3W(30) (g) RI-S3W(1-30) (h) ARI-S3W(1-30) Figure 15. The Mollweide projections for mini-batch projected gradient descent. ARI-S3W has pool size of 1000. RI-S3W(1-30) and ARI-S3W(1-30) denote RI-S3W and ARI-S3W with linear NR schedule from 1 to 30 over 500 epochs (c) RI-S3W(1) (d) RI-S3W(5) (e) RI-S3W(10) (f) ARI-S3W(30) (g) RI-S3W(1-30) (h) ARI-S3W(1-30) Figure 16. The Mollweide projections for full-batch projected gradient descent. ARI-S3W has pool size of 1000. RI-S3W(1-30) and ARI-S3W(1-30) denote RI-S3W and ARI-S3W with linear NR scheduled from 1 to 30 over 500 epochs. Table 4. Comparison between different distances as loss for gradient flows. Distance Runtime (s) NLL log(W2) SSW 46.10 0.28 -284.61 8.44 -1.19 0.05 S3W 5.16 0.62 -205.13 17.56 -1.17 0.055 RI-S3W (1) 5.50 0.96 -254.54 9.39 -1.18 0.06 RI-S3W (5) 8.62 0.94 -305.15 11.79 -1.22 0.05 RI-S3W (10) 8.75 0.61 -320.74 8.34 -2.80 0.12 ARI-S3W (30/1000) 6.75 0.12 -343.71 2.45 -3.02 0.17 RI-S3W (1-30) 7.67 0.10 -322.35 7.69 -2.96 0.20 ARI-S3W (1-30/1000) 4.38 0.19 -264.55 13.31 -2.84 0.15 Full-batch SW 0.64 0.1 -4891.85 3.59 -4.70 0.22 V SW 0.66 0.01 -4858.75 6.37 -2.32 0.08 RI-S3W (1-30) 26.90 0.13 -5003.21 1.88 -6.68 0.06 ARI-S3W (1-30/1000) 23.78 0.36 -5048.43 6.16 -5.40 0.03 Stereographic Spherical Sliced Wasserstein Distances Figure 17. Particle Evolution for mini-batch projected gradient descent. Stereographic Spherical Sliced Wasserstein Distances I.3. Study: Stability of Metrics w.r.t. ϵ-cap While in theory, one only needs to exclude the north pole from the stereographic projection (SP) operation, the image of points near the north pole under SP can be made arbitrarily large in norm. Numerically, this may result in overflows and make the optimization in our ML applications unstable. Hence, we fix an ϵ-cap for SP which ensures the norm of projected samples are bounded; in particular, any point such that xd+1 > 1 ϵ is first mapped to the circle xd+1 = 1 ϵ prior to the SP operation. In this section, we provide additional analysis of the effect of this ϵ-cap on the proposed metrics. We generate two v MFs (target at the North pole and source at the South pole) with κ = 50 and N = 2048 samples each. We then compute each of the proposed metrics between the source and target distributions as a function of ϵ. We visualize the results over 100 runs for each metric in Figure 18. We note that it is expected for the behavior of S3W to differ from RI-S3W and ARI-S3W. Importantly, we observe that all of the proposed methods are stable for a wide range of values for ϵ. We observe no notable difference when we vary the number of projections L {128, 512}. Figure 18. Stability of each metric w.r.t. ϵ. Here, we use N = 2048 samples of each distribution and fix L = 128 projections. I.4. Study: Evolution of the S3W Distance I.4.1. EXPERIMENT: EVOLUTION OF METRICS W.R.T. ITERATION AND RUNTIME Implementation In this experiment, we aim to compare the evolution of different distances when used as a loss for gradient optimization. We randomly initialize a uniform source distribution ˆµ Unif(S2) of 2400 samples. The objective is to learn a mixture ˆν of 12 v MFs similar to Section 5.1, also consisting of 2400 samples. We use the Adam optimizer for projected gradient descent (see Section I.2) with a learning rate of γ = 0.01 for all distances. The SSW baseline (Bonet et al., 2023a) will have an additional run with learning rate γ = 0.05. Results Figure 4 (in the main paper) shows comparative performance for: SSW, S3W, RI-S3W (1 random rotation), RI-S3W (5 random rotations), ARI-S3W (30 random rotations, pool size of 1000). We show the evolution of the log(W2) loss w.r.t. the cumulative runtime. We observe that at learning rate of 0.01, SSW evolves nicely and starts converging around epoch 500, which is similar in performance with RI-S3W (1) and RI-S3W (5), albeit an order of magnitude slower. For a learning rate of 0.05, SSW begins to display alternating behaviors but converges similarly to the case where learning rate is 0.01. S3W is the fastest but slightly underperform other distances. ARI-S3W (30) converges to the lowest loss and twice at fast as SSW for both learning rates. Stereographic Spherical Sliced Wasserstein Distances I.4.2. EXPERIMENT: EVOLUTION OF METRICS W.R.T. κ FOR VARYING DIMENSIONS 0 50 100 150 200 250 d = 3 d = 25 d = 50 d = 100 (a) KL(v MF(µ, κ) v MF( , 0)) 0 50 100 150 200 250 d = 3 d = 25 d = 50 d = 100 (b) SSW2(v MF(µ, κ) v MF( , 0)) 0 50 100 150 200 250 d = 3 d = 25 d = 50 d = 100 (c) S3W2(v MF(µ, κ) v MF( , 0)) 0 50 100 150 200 250 d = 3 d = 25 d = 50 d = 100 (d) RI-S3W2(v MF(µ, κ) v MF( , 0)) 0 50 100 150 200 250 d = 3 d = 25 d = 50 d = 100 (e) ARI-S3W2(v MF(µ, κ) v MF( , 0)) Figure 19. Evolution between v MF(µ, κ) and v MF( , 0) w.r.t. κ for varying dimension. We use L = 200 projections for all sliced metrics, NR = 100 rotations for RI-S3W and ARI-S3W, and pool size of 1000 for ARI-S3W. Each distribution has 500 samples. For each κ {1, 5, 10, 20, 30, 40, 50, 75, 100, 150, 200, 250} we average each metric over 10 iterations. We illustrate in Figure 19 the evolution of KL divergence, SSW, S3W, RI-S3W, and ARI-S3W between v MF(µ, κ) and v MF( , 0) w.r.t. κ for varying dimension. Just as (Bonet et al., 2023a) found that SSW gets lower with the dimension contrary to KL divergence, we find that S3W, RI-S3W, and ARI-S3W follow a similar trend. Here we use the analytic form for the KL divergence between the von Mises-Fisher distribution and the uniform distribution on Sd as derived in (Davidson et al., 2018; Xu & Durrett, 2018): KL v MF(µ, κ)||v MF( , 0) = κ I(d+1)/2(κ) I(d+1)/2 1(κ) + d + 1 2 1 log κ d + 1 2 log(2π) log I(d+1)/2 1(κ) 2 log π + log 2 log Γ d + 1 where Iv( ) is the modified Bessel function of the first kind with order v, and Γ( ) is the gamma function. I.4.3. EXPERIMENT: EVOLUTION OF METRICS BETWEEN ROTATED VON MISES-FISHER DISTRIBUTIONS Similar to (Bonet et al., 2023a), we also compare the evolution of SSW, S3W, RI-S3W, and ARI-S3W between a fixed v MF distribution and the rotation of this distribution along a great circle. That is, we compute each metric between v MF(µ0, κ0) and v MF((cos θ, sin θ, 0, . . . ), κ0) for θ { kπ 6 }12 k=0, where µ0 = (1, 0, 0, . . . ) and κ0 = 10. We observe similar behavior between all metrics, with each distance being maximal between v MF(µ0, κ0) and v MF( µ0, κ0) (corresponding to θ = π). Stereographic Spherical Sliced Wasserstein Distances d = 3 d = 25 d = 50 d = 3 d = 25 d = 50 d = 3 d = 25 d = 50 d = 3 d = 25 d = 50 (d) ARI-S3W Figure 20. Evolution between v MF distributions on Sd 1. Here we use L = 200 projections for all metrics, NR = 100 rotations for RI-S3W and ARI-S3W, and a pool size of 1000 for ARI-S3W. We average each metric over 100 iterations and use 500 samples for each distribution. I.4.4. EXPERIMENT: EVOLUTION OF S3W W.R.T. NUMBER OF SLICES AND NUMBER OF ROTATIONS We perform several experiments to understand how S3W, RI-S3W, and ARI-S3W evolve w.r.t. the number of slices used, the number of random rotations used, and the pool size used, respectively. In each of the following experiments, we measure the distance between a source uniform distribution v MF( , 0) and a target von Mises-Fisher distribution v MF(µ, κ), where we sample 500 points from each distribution. Figure 21 demonstrates that beyond L = 100 projections, variance becomes negligible across different dimensions and values of κ. Figure 22 demonstrates that RI-S3W is generally stable for R = 10 rotations across varying dimensions and κ. Finally, Figure 23 demonstrates that variance is stable beyond a pool size of 20. In all plots, we average the metric over 20 iterations. 100 101 102 103 Number of projections d = 3 d = 10 d = 50 d = 100 d = 500 d = 1000 100 101 102 103 Number of projections Figure 21. Evolution of S3W between the source and target v MFs w.r.t. number of projections used. For the plot on the left, we fix κ = 10 for the target distribution, and for the plot on the right we use d = 3. Stereographic Spherical Sliced Wasserstein Distances 100 101 102 103 Number of rotations d = 3 d = 10 d = 50 d = 100 d = 500 d = 1000 100 101 102 103 Number of rotations Figure 22. Evolution of RI-S3W between the source and target v MFs w.r.t. number of rotations used. We fix L = 10 projections. For the plot on the left, we fix κ = 10 for the target distribution, and for the plot on the right we use d = 3. 101 102 103 d = 3 d = 10 d = 50 d = 100 d = 500 d = 1000 101 102 103 Figure 23. Evolution of ARI-S3W between the source and target v MFs w.r.t. pool size used. We fix L = 10 projections and NR = 10 rotations. For the plot on the left, we fix κ = 10 for the target distribution, and for the plot on the right we use d = 3. I.4.5. EXPERIMENT: SAMPLE COMPLEXITY OF METRICS We perform a series of experiments to better understand the sample complexity of S3W, RI-S3W, and ARI-S3W. In application, since we employ empirical probability measures to approximate their continuous versions, we seek to understand how each of the proposed metrics evolves as we vary the number of supports in the empirical distributions. In each of the following experiments, we measured the proposed distances between N samples from two von Mises-Fisher distributions, centered at the north and south pole of Sd with κ = 10. We compute the distance as a function of the number of samples for d {3, 10, 50, 100, 500, 1000}, repeating each computation over 50 iterations. The resulting average and standard deviation are visualized in Figure 24. Stereographic Spherical Sliced Wasserstein Distances 101 102 103 104 Number of Samples d = 3 d = 10 d = 50 d = 100 d = 500 d = 1000 101 102 103 104 Number of Samples d = 3 d = 10 d = 50 d = 100 d = 500 d = 1000 101 102 103 104 Number of Samples d = 3 d = 10 d = 50 d = 100 d = 500 d = 1000 (c) ARI-S3W2 Figure 24. Sample complexity of S3W, RI-S3W, and ARI-S3W. Distances are computed between two v MF distributions on Sd. We use L = 1000 projections for all metrics, NR = 10 rotations for RI-S3W and ARI-S3W, and a pool size of 100 for ARI-S3W. Each distance is measured over 50 iterations. I.5. Study: Runtime Analysis In this section, we provide additional runtime analysis of the proposed method w.r.t. varying parameters. Figure 25 illustrates a comparative analysis of the runtime for calculating S3W and RI-S3W as we vary the number of projections, samples, and rotations, across different dimensions d of the data. In each of these experiments, we measure the distance between a source uniform distribution v MF( , 0) and a target von Mises-Fisher distribution v MF(µ, 10). The figure indicates a linear or quasi-linear relationship between the runtime and each parameter. Moreover, in Figure 26 we illustrate the time required to generate a pool of rotation matrices in SO(d + 1) as we vary the size of the pool and the dimension d. As expected, the runtime is linear in the size of rotation pool. 250 500 750 1000 1250 1500 1750 2000 Number of projections d = 3 d = 10 d = 50 d = 100 d = 500 d = 1000 2000 4000 6000 8000 10000 Number of samples d = 3 d = 10 d = 50 d = 100 d = 500 d = 1000 0 200 400 600 800 1000 Number of rotations d = 3 d = 10 d = 50 d = 100 d = 500 d = 1000 Figure 25. Runtime w.r.t. varying number of projections, samples, and rotations, respectively. In 25a, 25c we use N = 500 samples of each distribution; in 25b,25c we fix L = 100 projections; and in 25a,25b we do not use any rotations. 0 100 200 300 400 500 Pool size d = 3 d = 10 d = 50 d = 100 d = 500 d = 1000 Figure 26. Runtime to generate pool of rotation matrices of varying sizes and dimension. Stereographic Spherical Sliced Wasserstein Distances I.6. Task: Density Estimation with Normalizing Flows I.6.1. BACKGROUND OVERVIEW Normalizing Flows offers a framework for constructing complex probability distributions from simple ones through invertible transformations. It works by restricting the model class such that the likelihood function can be evaluated via simple sampling. Suppose we have access to the samples {xi µ}n i=1 where µ is our target distribution. Let p Z be a base distribution over the latent z (i.e. a Gaussian). We aim to find an invertible transformation T such that the pushforward T#pµ = p Z. Then, the density pµ can be obtained through the change of variable formula pµ(x) = p Z(T(x))| det JT (x)| 1, (106) where |det JT (x)| denotes the determinant of the Jacobian of T at x. More concretely, consider the dataset Dtrain = {x1, ..., xn} sampled i.i.d from the empirical measure ˆµ = 1 n Pn i=1 δxi where δxi is the Dirac delta at xi. The log likelihood for training the neural network is given by log p(Dtrain|w) = i=1 log p X(xi|w) = i=1 {log p Z(T(xi)) log | det J(xi)|} (107) Here, T 1 represents the inverse transformation of T, and | det J(xi)| is the determinant of the Jacobian of T 1 at xi. Real NVP and Stereographic Projection-based Normalizing Flows with Real NVP: Real NVP (Real-valued Non-Volume Preserving) (Dinh et al., 2016) uses a series of invertible layers to transform the base distribution into a more complex one. Each layer, known as a coupling layer, is designed such that its Jacobian is easy to compute. Specifically, given an input vector x Rm, the coupling layer splits it into two parts x1 Rm1 and x2 Rm2 such that m1 + m2 = m. The transformation applied to x2 is then conditioned on x1 and defined by x 2 = exp(s(x1)) x2 + t(x1), (108) where s : Rm1 Rm2 and t : Rm1 Rm2 are scale and translation functions realized through neural networks. The operation denotes the Hadamard product. The key advantage of Real NVP is the traceability of both its inverse and the Jacobian determinant, that is log | det JT (x)| = X s(x1) (109) which makes it a good candidate for high-dimensional density estimation. To generalize to Sd, (Gemici et al., 2016) proposes using a transformation T composed of a stereographic projection ϕ (see section 2) followed by a real NVP flow f, denoted as T = f ϕ. The log density of the target distribution under this transformation is given by log p(x) = log p Z(z) + log | det Jf(z)| 1 2 log | det(JT ϕ 1Jϕ 1(ϕ(x)))|, (110) where p Z is the density of a prior on Rd, typically a standard Gaussian, and Jf(z) is the Jacobian of the Real NVP flow. The log det term accounts for the change in volume due to the stereographic projection ϕ and its inverse ϕ 1. Exponential Map and Exponential Map Normalizing Flows: Exponential maps serve as a tool for mapping data residing on curved spaces. In essence, it maps a vector in the tangent space of a point on a manifold M to the manifold itself. Specifically, for a point x M and a tangent vector vx, the exponential map expx(v) traces a geodesic (i.e. shortest path) on M. For the spherical manifold Sd, it is given by Stereographic Spherical Sliced Wasserstein Distances expx(v) = x cos(|v|) + v |v| sin(|v|). (111) Inspired by this, (Rezende et al., 2020) introduces the exponential map normalizing flows, which ensures that the construction of transformations happens directly on the manifold, avoiding the need for intermediate, non-diffeomorphic set mappings to Euclidean spaces (i.e. via the Stereographic projection). In this frame work, the transformation T is defined by applying the exponential map to the projected gradient of a scalar field ϕ(x), that is T(x) = expx(Projx( ϕ(x))) (112) where Projx projects ϕ(x) onto the tangent space at x, Tx Sd. The scalar field ϕ(x) is a convex combination of simple functions: i=1 αiβieβi(x T µi 1), (113) with constraints αi 0, PK i=1 αi 1, each βi > 0, and µi Sd. The density update is then computed as p(T(x)) = π(x) p det(E(x)T JT (x)T JT (x)E(x)) , (114) where JT (x) is the Jacobian of T at x, and E(x) forms an orthonormal basis of Tx Sd. I.6.2. EXPERIMENT: ESTIMATING DENSITY ON THE EARTH DATASETS Table 5. Details of Earth datasets. Earthquake Flood Fire Train 4284 3412 8966 Test 1836 1463 3843 Total 6120 4875 12809 Our main focus is on S2 for the purpose of demonstration and visualization. Similar to (Bonet et al., 2023a), we use the three datasets introduced by (Mathieu & Nickel, 2020), representing the earth s surface as a perfect spherical manifold: volcano eruptions (NOAA, 2020), earthquakes (NOAA, 2020), and floods (Brakenridge, 2017). Our objective is to learn a normalizing flows model capable of reporting exact density at any point on the sphere, so that we can predict the likelihood of future events based on past data. Figure 27. Ground truth as estimated with KDE (bandwidth 0.1) using test data (quakes, flood, fires, respectively). Densities are on a scale of 0 to 1. Specifically, we demonstrate two approaches for training a normalizing flow model on spherical data. The first of which follows (Gemici et al., 2016), stereographically projecting the data from S2 onto R2 and then training a Real NVP model on R2. The second utilizes spherical exponential maps (Rezende et al., 2020) to train a normalizing flow natively on S2, using both SW and SSW (Bonet et al., 2023a) as baseline losses and comparing with our proposed S3W. Implementation For exponential map normalizing flows, we construct the model using 48 radial blocks, each consisting of 100 components, totaling 24000 trainable parameters, similar to (Bonet et al., 2023a). Each block in the flow applies a transformation defined by a potential function V . The model projects the gradient of the potential function onto the sphere s tangent space and then maps it back to the sphere using the spherical exponential map. For Stereographic Projection-based normalizing flows (Stereo-NF), we first project the manifold S2 onto R2 using the stereographic projection. Then, we train a Real NVP model, consisting of 10 Real NVP blocks with scaling and translation Stereographic Spherical Sliced Wasserstein Distances parameterized by an MLP (10 layers of 25 neurons each, Leaky Re LU0.2( ) activation). This setup follow (Bonet et al., 2023a) and has 27520 parameters. We train the models for 20000 epochs each using full batch gradient descent via the Adam optimizer. For exponential map NF, we use the learning rate of 0.1 for SSW and SW as in the original setup and 0.05 for our S3Wand its variants. For Stereo-NF, we use a learning rate of 0.001. Results Figure 33 provides density maps of events (quakes, flood, fire from left to right), with the color scale representing the relative density on a scale from 0 to 1. We note that this normalization scheme does not change the underlying distribution but allows for an interpretable visualization, showing the relative differences in densities on a fixed scale. We observe that our proposed S3W and its variants put relative density mass more accurately. Table 3 provides the Negative Log-likelihood values for different distances. Despite having a higher parameter count, the increase in model complexity does not necessarily translate to better performance. This shows that estimating density natively on the sphere is superior for spherical data per this metric. Stereographic Spherical Sliced Wasserstein Distances Figure 28. Stereo + Real NVP Figure 29. Sliced-Wasserstein (SW) Figure 30. Spherical Sliced-Wasserstein (SSW) (Bonet et al., 2023a) Figure 31. Stereographic Spherical Sliced-Wasserstein (S3W) Figure 32. Rotation Invatiant S3W (RI-S3W) with 1 random rotation Figure 33. Amortized Rotation Invatiant S3W (ARI-S3W) with 50 random rotations, pool size of 1000 Stereographic Spherical Sliced Wasserstein Distances I.7. Task: Self-Supervised Representation Learning on the Sphere The self-supervised learning (SSL) paradigm derives learning signals from its own training data without explicit labeling. Among various SSL approaches, contrastive learning-based methods have gained significant popularity due to their effectiveness. Previous works have found that constraining feature vectors to the hypersphere (i.e., having unit norm) offers additional benefits (Caron et al., 2020; Wu et al., 2018). In what follows, we provide a brief overview of (spherical) contrastive learning currently one of the most potent approaches in SSL and then outline the details of our experiment in Section I.7.2. I.7.1. BACKGROUND OVERVIEW Contrastive Learning with Info NCE in the SSL paradigm is fundamentally about distinguishing between similar (positive) and dissimilar (negative) pairs of data points. The objective is to minimize the distance between embeddings of similar data points while maximizing the distance between embeddings of dissimilar points. Let pdata( ) be the data distribution over Rm and ppos( , ) the distribution of positive pairs over Rm Rm. Assuming symmetry, i.e., x, y : ppos(x, y) = ppos(y, x), and that the marginal distribution matches the data distribution, i.e., x : R ppos(x, y)dy = pdata(x), the popular Info NCE (Oord et al., 2018) objective can be described as Lcontrastive = E(x, y) ppos log exp(sim(f(x), f(y))/τ) P k Neg(x) exp(sim(f(x), f(xk))/τ) where f(x) and f(y) denote the embedding of the positive pairs, sim( , ) a similarity measure between two embeddings (e.g., the dot product), τ the temperature, and Neg(x) the set of negative samples for x. Spherical Contrastive Loss as Alignment and Uniformity: (Wang & Isola, 2020) provides valuable insights into the mechanics of contrastive learning by decomposing the contrastive loss into two components Lalign(f; α) = E(x,y) ppos f(x) f(y) 2 α , α > 0. (116) and Luniform(f; t) = log Ex,y i.i.d. pdata exp( t f(x) f(y) 2 2) , t > 0. (117) The overall objective is therefore Lcontrastive(f; α, β) = αLalign + βLuniform (118) where α, β denotes the component weight for Lalign and Luniform, respectively. The alignment component Lalign forces positive spherical views to be similar. The uniformity component Luniformity encourages all spherical views to be spread out (i.e. uniformly distributed) on the hypersphere, which prevents representation collapse. I.7.2. EXPERIMENT: S3W -BASED UNIFORMITY LOSS To demonstrate the efficacy of our proposed S3W in SSL, we follow the same setup as described in (Bonet et al., 2023a). Specifically, they propose to replace the Gaussian kernel in Luniform, which operates on pairwise instances, with the distributional distance SSW, and optimize the following objective LSSW-SSL = 1 i=1 z A i z B i 2 2 | {z } Alignment loss 2 SSW 2 2 (z A, ν) + SSW 2 2 (z B, ν) | {z } Uniformity loss where z A, z B Rn (d+1) are the spherical representations of two views (i.e. augmentation) of the same images, ν = Unif(Sd) is the uniform distribution on the hypersphere and λ > 0 is the regularization coefficient. They achieved comparable performance with (Wang & Isola, 2020; Chen et al., 2020) while benefiting from the subquadratic O(Ln(d + log n)). Stereographic Spherical Sliced Wasserstein Distances (a) Supervised (b) Sim CLR (Chen et al., 2020) (c) (Wang & Isola, 2020) (h) ARI-S3W Figure 34. Evolution between the source and target v MFs, 500 samples each. For all sliced-Wasserstein variants, we use 200 projections, and for RI-S3W, we use 100 random rotations. We will demonstrate the effectiveness of S3W in terms of performance and runtime by modifying Eq. (119) and instead optimize LS3W-SSL = 1 i=1 z A i z B i 2 2 + λ 2 S3W2(z A, ν) + S3W2(z B, ν) . (120) We use analogous objectives for RI-S3W and ARI-S3W. Implementation Similar to (Bonet et al., 2023a), we use a pretrained Res Net18 encoder (He et al., 2016) on CIFAR-10 (Krizhevsky, 2009) with 1024 dimensional penultimate features, projected and ℓ2 normalized to be unit vectors on Sd. The models are pretrained for 200 epochs with minibatch SGD (momentum 0.9, weight decay 0.001, initial learning rate 0.05). We select the batch size to be 512 samples and use the standard random augmentation set consisting of random crop, horizontal flipping, color jittering, and gray scale transformation, as done in (Bonet et al., 2023a; Wang & Isola, 2020). For evaluation, we fit a linear classifier on the pretrained representations and report the test accuracy. We report the accuracy both when this linear classifier is fit on the encoded feature representations and when it is fit on the projected features in Sd. We train this linear layer for 100 epochs with the Adam optimizer (learning rate 0.001, weight decay of 0.2 at epochs 60 and 80). For comparison, we also test the hypersphere method in (Wang & Isola, 2020) and Sim CLR (Chen et al., 2020) to aid in evaluating the results. We also train a fully supervised model by training the encoder and linear classifier jointly with cross entropy loss, in order to serve as a baseline for performance comparison. We use L = 200 projections for all sliced distances, NR = 5 rotations for RI-S3W and ARI-S3W, and a pool size of 100 for ARI-S3W. We first perform a series of experiments with d = 9. In this setting, we let the regularization coefficient λ = 0.5 for S3W, RI-S3W, and ARI-S3W; λ = 20.0 for SSW; and λ = 1.0 for SW. These results are found in the main paper Table 1. We also perform a series of experiments with d = 2 in order to visualize the quality of the learned representations. In this setting, we instead use a regularization coefficient of λ = 0.1 for S3W, RI-S3W, and ARI-S3W. These visualizations for all the considered methods are provided in Figure 34. Stereographic Spherical Sliced Wasserstein Distances I.8. Task: Sliced-Wasserstein Variational Inference on the Sphere I.8.1. BACKGROUND OVERVIEW Variational Inference turns the Bayesian inference problem into a variational one, benefiting from the rich optimization literature and reduced computational costs. The usual goal is to approximate an unnormalized distribution by minimizing the KL divergence between a tractable density and the true posterior. Formally, let p( |x) denote the target posterior and and q Q the approximate posterior from a family of tractable distributions. The standard objective is to minimize min q Q KL(q||p( |x)) = Z q(Z) log q(Z) d Z = Eq[log q(Z) Sliced-Wasserstein Variational Inference (on the sphere): The non-metric nature of KL, namely its asymmetry and failure to satisfy the triangle inequality, can result in undesirable behaviors. As an alternative, (Yi & Liu, 2022) introduces the sliced Wasserstein variational inference, optimized via MCMC without optimization or requiring a tractable Q family. Hence, this can be armortized with deep neural networks. In spherical data contexts, (Bonet et al., 2023a) recommends replacing SW with SSW and employs the Geodesic Langevin Algorithm (GLA) (see I.8.1). To perform SWVI, we first select a sampler qθ for the sphere. At each iteration k, we draw a batch of N samples {zi 0}N i=1from qθ. These samples are then propagated using ℓMCMC steps on the sphere to obtain a new set of samples {zj ℓ}N j=1. The updates are constrained to the sphere by employing the exponential map expxk or normalization of the projection onto the tangent space Projxk at each point xk Sd. We then compute sliced Wasserstein distance (i.e. SSW, S3W) between the empirical ˆµ0 and ˆµℓ, which is then used to compute gradients w.r.t. θ. Unadjusted Langevin Algorithm (ULA) is used to sample from a probability distribution with density proportional to e V (x) where V (x) is a potential function (i.e. log density) and x Rd. The update rule simply combines a deterministic gradient step with a stochastic noise term. That is xt+1 = xt γ V (xt) + p 2γZ, Z N(0, I) (122) Geodesic Langevin Algorithm (GLA) (Wang et al., 2020) adapts ULA to a Riemannian manifold like Sd by modifying the update rule to respect the manifold structure. Similar to I.2.1, we first replace the Euclidean gradient with the Riemannian gradient, as follows xt+1 = Retrxt γ Riemannian V (xt) + p 2γZ , Z N(0, I), (123) Here, Retrxt is a retraction. If we select Retrxt = expxt (see Eq. (111)) then we arrive at the Geodesic Langevin Algorithm (GLA) whose update rule is simply xt+1 = expxt γ Riemannian V (xt) + p 2γZ , Z N(0, I) (124) where Riemannian V (xt) = V (xt) V (xt), xt xt. In our experiments, since Sd has constant curvature and the step size is small, we use normalization as the retraction for simplicity. The update rule is hence given by: xt+1 = xt+1 xt+1 , where xt+1 = xt γ Riemannian V (xt) + p where Z N(0, I) is the Gaussian noise term. Putting it all together, we arrive at xt+1 = xt γ ( V (xt) V (xt), xt xt) + 2γZ xt γ ( V (xt) V (xt), xt xt) + 2γZ . (126) Stereographic Spherical Sliced Wasserstein Distances Power Spherical Distribution (De Cao & Aziz, 2020) provides a parametric alternative to v MFs (refer to H2). It demonstrates greater stability and has a univariate marginal with a closed-form expression for both its CDF and its inverse. This enables fast and efficient sampling, in contrast to the v MF, which depends on rejection sampling. The PDF of the Power Spherical is defined as p X(x)(x; µ, κ) (1 + µ x)κ, (127) where µ Sd, κ > 0, for all x Sd. We sample by drawing Z Beta( d 1 2 ) and v Unif(Sd 1). These samples are then used to construct T = 2Z 1 and subsequently a vector Y = [T, v 1 T 2] . The final sample is obtained by applying a Householder5 reflection about the mean direction µ to Y . Effective Sample Size (ESS) estimates the equivalent number of independent samples that would provide the same amount of information as the correlated samples generated by an MCMC process (Doucet et al., 2001). For a chain of n samples, the autocorrelation at lag6 t (t 0) is given by Z (θ(n) µ)(θ(n + t) µ)p(θ)dθ. (128) The ESS is then calculated from the total number of samples, N, adjusted for the sum of autocorrelation at all lags ESS = N 1 + 2 P t=1 ρt (129) Direct calculation for the ESS is often impractical. (Rezende et al., 2020) proposes estimating it using importance sampling weights ws. That is ESS = Var Unif(e βu(X)) Varq e βu(X) PS s=1 ws 2 PS s=1 w2s , (130) where ws = e βu(xs)/qη(xs). Here, in the context of normalizing flows (see experiments I.8.2, I.8.3), it is reported as a percentage of the sample size. If the ESS can be estimated reliably, then higher ESS means the flow matches the target better. I.8.2. EXPERIMENT 1: LEARNING A POWER SPHERICAL DISTRIBUTION Implementation Following the setup described in (Bonet et al., 2023a), we set initial Power Spherical parameters µ = (1, 1, 1) and κ = 0.1 for source,µ = (0, 1, 0) and κ = 10 for target. We perform 2000 Riemannian Gradient Descent (Absil et al., 2008) steps and then 20 steps of GLA (step size 0.001) with 1000 projections (learning rate 2). We select K = 2000 steps with N = 500 particles. We report runtimes and parameter convergence behaviors for all the distances. Although they all converge similarly, the runtimes differ drastically. For SSW, we report an average of 215.5 seconds over 5 independent runs. For S3W, RI-S3W (1), RI-S3W (5), RI-S3W (10), ARI-S3W (10 rotations, pool size 1000), they are 55.3, 57.3, 63.6, 73.1, 63.04 seconds , respectively. 5a linear operation H(v) = v 2u(u v) reflecting v across the hyperplane perpendicular to u. 6the number of time steps separating sequential data points in a time series, used for measuring temporal correlation. Stereographic Spherical Sliced Wasserstein Distances (c) RI-S3W(1) (d) RI-S3W(5) (e) RI-S3W(5) (f) RI-S3W(10) Figure 35. Learning a mixture of 4 von Mises-Fisher distributions. To obtain the Mollweide projections, we perform Kernel Density Estimation with the Scott adaptive bandwidth. Stereographic Spherical Sliced Wasserstein Distances I.8.3. EXPERIMENT 2: LEARNING A MIXTURE OF 4 VON MISES-FISHER DISTRIBUTIONS Implementation We perform the Geodesic Langevin Algorithm (GLA) (see I.8.1) to generate samples that track the target distribution and used to guide our variational model, which is an exponential map normalizing flows fµT (x; Texp) (N = 6 blocks, 5 components). The objective of fµT is to transform uniform noise on the sphere into a close approximation of the target, which is a mixture of 4 von Mises-Fisher distributions. The parametric target has the mean directions µ1 = (1.5, 0.7 + π 2 ), µ2 = (1.0, 1.0 + π 2 ), µ3 = (5.0, 0.6 + π 2 ), µ4 = (4.0, 0.7 + π 2 ) and uniform concentration parameters κ = 10, equally weighted. The optimization is run for 10000 iterations with 20 GLA steps each, using the Adam optimizer (learning rate of 0.01, matching the baseline). Results Figure 36 demonstrates the qualitative results for our experiments. While all models are capable of closely matching the target, it appears that ARI-S3W (30), RI-S3W (10), RI-S3W (5) outperform RI-S3W(1) , which fares better than SSW and S3W. (a) Target Density (b) Target Empirical (e) RI-S3W (1 rotation) (f) RI-S3W (5 rotations) (g) RI-S3W (10 rotations) (h) ARI-S3W (30 rotations) Figure 36. Learning a mixture of 4 von Mises-Fisher distributions. To obtain the Mollweide projections, we perform KDE with the Scott adaptive bandwidth. For ARI-S3W, we use pool size of 1000. 0 2000 4000 6000 8000 10000 Iterations KL Divergence KL Divergence Over Iterations SW-VI S3W-VI RIS3W-VI (1) RIS3W-VI (5) RIS3W-VI (10) ARIS3W-VI (30/1000) ARIS3W-VI (50/1000) SSW-VI 0 2000 4000 6000 8000 10000 Iterations Effective Sample Size (ESS) Over Iterations SW-VI S3W-VI RIS3W-VI (1) RIS3W-VI (5) RIS3W-VI (10) ARIS3W-VI (30/1000) ARIS3W-VI (50/1000) SSW-VI Figure 37. Evolution between the source and target v MFs, 500 samples each. For all sliced-Wasserstein variants, we use 200 projections, and for ARI-S3W, we use the pool size of 1000 random rotations. To compare the performance of the different distances more rigorously, we use the KL Divergence and the Effective Sample Size (ESS, see I.8.1), similar to (Bonet et al., 2023a). We perform 10 runs per distance, and also add ARI-S3W(50) for comparison. Figure 37 shows the convergence trend across different distances. The KL Divergence plot shows most models (with different distances) similarly and rapidly converge to approximating the target distribution. S3W performs slightly worse than the rest while SSW is on par with other variants of S3W, in contrast to our qualitative finding based on figure 36. The ESS plot demonstrates sampling efficacy. Again, similar trend is observed, with S3W slightly underperforming other distances. We note that greater ESS values signal more independence among drawn samples. Additionally, we tried the learning rate of 0.001 (not shown) for all distances and observed a slightly different trend. SSW still outperforms S3W and is on par with RI-S3W(1), while underperforming the rest by a small margin. Stereographic Spherical Sliced Wasserstein Distances I.9. Task: Generative Modeling with Sliced-Wasserstein Autoencoder (SWAE) Autoencoders (AEs) map an input x Rd back to itself through an intermediate latent space and play a central role in discovering low-dimensional representations useful for downstream tasks. Deterministic AEs focus on point estimates in the latent space, while their probabilistic counterparts extend this concept to probability distributions. A common requirement for both is regularization to avoid the trivial solutions. SWAEs (Kolouri et al., 2016b), a type of probabilistic AE, offer a simple yet effective approach to regularizing the latent space. In this section, we show the effectiveness of our proposed S3W method when the SWAE latents are constrained to the unit sphere. I.9.1. EXPERIMENT: S3W -BASED LATENT REGULARIZATION LOSS Algorithm 5 S3W-AE Require: Reg. coefficient λ, L projections, encoder fθ, decoder gη, dimension d of the hypersphere. while fθ and gη have not converged do Sample x1, ..., x M from the training set (i.e., p X). Encode the samples: z1, ..., z M fθ(x1), ..., fθ(x M). Normalize encoded samples to Sd: ˆz1, ..., ˆz M Normalize(z1, ..., z M). Sample z1, ..., z M from q Z and normalize to Sd. Use algorithm 1 or its variants to calculate L(l) S3W({ zm}, {ˆzm}). Compute ˆLS3W = 1 L PL l=1 L(l) S3W. Update θ and η by gradient descending m=1 c(xm, gη(ˆzm)) + λ ˆLS3W We perform experiments with both MNIST and CIFAR-10. The details of the architectures used in our experiments can be found in Table 7. In all experiments, we train the networks with the Adam optimizer (learning rate of 10 3) with a batch size of 500 over 100 epochs. The procedure is described in Algorithm 5. For all experiments, we use the standard binary cross entropy (BCE) loss as our reconstruction loss, with a tradeoff parameter λ. For CIFAR-10, we use λ = 10 3 for SW, S3W, RI-S3W, and ARI-S3W, and λ = 10 for SSW. We use L = 100 projections for all sliced methods, NR = 5 rotations, and a pool size of 100 random rotations. Moreover, we use a v MF mixture with 10 components for the latent prior. These results are given in Table 2. For MNIST, we provide a comprehensive evaluation of the considered methods as we vary the regularization parameter λ. These results can be found in Table 8. In these experiments, we once again use L = 100 projections for all sliced methods and a pool size of 100 for ARI-S3W. Here, we use the uniform distribution Unif(S2) as our latent prior. Lastly, we repeat a set of experiments for each distance using the MNIST dataset and compare the Fr echet Inception Distance (FID) (Heusel et al., 2017) of each resulting model in Table 6. We compute the FID using 10000 samples and report the mean and standard deviation over 5 independent runs. Moreover, we visualize samples generated by each model in Figure 39. For these experiments, we again use L = 100 projections for all methods and a pool size of 100 for ARI-S3W, as well as a uniform prior on S2. We use λ = 10 3 for all models for fair comparison. Table 6. FID scores for SWAE models. Method FID SW 74.2301 2.7727 SSW 75.0354 2.8609 S3W 75.0717 2.6026 RI-S3W (1) 73.5115 3.4937 RI-S3W (10) 70.2262 3.4730 ARI-S3W (30) 69.5562 1.5386 Stereographic Spherical Sliced Wasserstein Distances (a) Autoencoder (λ = 1) (b) SW(λ = 1000) (c) SSW(λ = 1000) (d) S3W(λ = 1) (e) RI-S3W (1 rotations, λ = 1) (f) RI-S3W (10 rotations, λ = 1) (g) ARI-S3W (10 rotations, pool size of 100, λ = 1) (h) ARI-S3W (30 rotations, pool size of 100, λ = 1) Figure 38. Latent space visualization (MNIST) (d) RI-S3W (1) (e) RI-S3W (10) (f) ARI-S3W (30) Figure 39. Samples generated by SWAE models with a uniform prior on S2. Stereographic Spherical Sliced Wasserstein Distances Dataset Encoder Decoder x R28 28 Conv2d32 Re LU s S2 FC512 FC512 Re LU Conv2d32 Re LU Reshape(128 2 2) Conv2d64 Re LU Conv2d T128 Re LU Conv2d64 Re LU Conv2d T64 Re LU Conv2d128 Re LU Conv2d T64 Re LU Conv2d128 Conv2d T32 Re LU Flatten FC512 Re LU Conv2d T32 Re LU FC3 ℓ2 normalization s S2 Conv2d T1 Sigmoid x R3 32 32 Conv2d32 Re LU s S2 FC512 FC2048 Re LU Conv2d32 Re LU Reshape(128 4 4) Conv2d64 Re LU Conv2d T128 Re LU Conv2d64 Re LU Conv2d T64 Re LU Conv2d128 Re LU Conv2d T64 Re LU Conv2d128 Conv2d T32 Re LU Flatten FC512 Re LU Conv2d T32 Re LU FC3 ℓ2 normalization s S2 Conv2d T3 Sigmoid Table 7. Architecture of the Encoder and Decoder for MNIST and CIFAR datasets Stereographic Spherical Sliced Wasserstein Distances Table 8. SWAE with different regularizations on the MNIST dataset with uniform prior for d = 2. Here we compare among the Sliced Wasserstein (SW), Spherical Sliced Wasserstein (SSW) (Bonet et al., 2023a), S3W, RI-S3W (1 rotation), RI-S3W (10 rotations), ARI-S3W (10 rotations), ARI-S3W (30 rotations, pool size of 100). The metrics are log10(W2) and Cross Entropy (CE). For SSW, we include a wide range of regularization parameter λ = {0.001, 0.01, 0.1, 1, 10, 100, 1000} for thorough comparison. For other methods, we include either λ = {0.001, 0.01, 0.1, 1} or λ = {1, 10, 100, 1000}. Method λ log(W2) CE Time(s/ep.) Supervised AE 1 1.8438 0.1667 0.1585 0.0018 5.2500 0.0502 1 1.1294 0.0102 0.1619 0.0608 5.2113 0.1191 10 1.1827 0.0096 0.1652 0.0631 5.4121 0.9013 100 1.6033 0.0148 0.1641 0.0631 5.7512 0.3019 1000 1.7856 0.0296 0.1725 0.0668 6.1400 0.4853 0.001 0.7585 0.0020 0.1620 0.0615 16.1380 1.2041 0.01 0.8523 0.0097 0.1619 0.0623 14.4800 1.4353 0.1 0.8555 0.0119 0.1609 0.0610 12.4125 0.0844 1 0.9393 0.0123 0.1681 0.0647 12.1478 1.462 10 0.9168 0.0099 0.1603 0.0604 13.3211 0.2393 100 1.1430 0.0119 0.1598 0.0608 12.3703 0.2108 1000 1.3658 0.0198 0.1624 0.0616 12.9221 0.1441 0.001 1.4883 0.0298 0.1649 0.0634 6.0980 0.1298 0.01 1.6289 0.0230 0.2013 0.0706 6.1580 0.0264 0.1 1.7331 0.0543 0.2262 0.0643 6.1620 0.0117 1 1.7446 0.0403 0.2188 0.0669 6.1400 0.0303 0.001 1.4061 0.0098 0.1648 0.0624 5.8920 0.0546 0.01 1.6955 0.0422 0.1655 0.0422 5.9120 0.0286 0.1 1.8754 0.0384 0.2097 0.0721 5.8780 0.0293 1 1.8308 0.0618 0.2315 0.0699 5.9100 0.0460 0.001 1.6410 0.0205 0.1599 0.0613 5.7860 0.4198 0.01 1.6724 0.0256 0.1642 0.0644 5.9725 0.0286 0.1 1.7934 0.0401 0.1784 0.0679 5.9900 0.0374 1 1.7332 0.0437 0.2178 0.0718 6.0040 0.0162 ARI-S3W(10) 0.001 1.5523 0.0347 0.1649 0.0638 6.2960 0.0372 0.01 1.5835 0.0311 0.1572 0.0018 6.3120 0.0271 0.1 1.6448 0.0381 0.2141 0.0698 6.3240 0.0242 1 1.7876 0.0586 0.2273 0.0688 6.1980 0.0970 ARI-S3W(30) 0.001 1.6453 0.0154 0.1952 0.0176 5.9540 0.5100 0.01 1.6290 0.0126 0.1756 0.0647 6.1600 0.0253 0.1 1.7889 0.0458 0.2305 0.0631 6.1625 0.0286 1 1.8890 0.0506 0.2263 0.0637 6.1840 0.0185