# spherical_slicedwasserstein__f57eb20f.pdf Published as a conference paper at ICLR 2023 SPHERICAL SLICED-WASSERSTEIN Cl ement Bonet1, Paul Berg2, Nicolas Courty2, Franc ois Septier1, Lucas Drumetz3, Minh-Tan Pham2 Universit e Bretagne Sud, LMBA1, IRISA2; IMT Atlantique, Lab-STICC3 {clement.bonet, paul.berg, francois.septier}@univ-ubs.fr {nicolas.courty, minh-tan.pham}@irisa.fr; lucas.drumetz@imt-atlantique.fr Many variants of the Wasserstein distance have been introduced to reduce its original computational burden. In particular the Sliced-Wasserstein distance (SW), which leverages one-dimensional projections for which a closed-form solution of the Wasserstein distance is available, has received a lot of interest. Yet, it is restricted to data living in Euclidean spaces, while the Wasserstein distance has been studied and used recently on manifolds. We focus more specifically on the sphere, for which we define a novel SW discrepancy, which we call spherical Sliced Wasserstein, making a first step towards defining SW discrepancies on manifolds. Our construction is notably based on closed-form solutions of the Wasserstein distance on the circle, together with a new spherical Radon transform. Along with efficient algorithms and the corresponding implementations, we illustrate its properties in several machine learning use cases where spherical representations of data are at stake: sampling on the sphere, density estimation on real earth data or hyperspherical auto-encoders. 1 INTRODUCTION Optimal transport (OT) (Villani, 2009) has received a lot of attention in machine learning in the past few years. As it allows to compare distributions with metrics, it has been used for different tasks such as domain adaptation (Courty et al., 2016) or generative models (Arjovsky et al., 2017), to name a few. The most classical distance used in OT is the Wasserstein distance. However, calculating it can be computationally expensive. Hence, several variants were proposed to alleviate the computational burden, such as the entropic regularization (Cuturi, 2013; Scetbon et al., 2021), minibatch OT (Fatras et al., 2020) or the sliced-Wasserstein distance (SW) for distributions supported on Euclidean spaces (Rabin et al., 2011b). Although embedded in larger dimensional Euclidean spaces, data generally lie in practice on manifolds (Fefferman et al., 2016). A simple manifold, but with lots of practical applications, is the hypersphere Sd 1. Several types of data are by essence spherical: a good example is found in directional data (Mardia et al., 2000; Pewsey & Garc ıa-Portugu es, 2021) for which dedicated machine learning solutions are being developed (Sra, 2018), but other applications concern for instance geophysical data (Di Marzio et al., 2014), meteorology (Besombes et al., 2021), cosmology (Perraudin et al., 2019) or extreme value theory for the estimation of spectral measures (Guillou et al., 2015). Remarkably, in a more abstract setting, considering hyperspherical latent representations of data is becoming more and more common (e.g. (Liu et al., 2017; Xu & Durrett, 2018; Davidson et al., 2018)). For example, in the context of variational autoencoders (Kingma & Welling, 2013), using priors on the sphere has been demonstrated to be beneficial (Davidson et al., 2018). Also, in the context of self-supervised learning (SSL), where one wants to learn discriminative representations in an unsupervised way, the hypersphere is usually considered for the latent representation (Wu et al., 2018; Chen et al., 2020a; Wang & Isola, 2020; Grill et al., 2020; Caron et al., 2020). It is thus of primary importance to develop machine learning tools that accommodate well with this specific geometry. The OT theory on manifolds is well developed (Villani, 2009; Figalli & Villani, 2011; Mc Cann, 2001) and several works started to use it in practice, with a focus mainly on the approximation of OT maps. For example, Cohen et al. (2021); Rezende & Racani ere (2021) approximate the OT map to define normalizing flows on Riemannian manifolds, Hamfeldt & Turnquist (2021a;b); Cui et al. (2019) derive algorithms to approximate the OT map on the sphere, Alvarez-Melis et al. (2020); Hoyos- Published as a conference paper at ICLR 2023 Idrobo (2020) learn the transport map on hyperbolic spaces. However, the computational bottleneck to compute the Wasserstein distance on such spaces remains, and, as underlined in the conclusion of (Nadjahi, 2021), defining SW distances on manifolds would be of much interest. Notably, Rustamov & Majumdar (2020) proposed a variant of SW, based on the spectral decomposition of the Laplace Beltrami operator, which generalizes to manifolds given the availability of the eigenvalues and eigenfunctions. However, it is not directly related to the original SW on Euclidean spaces. Contributions. Therefore, by leveraging properties of the Wasserstein distance on the circle (Rabin et al., 2011a), we define the first, to the best of our knowledge, natural generalization of the original SW discrepancy on a non trivial manifold, namely the sphere Sd 1, and hence we make a first step towards defining SW distances on Riemannian manifolds. We make connections with a new spherical Radon transform and analyze some of its properties. We discuss the underlying algorithmic procedure, and notably provide an efficient implementation when computing the discrepancy against a uniform distribution. Then, we show that we can use this discrepancy on different tasks such as sampling, density estimation or generative modeling. 2 BACKGROUND The aim of this paper is to define a Sliced-Wasserstein discrepancy on the hypersphere Sd 1 = {x Rd, x 2 = 1}. Therefore, in this section, we introduce the Wasserstein distance on manifolds and the classical SW distance on Rd. 2.1 WASSERSTEIN DISTANCE Since we are interested in defining a SW discrepancy on the sphere, we start by introducing the Wasserstein distance on a Riemannian manifold M endowed with the Riemannian distance d. We refer to (Villani, 2009; Figalli & Villani, 2011) for more details. Let p 1 and µ, ν Pp(M) = {µ P(M), R M dp(x, x0) dµ(x) < for some x0 M}. Then, the p-Wasserstein distance between µ and ν is defined as W p p (µ, ν) = inf γ Π(µ,ν) M M dp(x, y) dγ(x, y), (1) where Π(µ, ν) = {γ P(M M), A M, γ(M A) = ν(A) and γ(A M) = µ(A)} denotes the set of couplings. For discrete probability measures, the Wasserstein distance can be computed using linear programs (Peyr e et al., 2019). However, these algorithms have a O(n3 log n) complexity w.r.t. the number of samples n which is computationally intensive. Therefore, a whole literature consists of defining alternative discrepancies which are cheaper to compute. On Euclidean spaces, one of them is the Sliced-Wasserstein distance. 2.2 SLICED-WASSERSTEIN DISTANCE On M = Rd with d(x, y) = x y p p, a more attractive distance is the Sliced-Wasserstein (SW) distance. This distance relies on the appealing fact that for one dimensional measures µ, ν P(R), we have the following closed-form (Peyr e et al., 2019, Remark 2.30) W p p (µ, ν) = Z 1 F 1 µ (u) F 1 ν (u) p du, (2) where F 1 µ (resp. F 1 ν ) is the quantile function of µ (resp. ν). From this property, Rabin et al. (2011b); Bonnotte (2013) defined the SW distance as µ, ν Pp(Rd), SW p p (µ, ν) = Z Sd 1 W p p (P θ #µ, P θ #ν) dλ(θ), (3) where P θ(x) = x, θ , λ is the uniform distribution on Sd 1 and for any Borel set A B(Rd), P θ #µ(A) = µ((P θ) 1(A)). Published as a conference paper at ICLR 2023 This distance can be approximated efficiently by using a Monte-Carlo approximation (Nadjahi et al., 2019), and amounts to a complexity of O(Ln(d + log n)) where L denotes the number of projections used for the Monte-Carlo approximation and n the number of samples. SW can also be written through the Radon transform (Bonneel et al., 2015). Let f L1(Rd), then the Radon transform R : L1(Rd) L1(R Sd 1) is defined as (Helgason et al., 2011) θ Sd 1, t R, Rf(t, θ) = Z Rd f(x)1{ x,θ =t}dx. (4) Its dual R : C0(R Sd 1) C0(Rd) (also known as back-projection operator), where C0 denotes the set of continuous functions that vanish at infinity, satisfies for all f, g, Rf, g R Sd 1 = f, R g Rd and can be defined as (Boman & Lindskog, 2009; Bonneel et al., 2015) g C0(R Sd 1), x Rd, R g(x) = Z Sd 1 g( x, θ , θ) dθ. (5) Therefore, by duality, we can define the Radon transform of a measure µ M(Rd) as the measure Rµ M(R Sd 1) such that for all g C0(R Sd 1), Rµ, g R Sd 1 = µ, R g Rd. Since Rµ is a measure on the product space R Sd 1, we can disintegrate it w.r.t. λ, the uniform measure on Sd 1 (Ambrosio et al., 2005), as Rµ = λ K with K a probability kernel on Sd 1 B(R), i.e. for all θ Sd 1, K(θ, ) is a probability on R, for any Borel set A B(R), K( , A) is measurable, and ϕ C(R Sd 1), Z R Sd 1 ϕ(t, θ)d(Rµ)(t, θ) = Z R ϕ(t, θ)K(θ, dt)dλ(θ), (6) with C(R Sd 1) the set of continuous functions on R Sd 1. By Proposition 6 in (Bonneel et al., 2015), we have that for λ-almost every θ Sd 1, (Rµ)θ = P θ #µ where we denote K(θ, ) = (Rµ)θ. Therefore, we have µ, ν Pp(Rd), SW p p (µ, ν) = Z Sd 1 W p p (Rµ)θ, (Rµ)θ dλ(θ). (7) Variants of SW have been defined in recent works, either by integrating w.r.t. different distributions (Deshpande et al., 2019; Nguyen et al., 2021; 2020), by projecting on R using different projections (Nguyen & Ho, 2022a;b; Rustamov & Majumdar, 2020) or Radon transforms (Kolouri et al., 2019; Chen et al., 2020b), or by projecting on subspaces of higher dimensions (Paty & Cuturi, 2019; Lin et al., 2020; 2021; Huang et al., 2021). 3 A SLICED-WASSERSTEIN DISCREPANCY ON THE SPHERE Our goal here is to define a sliced-Wasserstein distance on the sphere Sd 1. To that aim, we proceed analogously to the classical Euclidean space. We first rely on the nice properties of the Wasserstein distance on the circle (Rabin et al., 2011a) and then propose to project distributions lying on the sphere to great circles. Hence, circles play the role of the real line for the hypersphere. In this section, we first describe the OT problem on the circle, then we define a sliced-Wasserstein discrepancy on the sphere and discuss some of its properties. Notably, we derive a new spherical Radon transform which is linked to our newly defined spherical SW. We refer to Appendix A for the proofs. 3.1 OPTIMAL TRANSPORT ON THE CIRCLE On the circle S1 = R/Z equipped with the geodesic distance d S1, an appealing formulation of the Wasserstein distance is available (Delon et al., 2010). First, let us parametrize S1 by [0, 1[, then the geodesic distance can be written as (Rabin et al., 2011a), for all x, y [0, 1[, d S1(x, y) = min(|x y|, 1 |x y|). Then, for the cost function c(x, y) = h(d S1(x, y)) with h : R R+ an increasing convex function, the Wasserstein distance between µ P(S1) and ν P(S1) can be written as Wc(µ, ν) = inf α R 0 h |F 1 µ (t) (Fν α) 1(t)| dt, (8) where Fµ : [0, 1[ [0, 1] denotes the cumulative distribution function (cdf) of µ, F 1 µ its quantile function and α is a shift parameter. The optimization problem over the shifted cdf Fν α can be seen Published as a conference paper at ICLR 2023 as looking for the best cut (or origin) of the circle into the real line because of the 1-periodicity. Indeed, the proof of this result for discrete distributions in (Rabin et al., 2011a) consists in cutting the circle at the optimal point and wrapping it around the real line, for which the optimal transport map is the increasing rearrangement F 1 ν Fµ which can be obtained for discrete distributions by sorting the points (Peyr e et al., 2019). Rabin et al. (2011a) showed that the minimization problem is convex and coercive in the shift parameter and Delon et al. (2010) derived a binary search algorithm to find it. For the particular case of h = Id, it can further be shown (Werman et al., 1985; Cabrelli & Molter, 1995) that W1(µ, ν) = inf α R 0 |Fµ(t) Fν(t) α| dt. (9) In this case, we know exactly the minimum which is attained at the level median (Hundrieser et al., 2021). For f : [0, 1[ R, Lev Med(f) = min argmin α R 0 |f(t) α|dt = inf t R, β({x [0, 1[, f(x) t}) 1 (10) where β is the Lebesgue measure. Therefore, we also have W1(µ, ν) = Z 1 0 |Fµ(t) Fν(t) Lev Med(Fµ Fν)| dt. (11) Since we know the minimum, we do not need the binary search and we can approximate the integral very efficiently as we only need to sort the samples to compute the level median and the cdfs. Another interesting setting in practice is to compute W2, i.e. with h(x) = x2, w.r.t. a uniform distribution ν on the circle. We derive here the optimal shift ˆα for the Wasserstein distance between µ an arbitrary distribution on S1 and ν. We also provide a closed-form when µ is a discrete distribution. Proposition 1. Let µ P2(S1) and ν = Unif(S1). Then, W 2 2 (µ, ν) = Z 1 0 |F 1 µ (t) t ˆα|2 dt with ˆα = Z x dµ(x) 1 In particular, if x1 < < xn and µn = 1 n Pn i=1 δxi, then W 2 2 (µn, ν) = 1 i=1 xi 2 + 1 i=1 (n + 1 2i)xi + 1 This proposition offers an intuitive interpretation: the optimal cut point between an empirical and a uniform distributions is the antipodal point of the circular mean of the discrete samples. Moreover, a very efficient algorithm can be derived from this property, as it solely requires a sorting operation to compute the order statistics of the samples. 3.2 DEFINITION OF SW ON THE SPHERE On the hypersphere, the counterpart of straight lines are the great circles, which are circles with the same diameter as the sphere, and which correspond to the geodesics. Moreover, we can compute the Wasserstein distance on the circle fairly efficiently. Hence, to define a sliced-Wasserstein discrepancy on this manifold, we propose, analogously to the classical SW distance, to project measures on great circles. The most natural way to project points from Sd 1 to a great circle C is to use the geodesic projection (Jung, 2021; Fletcher et al., 2004) defined as x Sd 1, P C(x) = argmin y C d Sd 1(x, y), (14) where d Sd 1(x, y) = arccos( x, y ) is the geodesic distance. See Figure 1 for an illustration of the geodesic projection on a great circle. Note that the projection is unique for almost every x (see (Bardelli & Mennucci, 2017, Proposition 4.2) and Appendix B.1) and hence the pushforward P C # µ of µ Pp,ac(Sd 1), where Pp,ac(Sd 1) denotes the set of absolutely continuous measures w.r.t. the Lebesgue measure and with moments of order p, is well defined. Published as a conference paper at ICLR 2023 Great circles can be obtained by intersecting Sd 1 with a 2-dimensional plane (Jung et al., 2012). Therefore, to average over all great circles, we propose to integrate over the Grassmann manifold Gd,2 = {E Rd, dim(E) = 2} (Absil et al., 2004; Bendokat et al., 2020) and then to project the distribution onto the intersection with the hypersphere. Since the Grassmannian is not very practical, we consider the identification using the set of rank 2 projectors: Gd,2 = {P Rd d, P T = P, P 2 = P, Tr(P) = 2} = {UU T , U Vd,2}, (15) where Vd,2 = {U Rd 2, U T U = I2} is the Stiefel manifold (Bendokat et al., 2020). Finally, we can define the Spherical Sliced-Wasserstein distance (SSW) for p 1 between locally absolutely continuous measures w.r.t. the Lebesgue measure (Bardelli & Mennucci, 2017) µ, ν Pp,ac(Sd 1) as SSW p p (µ, ν) = Z Vd,2 W p p (P U # µ, P U # ν) dσ(U), (16) where σ is the uniform distribution over the Stiefel manifold Vd,2, P U is the geodesic projection on the great circle generated by U and then projected on S1, i.e. U Vd,2, x Sd 1, P U(x) = U T argmin y span(UU T ) Sd 1 d Sd 1(x, y) = argmin z S1 d Sd 1(x, Uz), (17) and the Wasserstein distance is defined with the geodesic distance d S1. Figure 1: Illustration of the geodesic projections on a great circle (in black). In red, random points sampled on the sphere. In green the projections and in blue the trajectories. Moreover, we can derive a closed form expression which will be very useful in practice: Lemma 1. Let U Vd,2 then for a.e. x Sd 1, P U(x) = U T x U T x 2 . (18) Hence, we notice from this expression of the projection that we recover almost the same formula as Lin et al. (2020) but with an additional ℓ2 normalization which projects the data on the circle. As in (Lin et al., 2020), we could project on a higher dimensional subsphere by integrating over Vd,k with k 2. However, we would lose the computational efficiency provided by the properties of the Wasserstein distance on the circle. 3.3 A SPHERICAL RADON TRANSFORM As for the classical SW distance, we can derive a second formulation using a Radon transform. Let f L1(Sd 1), we define a spherical Radon transform R : L1(Sd 1) L1(S1 Vd,2) as z S1, U Vd,2, Rf(z, U) = Z Sd 1 f(x)1{z=P U(x)}dx. (19) This is basically the same formulation as the classical Radon transform (Natterer, 2001; Helgason et al., 2011) where we replaced the real line coordinate t by the coordinate on the circle z and the projection is the geodesic one which is well suited to the sphere. This transform is actually new since we integrate over different sets compared to existing works on spherical Radon transforms. Then, analogously to the classical Radon transform, we can define the back-projection operator R : C0(S1 Vd,2) Cb(Sd 1), Cb(Sd 1) being the space of continuous bounded functions, for g C0(S1 Vd,2) as for a.e. x Sd 1, Vd,2 g(P U(x), U) dσ(U). (20) Proposition 2. R is the dual operator of R, i.e. for all f L1(Sd 1), g C0(S1 Vd,2), Rf, g S1 Vd,2 = f, R g Sd 1. (21) Published as a conference paper at ICLR 2023 Now that we have a dual operator, we can also define the Radon transform of an absolutely continuous measure µ Mac(Sd 1) by duality (Boman & Lindskog, 2009; Bonneel et al., 2015) as the measure Rµ satisfying g C0(S1 Vd,2), Z S1 Vd,2 g(z, U) d( Rµ)(z, U) = Z Sd 1 R g(x) dµ(x). (22) Since Rµ is a measure on the product space S1 Vd,2, Rµ can be disintegrated (Ambrosio et al., 2005, Theorem 5.3.1) w.r.t. σ as Rµ = σ K where K is a probability kernel on Vd,2 S1 with S1 the Borel σ-field of S1. We will denote for σ-almost every U Vd,2, ( Rµ)U = K(U, ) the conditional probability. Proposition 3. Let µ Mac(Sd 1), then for σ-almost every U Vd,2, ( Rµ)U = P U # µ. Finally, we can write SSW (16) using this Radon transform: µ, ν Pp,ac(Sd 1), SSW p p (µ, ν) = Z Vd,2 W p p ( Rµ)U, ( Rν)U dσ(U). (23) Note that a natural way to define SW distances can be through already known Radon transforms using the formulation (23). It is for example what was done in (Kolouri et al., 2019) using generalized Radon transforms (Ehrenpreis, 2003; Homan & Zhou, 2017) to define generalized SW distances, or in (Chen et al., 2020b) with the spatial Radon transform. However, for known spherical Radon transforms (Abouelaz & Daher, 1993; Antipov et al., 2011) such as the Minkowski-Funk transform (Dann, 2010) or more generally the geodesic Radon transform (Rubin, 2002), there is no natural way that we know of to integrate over some product space and allowing to define a SW distance using disintegration. As observed by Kolouri et al. (2019) for the generalized SW distances (GSW), studying the injectivity of the related Radon transforms allows to study the set on which SW is actually a distance. While the classical Radon transform integrates over hyperplanes of Rd, the generalized Radon transform over hypersurfaces (Kolouri et al., 2019) and the Minkowski-Funk transform over big circles , i.e. the intersection between a hyperplane and Sd 1 (Rubin, 2003), the set of integration here is a half of a big circle. Hence, R is related to the hemispherical transform (Rubin, 1999) on Sd 2. We refer to Appendix A.6 for more details on the links with the hemispherical transform. Using these connections, we can derive the kernel of R as the set of even measures which are null over all hyperplanes intersected with Sd 1. Proposition 4. ker( R) = {µ Meven(Sd 1), H Gd,d 1, µ(H Sd 1) = 0} where µ Meven if for all f C(Sd 1), µ, f = µ, f+ with f+(x) = (f(x) + f( x))/2 for all x. We leave for future works checking whether this set is null or not. Hence, we conclude here that SSW is a pseudo-distance, but a distance on the sets of injectivity of R (Agranovskyt & Quintott, 1996). Proposition 5. Let p 1, SSWp is a pseudo-distance on Pp,ac(Sd 1). 4 IMPLEMENTATION In practice, we approximate the distributions with empirical approximations and, as for the classical SW distance, we rely on the Monte-Carlo approximation of the integral on Vd,2. We first need to sample from the uniform distribution σ P(Vd,2). This can be done by first constructing Z Rd 2 by drawing each of its component from the standard normal distribution N(0, 1) and then applying the QR decomposition (Lin et al., 2021). Once we have (Uℓ)L ℓ=1 σ, we project the samples on the circle S1 by applying Lemma 1 and we compute the coordinates on the circle using the atan2 function. Finally, we can compute the Wasserstein distance on the circle by either applying the binary search algorithm of (Delon et al., 2010) or the level median formulation (11) for SSW1. In the particular case in which we want to compute SSW2 between a measure µ and the uniform measure on the sphere ν = Unif(Sd 1), we can use the appealing fact that the projection of ν on the circle is uniform, i.e. P U # ν = Unif(S1) (particular case of Theorem 3.1 in (Jung, 2021), see Appendix B.3). Hence, we can use the Proposition 1 to compute W2, which allows a very efficient implementation either by the closed-form (13) or approximation by rectangle method of (12). This will be of particular interest for applications in Section 5 such as autoencoders. We sum up the procedure in Algorithm 1. Published as a conference paper at ICLR 2023 Algorithm 1 SSW Input: (xi)n i=1 µ, (yj)m j=1 ν, L the number of projections, p the order for ℓ= 1 to L do Draw a random matrix Z Rd 2 with for all i, j, Zi,j N(0, 1) U = QR(Z) σ Project on S1 the points: i, j, ˆxℓ i = U T xi U T xi 2 , ˆyℓ j = U T yj U T yj 2 Compute the coordinates on the circle S1: i, j, xℓ i = (π + atan2( xi,2, xi,1))/(2π), yℓ j = (π + atan2( yj,2, yj,1))/(2π) Compute W p p ( 1 n Pn i=1 δ xℓ i, 1 m Pm j=1 δ yℓ j) by binary search or (11) for p = 1 end for Return SSW p p (µ, ν) 1 L PL ℓ=1 W p p ( 1 n Pn i=1 δ xℓ i, 1 m Pm j=1 δ yℓ j) Complexity. Let us note n (resp. m) the number of samples of µ (resp. ν), and L the number of projections. First, we need to compute the QR factorization of L matrices of size d 2. This can be done in O(Ld) by using e.g. Householder reflections (Golub & Van Loan, 2013, Chapter 5.2) or the Scharwz-Rutishauser algorithm (Gander, 1980). Projecting the points on S1 by Lemma 1 is in O((n + m)d L) since we need to compute L(n + m) products between U T ℓ R2 d and x Rd. For the binary search or particular case formula (11) and (13), we need first to sort the points. But the binary search also adds a cost of O((n + m) log( 1 ϵ )) to approximate the solution with precision ϵ (Delon et al., 2010) and the computation of the level median requires to sort (n + m) points. Hence, for the general SSWp, the complexity is O(L(n + m)(d + log( 1 ϵ )) + Ln log n + Lm log m) versus O(L(n + m)(d + log(n + m))) for SSW1 with the level median and O(Ln(d + log n)) for SSW2 against a uniform with the particular advantage that we do not need uniform samples in this case. Runtime Comparison. We perform here some runtime comparisons. Using Pytorch (Paszke et al., 2019), we implemented the binary search algorithm of (Delon et al., 2010) and used it with ϵ = 10 6. We also implemented SSW1 using the level median formula (11) and SSW2 against a uniform measure (12). All experiments are conducted on GPU. 102 103 104 105 Number of samples in each distribution SSW1, L = 200 SSW2, BS, L = 200 SSW2, Unif, L = 200 SW2, L = 200 Wasserstein Sinkhorn Figure 2: Runtime comparison in log-log scale between W, Sinkhorn with the geodesic distance, SW2, SSW2 with the binary search (BS) and uniform distribution (12) and SSW1 with formula (11) between two distributions on S2. The time includes the calculation of the distance matrices. On Figure 2, we compare the runtime between two distributions on S2 between SSW, SW, the Wasserstein distance and the entropic approximation using the Sinkhorn algorithm (Cuturi, 2013) with the geodesic distance as cost function. The distributions were approximated using n {102, 103, 104, 5 104, 105, 5 105} samples of each distribution and we report the mean over 20 computations. We use the Python Optimal Transport (POT) library (Flamary et al., 2021) to compute the Wasserstein distance and the entropic approximation. For large enough batches, we observe that SSW is much faster than its Wasserstein counterpart, and it also scales better in term of memory because of the need to store the n n cost matrix. For small batches, the computation of SSW actually takes longer because of the computation of the QR factorizations and of the projections. For bigger batches, it is bounded by the sorting operation and we recover the quasi-linear slope. Furthermore, as expected, the fastest algorithms are SSW1 with the level median and SSW2 against a uniform as they have a quasilinear complexity. We report in Appendix C.2 other runtimes experiments w.r.t. to e.g. the number of projections or the dimension. Additionally, we study both theoretically and empirically the projection and sample complexities in Appendices A.9 and C.1. We obtain similar results as (Nadjahi et al., 2020) derived for the SW distance. Notably, the sample complexity is independent w.r.t. the dimension. Published as a conference paper at ICLR 2023 (b) KDE estimate Figure 3: Minimization of SSW with respect to a mixture of v MF. Figure 4: Latent space of SWAE and SSWAE on MNIST for a uniform prior on S2. 5 EXPERIMENTS Apart from showing that SSW is an effective discrepancy for learning problems defined over the sphere, the objectives of this experimental Section is to show that it behaves better than using the more immediate SW in the embedding space. We first illustrate the ability to approximate different distributions by minimizing SSW w.r.t. some target distributions on S2 and by performing density estimation experiments on real earth data. Then, we apply SSW for generative modeling tasks using the framework of Sliced-Wasserstein autoencoder and we show that we obtain competitive results with other Wasserstein autoencoder based methods using a prior on higher dimensional hyperspheres. Complete details about the experimental settings and optimization strategies are given in Appendix C. We also report in Appendices C.5 or C.7 complementary experiments on variational inference on the sphere or self-supervised learning with uniformity prior on the embedding hypersphere that further assess the effectiveness of SSW in a wide range of learning tasks. The code is available online1. 5.1 SSW AS A LOSS Gradient flow on toy data. We verify on the first experiments that we can learn some target distribution ν P(Sd 1) by minimizing SSW, i.e. we consider the minimization problem argminµ SSW p p (µ, ν). We suppose that we have access to the target distribution ν through samples, i.e. through ˆνm = 1 m Pm j=1 δyj where (yj)m j=1 are i.i.d samples of ν. We add in Appendix C.5 the case where we know the density up to some constant which can be dealt with the sliced-Wasserstein variational inference framework introduced in (Yi & Liu, 2021). We choose as target distribution a mixture of 6 well separated von Mises-Fisher distributions (Mardia, 1975). This is a fairly challenging distribution since there are 6 modes which are not connected. We show on Figure 3 the Mollweide projection of the density approximated by a kernel density estimator for a distribution with 500 particles. To optimize directly over particles, we perform a Riemannian gradient descent on the sphere (Absil et al., 2009). Table 1: Negative test log likelihood. Earthquake Flood Fire SSW 0.84 0.07 1.26 0.05 0.23 0.18 SW 0.94 0.02 1.36 0.04 0.54 0.37 Stereo 1.91 0.1 2.00 0.07 1.27 0.09 Density estimation on earth data. We perform density estimation on datasets first gathered by Mathieu & Nickel (2020) which contain locations of wild fires (EOSDIS, 2020), floods (Brakenridge, 2017) or eathquakes (NOAA, 2022). We use exponential map normalizing flows introduced in (Rezende et al., 2020) (see Appendix B.4) which are invertible transformations mapping the data to some prior that we need to enforce. Here, we choose as prior a uniform distribution on S2 and we learn the model using SSW. These transformations allow to evaluate exactly the density at any point. More precisely, let T be such transformation, let p Z be a prior distribution on S2 and µ the measure of interest, which we know from samples, i.e. through ˆµn = 1 n Pn i=1 δxi. Then, we solve the following optimization problem min T SSW 2 2 (T#µ, p Z). Once it is fitted, then the learned density fµ can be obtained by x S2, fµ(x) = p Z T(x) | det JT (x)|, (24) where we used the change of variable formula. We show on Figure 5 the density of test data learned. We observe on this figure that the normalizing flows (NFs) put mass where most data points lie, and hence are able to somewhat recover the principle 1https://github.com/clbonet/Spherical_Sliced-Wasserstein Published as a conference paper at ICLR 2023 (b) Earthquake Figure 5: Density estimation of models trained on earth data. We plot the density on the test data. modes of the data.We also compare on Table 1 the negative test log likelihood, averaged over 5 trainings with different split of the data, between different OT metrics, namely SSW, SW and the stereographic projection model (Gemici et al., 2016) which first projects the data on R2 and use a regular NF in the projected space. We observe that SSW allows to better fit the data compared to the other OT based methods which are less suited to the sphere. 5.2 SSW AUTOENCODERS Table 2: FID (Lower is better). Method / Dataset MNIST Fashion CIFAR10 SSWAE 14.91 0.32 43.94 0.81 98.57 035 SWAE 15.18 0.32 44.78 1.07 98.5 0.45 WAE-MMD IMQ 18.12 0.62 68.51 2.76 100.14 0.67 WAE-MMD RBF 20.09 1.42 70.58 1.75 100.27 0.74 SAE 19.39 0.56 56.75 1.7 99.34 0.96 Circular GSWAE 15.01 0.26 44.65 1.2 98.8 0.68 In this section, we use SSW to learn the latent space of autoencoders (AE). We rely on the SWAE framework introduced by Kolouri et al. (2018). Let f be some encoder and g be some decoder, denote p Z a prior distribution, then the loss minimized in SWAE is L(f, g) = Z c x, g(f(x)) dµ(x) + λSW 2 2 (f#µ, p Z), (25) where µ is the distribution of the data for which we have access to samples. One advantage of this framework over more classical VAEs (Kingma & Welling, 2013) is that no parametrization trick is needed here and therefore the choice of the prior is more free. In several concomitant works, it was shown that using a prior on the hypersphere can improve the results (Davidson et al., 2018; Xu & Durrett, 2018). Hence, we propose in the same fashion as (Kolouri et al., 2018; 2019; Patrini et al., 2020) to replace SW by SSW, which we denote SSWAE, and to enforce a prior on the sphere. In the following, we use the MNIST (Le Cun & Cortes, 2010), Fashion MNIST (Xiao et al., 2017) and CIFAR10 (Krizhevsky, 2009) datasets, and we put an ℓ2 normalization at the output of the encoder. As a prior, we use the uniform distribution on S10 for MNIST and Fashion, and on S64 for CIFAR10. We compare in Table 2 the Fr echet Inception Distance (FID) (Heusel et al., 2017), for 10000 samples and averaged over 5 trainings, obtained with the Wasserstein Autoencoder (WAE) (Tolstikhin et al., 2018), the classical SWAE (Kolouri et al., 2018), the Sinkhorn Autoencoder (SAE) (Patrini et al., 2020) and circular GSWAE (Kolouri et al., 2019). We observe that we obtain fairly competitive results on the different datasets. We add on Figure 4 the latent space obtained with a uniform prior on S2 on MNIST. We notably observe a better separation between classes for SSWAE. 6 CONCLUSION AND DISCUSSION In this work, we derive a new sliced-Wasserstein discrepancy on the hypersphere, that comes with practical advantages when computing optimal transport distances on hyperspherical data. We notably showed that it is competitive or even sometimes better than other metrics defined directly on Rd on a variety of machine learning tasks, including density estimation or generative models. Our work is, up to our knowledge, the first to adapt the classical sliced Wasserstein framework to non-trivial manifolds. The three main ingredients are: i) a closed-form for Wasserstein on the circle, ii) a closed-form solution to the projection onto great circles, and iii) a novel Radon transform on the Sphere. An immediate extension of this work would be to consider sliced-Wasserstein discrepancy in hyperbolic spaces, where geodesics are circular arcs as in the Poincar e disk. Beyond the generalization to other, possibly well behaved, manifolds, asymptotic properties as well as statistical and topological aspects need to be examined. While we postulate that results comparable to the Euclidean case might be reached, the fact that the manifold is closed might bring interesting differences and justify further use of this type of discrepancies rather than their Euclidean counterparts. Published as a conference paper at ICLR 2023 ACKNOWLEDGMENTS Cl ement Bonet thanks Benoˆıt Mal ezieux for fruitful discussions. This work was performed partly using HPC resources from GENCI-IDRIS (Grant 2022-AD011013514). This research was funded by project Dyna Learn from Labex Comin Labs and R egion Bretagne ARED DLearn Me, and by the project OTTOPIA ANR-20-CHIA-0030 of the French National Research Agency (ANR). Ahmed Abouelaz and Radouan Daher. Sur la transformation de radon de la sph ere Sd. Bulletin de la Soci et e Math ematique de France, 121(3):353 382, 1993. (Cited on p. 6) P-A Absil, Robert Mahony, and Rodolphe Sepulchre. Riemannian geometry of grassmann manifolds with a view on algorithmic computation. Acta Applicandae Mathematica, 80(2):199 220, 2004. (Cited on p. 5) P-A Absil, Robert Mahony, and Rodolphe Sepulchre. Optimization algorithms on matrix manifolds. In Optimization Algorithms on Matrix Manifolds. Princeton University Press, 2009. (Cited on p. 8, 25, 29, 32) Mark L Agranovskyt and Eric Todd Quintott. Injectivity of the spherical mean operator and related problems. Complex analysis, harmonic analysis and applications, 347:12, 1996. (Cited on p. 6) David Alvarez-Melis, Youssef Mroueh, and Tommi Jaakkola. Unsupervised hierarchy matching with optimal transport over hyperbolic spaces. In International Conference on Artificial Intelligence and Statistics, pp. 1606 1617. PMLR, 2020. (Cited on p. 1) Luigi Ambrosio, Nicola Gigli, and Giuseppe Savar e. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2005. (Cited on p. 3, 6) Yuri A Antipov, Ricardo Estrada, and Boris Rubin. Inversion formulas for the spherical means in constant curvature spaces. ar Xiv preprint ar Xiv:1107.5992, 2011. (Cited on p. 6) Martin Arjovsky, Soumith Chintala, and L eon Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pp. 214 223. PMLR, 2017. (Cited on p. 1) Eleonora Bardelli and Andrea Carlo Giuseppe Mennucci. Probability measures on infinitedimensional stiefel manifolds. Journal of Geometric Mechanics, 9(3):291, 2017. (Cited on p. 4, 5, 18, 24, 25) Thomas Bendokat, Ralf Zimmermann, and P-A Absil. A grassmann manifold handbook: Basic geometry and computational aspects. ar Xiv preprint ar Xiv:2011.13699, 2020. (Cited on p. 5) Camille Besombes, Olivier Pannekoucke, Corentin Lapeyre, Benjamin Sanderson, and Olivier Thual. Producing realistic climate data with generative adversarial networks. Nonlinear Processes in Geophysics, 28(3):347 370, 2021. (Cited on p. 1) David M Blei, Alp Kucukelbir, and Jon D Mc Auliffe. Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859 877, 2017. (Cited on p. 31) Jan Boman and Filip Lindskog. Support theorems for the radon transform and cram er-wold theorems. Journal of theoretical probability, 22(3):683 710, 2009. (Cited on p. 3, 6) Silvere Bonnabel. Stochastic gradient descent on riemannian manifolds. IEEE Transactions on Automatic Control, 58(9):2217 2229, 2013. (Cited on p. 25) Nicolas Bonneel, Julien Rabin, Gabriel Peyr e, and Hanspeter Pfister. Sliced and radon wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision, 51(1):22 45, 2015. (Cited on p. 3, 6) Nicolas Bonnotte. Unidimensional and evolution methods for optimal transportation. Ph D thesis, Paris 11, 2013. (Cited on p. 2) Published as a conference paper at ICLR 2023 Nicolas Boumal. An introduction to optimization on smooth manifolds. To appear with Cambridge University Press, Apr 2022. URL http://www.nicolasboumal.net/book. (Cited on p. 25, 29) G Brakenridge. Global active archive of large flood events. http://floodobservatory.colorado.edu/Archives/index.html, 2017. (Cited on p. 8) Simon Byrne and Mark Girolami. Geodesic monte carlo on embedded manifolds. Scandinavian Journal of Statistics, 40(4):825 845, 2013. (Cited on p. 31) Carlos A Cabrelli and Ursula M Molter. The kantorovich metric for probability measures on the circle. Journal of Computational and Applied Mathematics, 57(3):345 361, 1995. (Cited on p. 4) Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 9912 9924. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/ 70feb62b69f16e0238f741fab228fec2-Paper.pdf. (Cited on p. 1, 36) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In Hal Daum e III and Aarti Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 1597 1607. PMLR, 13 18 Jul 2020a. URL https://proceedings. mlr.press/v119/chen20j.html. (Cited on p. 1, 36, 37, 38) Xiongjie Chen, Yongxin Yang, and Yunpeng Li. Augmented sliced wasserstein distances. ar Xiv preprint ar Xiv:2006.08812, 2020b. (Cited on p. 3, 6) Samuel Cohen, Brandon Amos, and Yaron Lipman. Riemannian convex potential maps. In International Conference on Machine Learning, pp. 2028 2038. PMLR, 2021. (Cited on p. 1, 26) Nicolas Courty, R emi Flamary, Devis Tuia, and Alain Rakotomamonjy. Optimal transport for domain adaptation. IEEE transactions on pattern analysis and machine intelligence, 39(9):1853 1865, 2016. (Cited on p. 1) Li Cui, Xin Qi, Chengfeng Wen, Na Lei, Xinyuan Li, Min Zhang, and Xianfeng Gu. Spherical optimal transportation. Computer-Aided Design, 115:181 193, 2019. (Cited on p. 1) Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information processing systems, 26, 2013. (Cited on p. 1, 7) Susanna Dann. On the minkowski-funk transform. ar Xiv preprint ar Xiv:1003.5565, 2010. (Cited on p. 6) Tim R. Davidson, Luca Falorsi, Nicola De Cao, Thomas Kipf, and Jakub M. Tomczak. Hyperspherical variational auto-encoders. In Amir Globerson and Ricardo Silva (eds.), Proceedings of the Thirty Fourth Conference on Uncertainty in Artificial Intelligence, UAI 2018, Monterey, California, USA, August 6-10, 2018, pp. 856 865. AUAI Press, 2018. URL http://auai.org/uai2018/ proceedings/papers/309.pdf. (Cited on p. 1, 9, 27) Nicola De Cao and Wilker Aziz. The power spherical distribution. ar Xiv preprint ar Xiv:2006.04437, 2020. (Cited on p. 32) Julie Delon, Julien Salomon, and Andrei Sobolevski. Fast transport optimization for monge costs on the circle. SIAM Journal on Applied Mathematics, 70(7):2239 2258, 2010. (Cited on p. 3, 4, 6, 7) Ishan Deshpande, Yuan-Ting Hu, Ruoyu Sun, Ayis Pyrros, Nasir Siddiqui, Sanmi Koyejo, Zhizhen Zhao, David Forsyth, and Alexander G Schwing. Max-sliced wasserstein distance and its use for gans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10648 10656, 2019. (Cited on p. 3) Published as a conference paper at ICLR 2023 Marco Di Marzio, Agnese Panzera, and Charles C Taylor. Nonparametric regression for spherical data. Journal of the American Statistical Association, 109(506):748 763, 2014. (Cited on p. 1) Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. ar Xiv preprint ar Xiv:1605.08803, 2016. (Cited on p. 27, 30) Arnaud Doucet, Nando De Freitas, Neil James Gordon, et al. Sequential Monte Carlo methods in practice, volume 1. Springer, 2001. (Cited on p. 33) Leon Ehrenpreis. The universality of the Radon transform. OUP Oxford, 2003. (Cited on p. 6) EOSDIS. Land, atmosphere near real-time capability for eos (lance) system operated by nasa s earth science data and information system (esdis). https://earthdata.nasa.gov/earth-observation-data/nearreal-time/firms/active-fire-data, 2020. (Cited on p. 8) Kilian Fatras, Younes Zine, R emi Flamary, Remi Gribonval, and Nicolas Courty. Learning with minibatch wasserstein : asymptotic and gradient properties. In Silvia Chiappa and Roberto Calandra (eds.), Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pp. 2131 2141. PMLR, 26 28 Aug 2020. URL https://proceedings.mlr.press/v108/fatras20a.html. (Cited on p. 1) Charles Fefferman, Sanjoy Mitter, and Hariharan Narayanan. Testing the manifold hypothesis. Journal of the American Mathematical Society, 29(4):983 1049, 2016. (Cited on p. 1) Jean Feydy, Thibault S ejourn e, Franc ois-Xavier Vialard, Shun-ichi Amari, Alain Trouve, and Gabriel Peyr e. Interpolating between optimal transport and mmd using sinkhorn divergences. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2681 2690, 2019. (Cited on p. 35) Alessio Figalli and C edric Villani. Optimal transport and curvature. In Nonlinear PDE s and Applications, pp. 171 217. Springer, 2011. (Cited on p. 1, 2) R emi Flamary, Nicolas Courty, Alexandre Gramfort, Mokhtar Z. Alaya, Aur elie Boisbunon, Stanislas Chambon, Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo Fournier, L eo Gautheron, Nathalie T.H. Gayraud, Hicham Janati, Alain Rakotomamonjy, Ievgen Redko, Antoine Rolet, Antony Schutz, Vivien Seguy, Danica J. Sutherland, Romain Tavenard, Alexander Tong, and Titouan Vayer. Pot: Python optimal transport. Journal of Machine Learning Research, 22(78):1 8, 2021. URL http://jmlr.org/papers/v22/20-451.html. (Cited on p. 7, 35) P Thomas Fletcher, Conglin Lu, Stephen M Pizer, and Sarang Joshi. Principal geodesic analysis for the study of nonlinear statistics of shape. IEEE transactions on medical imaging, 23(8):995 1005, 2004. (Cited on p. 4) Walter Gander. Algorithms for the qr decomposition. Res. Rep, 80(02):1251 1268, 1980. (Cited on p. 7) Mevlana C Gemici, Danilo Rezende, and Shakir Mohamed. Normalizing flows on riemannian manifolds. ar Xiv preprint ar Xiv:1611.02304, 2016. (Cited on p. 9, 26, 27, 30) Gene H Golub and Charles F Van Loan. Matrix computations. JHU press, 2013. (Cited on p. 7) Jean-Bastien Grill, Florian Strub, Florent Altch e, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, Bilal Piot, koray kavukcuoglu, Remi Munos, and Michal Valko. Bootstrap your own latent - a new approach to self-supervised learning. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 21271 21284. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/ 2020/file/f3ada80d5c4ee70142b17b8192b2958e-Paper.pdf. (Cited on p. 1) Armelle Guillou, Philippe Naveau, and Alexandre You. A folding methodology for multivariate extremes: estimation of the spectral probability measure and actuarial applications. Scandinavian Actuarial Journal, 2015(7):549 572, 2015. (Cited on p. 1) Published as a conference paper at ICLR 2023 Brittany Froese Hamfeldt and Axel GR Turnquist. A convergence framework for optimal transport on the sphere. ar Xiv preprint ar Xiv:2103.05739, 2021a. (Cited on p. 1) Brittany Froese Hamfeldt and Axel GR Turnquist. A convergent finite difference method for optimal transport on the sphere. ar Xiv preprint ar Xiv:2105.03500, 2021b. (Cited on p. 1) Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770 778, 2016. (Cited on p. 36) Sigurdur Helgason et al. Integral geometry and Radon transforms. Springer, 2011. (Cited on p. 3, 5) Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017. (Cited on p. 9) Andrew Homan and Hanming Zhou. Injectivity and stability for a generic class of generalized radon transforms. The Journal of Geometric Analysis, 27(2):1515 1529, 2017. (Cited on p. 6) Andr es Hoyos-Idrobo. Aligning hyperbolic representations: an optimal transport-based approach. ar Xiv preprint ar Xiv:2012.01089, 2020. (Cited on p. 1) Minhui Huang, Shiqian Ma, and Lifeng Lai. A riemannian block coordinate descent method for computing the projection robust wasserstein distance. In International Conference on Machine Learning, pp. 4446 4455. PMLR, 2021. (Cited on p. 3) Shayan Hundrieser, Marcel Klatt, and Axel Munk. The statistics of circular optimal transport. ar Xiv preprint ar Xiv:2103.15426, 2021. (Cited on p. 4) Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An introduction to variational methods for graphical models. Machine learning, 37(2):183 233, 1999. (Cited on p. 31) Sungkyu Jung. Geodesic projection of the von mises fisher distribution for projection pursuit of directional data. Electronic Journal of Statistics, 15(1):984 1033, 2021. (Cited on p. 4, 6, 26) Sungkyu Jung, Ian L Dryden, and James Stephen Marron. Analysis of principal nested spheres. Biometrika, 99(3):551 568, 2012. (Cited on p. 5, 20) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. (Cited on p. 29, 33, 34, 36) Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. (Cited on p. 1, 9) Soheil Kolouri, Phillip E Pope, Charles E Martin, and Gustavo K Rohde. Sliced wasserstein autoencoders. In International Conference on Learning Representations, 2018. (Cited on p. 9, 34) Soheil Kolouri, Kimia Nadjahi, Umut Simsekli, Roland Badeau, and Gustavo Rohde. Generalized sliced wasserstein distances. Advances in Neural Information Processing Systems, 32, 2019. (Cited on p. 3, 6, 9, 19, 34) Alex Krizhevsky. Learning multiple layers of features from tiny images, 2009. (Cited on p. 9, 36) Gerhard Kurz and Uwe D Hanebeck. Stochastic sampling of the hyperspherical von mises fisher distribution without rejection methods. In 2015 Sensor Data Fusion: Trends, Solutions, Applications (SDF), pp. 1 6. IEEE, 2015. (Cited on p. 25) Shiwei Lan, Bo Zhou, and Babak Shahbaba. Spherical hamiltonian monte carlo for constrained target distributions. In International Conference on Machine Learning, pp. 629 637. PMLR, 2014. (Cited on p. 31) Yann Le Cun and Corinna Cortes. MNIST handwritten digit database. 2010. URL http://yann. lecun.com/exdb/mnist/. (Cited on p. 9) Published as a conference paper at ICLR 2023 Mufan Bill Li and Murat A Erdogdu. Riemannian langevin algorithm for solving semidefinite programs. ar Xiv preprint ar Xiv:2010.11176, 2020. (Cited on p. 31) Tianyi Lin, Chenyou Fan, Nhat Ho, Marco Cuturi, and Michael Jordan. Projection robust wasserstein distance and riemannian optimization. Advances in neural information processing systems, 33: 9383 9397, 2020. (Cited on p. 3, 5) Tianyi Lin, Zeyu Zheng, Elynn Chen, Marco Cuturi, and Michael I Jordan. On projection robust optimal transport: Sample complexity and model misspecification. In International Conference on Artificial Intelligence and Statistics, pp. 262 270. PMLR, 2021. (Cited on p. 3, 6) Chang Liu, Jun Zhu, and Yang Song. Stochastic gradient geodesic mcmc methods. Advances in neural information processing systems, 29, 2016. (Cited on p. 31, 32) Jun S Liu and Rong Chen. Blind deconvolution via sequential imputations. Journal of the american statistical association, 90(430):567 576, 1995. (Cited on p. 33) Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. Sphereface: Deep hypersphere embedding for face recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. (Cited on p. 1) Kanti V Mardia, Peter E Jupp, and KV Mardia. Directional statistics, volume 2. Wiley Online Library, 2000. (Cited on p. 1, 25) Kantilal Varichand Mardia. Statistics of directional data. Journal of the Royal Statistical Society: Series B (Methodological), 37(3):349 371, 1975. (Cited on p. 8) Emile Mathieu and Maximilian Nickel. Riemannian continuous normalizing flows. Advances in Neural Information Processing Systems, 33:2503 2515, 2020. (Cited on p. 8, 27) Robert J Mc Cann. Polar factorization of maps on riemannian manifolds. Geometric & Functional Analysis GAFA, 11(3):589 608, 2001. (Cited on p. 1) Kimia Nadjahi. Sliced-Wasserstein distance for large-scale machine learning: theory, methodology and extensions. Ph D thesis, Institut polytechnique de Paris, 2021. (Cited on p. 2) Kimia Nadjahi, Alain Durmus, Umut Simsekli, and Roland Badeau. Asymptotic guarantees for learning generative models with the sliced-wasserstein distance. Advances in Neural Information Processing Systems, 32, 2019. (Cited on p. 3) Kimia Nadjahi, Alain Durmus, L ena ıc Chizat, Soheil Kolouri, Shahin Shahrampour, and Umut Simsekli. Statistical and topological properties of sliced probability divergences. Advances in Neural Information Processing Systems, 33:20802 20812, 2020. (Cited on p. 7, 23, 24, 28) Frank Natterer. The mathematics of computerized tomography. SIAM, 2001. (Cited on p. 5) Khai Nguyen and Nhat Ho. Amortized projection optimization for sliced wasserstein generative models. ar Xiv preprint ar Xiv:2203.13417, 2022a. (Cited on p. 3) Khai Nguyen and Nhat Ho. Revisiting sliced wasserstein on images: From vectorization to convolution. ar Xiv preprint ar Xiv:2204.01188, 2022b. (Cited on p. 3) Khai Nguyen, Son Nguyen, Nhat Ho, Tung Pham, and Hung Bui. Improving relational regularized autoencoders with spherical sliced fused gromov wasserstein. ar Xiv preprint ar Xiv:2010.01787, 2020. (Cited on p. 3) Khai Nguyen, Nhat Ho, Tung Pham, and Hung Bui. Distributional sliced-wasserstein and applications to generative modeling. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. Open Review.net, 2021. URL https://openreview. net/forum?id=QYj O70ACDK. (Cited on p. 3) NOAA. Ncei/wds global significant earthquake database. https://www.ncei.noaa.gov/access/metadata/landing-page/bin/iso?id=gov.noaa.ngdc.mgg.hazards:G012153, 2022. (Cited on p. 8) Published as a conference paper at ICLR 2023 George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. Normalizing flows for probabilistic modeling and inference. Journal of Machine Learning Research, 22(57):1 64, 2021. (Cited on p. 26) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch e-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems 32, pp. 8024 8035. Curran Associates, Inc., 2019. (Cited on p. 7) Giorgio Patrini, Rianne van den Berg, Patrick Forre, Marcello Carioni, Samarth Bhargav, Max Welling, Tim Genewein, and Frank Nielsen. Sinkhorn autoencoders. In Uncertainty in Artificial Intelligence, pp. 733 743. PMLR, 2020. (Cited on p. 9, 34) Franc ois-Pierre Paty and Marco Cuturi. Subspace robust wasserstein distances. In International conference on machine learning, pp. 5072 5081. PMLR, 2019. (Cited on p. 3) Nathana el Perraudin, Micha el Defferrard, Tomasz Kacprzak, and Raphael Sgier. Deepsphere: Efficient spherical convolutional neural network with healpix sampling for cosmological applications. Astronomy and Computing, 27:130 146, 2019. (Cited on p. 1) Arthur Pewsey and Eduardo Garc ıa-Portugu es. Recent advances in directional statistics. Test, 30(1): 1 58, 2021. (Cited on p. 1) Gabriel Peyr e, Marco Cuturi, et al. Computational optimal transport: With applications to data science. Foundations and Trends in Machine Learning, 11(5-6):355 607, 2019. (Cited on p. 2, 4) Julien Rabin, Julie Delon, and Yann Gousseau. Transportation distances on the circle. Journal of Mathematical Imaging and Vision, 41(1):147 167, 2011a. (Cited on p. 2, 3, 4, 17) Julien Rabin, Gabriel Peyr e, Julie Delon, and Marc Bernot. Wasserstein barycenter and its application to texture mixing. In International Conference on Scale Space and Variational Methods in Computer Vision, pp. 435 446. Springer, 2011b. (Cited on p. 1, 2) Danilo J Rezende and S ebastien Racani ere. Implicit riemannian concave potential maps. ar Xiv preprint ar Xiv:2110.01288, 2021. (Cited on p. 1, 26) Danilo Jimenez Rezende, George Papamakarios, S ebastien Racaniere, Michael Albergo, Gurtej Kanwar, Phiala Shanahan, and Kyle Cranmer. Normalizing flows on tori and spheres. In International Conference on Machine Learning, pp. 8083 8092. PMLR, 2020. (Cited on p. 8, 26, 30, 32, 33) Boris Rubin. Inversion and characterization of the hemispherical transform. Journal d Analyse Math ematique, 77(1):105 128, 1999. (Cited on p. 6, 21) Boris Rubin. Inversion formulas for the spherical radon transform and the generalized cosine transform. Advances in Applied Mathematics, 29(3):471 497, 2002. (Cited on p. 6) Boris Rubin. Notes on radon transforms in integral geometry. Fractional Calculus and Applied Analysis, 6(1):25 72, 2003. (Cited on p. 6, 19, 20) Raif M Rustamov and Subhabrata Majumdar. Intrinsic sliced wasserstein distances for comparing collections of probability distributions on manifolds and graphs. ar Xiv preprint ar Xiv:2010.15285, 2020. (Cited on p. 2, 3) Meyer Scetbon, Marco Cuturi, and Gabriel Peyr e. Low-rank sinkhorn factorization. In International Conference on Machine Learning, pp. 9344 9354. PMLR, 2021. (Cited on p. 1) Suvrit Sra. Directional statistics in machine learning: a brief review. Applied Directional Statistics: Modern Methods and Case Studies, 225:6, 2018. (Cited on p. 1) Published as a conference paper at ICLR 2023 Ilya O. Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Sch olkopf. Wasserstein autoencoders. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. Open Review.net, 2018. URL https://openreview.net/forum?id=Hk L7n1-0b. (Cited on p. 9, 34) Gary Ulrich. Computer generation of distributions on the m-sphere. Journal of the Royal Statistical Society: Series C (Applied Statistics), 33(2):158 163, 1984. (Cited on p. 25) C edric Villani. Optimal transport: old and new, volume 338. Springer, 2009. (Cited on p. 1, 2, 23) Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, St efan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C J Carey, Ilhan Polat, Yu Feng, Eric W. Moore, Jake Vander Plas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R. Harris, Anne M. Archibald, Antˆonio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and Sci Py 1.0 Contributors. Sci Py 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261 272, 2020. doi: 10.1038/s41592-019-0686-2. (Cited on p. 33) Dilin Wang and Qiang Liu. Learning to draw samples: With application to amortized mle for generative adversarial learning. ar Xiv preprint ar Xiv:1611.01722, 2016. (Cited on p. 31) Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning, pp. 9929 9939. PMLR, 2020. (Cited on p. 1, 36, 37, 38) Xiao Wang, Qi Lei, and Ioannis Panageas. Fast convergence of langevin dynamics on manifold: Geodesics meet log-sobolev. Advances in Neural Information Processing Systems, 33:18894 18904, 2020. (Cited on p. 31) Michael Werman, Shmuel Peleg, and Azriel Rosenfeld. A distance metric for multidimensional histograms. Computer Vision, Graphics, and Image Processing, 32(3):328 336, 1985. (Cited on p. 4) Andrew TA Wood. Simulation of the von mises fisher distribution. Communications in statisticssimulation and computation, 23(1):157 164, 1994. (Cited on p. 25) Zhirong Wu, Yuanjun Xiong, Stella X. Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. (Cited on p. 1, 36) Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017. (Cited on p. 9) Jiacheng Xu and Greg Durrett. Spherical latent spaces for stable variational autoencoders. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun ichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pp. 4503 4513. Association for Computational Linguistics, 2018. (Cited on p. 1, 9, 27) Mingxuan Yi and Song Liu. Sliced wasserstein variational inference. In Fourth Symposium on Advances in Approximate Bayesian Inference, 2021. (Cited on p. 8, 31) Published as a conference paper at ICLR 2023 A.1 PROOF OF PROPOSITION 1 Optimal α. Let µ P2(S1), ν = Unif(S1). Since ν is the uniform distribution on S1, its cdf is the identity on [0, 1] (where we identified S1 and [0, 1]). We can extend the cdf F on the real line as in (Rabin et al., 2011a) with the convention F(y + 1) = F(y) + 1. Therefore, Fν = Id on R. Moreover, we know that for all x S1, (Fν α) 1(x) = F 1 ν (x + α) = x + α and W 2 2 (µ, ν) = inf α R 0 |F 1 µ (t) (Fν α) 1(t)|2 dt. (26) For all α R, let f(α) = R 1 0 F 1 µ (t) (Fν α) 1(t) 2 dt. Then, we have: α R, f(α) = Z 1 F 1 µ (t) t α 2 dt F 1 µ (t) t 2 dt + α2 2α Z 1 0 (F 1 µ (t) t) dt F 1 µ (t) t 2 dt + α2 2α Z 1 0 x dµ(x) 1 where we used that (F 1 µ )#Unif([0, 1]) = µ. Hence, f (α) = 0 α = R 1 0 x dµ(x) 1 Closed-form for empirical distributions. Let (xi)n i=1 [0, 1[n such that x1 < < xn and let µn = 1 n Pn i=1 δxi a discrete distribution. To compute the closed-form of W2 between µn and ν = Unif(S1), we first have that the optimal α is αn = 1 n Pn i=1 xi 1 2. Moreover, we also have: W 2 2 (µn, ν) = Z 1 F 1 µn (t) (t + ˆαn) 2 dt 0 F 1 µn (t)2 dt 2 Z 1 0 t F 1 µn (t)dt 2ˆαn 0 F 1 µn (t)dt + 1 3 + ˆαn + ˆα2 n. Then, by noticing that F 1 µn (t) = xi for all t [F(xi), F(xi+1)[, we have 0 t F 1 µn (t)dt = n txidt = 1 2n2 i=1 xi(2i 1), (29) 0 F 1 µ (t)2dt = 1 i=1 x2 i , Z 1 0 F 1 µ (t)dt = 1 i=1 xi, (30) and we also have: ˆαn + ˆα2 n = 1 i=1 xi 2 + 1 Then, by plugging these results into (28), we obtain W 2 2 (µn, ν) = 1 i=1 (2i 1)xi 2 1 i=1 xi 2 + 1 i=1 xi 2 + 1 i=1 (n + 1 2i)xi + 1 Published as a conference paper at ICLR 2023 A.2 PROOF OF EQUATION 17 Let U Vd,2. Then the great circle generated by U Vd,2 is defined as the intersection between span(UU T ) and Sd 1. And we have the following characterization: x span(UU T ) Sd 1 y Rd, x = UU T y and x 2 2 = 1 y Rd, x = UU T y and UU T y 2 2 = y T UU T y = U T y 2 2 = 1 z S1, x = Uz. And we deduce that U Vd,2, x Sd 1, P U(x) = argmin z S1 d Sd 1(x, Uz). (33) A.3 PROOF OF LEMMA 1 Let U Vd,2 and x Sd 1 such that U T x = 0. Denote U = (u1 u2), i.e. the 2-plane E is E = span(UU T ) = span(u1, u2) and (u1, u2) is an orthonormal basis of E. Then, for all x Sd 1, the projection on E is p E(x) = u1, x u1 + u2, x u2 = UU T x. Now, let us compute the geodesic distance between x Sd 1 and p E(x) p E(x) 2 E Sd 1: d Sd 1 x, p E(x) p E(x) 2 = arccos x, p E(x) p E(x) 2 = arccos( p E(x) 2), (34) using that x = p E(x) + p E (x). Let y E Sd 1 another point on the great circle. By the Cauchy-Schwarz inequality, we have x, y = p E(x), y p E(x) 2 y 2 = p E(x) 2. (35) Therefore, using that arccos is decreasing on ( 1, 1), d Sd 1(x, y) = arccos( x, y ) arccos( p E(x) 2) = d Sd 1 x, p E(x) p E(x) 2 Moreover, we have equality if and only if y = λp E(x). And since y Sd 1, |λ| = 1 p E(x) 2 . Using again that arccos is decreasing, we deduce that the minimum is well attained in y = p E(x) p E(x) 2 = UU T x UU T x 2 . Finally, using that UU T x 2 = x T UU T UU T x = x T UU T x = U T x 2, we deduce that P U(x) = U T x U T x 2 . (37) Finally, by noticing that the projection is unique if and only if U T x = 0, and using (Bardelli & Mennucci, 2017, Proposition 4.2) which states that there is a unique projection for a.e. x, we deduce that {x Sd 1, U T x = 0} is of measure null and hence, for a.e. x Sd 1, we have the result. Published as a conference paper at ICLR 2023 A.4 PROOF OF PROPOSITION 2 Let f L1(Sd 1), g C0(S1 Vd,2), then by Fubini s theorem, Rf, g S1 Vd,2 = Z S1 Rf(z, U)g(z, U) dzdσ(U) Sd 1 f(x)1{z=P U(x)}g(z, U) dxdzdσ(U) Sd 1 f(x) Z S1 g(z, U)1{z=P U(x)} dzdσ(U)dx Sd 1 f(x) Z Vd,2 g P U(x), U dσ(U)dx Sd 1 f(x) R g(x) dx = f, R g Sd 1. A.5 PROOF OF PROPOSITION 3 Let g C0(S1 Vd,2), S1 g(z, U) ( Rµ)U(dz) dσ(U) = Z S1 Vd,2 g(z, U) d( Rµ)(z, U) Sd 1 R g(x) dµ(x) Vd,2 g(P U(x), U) dσ(U)dµ(x) Sd 1 g(P U(x), U) dµ(x)dσ(U) S1 g(z, U) d(P U # µ)(z)dσ(U). Hence, for σ-almost every U Vd,2, ( Rµ)U = P U # µ. A.6 STUDY OF THE SPHERICAL RADON TRANSFORM R In this Section, we first discuss the set of integration of the spherical Radon transform R (19). We further show that it is related to the hemispherical Radon transform and we derive its kernel. Set of integration. While the classical Radon transform integrates over hyperplanes of Rd and the generalized Radon transform integrates over hypersurfaces (Kolouri et al., 2019), the set of integration of the spherical Radon transform (19) is a half of a big circle , i.e. half of the intersection between a hyperplane and Sd 1 (Rubin, 2003). We illustrate this on S2 in Figure 6. On S2, the intersection between a hyperplane and S2 is a great circle. Published as a conference paper at ICLR 2023 Figure 6: Set of integration of the spherical Radon transform (19). The great circle is in black and the set of integration in blue. The point Uz span(UU T ) Sd 1 is in blue. Proposition 6. Let U Vd,2, z S1. The set of integration of (19) is {x Sd 1, P U(x) = z} = {x F Sd 1, x, Uz > 0}, (40) where F = span(UU T ) span(Uz). Proof. Let U Vd,2, z S1. Denote E = span(UU T ) the 2-plane generating the great circle, and E its orthogonal complementary. Hence, E E = Rd and dim(E ) = d 2. Now, let F = E span(Uz). Since Uz = UU T Uz E, we have that dim(F) = d 1. Hence, F is a hyperplane and F Sd 1 is a big circle (Rubin, 2003), i.e. a (d 2)-dimensional subsphere of Sd 1. Now, for the first inclusion, let x {x Sd 1, P U(x) = z}. First, we show that x F Sd 1. By Lemma 1 and hypothesis, we know that P U(x) = U T x U T x 2 = z. By denoting by p E the projection on E, we have: p E(x) = UU T x = U( U T x 2z) = U T x 2Uz span(Uz). (41) Hence, x = p E(x) + x E = U T x 2Uz + x E F. Moreover, as x, Uz = U T x 2 Uz, Uz = U T x 2 > 0, (42) we deduce that x {F Sd 1, x, Uz > 0}. For the other inclusion, let x {F Sd 1, x, Uz > 0}. Since x F, we have x = x E + λUz, λ R. Hence, using Lemma 1, P U(x) = U T x U T x 2 = λ |λ| z z 2 = sign(λ)z. (43) But, we also have x, Uz = λ Uz 2 2 = λ > 0. Therefore, sign(λ) = 1 and P U(x) = z. Finally, we conclude that {x Sd 1, P U(x) = z } = {x F Sd 1, x, Uz > 0}. Link with Hemispherical transform. Since the intersection between a hyperplane and Sd 1 is isometric to Sd 2 (Jung et al., 2012), we can relate R to the hemispherical transform H (Rubin, 2003) on Sd 2. First, the hemispherical transform of a function f L1(Sd 1) is defined as x Sd 1, Hf(x) = Z Sd 1 f(y)1{ x,y >0}dy. (44) From Proposition 6, we can write the spherical Radon transform (19) as a hemispherical transform on Sd 2. Published as a conference paper at ICLR 2023 Proposition 7. Let f L1(Sd 1), U Vd,2 and z S1, then Rf(z, U) = Z Sd 2 f(x)1{ x, Uz >0}dx = H f( Uz), (45) where for all x Sd 2, f(x) = f(OT Jx) with O the rotation matrix such that for all x F, Ox span(e1, . . . , ed 1) where (e1, . . . , ed) denotes the canonical basis, and J = Id 1 01,d 1 U = JT OU R(d 1) 2. Proof. Let f L1(Sd 1), z S1, U Vd,2, then by Proposition 6, Rf(z, U) = Z Sd 1 F f(x)1{ x,Uz >0}dx. (46) F is a hyperplane. Let O Rd d be the rotation such that for all x F, Ox span(e1, . . . , ed 1) = F where (e1, . . . , ed) is the canonical basis. By applying the change of variable Ox = y, and since O 1 = OT , det O = 1, we obtain Rf(z, U) = Z O(F Sd 1) f(OT y)1{ OT y,Uz >0}dy = Z F Sd 1 f(OT y)1{ y,OUz >0}dy. (47) Now, we have that OU Vd,2 since (OU)T (OU) = I2, and since Uz F, OUz F. For all y F, we have y, ed = yd = 0. Let J = Id 1 01,d 1 Rd (d 1), then for all y F Sd 1, y = J y where y Sd 2 is composed of the d 1 first coordinates of y. Let s define, for all y Sd 2, f( y) = f(OT J y), U = JT OU. Then, since F Sd 1 = Sd 2, we can write: Rf(z, U) = Z Sd 2 f( y)1{ y, Uz >0}d y = H f( Uz). (48) Kernel of R. By exploiting the expression using the hemispherical transform in Proposition 7, we can derive its kernel in Appendix A.7. A.7 PROOF OF PROPOSITION 4 First, we recall Lemma 2.3 of (Rubin, 1999) on Sd 2. Lemma 2 (Lemma 2.3 (Rubin, 1999)). ker(H) = {µ Meven(Sd 2), µ(Sd 2) = 0} where Meven is the set of even measures, i.e. measures such that for all f C(Sd 2), µ, f = µ, f where f (x) = f( x) for all x Sd 2. Let µ Mac(Sd 1). First, we notice that the density of Rµ w.r.t. λ σ is, for all z S1, U Vd,2, ( Rµ)(z, U) = Z Sd 1 1{P U(x)=z}dµ(x) = Z F Sd 1 1{ x,Uz >0}dµ(x). (49) Indeed, using Proposition 2, and Proposition 6, we have for all g C0(S1 Vd,2), Rµ, g S1 Vd,2 = µ, R g Sd 1 = Z Sd 1 R g(x)dµ(x) S1 g(z, U)1{z=P U(x)}dzdσ(U)dµ(x) Vd,2 S1 g(z, U) Z Sd 1 1{z=P U(x)}dµ(x) dzdσ(U) Vd,2 S1 g(z, U) Z F Sd 1 1{ x,Uz >0}dµ(x) dzdσ(U). Published as a conference paper at ICLR 2023 Hence, using Proposition 7, we can write ( Rµ)(z, U) = (H µ)( Uz) where µ = JT #O#µ. Now, let µ ker( R), then for all z S1, U Vd,2, Rµ(z, U) = H µ( Uz) = 0 and hence µ ker(H) = { µ Meven(Sd 2), µ(Sd 2) = 0}. First, let s show that µ Meven(Sd 1). Let f C(Sd 1) and U Vd,2, then, by using the same notation as in Propositions 6 and 7, we have µ, f Sd 1 = Z Sd 1 f(x)dµ(x) = Z S1 f(x)1{z=P U(x)} dz dµ(x) Sd 1 f(x)1{z=P U(x)}dµ(x)dz F Sd 1 f(x)1{ x,Uz >0}dµ(x)dz by Prop. 6 Sd 2 f(y)1{ y, Uz >0}d µ(y)dz S1 H µ, f Sd 2 dz S1 µ, H f Sd 2 dz S1 µ, (H f) Sd 2 dz since µ Meven Sd 1 f (x)dµ(x) = µ, f Sd 1, using for the last line all the opposite transformations. Therefore, µ Meven(Sd 1). Now, we need to find on which set the measure is null. We have z S1, U Vd,2, µ(Sd 2) = 0 z S1, U Vd,2, µ(O 1((JT ) 1(Sd 2))) = µ(F Sd 1) = 0. (52) Hence, we deduce that ker( R) = {µ Meven(Sd 1), U Vd,2, z S1, F = span(UU T ) span(Uz), µ(F Sd 1) = 0}. (53) Moreover, we have that U,z FU,z Sd 1 = {H Sd 1 Rd, dim(H) = d 1}. Indeed, on the one hand, let H an hyperplane, x H Sd 1, U Vd,2, and note z = P U(x). Then, x F Sd 1 by Proposition 6 and H Sd 1 U,z FU,z. On the other hand, let U Vd,2, z S1, F is a hyperplane since dim(F) = d 1 and therefore F Sd 1 {H, dim(H) = d 1}. Finally, we deduce that ker( R) = µ Meven(Sd 1), H Gd,d 1, µ(H Sd 1) . (54) A.8 PROOF OF PROPOSITION 5 Let p 1. First, it is straightforward to see that for all µ, ν Pp(Sd 1), SSWp(µ, ν) 0, SSWp(µ, ν) = SSWp(ν, µ), µ = ν = SSWp(µ, ν) = 0 and that we have the triangular Published as a conference paper at ICLR 2023 inequality since µ, ν, α Pp(Sd 1), SSWp(µ, ν) = Z Vd,2 W p p (P U # µ, P U # ν) dσ(U) 1 Wp(P U # µ, P U # α) + Wp(P U # α, P U # ν) p dσ(U) 1 Vd,2 W p p (P U # µ, P U # α) dσ(U) 1 Vd,2 W p p (P U # α, P U # ν) dσ(U) 1 = SSWp(µ, α) + SSWp(α, ν), (55) using the triangular inequality for Wp and the Minkowski inequality. Therefore, it is at least a pseudo-distance. To be a distance, we also need SSWp(µ, ν) = 0 = µ = ν. Suppose that SSWp(µ, ν) = 0. Since, for all U Vd,2, W p p (P U # µ, P U # ν) 0, SSW p p (µ, ν) = 0 implies that for σ-ae U Vd,2, W p p (P U # µ, P U # ν) = 0 and hence P U # µ = P U # ν or ( Rµ)U = ( Rν)U for σ-ae U Vd,2 since Wp is a distance on the circle. Therefore, it is a distance on the sets of injectivity of R. A.9 ADDITIONAL PROPERTIES In this Section, we derive additional properties of SSW. First, we will show that the weak convergence implies the convergence w.r.t SSW. Then, we will show that the sample complexity is independent of the dimension. Finally, we will derive the projection complexity of SSW. Convergence Properties. Proposition 8. Let (µk), µ Pp(Sd 1) such that µk k µ, then SSWp(µk, µ) k 0. (56) Proof. Since the Wasserstein distance metrizes the weak convergence (Corollary 6.11 (Villani, 2009)), we have P U # µk k P U # µ (by continuity) W p p (P U # µk, P U # µ) k 0 and hence by the dominated convergence theorem, SSW p p (µk, µ) k 0. Sample Complexity. We show here that the sample complexity is independent of the dimension. Actually, this is a well known properties of sliced-based distances and it was studied first in (Nadjahi et al., 2020). To the best of our knowledge, the sample complexity of the Wasserstein distance on the circle has not been previously derived. We suppose in the next proposition that it is known as we mainly want to show that the sample complexity of SSW does not depend on the dimension. Proposition 9. Let p 1. Suppose that for µ, ν P(S1), with empirical measures ˆµn = 1 n Pn i=1 δxi and ˆνn = 1 n Pn i=1 δyi, where (xi)i µ, (yi)i ν are independent samples, we have E[|W p p (ˆµn, ˆνn) W p p (µ, ν)|] β(p, n). (57) Then, for any µ, ν Pp,ac(Sd 1) with empirical measures ˆµn and ˆνn, we have E[|SSW p p (ˆµn, ˆνn) SSW p p (µ, ν)|] β(p, n). (58) Published as a conference paper at ICLR 2023 Proof. By using the triangle inequality, Fubini-Tonelli, and the hypothesis on the sample complexity of W p p on S1, we obtain: E[|SSW p p (ˆµn, ˆνn) SSW p p (µ, ν)|] = E W p p (P U # ˆµn, P U # ˆνn) W p p (P U # µ, P U # ν) dσ(U) W p p (P U # ˆµn, P U # ˆνn) W p p (P U # µ, P U # ν) dσ(U) Vd,2 E W p p (P U # ˆµn, P U # ˆνn) W p p (P U # µ, P U # ν) dσ(U) Vd,2 β(p, n) dσ(U) = β(p, n). (59) Projection Complexity. We derive in the next proposition the projection complexity, which refers to the convergence rate of the Monte Carlo approximate w.r.t of the number of projections L towards the true integral. Note that we find the typical rate of Monte Carlo estimates, and that it has already been derive for sliced-based distances in (Nadjahi et al., 2020). Proposition 10. Let p 1, µ, ν Pp,ac(Sd 1). Then, the error made with the Monte Carlo estimate of SSWp can be bounded as EU h |\ SSW p p,L(µ, ν) SSW p p (µ, ν)| i2 1 W p p (P U # µ, P U # ν) SSW p p (µ, ν) 2 dσ(U) LVar U W p p (P U # µ, P U # ν) , (60) where \ SSW p p,L(µ, ν) = 1 L PL i=1 W p p (P Ui # µ, P U i # ν) with (Ui)L i=1 σ independent samples. Proof. Let (Ui)L i=1 be iid samples of σ. Then, by first using Jensen inequality and then remembering that EU[W p p (P U # µ, P U # ν)] = SSW p p (µ, ν), we have EU h |\ SSW p p,L(µ, ν) SSW p p (µ, ν)| i2 EU \ SSW p p,L(µ, ν) SSW p p (µ, ν) 2 W p p (P Ui # µ, P Ui # ν) SSW p p (µ, ν) i=1 W p p (P Ui # µ, P Ui # ν) LVar U W p p (P U # µ, P U # ν) W p p (P U # µ, P U # ν) SSW p p (µ, ν) 2 dσ(U). B BACKGROUND ON THE SPHERE B.1 UNIQUENESS OF THE PROJECTION Here, we discuss the uniqueness of the projection P U for almost every x. For that, we recall some results of (Bardelli & Mennucci, 2017). Published as a conference paper at ICLR 2023 Let M be a closed subset of a complete finite-dimensional Riemannian manifold N. Let d be the Riemannian distance on N. Then, the distance from the set M is defined as d M(x) = inf y M d(x, y). (62) The infimum is a minimum since M is closed and N locally compact, but the minimum might not be unique. When it is unique, let s denote the point which attains the minimum as π(x), i.e. d(x, π(x)) = d M(x). Proposition 11 (Proposition 4.2 in (Bardelli & Mennucci, 2017)). Let M be a closed set in a complete m-dimensional Riemannian manifold N. Then, for almost every x, there exists a unique point π(x) M that realizes the minimum of the distance from x. From this Proposition, they further deduce that the measure π#γ is well defined on M with γ a locally absolutely continuous measure w.r.t. the Lebesgue measure. In our setting, for all U Vd,2, we want to project a measure µ P(Sd 1) on the great circle span(UU T ) S 1. Hence, we have N = Sd 1 which is a complete finite-dimensional Riemannian manifold and M = span(UU T ) Sd 1 a closed set in N. Therefore, we can apply Proposition 11 and the push-forward measures are well defined for absolutely continuous measures. B.2 OPTIMIZATION ON THE SPHERE Let F : Sd 1 R be some functional on the sphere. Then, we can perform a gradient descent on a Riemannian manifold by following the geodesics, which are the counterpart of straight lines in Rd. Hence, the gradient descent algorithm (Absil et al., 2009; Bonnabel, 2013) reads as k 0, xk+1 = expxk γgradf(x) , (63) where for all x Sd 1, expx : Tx Sd 1 Sd 1 is a map from the tangent space Tx Sd 1 = {v Rd, x, v = 0} to Sd 1 such that for all v Tx Sd 1, expx(v) = γv(1) with γv the unique geodesic starting from x with speed v, i.e. γ(0) = x and γ (0) = v. For Sd 1, the exponential map is known and is x Sd 1, v Tx Sd 1, expx(v) = cos( v 2)x + sin( v 2) v v 2 . (64) Moreover, the Riemannian gradient on Sd 1 is known as (Absil et al., 2009, Eq. 3.37) gradf(x) = Projx( f(x)) = f(x) f(x), x x, (65) Projx denoting the orthogonal projection on Tx Sd 1. For more details, we refer to (Absil et al., 2009; Boumal, 2022). B.3 VON MISES-FISHER DISTRIBUTION The von Mises-Fisher (v MF) distribution is a distribution on Sd 1 characterized by a concentration parameter κ > 0 and a location parameter µ Sd 1 through the density θ Sd 1, fv MF(θ; µ, κ) = κd/2 1 (2π)d/2Id/2 1(κ) exp(κµT θ), (66) where Iν(κ) = 1 2π R π 0 exp(κ cos(θ)) cos(νθ)dθ is the modified Bessel function of the first kind. Several algorithms allow to sample from it, see e.g. (Wood, 1994; Ulrich, 1984) for algorithms using rejection sampling or (Kurz & Hanebeck, 2015) without rejection sampling. For d = 1, the v MF coincides with the von Mises (v M) distribution, which has for density θ [ π, π[, fv M(θ; µ, κ) = 1 I0(κ) exp(κ cos(θ µ)), (67) with µ [0, 2π[ the mean direction and κ > 0 its concentration parameter. We refer to (Mardia et al., 2000, Section 3.5 and Chapter 9) for more details on these distributions. Published as a conference paper at ICLR 2023 In particular, for κ = 0, the v MF (resp. v M) distribution coincides with the uniform distribution on the sphere (resp. the circle). Jung (2021) studied the law of the projection of a v MF on a great circle. In particular, they showed that, while the v MF plays the role of the normal distributions for directional data, the projection actually does not follow a von Mises distribution. More precisely, they showed the following theorem: Theorem 1 (Theorem 3.1 in (Jung, 2021)). Let d 3, X v MF(µ, κ) Sd 1, U Vd,2 and T = P U(X) the projection on the great circle generated by U. Then, the density function of T is t [ π, π[, f(t) = Z 1 0 f R(r)fv M(t; 0, κ cos(δ)r) dr, (68) where δ is the deviation of the great circle (geodesic) from µ and the mixing density is r ]0, 1[, f R(r) = 2 I ν(κ)I0(κ cos(δ)r)r(1 r2)ν 1I ν 1(κ sin(δ) p 1 r2), (69) with ν = (d 2)/2 and I ν(z) = ( z 2) νIν(z) for z > 0, I ν(0) = 1/Γ(ν + 1). Hence, as noticed by Jung (2021), in the particular case κ = 0, i.e. X Unif(Sd 1), then 0 f R(r)fv M(t; 0, 0) dr = fv M(t;0,0) 0 f R(r)dr = fv M(t; 0, 0), (70) and hence T Unif(S1). B.4 NORMALIZING FLOWS ON THE SPHERE Normalizing flows (Papamakarios et al., 2021) are invertible transformations. There has been a recent interest in defining such transformations on manifolds, and in particular on the sphere (Rezende et al., 2020; Cohen et al., 2021; Rezende & Racani ere, 2021). Exponential map normalizing flows. Here, we implemented the Exponential map normalizing flows introduced in (Rezende et al., 2020). The transformation T is x Sd 1, z = T(x) = expx Projx( ϕ(x)) , (71) where ϕ(x) = PK i=1 αi βi eβi(x T µi 1), αi 0, P i αi 1, µi Sd 1 and βi > 0 for all i. (αi)i, (βi)i and (µi)i are the learnable parameters. The density of z can be obtained as p Z(z) = p X(x) det E(x)T JT (x)T JT (x)E(x) 1 where Jf is the Jacobian in the embedded space and E(x) it the matrix whose columns form an orthonormal basis of Tx Sd 1. The common way of training normalizing flows is to use either the reverse or forward KL divergence. Here, we use them with a different loss, namely SSW. Stereographic projection. The stereographic projection ρ : Sd 1 Rd 1 maps the sphere Sd 1 to the Euclidean space. A strategy first introduced in (Gemici et al., 2016) is to use it before applying a normalizing flows in the Euclidean space in order to map some prior, and which allows to perform density estimation. More precisely, the stereographic projection is defined as x Sd 1, ρ(x) = x2:d 1 + x1 , (73) and its inverse is u Rd 1, ρ 1(u) = 2 u u 2 2+1 1 2 u 2 2+1 Published as a conference paper at ICLR 2023 Gemici et al. (2016) derived the change of variable formula for this transformation, which comes from the theory of probability between manifolds. If we have a transformation T = f ρ, where f is a normalizing flows on Rd 1, e.g. a Real NVP (Dinh et al., 2016), then the log density of the target distribution can be obtained as log p(x) = log p Z(z) + log | det Jf(z)| 1 2 log | det JT ρ 1Jρ 1(ρ(x))| = log p Z(z) + log | det Jf(z)| d log 2 ρ(x) 2 2 + 1 where we used the formula of (Gemici et al., 2016) for the change of variable formula of ρ, and where p Z is the density of some prior on Rd 1, typically of a standard Gaussian. We refer to (Gemici et al., 2016; Mathieu & Nickel, 2020) for more details about these transformations. C ADDITIONAL EXPERIMENTS C.1 EVOLUTION OF SSW BETWEEN VON MISES-FISHER DISTRIBUTIONS The KL divergence between the von Mises-Fisher distribution and the uniform distribution has been derived analytically in (Davidson et al., 2018; Xu & Durrett, 2018) as KL v MF(µ, κ)||v MF( , 0) = κ Id/2(κ) Id/2 1(κ) + d 2 1 log κ d 2 log(2π) log Id/2 1(κ) 2 log π + log 2 log Γ d We plot on Figure 7 the evolution of KL and SSW w.r.t. κ for different dimensions. We observe a different trend. SSW seems to get lower with the dimension contrary to KL. 0 50 100 150 200 250 d = 3 d = 25 d = 50 d = 100 (a) KL v MF(µ, κ)||v MF( , 0) 0 50 100 150 200 250 d = 3 d = 25 d = 50 d = 100 (b) SW 2 2 v MF(µ, κ)||v MF( , 0) Figure 7: Evolution w.r.t κ between v MK(µ, κ) and v MF( , 0). For SW, we used 100 projections (for memory reasons for d = 100), and computed it for κ {1, 5, 10, 20, 30, 40, 50, 75, 100, 150, 200, 250}, 10 times by dimension and κ, and with 500 samples of both distributions. As a sanity check, we compare on Figure 8 the evolution of SSW between v MF distributions where we fix v MF(µ0, 10) and we rotate the first v MF along a great circle. More precisely, we plot SW 2 2 v MF((1, 0, 0, ...), 10), v MF((cos(θ), sin(θ), 0, ...), 10) for θ { kπ 6 }k {0,...,12}. As expected, we obtain a bell shape which is maximal when the second v MF distribution has for location parameter µ0. We observe a similar behavior between SSW2, SSW1 and SW2 with different scales. Published as a conference paper at ICLR 2023 d = 3 d = 25 d = 50 d = 3 d = 25 d = 50 d = 3 d = 25 d = 50 Figure 8: Evolution of SW between v MF samples in Sd 1 (mean over 100 batch). On Figure 9, we plot the evolution of SSW w.r.t. the number of projections for different dimensions. We observe that for around 100 projections, the variance seems to be low enough. 100 101 102 103 Number of projections = 1 = 10 = 20 = 50 = 100 = 200 100 101 102 103 Number of projections d = 3 d = 100 d = 500 d = 1000 d = 2000 Figure 9: Influence of the number of projections. We compute SW 2 2 v MF(µ, κ)||v MF( , 0) 20 times, for n = 500 samples in dimension d = 3. Nadjahi et al. (2020) proved that, contrary to the Wasserstein distance, the classical sliced-Wasserstein distance has a sample complexity independent of the dimension d. As shown in Propositon 9, we have similar results for SSW. We show it empirically on Figure 10 by plotting SSW and the Wasserstein distance (with geodesic distance) between samples of the uniform distribution on the sphere w.r.t. the number of samples. We observe indeed that the convergence rate of SSW is independent of the dimension. 101 102 103 104 Number of Samples Figure 10: Spherical Sliced-Wasserstein and Wasserstein distance (with geodesic distance) between samples of the uniform distribution on the sphere. Results are averaged over 20 runs and the shaded are correponds to the standard deviation. Published as a conference paper at ICLR 2023 C.2 RUNTIME COMPARISONS We study here the evolution of the runtime w.r.t. different parameters. On Figure 11, we plot for several dimensions the runtime to compute SSW2 w.r.t. the number of projections and the number of samples. We observe the linearity w.r.t. the number of projections and the quasi-linearity w.r.t. the number of samples. 0 250 500 750 1000 1250 1500 1750 2000 Number of projections d = 3 d = 100 d = 500 d = 1000 0 500 1000 1500 2000 2500 3000 Number of samples Computational Time d = 3 d = 100 d = 500 d = 1000 Figure 11: Computation time w.r.t. the number of projections or samples, taken for κ = 10 and n = 500 samples for the left figure, and κ = 10 and 200 projections for the right figure, and for 20 times. C.3 GRADIENT FLOWS Mixture of v MF distributions. For the experiment in Section 5.1, we use as target distribution of mixture of 6 v MF distributions from which we have access to samples. We refer to Appendix B.3 for background on v MF distributions. The 6 v MF distributions have weights 1/6, concentration parameter κ = 10 and location parameters µ1 = (1, 0, 0), µ2 = (0, 1, 0), µ3 = (0, 0, 1), µ4 = ( 1, 0, 0), µ5 = (0, 1, 0) and µ6 = (0, 0, 1). We use two different approximation of the distribution. First, we approximate it using the empirical distribution, i.e. ˆµ = 1 n Pn i=1 δxi and we optimize over the particles (xi)n i=1. To optimize over particles, we can either use a projected gradient descent: ( x(k+1) = x(k) γ x(k)SSW 2 2 (ˆµk, ν) x(k+1) = x(k+1) x(k+1) 2 , (77) or a Riemannian gradient descent on the sphere (Absil et al., 2009) (see Appendix B.2 for more details). Note that the projected gradient descent is a Riemannian gradient descent with retraction (Boumal, 2022). We can also use neural networks such as a multilayer perceptron (MLP). We used a MLP composed of 5 layers of 100 units with leaky relu activation functions. The output of the MLP is normalized on the sphere using a ℓ2 normalization. We perform a gradient descent using Adam (Kingma & Ba, 2014) as the optimizer with a learning rate of 10 4 for 2000 epochs. We approximate SSW with L = 1000 projections and a batch size of 500. The base distribution is choose as the uniform distribution on the sphere. We report on Figure 12 a comparison of the 2 approximations where the density is estimated with a Gaussian kernel density estimator. (a) Target: Mixture of v MF (b) KDE estimate of the MLP (c) KDE estimate of 500 particles Figure 12: Minimization of SSW with respect to a mixture of v MF. Published as a conference paper at ICLR 2023 v MF distribution. A a simpler experiment, we choose a simple v MF distribution with κ = 10. We report on Figure 13 the evolution of the density approximated using a KDE, and on Figure 14 the evolution of particles. (a) Target distribution (b) Initial samples (c) Approximated distribution at the epoch 100 (d) Approximated distribution at the epoch 1000 Figure 13: Gradient Flows on SW with a v MF target and Mollweide projections. The distributions are approximated using KDE. (a) Initial samples (b) Approximated distribution at the epoch 100 (c) Approximated distribution at the epoch 1000 Figure 14: Gradient Flows on SW with a v MF target and Mollweide projections. C.4 EARTH DATA ESTIMATION Let T be a normalizing flow (NF). For a density estimation task, we have access to a distribution µ through samples (xi)n i=1, i.e. through the empirical measure ˆµn = 1 n Pn i=1 δxi . And the goal is to find an invertible transformation T such that T#µ = p Z, where p Z is a prior distribution for which we know the density. In that case, indeed, the density of µ, denoted as fµ can be obtained as x, fµ(x) = p Z(T(x))| det JT (x)|. (78) For the invertible transform, we propose to use normalizing flows on the sphere (see Appendix B.4). We use two different normalizing flows, exponential map normalizing flows (Rezende et al., 2020) and Real NVP (Dinh et al., 2016) + stereographic projection (Gemici et al., 2016) which we call Stereo in Table 1. To fit T#µ = p Z, we use either SSW, SW on the sphere, or SW on Rd 1 for the stereographic projection based NF. For the exponential map normalizing flow, we compose 48 blocks, each one with 100 components. These transformations have 24000 parameters. For Real NVP, we compose 10 blocks of Real NVPs, with shifting and scaling as multilayer perceptron, composed of 10 layers, 25 hidden units and with leaky relu of parameters 0.2 for the activation function. The number of parameters of these networks are 27520. For the training process, we perform 20000 epochs with full batch size. We use Adam as optimizer with a learning rate of 10 1. For the sterographic NF, we use a learning rate of 10 3. We report in Table 3 details of the datasets. Published as a conference paper at ICLR 2023 Algorithm 2 SWVI (Yi & Liu, 2021) Input: V a potential, K the number of iterations of SWVI, N the batch size, ℓthe number of MCMC steps Initialization: Choose qθ a sampler for k = 1 to K do Sample (z0 i )N i=1 qθ Run ℓMCMC steps starting from (z0 i )N i=1 to get (zℓ j)N j=1 // Denote ˆµ0 = 1 N PN j=1 δz0 j and ˆµℓ= 1 N PN j=1 δzℓ j Compute J = SW 2 2 (ˆµ0, ˆµℓ) Backpropagate through J w.r.t. θ Perform a gradient step end for Table 3: Details of Earth datasets. Earthquake Flood Fire Train set size 4284 3412 8966 Test set size 1836 1463 3843 Data size 6120 4875 12809 C.5 SLICED-WASSERSTEIN VARIATIONAL INFERENCE C.5.1 VARIATIONAL INFERENCE In variational inference (VI) (Jordan et al., 1999; Blei et al., 2017), we have some observed data (xi)n i=1 and some latent data (zi)n i=1. The goal of variational inference is to approximate the posterior distribution p( |x) by some distribution q Q where Q is a family of probabilities. The usual way of doing that is to minimize the Kullback-Leibler divergence among this family, i.e. min q Q KL(q||p( |x)) = Eq[log q(Z) But the KL divergence suffers from some drawbacks, as it is only a divergence (i.e. it does not satisfy the triangular inequality, and it is non symmetric), but it also suffers from under estimating the target distribution (or over estimating it for the reverse KL). Yi & Liu (2021) propose to use an optimal transport distance instead, namely the SW distance which gives the sliced-Wasserstein variational inference method. Basically, given some unnormalized probability p( |x) that we want to approximate with some variational distribution qϕ, we can first apply a MCMC algorithm and then learn qϕ using a gradient descent on SW with the target being the empirical distributions of the samples given by the MCMC. But running long MCMC chain is time consuming and it might be difficult to diagnose burn-in period. Therefore, they propose to only run at each iteration some number of steps t of MCMC chain, and then learn by gradient descent the variational distribution. Therefore, the variational distribution is guided at each step by the MCMC samples toward the stationary distribution which is the target. This is called an amortized sampler (see Problem 1 in (Wang & Liu, 2016)). We sum up the procedure in Algorithm 2. We propose here to substitute SW by SSW in order to perform SSWVI on the sphere. To do that, we first need a MCMC method on the sphere. C.5.2 MCMC ON THE SPHERE Several MCMC methods on the sphere have been proposed. For example, Hamiltonian Monte-Carlo (HMC) methods were proposed in (Byrne & Girolami, 2013; Lan et al., 2014; Liu et al., 2016), and Riemannian Langevin algorithms were proposed in (Li & Erdogdu, 2020; Wang et al., 2020). In our experiments, we use the Geodesic Langevin algorithm (GLA) introduced by Wang et al. (2020). This algorithm is a natural generalization of the Unadjusted Langevin Algorithm (ULA) and Published as a conference paper at ICLR 2023 it consists at simply following the geodesics of the regular ULA step, i.e. k > 0, xk+1 = expxk Projxk( γ V (xk) + p 2γZ) , Z N(0, I), (80) where for the sphere, x Sd 1, v Tx Sd 1, expx(v) = x cos( v ) + v v sin( v ), (81) Projx is the projection on the tangent space Tx Sd 1 = {v Rd, x, v = 0} (which is the orthogonal space) and is defined as Projx(v) = v x, v x. (82) For more details, we refer to (Absil et al., 2009). We use GLA here for simplicity and as a proof of concept. But note that GLA, as ULA, is biased and therefore the distribution learned will not be the exact true stationary distribution. However, a Metropolis-Hastings step at each iteration could be used to enforce the reversibility w.r.t. the target distribution or we could use other MCMC with more appealing convergence properties (see e.g. Liu et al. (2016)). C.5.3 APPLICATIONS Target: Power spherical distribution. First, as a simple example on S2, we use the power spherical distribution introduced by De Cao & Aziz (2020). This distribution has the advantage over the v MF distribution to allow for the direct use of the reparameterization trick since it does not require rejection sampling. The pdf is obtained as, x Sd 1, p X(x; µ, κ) (1 + µT x)κ (83) with µ Sd 1 and κ > 0. We can sample from drawing first Z Beta( d 1 2 ), v Unif(Sd 2), then constructing T = 2Z 1 and Y = [T, v T 1 T 2]T . Finally, apply a Householder reflection about µ to Y . All the operations are well differentiable and allow to apply the reparametrization trick. For the algorithm, see Algorithm 1 in (De Cao & Aziz, 2020). Hence, in this case, if we denote gθ the map which takes samples from a uniform distribution on Sd 2 and from a Beta distribution as input and outputs samples of power spherical distribution with parameters θ = (κ, µ), we can use it as the sampler. We test the algorithm with a target being a power spherical distribution of parameter µ = (0, 1, 0) and κ = 10, starting from µ = (1, 1, 1) and κ = 0.1. Performing 2000 optimization steps with a gradient descent (Riemannian gradient descent on µ to stay on the sphere), and 20 steps of the GLA algorithm, we are getting close enough to the true distribution as we can see on Figure 15. For the hyperparameters, we used a step size of 10 3 for GLA, 1000 projections to approximate SSW, a Riemannian gradient descent on the sphere (Absil et al., 2009) to learn the location parameter µ with a learning rate of 2, and a learning of 200 for κ. We performed K = 2000 steps and used N = 500 particles. (a) Target distribution (b) Initial sampler (c) Approximated distribution at the epoch 1000 (d) Approximated distribution at the epoch 2000 Figure 15: SWVI on Power Spherical Distributions with Mollweide projections. Target: mixture of v MFs. In Section 5.1, we perform amortized variational inference with a mixture of v MF distributions as target. For this, we train exponential map normalizing flows (see (Rezende et al., 2020) and Appendix B.4). Moreover, we use the same target as Rezende et al. (2020), Published as a conference paper at ICLR 2023 i.e. the target ν has a density p(x) P4 k=1 e10x T Ts e(µk) with µ1 = (0.7, 1.5), µ2 = ( 1, 1), µ3 = (0.6, 0.5) and µ4 = ( 0.7, 4). These are spherical coordinates which are be converted to euclidean using Ts e(θ, ϕ) = (sin ϕ cos θ, sin ϕ sin θ, cos ϕ). The exponential map normalizing flow is composed of N = 6 blocks with K = 5 components. We run the algorithm for 10000 iterations, with at each iteration 20 steps of GLA with γ = 10 1 as learning rate, and one step of backpropagation through SSW using the Adam (Kingma & Ba, 2014) optimizer with a learning rate of 10 3. We report on Figure 16 the Mollweide projection of the learned density. Since we learn to samples from a noise distribution, here the uniform distribution on the sphere, we do not have directly access to the density and we report a kernel density estimate with a Gaussian kernel using the implementation of Scipy (Virtanen et al., 2020). (a) Mixture of v MF Target distribution (b) Density learned with NF Figure 16: SSWVI on mixture of v MF 0 2000 4000 6000 8000 10000 Iterations Figure 17: Comparison of the ESS between SWVI et SSWVI with the mixture target (mean over 10 runs). We also report in Figure 17 the effective sample size (ESS) (Doucet et al., 2001; Liu & Chen, 1995) over the iterations. The ESS is estimated by (Rezende et al., 2020) ESS = Var Unif(e βu(X)) Varq e βu(X) PS s=1 ws 2 PS s=1 w2s , (84) where ws = e βu(xs)/qη(xs). The ESS is reported as a percentage of the sample size. Higher ESS indicates that the flow matches the target better (Rezende et al., 2020). Published as a conference paper at ICLR 2023 C.6 SLICED-WASSERSTEIN AUTOENCODER We recall that in the WAE framework, we want to minimize L(f, g) = Z c x, g(f(x)) dµ(x) + λD(f#µ, p Z), (85) where f is an encoder, g a decoder, p Z a prior distribution, c some cost function and D is a divergence in the latent space. Several D were proposed. For example, Tolstikhin et al. (2018) proposed to use the MMD, Kolouri et al. (2018) used the SW distance, Patrini et al. (2020) used the Sinkhorn divergence, Kolouri et al. (2019) used the generalized SW distance. Here, we use D = SSW 2 2 . Architecture and procedure. We first detail the hyperparameters and architectures of neural networks for MNIST and Fashion MNIST. For the encoder f and the decoder g, we use the same architecture as Kolouri et al. (2018). For both the encoder and the decoder architecture, we use fully convolutional architectures with 3x3 convolutional filters. More precisely, the architecture of the encoder is x R28 28 Conv2d16 Leaky Re LU0.2 Conv2d16 Leaky Re LU0.2 Avg Pool2 Conv2d32 Leaky Re LU0.2 Conv2d32 Leaky Re LU0.2 Avg Pool2 Conv2d64 Leaky Re LU0.2 Conv2d64 Leaky Re LU0.2 Avg Pool2 Flatten FC128 Re LU FCd Z ℓ2 normalization where d Z is the dimension of the latent space (either 11 for S10 or 3 for S2). The architecture of the decoder is z Rd Z FC128 FC1024 Re LU Reshape(64x4x4) Upsample2 Conv64 Leaky Re LU0.2 Conv64 Leaky Re LU0.2 Upsample2 Conv64 Leaky Re LU0.2 Conv32 Leaky Re LU0.2 Upsample2 Conv32 Leaky Re LU0.2 Conv1 Sigmoid To compare the different autoencoders, we used as the reconstruction loss the binary cross entropy, λ = 10, Adam (Kingma & Ba, 2014) as optimizer with a learning rate of 10 3 and Pytorch s default momentum parameters for 800 epochs with batch of size n = 500. Moreover, when using SW type of distance, we approximated it with L = 1000 projections. For the experiment on CIFAR10, we use the same architecture as Tolstikhin et al. (2018). More precisely, the architecture of the encoder is x R3 32 32 Conv2d128 Batch Norm Re LU Conv2d256 Batch Norm Re LU Conv2d512 Batch Norm Re LU Conv2d1024 Batch Norm Re LU FCdz ℓ2 normalization where dz = 65. Published as a conference paper at ICLR 2023 The architecture of the decoder is z Rdz FC4096 Reshape(1024 2 2) Conv2d T512 Batch Norm Re LU Conv2d T256 Batch Norm Re LU Conv2d T128 Batch Norm Re LU Conv2d T3 Sigmoid We use here a batch size of n = 128, λ = 0.1, the binary cross entropy as reconstruction loss and Adam as optimizer with a learning rate of 10 3. We report in Table 2 the FID obtained using 10000 samples and we report the mean over 5 trainings. For SSW, we used the formulation using the uniform distribution (12). To compute SW, we used the POT library (Flamary et al., 2021). To compute the Sinkhorn divergence, we used the Geom Loss package (Feydy et al., 2019). Additional experiments. We report on Figure 18 samples obtained with SSW for a uniform prior on S10. Figure 18: Samples generated with Sliced-Wasserstein Autoencoders with a uniform prior on S10. On Figure 19, we add the evolution over epochs of the Wasserstein distance between generated images and samples from the test set. 0 200 400 600 800 Epochs 2(g#f# , )) (a) With uniform prior on S2. 0 200 400 600 800 Epochs 2(g#f# , )) (b) With prior on S10. Figure 19: Comparison of the evolution of the Wasserstein distance over epochs between SWAE and SSWAE on MNIST (averaged over 5 trainings). Published as a conference paper at ICLR 2023 C.7 SELF-SUPERVISED LEARNING Table 4: Linear evaluation on CIFAR10. The features are taken either on the encoder output or directly on the sphere S2. Method Encoder output S2 Supervised 82.26 81.43 Chen et al. (2020a) 66.55 59.09 Wang & Isola (2020) 60.53 55.86 SW-SSL, λ = 1, L = 10 62.65 57.77 SW-SSL, λ = 1, L = 3 62.46 57.64 SSW-SSL, λ = 20, L = 10 64.89 58.91 SSW-SSL, λ = 20, L = 3 63.75 59.75 We conduct experiments using SSW to prevent collapsing representations in contrastive self-supervised learning (SSL) models. Such contrastive losses on the hypersphere have exhibited great representative capacity (Wu et al., 2018; Chen et al., 2020a; Caron et al., 2020) on unlabelled datasets by learning robust image representations invariantly to augmentations. As proposed in (Wang & Isola, 2020), the contrastive objective can be decomposed into an alignment loss which forces positive representations coming from the same image to be similar and a uniformity loss which preserves maximal information of the feature distribution and hence avoids collapsing representations. Without the uniformity loss, the representations tend to converge towards a constant representation which yields the best alignment loss possible but also contains no information about original images. Wang & Isola (2020) propose to enforce uniformicity by leveraging the Gaussian potential kernel which is bound to the uniform distribution on the sphere. This formulation is also related to the denominator of the contrastive loss as specified in Chen et al. (2020a). We propose to replace the Gaussian kernel uniformity loss with SSW for which the complexity is more linear w.r.t. the number of batch samples. A simple choice of the alignment loss is to minimize the mean squared euclidean distance between pairs of different augmented versions of the same image. A self-supervised learning network is pre-trained using this alignment loss added with an uniformity term. Our overall self-supervised loss can be defined as: LSSW-SSL = 1 i=1 z A i z B i 2 2 | {z } Alignment loss 2 SSW 2 2 (z A, ν) + SSW 2 2 (z B, ν) | {z } Uniformity loss where z A, z B Rn d are the representations from the network projected on the hypersphere of two augmented versions of the same images, ν = Unif(Sd 1) is the uniform distribution on the hypersphere and λ > 0 is used to balance the two terms. We pretrain a Res Net18 (He et al., 2016) model on the CIFAR10 (Krizhevsky, 2009) data with projections projected onto the sphere S2. This feature dimension allow us to visualize the entire validation set of CIFAR10 and its distribution on the sphere. The visualization of the projections on S2 are visible on Figure 20. We then evaluate the performance of each contrastive objective by fitting a linear classifier on top of the output of the layer before the projection on the sphere on the training dataset as is common for SSL methods. For comparison, we also report the results when the features are taken directly on the sphere. As a baseline, we also train a predictive supervised encoder by training jointly the linear classifier and the image encoder in a supervised manner using cross entropy. We use a Res Net18 (He et al., 2016) encoder which outputs 1024 features that are then projected onto the sphere S2 using a last fully connected layer followed by a ℓ2 normalization. We pretrain the model for 200 epochs using minibatch stochastic gradient descent (SGD) with a momentum of 0.9, a weight decay of 0.001 and an initial learning rate of 0.05. We use a batch size of 512 samples. The images are augmented using a standard set of random augmentations for SSL: random crops, horizontal flipping, color jittering and gray scale transformation as done in Wang & Isola (2020). For the trade-off parameter λ, we λ = 20 for SSW and λ = 1 for SW. To evaluate the performance of representations, we use the common linear evaluation protocol where a linear classifier is fitted on top of the pre-trained representations and the best validation accuracy is reported. The linear classifiers are trained for 100 epochs using the Adam (Kingma & Ba, 2014) optimizer with a learning rate of 0.001 with a decay of 0.2 at epoch 60 and 80. We compare our methods with two other contrastive objectives, Chen et al. (2020a) with the normalized temperaturescaled cross-entropy (NT-Xent) loss and Wang & Isola (2020) which proposes to decompose the Published as a conference paper at ICLR 2023 (a) SSW-SSL, λ = 20, L = 10 (Ours) (b) SW-SSL, λ = 1, L = 10 (c) Wang & Isola (2020) (d) Chen et al. (2020a) (e) Supervised prediction (f) Random initialization Figure 20: The CIFAR10 validation set on S2 after pre-training. Published as a conference paper at ICLR 2023 Table 5: Comparison of contrastive methods and their respective uniformity objective where z A, z B Rn d are representations from two augmented versions of the same set of images and ν = Unif(Sd 1) is the uniform distribution on the hypersphere. Method Luniform(z A) + Luniform(z B) Complexity Chen et al. (2020a) 1 2n Pn i=1 log P j =i exp( ˆzi,ˆzj τ ), ˆz = cat(z A, z B) O(n2d) Wang & Isola (2020) P z {z A,z B} log 2 n(n 1) P i>j exp( t zi zj 2 2) O(n2d) SSW-SSL (Ours) 1 2(SSW 2 2 (z A, ν) + SSW 2 2 (z B, ν)) O(Ln(d + log n)) objective in two distinct terms Lalign and Luniform. We recall the respective uniformity loss of each method in Table 5. As one can see in Table 4, our method achieves here comparable performances to two state-of-the-art approaches, yet slightly under-performing compared to (Chen et al., 2020a). We suspect that a finer validation of the balancing parameter λ is needed. Especially since the representations on Figure 20a are not completely uniformly distributed around the sphere after pre-training compared to other contrastive methods. Nevertheless, these preliminary results show that SSW-SSL is a promising contrastive learning approach without explicit distances between negative samples, especially compared to SW on the sphere. To this end, further works should be devoted to finding a good balance between the alignment and uniformity objectives.