# generalized_sliced_wasserstein_distances__454f6147.pdf

Generalized Sliced Wasserstein Distances

Soheil Kolouri1 , Kimia Nadjahi2 , Umut Sim sekli2,3, Roland Badeau2, Gustavo K. Rohde4 1: HRL Laboratories, LLC., Malibu, CA, USA, 90265 2: LTCI, Télécom Paris, Institut Polytechnique de Paris, France 3: Department of Statistics, University of Oxford, UK 4: University of Virginia, Charlottesville, VA, USA, 22904 skolouri@hrl.com, gustavo@virginia.edu {kimia.nadjahi, umut.simsekli, roland.badeau}@telecom-paris.fr

The Wasserstein distance and its variations, e.g., the sliced-Wasserstein (SW) distance, have recently drawn attention from the machine learning community. The SW distance, speciﬁcally, was shown to have similar properties to the Wasserstein distance, while being much simpler to compute, and is therefore used in various applications including generative modeling and general supervised/unsupervised learning. In this paper, we ﬁrst clarify the mathematical connection between the SW distance and the Radon transform. We then utilize the generalized Radon transform to deﬁne a new family of distances for probability measures, which we call generalized sliced-Wasserstein (GSW) distances. We further show that, similar to the SW distance, the GSW distance can be extended to a maximum GSW (max-GSW) distance. We then provide the conditions under which GSW and max-GSW distances are indeed proper metrics. Finally, we compare the numerical performance of the proposed distances on the generative modeling task of SW ﬂows and report favorable results.

1 Introduction

The Wasserstein distance has its roots in optimal transport (OT) theory [1] and forms a metric between two probability measures. It has attracted abundant attention in data sciences and machine learning due to its convenient theoretical properties and applications on many domains [2, 3, 4, 5, 6, 7, 8], especially in implicit generative modeling such as OT-based generative adversarial networks (GANs) and variational auto-encoders [9, 10, 11, 12].

While OT brings new perspectives and principled ways to formalize problems, the OT-based methods usually suffer from high computational complexity. The Wasserstein distance is often the computational bottleneck and it turns out that evaluating it between multi-dimensional measures is numerically intractable in general. This important computational burden is a major limiting factor in the application of OT distances to large-scale data analysis. Recently, several numerical methods have been proposed to speed-up the evaluation of the Wasserstein distance. For instance, entropic regularization techniques [13, 14, 15] provide a fast approximation to the Wasserstein distance by regularizing the original OT problem with an entropy term. The linear OT approach, [16, 17] further simpliﬁes this computation for a given dataset by a linear approximation of pairwise distances with a functional deﬁned on distances to a reference measure. Other notable contributions towards computational methods for OT include multi-scale and sparse approximation approaches [18, 19], and Newton-based schemes for semi-discrete OT [20, 21].

There are some special favorable cases where solving the OT problem is easy and reasonably cheap. In particular, the Wasserstein distance for one-dimensional probability densities has a closed-

Denotes equal contribution.

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

form formula that can be efﬁciently approximated. This nice property motivates the use of the sliced-Wasserstein distance [22], an alternative OT distance obtained by computing inﬁnitely many linear projections of the high-dimensional distribution to one-dimensional distributions and then computing the average of the Wasserstein distance between these one-dimensional representations. While having similar theoretical properties [23], the sliced-Wasserstein distance has signiﬁcantly lower computational requirements than the classical Wasserstein distance. Therefore, it has recently attracted ample attention and successfully been applied to a variety of practical tasks [22, 24, 25, 26, 27, 28, 29, 30, 31].

As we will detail in the next sections, the linear projection process used in the sliced-Wasserstein distance is closely related to the Radon transform, which is widely used in tomography [32, 33]. In other words, the sliced-Wasserstein distance is calculated via linear slicing of the probability distributions. However, the linear nature of these projections does not guarantee an efﬁcient evaluation of the sliced-Wasserstein distance: in very high-dimensional settings, the data often lives in a thin manifold and the number of randomly chosen linear projections required to capture the structure of the data distribution grows very quickly [27]. Reducing the number of required projections would thus result in a signiﬁcant performance improvement in sliced-Wasserstein computations.

To address the inefﬁciencies caused by the linear projections, very recently, several attempts have been made. In [34], Rowland et al. combined linear projections with orthogonal coupling in Monte Carlo estimation to increase computational efﬁciency and estimation quality. In [35], Deshpande et al. extended the sliced-Wasserstein distance to the max-sliced-Wasserstein distance, where they aimed at ﬁnding a single linear projection that maximizes the distance in the projected space. In another study [36], Paty and Cuturi extended this idea to projection on linear subspaces, where they aimed at ﬁnding the optimal subspace for the projections by replacing the projections along a vector with projections onto the nullspace of a matrix. While these methods reduce the computational cost induced by the projection operations by choosing a single vector or an orthogonal matrix, they incur an additional cost since they need to solve a non-convex optimization over manifolds.

In this paper, we address the aforementioned computational issues of the sliced-Wasserstein distance by taking an alternative route. In particular, we extend the linear slicing to non-linear slicing of probability measures. Our main contributions are summarized as follows:

Using the theory of the generalized Radon transform [37] we extend the deﬁnition of the sliced-Wasserstein distance to an entire class of distances, which we call the generalized sliced Wasserstein (GSW) distance. We prove that replacing the linear projections with non-linear projections can still yield a valid distance metric and we then identify general conditions under which the GSW distance is a proper metric function. To the best of our knowledge, this is the ﬁrst study to generalize SW to non-linear projection. Similar to [35], we then show that, instead of using inﬁnitely many projections as required by the GSW distance, we can still deﬁne a valid distance metric by using a single projection, as long as the projection gives the maximal distance in the projected space. We aptly call this distance the max-GSW distance. As instances of non-linear projections, we ﬁrst investigate projections with polynomial kernels, which fulﬁll all the conditions that we identify. However, we observe that the memory complexity of such projections has a combinatorial growth with respect to the dimension of the problem, hence restricts their applications to modern problems. This motivates us to consider a neural-network-based projection scheme, where we observe that fully connected or convolutional networks with leaky Re LU activations fulﬁll all the crucial conditions so that their resulting GSW becomes a pseudo-metric for probability measures. In addition to its practical advantages, this scheme also brings an interesting perspective on adversarial generative modeling, showing that such algorithms contain an implicit stage for learning projections with different cost functions than ours. Due to their inherent non-linearity, the GSW and max-GSW distances are expected to capture the complex structure of high-dimensional distributions by using much less projections, which will reduce the iteration complexity in a signiﬁcant amount. We verify this fact in our experiments, where we illustrate the superior performance of the proposed distances in both synthetic and real-data settings.

2 Background

We review in this section the preliminary concepts and formulations needed to develop our framework, namely the p-Wasserstein distance, the Radon transform, the sliced p-Wasserstein distance and the maximum sliced p-Wasserstein distance. In what follows, we denote by Pp(Ω) the set of Borel probability measures with ﬁnite p th moment deﬁned on a given metric space (Ω, d) and by µ Pp(X) and ν Pp(Y ) probability measures deﬁned on X, Y Ωwith corresponding probability density functions Iµ and Iν, i.e. dµ(x) = Iµ(x)dx and dν(y) = Iν(y)dy.

Wasserstein Distance. The p-Wasserstein distance, p [1, ), between µ and ν is deﬁned as the solution of the optimal mass transportation problem [1]:

Wp(µ, ν) = inf γ Γ(µ,ν)

X Y dp(x, y)dγ(x, y) 1

where dp( , ) is the cost function, and Γ(µ, ν) is the set of all transportation plans γ Γ(µ, ν) such that:

γ(A Y ) = µ(A) for any Borel A X, γ(X B) = ν(B) for any Borel B Y.

Due to Brenier s theorem [38], for absolutely continuous probability measures µ and ν (with respect to the Lebesgue measure), the p-Wasserstein distance can be equivalently obtained from

Wp(µ, ν) = inf f MP (µ,ν)

X dp x, f(x) dµ(x) 1

where MP(µ, ν) = {f : X Y | f#µ = ν} and f#µ represents the pushforward of measure µ, characterized as Z

A df#µ(y) = Z

f 1(A) dµ(x) for any Borel subset A Y.

Note that in most engineering and computer science applications, Ωis a compact subset of Rd and d(x, y) = |x y| is the Euclidean distance. By abuse of notation, we will use Wp(µ, ν) and Wp(Iµ, Iν) interchangeably.

One-dimensional distributions: The case of one-dimensional continuous probability measures is speciﬁcally interesting as the p-Wasserstein distance has a closed-form solution. More precisely, for one-dimensional probability measures, there exists a unique monotonically increasing transport map that pushes one measure to another. Let Fµ(x) = µ(( , x]) = R x Iµ(τ)dτ be the cumulative distribution function (CDF) for Iµ and deﬁne Fν to be the CDF of Iν. The optimal transport map is then uniquely deﬁned as f(x) = F 1 ν (Fµ(x)) and, consequently, the p-Wasserstein distance has an analytical form given as follows:

Wp(µ, ν) = Z

X dp x, F 1 ν (Fµ(x)) dµ(x) 1

0 dp F 1 µ (z), F 1 ν (z) dz 1

where Eq. (3) results from the change of variable Fµ(x) = z. Note that for empirical distributions, Eq. (3) is calculated by simply sorting the samples from the two distributions and calculating the average dp( , ) between the sorted samples. This requires only O(M) operations at best and O(M log M) at worst, where M is the number of samples drawn from each distribution (see [30] for more details). The closed-form solution of the p-Wasserstein distance for one-dimensional distributions is an attractive property that gives rise to the sliced-Wasserstein (SW) distance. Next, we review the Radon transform, which enables the deﬁnition of the SW distance. We also formulate an alternative OT distance called the maximum sliced-Wasserstein distance.

Radon Transform. The standard Radon transform, denoted by R, maps a function I L1(Rd), where L1(Rd) = {I : Rd R / Z

Rd |I(x)|dx < },

to the inﬁnite set of its integrals over the hyperplanes of Rd and is deﬁned as

RI(t, θ) = Z

Rd I(x)δ(t x, θ )dx, (4)

for (t, θ) R Sd 1, where Sd 1 Rd stands for the d-dimensional unit sphere, δ( ) the onedimensional Dirac delta function, and , the Euclidean inner-product. Note that R : L1(Rd) L1(R Sd 1). Each hyperplane can be written as:

H(t, θ) = {x Rd | x, θ = t}, (5)

which alternatively can be interpreted as a level set of the function g Rd Sd 1 R deﬁned as g(x, θ) = x, θ . For a ﬁxed θ, the integrals over all hyperplanes orthogonal to θ deﬁne a continuous function RI( , θ) : R R which is a projection (or a slice) of I.

The Radon transform is a linear bijection [39, 33] and its inverse R 1 is deﬁned as:

I(x) = R 1 RI(t, θ) = Z

Sd 1(RI( x, θ , θ) η( x, θ )dθ (6)

where η( ) is a one-dimensional high-pass ﬁlter with corresponding Fourier transform Fη(ω) = c|ω|d 1, which appears due to the Fourier slice theorem [33], and is the convolution operator. The above deﬁnition of the inverse Radon transform is also known as the ﬁltered back-projection method, which is extensively used in image reconstruction in the biomedical imaging community. Intuitively each one-dimensional projection (or slice) RI( , θ) is ﬁrst ﬁltered via a high-pass ﬁlter and then smeared back into Rd along H( , θ) to approximate I. The summation of all smeared approximations then reconstructs I. Note that in practice, acquiring an inﬁnite number of projections is not feasible, therefore the integration in the ﬁltered back-projection formulation is replaced with a ﬁnite summation over projections (i.e., a Monte-Carlo approximation).

Sliced-Wasserstein and Maximum Sliced-Wasserstein Distances. The idea behind the sliced p-Wasserstein distance is to ﬁrst, obtain a family of one-dimensional representations for a higherdimensional probability distribution through linear projections (via the Radon transform), and then, calculate the distance between two input distributions as a functional on the p-Wasserstein distance of their one-dimensional representations (i.e., the one-dimensional marginals). The sliced p-Wasserstein distance between Iµ and Iν is then formally deﬁned as:

SWp(Iµ, Iν) = Z

Sd 1 W p p RIµ(., θ), RIν(., θ) dθ 1

This is indeed a distance function as it satisﬁes positive-deﬁniteness, symmetry and the triangle inequality [23, 24].

The computation of the SW distance requires an integration over the unit sphere in Rd. In practice, this integration is approximated by using a simple Monte Carlo scheme that draws samples {θl} from the uniform distribution on Sd 1 and replaces the integral with a ﬁnite-sample average:

SWp(Iµ, Iν) 1

l=1 W p p RIµ( , θl), RIν( , θl) 1/p (8)

In higher dimensions, the random nature of slices could lead to underestimating the distance between the two probability measures. To further clarify this, let Iµ = N(0, Id) and Iν = N(x0, Id), x0 Rd, be two multivariate Gaussian densities with the identity matrix as the covariance matrix. Their projected representations are one-dimensional Gaussian distributions of the form RIµ( , θ) = N(0, 1) and RIν( , θ) = N( θ, x0 , 1). It is therefore clear that W2(RIµ( , θ), RIν( , θ)) achieves its maximum value when θ = x0 x0 2 and is zero for θ s that are orthogonal to x0. On the other hand, we know that vectors that are randomly picked from the unit sphere are more likely to be nearly orthogonal in high-dimension. More rigorously, the following inequality holds: Pr(| θ, x0 x0 2 | <

ϵ) > 1 e( dϵ2), which implies that for a high dimension d, the majority of sampled θ s would be nearly orthogonal to x0 and therefore, W2(RIµ( , θ), RIν( , θ)) 0 with high probability.

To remedy this issue, one can avoid uniform sampling of the unit sphere, and pick samples θ s that contain discriminant information between Iµ and Iν instead. This idea was for instance used in [28, 35, 36]. For instance, Deshpande et al. [28] ﬁrst calculate a linear discriminant subspace and then measure the empirical SW distance by setting the θ s to be the discriminant components of the subspace.

A similarly ﬂavored but less heuristic approach is to use the maximum sliced p-Wasserstein (max-SW) distance, which is an alternative OT metric deﬁned as [35]:

max-SWp(Iµ, Iν) = max θ Sd 1 Wp RIµ( , θ), RIν( , θ) (9)

Given that Wp is a distance, it is straightforward to show that max-SWp is also a distance: we will prove in Section 3.2 that the metric axioms would also hold for the maximum generalized sliced-Wasserstein distance, which contains the max-SW distance as a special case.

3 Generalized Sliced-Wasserstein Distances

Figure 1: Visualizing the slicing process for classical and generalized Radon transforms for the Half Moons distribution. The slices GI(t, θ) follow Equation (10).

We propose in this paper to extend the deﬁnition of the sliced-Wasserstein distance to formulate a new optimal transport metric, which we call the generalized sliced-Wasserstein (GSW) distance. The GSW distance is obtained using the same procedure as for the SW distance, except that here, the one-dimensional representations are acquired through nonlinear projections. In this section, we ﬁrst review the generalized Radon transform, which is used to project the high-dimensional distributions, and we then formally deﬁne the class of GSW distances. We also extend the concept of max-SW distance to the class of maximum generalized sliced-Wasserstein (max-GSW) distances.

3.1 Generalized Radon Transform

The generalized Radon transform (GRT) extends the original idea of the classical Radon transform introduced by [32] from integration over hyperplanes of Rd to integration over hypersurfaces, i.e. (d 1)-dimensional manifolds [37, 40, 41, 42, 43, 44]. The GRT has various applications, including Thermoacoustic Tomography, where the hypersurfaces are spheres, and Electrical Impedance Tomography, which requires integration over hyperbolic surfaces.

To formally deﬁne the GRT, we introduce a function g deﬁned on X (Rn\{0}) with X Rd. We say that g is a deﬁning function when it satisﬁes the four conditions below:

H1. g is a real-valued C function on X (Rn\{0})

H2. g(x, θ) is homogeneous of degree one in θ, i.e., λ R, g(x, λθ) = λg(x, θ).

H3. g is non-degenerate in the sense that (x, θ) X Rn\{0}, g x(x, θ) = 0.

H4. The mixed Hessian of g is strictly positive, i.e. det 2g xi θj

Then, the GRT of I L1(Rd) is the integration of I over hypersurfaces characterized by the level sets of g, which are characterized by Ht,θ = {x X | g(x, θ) = t}.

Let g be a deﬁning function. The generalized Radon transform of I, denoted by GI, is then formally deﬁned as:

GI(t, θ) = Z

Rd I(x)δ(t g(x, θ))dx (10)

Note that the standard Radon transform is a special case of the GRT with g(x, θ) = x, θ . Figure 1 illustrates the slicing process for standard and generalized Radon transforms for the Half Moons dataset as input.

3.2 Generalized Sliced-Wasserstein and Max-Generalized Sliced-Wasserstein Distances

Following the deﬁnition of the SW distance in Equation (7), we deﬁne the generalized sliced p Wasserstein distance using the generalized Radon transform as:

GSWp(Iµ, Iν) = Z

Ωθ W p p GIµ( , θ), GIν( , θ) dθ 1

where Ωθ is a compact set of feasible parameters for g( , θ) (e.g., Ωθ = Sd 1 for g( , θ) = , θ ).

The GSW distance can also suffer from the projection complexity issue described before; that is why we formulate the maximum generalized sliced p-Wasserstein distance, which generalizes the max-SW distance as deﬁned in (9): max-GSWp(Iµ, Iν) = max θ Ωθ Wp GIµ( , θ), GIν( , θ)

Proposition 1. The generalized sliced p-Wasserstein distance and the maximum generalized sliced p Wasserstein distance are, indeed, distances over Pp(Ω) if and only if the generalized Radon transform is injective.

The proof is given in the supplementary document. Remark 1. If the chosen generalized Radon transform is not injective, then we can only say that the GSW and max-GSW distances are pseudo-metrics: they still satisfy non-negativity, symmetry, the triangle inequality, and GSWp(Iµ, Iµ) = 0 and max-GSWp(Iµ, Iµ) = 0. Remark 2. Proposition 1 shows that the injectivity of GRT is sufﬁcient and necessary for GSW to be a metric. In this respect, our result brings a different perspective on the results of [23] by showing that SW is indeed distance since the standard Radon transform is injective.

3.3 Injectivity of the Generalized Radon Transform

We have shown that the injectivity of the GRT is crucial for the GSW and max-GSW distances to be, indeed, distances between probability measures. Here, we enumerate some of the known deﬁning functions that lead to injective GRTs.

The investigation of the sufﬁcient and necessary conditions for showing the injectivity of GRTs is a long-standing topic [37, 44, 45, 41]. The circular deﬁning function, g(x, θ) = x r θ 2 with r R+

and Ωθ = Sd 1 was shown to provide an injective GRT [43]. More interestingly, homogeneous polynomials with an odd degree also yield an injective GRT [46], i.e. g(x, θ) = P |α|=m θαxα, where

we use the multi-index notation α = (α1, . . . , αdα) Ndα, |α| = Pdα i=1 αi, and xα = Qdα i=1 xαi i . Here, the summation iterates over all possible multi-indices α, such that |α| = m, where m denotes the degree of the polynomial and θα R. The parameter set for homogeneous polynomials is then set to Ωθ = Sdα 1. We can observe that choosing m = 1 reduces to the linear case x, θ , since the set of the multi-indices with |α| = 1 becomes {(α1, . . . , αd); αi = 1 for a single i J1, d K, and αj = 0, j = i} and contains d elements.

While the polynomial projections form an interesting alternative to linear projections, their memory complexity dα grows exponentially with the dimension of the data and the degree of the polynomial, hence deteriorate their potential in modern machine learning problems. As a remedy, given the current success of the neural networks in various application domains, a natural task in our context would be to come up with a neural network, which would yield a valid GSW or max-GSW, when used as the deﬁning function in the GRT. As a neural network-based deﬁning function, we propose a multi-layer fully connected network with leaky Re LU activations. Under this speciﬁc network architecture, one can easily show that the corresponding deﬁning function satisﬁes H1 to H4 on (X\{0}) (Rn\{0}). On the other hand, it is highly non-trivial to show the injectivity of the associated GRT, therefore the GSW associated with this particular deﬁning function is a pseudo-metric, as we discussed in Remark 1. However, as illustrated later on in Section 5, this neural network-based deﬁning function still performs well in practice, and speciﬁcally, the non-differentiability of the leaky Re LU function at 0 does not seem to be a big issue in practice. Remark 3. With a neural network as the deﬁning function, minimizing max-GSW between two distributions is analogical to adversarial learning, where the adversary network s goal is to distinguish the two distributions. In the max-GSW case, the adversary network (i.e. the deﬁning function) seeks optimal parameters that maximize the GSW distance between the input distributions.

4 Numerical Implementation

4.1 Generalized Radon Transforms of Empirical PDFs

In most machine learning applications, we do not have access to the distribution Iµ but to a set of samples {xi}N i=1 drawn from Iµ, for which the empirical density is: Iµ(x) 1

N PN i=1 δ(x xi) The GRT of the empirical density is then given by: GIµ(t, θ) 1

N PN i=1 δ t g(xi, θ) Moreover, for high-dimensional problems, estimating Iµ in Rd requires a large number of samples. However, the projections of Iµ, GI( , θ), are one-dimensional and it may not be critical to have a large number of samples to estimate these one-dimensional densities.

4.2 Numerical Implementation of GSW Distances

Let {xi}N i=1 and {yj}N j=1 be samples respectively drawn from Iµ and Iν, and let g( , θ) be a deﬁning function. Following the work of [30], the Wasserstein distance between one-dimensional distributions GIµ( , θ) and GIν( , θ) can be calculated from sorting their samples and calculating the Lp distance between the sorted samples. In other words, the GSW distance between Iµ and Iν can be approximated from their samples as follows:

GSWp(Iµ, Iν) 1

n=1 |g(xi[n], θl) g(yj[n], θl)|p 1/p

where i[m] and j[n] are the indices of sorted {g(xi, θ)}N i=1 and {g(yj, θ)}N j=1. The procedure to approximate the GSW distance is summarized the supplementary document.

4.3 Numerical Implementation of max-GSW Distances

To compute the max-GSW distance, we perform an EM-like optimization scheme: (a) for a ﬁxed θ, g(xi, θ) and g(yi, θ) are sorted to compute Wp, (b) θ is updated with a Projected Gradient Descent (PGD) step:

θ = Proj Ωθ

n=1 |g(xi[n], θ) g(yj[n], θ)|p), θ

where Optim( ) refers to the preferred optimizer, for instance Gradient Descent (GD) or ADAM [47], and Proj Ωθ ( ) is the operator projecting θ onto Ωθ. For instance, when θ Sn 1, Proj Ωθ (θ) = θ θ .

Remark 4. Here, we ﬁnd the optimal θ by optimizing the actual Wp, as opposed to the heuristic approaches proposed in [28] and [30], where the pseudo-optimal slice is found via perceptrons or penalized linear discriminant analysis [48].

Finally, once convergence is reached, the max-GSW distance is approximated with:

max-GSWp(Iµ, Iν) 1

n=1 |g(xi[n], θ ) g(yj[n], θ )|p 1

The whole procedure is summarized as pseudocode in the supplementary document.

5 Experiments

In this section, we conduct experiments on the generalized Sliced-Wasserstein ﬂows. We also implemented GSW-based auto-encoders, whose results are reported in the supplementary document due to space limitations. We provide the source code to reproduce the experiments of this paper.2

Our goal is to demonstrate the effects of the choice of the GSW distance in its purest form by considering the following problem: minµ GSWp(µ, ν), where ν is a target distribution and µ is the source distribution, which is initialized to be the normal distribution. The optimization is then solved iteratively via: tµt = GSWp(µt, ν), µ0 = N(0, 1).

2See https://github.com/kimiandj/gsw.

Linear Poly 3 Poly 5 NN D=1 NN D=2 NN D=3

2-Wasserstein distance between !" and #

Figure 2: Log 2-Wasserstein distance between the source and target distributions as a function of the number of iterations for 4 classical target distributions.

Figure 3: 2-Wasserstein distance between source and target distributions for the MNIST dataset.

We started by using 4 well-known distributions as the target, namely the 25-Gaussians, 8-Gaussians, Swiss Roll, and Circle distributions. We compare GSW and max-GSW for optimizing the ﬂow with linear (i.e., SW distance), homogeneous polynomials of degree 3 and 5, and neural networks with 1, 2, and 3 hidden layers as deﬁning functions. We used the exact same optimization scheme for all methods, and kept only L = 1 projection, and calculated the 2-Wasserstein distance between µt and ν at each iteration of the optimization (via solving a linear programming at each step). We repeated each experiment 100 times and reported the mean of the 2-Wasserstein distance for all target datasets in Figure 2. We also showed a snapshot of µt and ν at t = 100 iterations for all datasets. We observe that (i) max-GSW outperforms GSW, of course at the cost of an additional optimization, and (ii) while the choice of the deﬁning function g( , θ) is data-dependent, one can see that the homogeneous polynomials are often among the top performers for all datasets. Speciﬁcally, SW is always outperformed by GSW with polynomial projections ( Poly 3 and Poly 5 in Figure 2, left) and by all the variants of max-GSW. Besides, max-linear-SW is consistently outperformed by max-GSW-NN. The only variant of GSW that is outperformed by SW is GSW with neural network-based deﬁning function, which was expected because of its inherent complexity of approximating the integral over a very large domain (11) with a simple Monte Carlo average. To circumvent this issue, max-GSW replaces sampling with optimization.

To move to more realistic datasets, we considered GSW ﬂows for the hand-written digit recognition dataset, MNIST, where we initialize 100 random images and optimize the ﬂow via max-SW and max-

Figure 4: Flow minimization comparison between max-SW and max-GSW on the Celeb A dataset.

GSW and measure the 2-Wasserstein distance between the µt (the 100 images) and ν (the training set of MNIST). See supplementary material for videos. Given the high-dimensional nature of the problem (i.e., 784-dims.) we cannot use the homogeneous polynomials due to memory constraints caused by the combinatorial growth of the coefﬁcients. Therefore, we chose a 3-layer neural network for our deﬁning function. Figure 3 shows the 2-Wasserstein between the source and target distributions as a function of number of training epochs. We observe that with the proposed approach the error is decreasing signiﬁcantly faster when compared to the linear projections. We also observe this in the quality of the generated images, where we obtain crisper results.

Finally, we applied our methodology on a larger dataset, namely Celeb A [49]. We performed ﬂow optimization in a 256-dimensional latent space of a pre-trained auto-encoder, and compared max-SW with max-GSW using a 3 layer neural network. We then measured the 2-Wasserstein between the real and optimized distributions in the 256-dimensional latent space. Figure 4 shows the results of this experiment. As can be seen, max-GSW ﬁnds a better solution than max-SW in fewer iterations and the quality of the generated images is slightly better.

6 Conclusion

We introduced a new family of optimal transport metrics for probability measures that generalizes the sliced-Wasserstein distance: while the latter is based on linear slicing of distributions, we propose to perform nonlinear slicing. We provided theoretical conditions that yield the generalized sliced Wasserstein distance to be, indeed, a distance function, and we empirically demonstrated the superior performance of the GSW and max-GSW distances over the classical sliced-Wasserstein distance in various generative modeling applications.

Acknowledgements

This work was partially supported by the United States Air Force and DARPA under Contract No. FA8750-18-C-0103. Any opinions, ﬁndings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reﬂect the views of the United States Air Force and DARPA. This work is also partly supported by the French National Research Agency (ANR) as a part of the FBIMATRIX project (ANR-16-CE23-0014) and by the industrial chair Machine Learning for Big Data from Télécom Paris Tech.

[1] Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.

[2] Justin Solomon, Raif Rustamov, Leonidas Guibas, and Adrian Butscher. Wasserstein propagation for semi-supervised learning. In International Conference on Machine Learning, pages 306 314, 2014.

[3] Charlie Frogner, Chiyuan Zhang, Hossein Mobahi, Mauricio Araya, and Tomaso A Poggio. Learning with a Wasserstein loss. In Advances in Neural Information Processing Systems, pages 2053 2061, 2015.

[4] Grégoire Montavon, Klaus-Robert Müller, and Marco Cuturi. Wasserstein training of restricted Boltzmann machines. In Advances in Neural Information Processing Systems, pages 3718 3726, 2016.

[5] Soheil Kolouri, Se Rim Park, Matthew Thorpe, Dejan Slepcev, and Gustavo K Rohde. Optimal mass transport: Signal processing and machine-learning applications. IEEE Signal Processing Magazine, 34(4):43 59, 2017.

[6] Nicolas Courty, Rémi Flamary, Devis Tuia, and Alain Rakotomamonjy. Optimal transport for domain adaptation. IEEE transactions on pattern analysis and machine intelligence, 39(9):1853 1865, 2017.

[7] Gabriel Peyré and Marco Cuturi. Computational optimal transport. ar Xiv preprint ar Xiv:1803.00567, 2018.

[8] Morgan A Schmitz, Matthieu Heitz, Nicolas Bonneel, Fred Ngole, David Coeurjolly, Marco Cuturi, Gabriel Peyré, and Jean-Luc Starck. Wasserstein dictionary learning: Optimal transport-based unsupervised nonlinear dictionary learning. SIAM Journal on Imaging Sciences, 11(1):643 678, 2018.

[9] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein GAN. ar Xiv preprint ar Xiv:1701.07875, 2017.

[10] Olivier Bousquet, Sylvain Gelly, Ilya Tolstikhin, Carl-Johann Simon-Gabriel, and Bernhard Schoelkopf. From optimal transport to generative modeling: the VEGAN cookbook. ar Xiv preprint ar Xiv:1705.07642, 2017.

[11] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of Wasserstein GANs. In Advances in Neural Information Processing Systems, pages 5767 5777, 2017.

[12] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein autoencoders. In International Conference on Learning Representations, 2018.

[13] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, pages 2292 2300, 2013.

[14] Marco Cuturi and Gabriel Peyré. A smoothed dual approach for variational Wasserstein problems. SIAM Journal on Imaging Sciences, December 2015.

[15] Justin Solomon, Fernando De Goes, Gabriel Peyré, Marco Cuturi, Adrian Butscher, Andy Nguyen, Tao Du, and Leonidas Guibas. Convolutional Wasserstein distances: Efﬁcient optimal transportation on geometric domains. ACM Transactions on Graphics (TOG), 34(4):66, 2015.

[16] Wei Wang, Dejan Slepˇcev, Saurav Basu, John A Ozolek, and Gustavo K Rohde. A linear optimal transportation framework for quantifying and visualizing variations in sets of images. International journal of computer vision, 101(2):254 269, 2013.

[17] Soheil Kolouri, Akif B Tosun, John A Ozolek, and Gustavo K Rohde. A continuous linear optimal transport approach for pattern analysis in image datasets. Pattern recognition, 51:453 462, 2016.

[18] Adam M Oberman and Yuanlong Ruan. An efﬁcient linear programming method for optimal transportation. ar Xiv preprint ar Xiv:1509.03668, 2015.

[19] Bernhard Schmitzer. A sparse multiscale algorithm for dense optimal transport. Journal of Mathematical Imaging and Vision, 56(2):238 259, Oct 2016.

[20] Bruno Lévy. A numerical algorithm for L2 semi-discrete optimal transport in 3D. ESAIM Math. Model. Numer. Anal., 49(6):1693 1715, 2015.

[21] Jun Kitagawa, Quentin Mérigot, and Boris Thibert. Convergence of a Newton algorithm for semi-discrete optimal transport. ar Xiv preprint ar Xiv:1603.05579, 2016.

[22] Nicolas Bonneel, Julien Rabin, Gabriel Peyré, and Hanspeter Pﬁster. Sliced and Radon Wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision, 51(1):22 45, 2015.

[23] Nicolas Bonnotte. Unidimensional and evolution methods for optimal transportation. Ph D thesis, Université Paris 11, France, 2013.

[24] Soheil Kolouri, Yang Zou, and Gustavo K Rohde. Sliced-Wasserstein kernels for probability distributions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4876 4884, 2016.

[25] Mathieu Carriere, Marco Cuturi, and Steve Oudot. Sliced Wasserstein kernel for persistence diagrams. In ICML 2017-Thirty-fourth International Conference on Machine Learning, pages 1 10, 2017.

[26] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. ar Xiv preprint ar Xiv:1710.10196, 2017.

[27] Antoine Liutkus, Umut Sim sekli, Szymon Majewski, Alain Durmus, and Fabian-Robert Stoter. Sliced-Wasserstein ﬂows: Nonparametric generative modeling via optimal transport and diffusions. In International Conference on Machine Learning, 2019.

[28] Ishan Deshpande, Ziyu Zhang, and Alexander Schwing. Generative modeling using the sliced Wasserstein distance. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3483 3491, 2018.

[29] Soheil Kolouri, Gustavo K. Rohde, and Heiko Hoffmann. Sliced Wasserstein distance for learning gaussian mixture models. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.

[30] Soheil Kolouri, Phillip E. Pope, Charles E. Martin, and Gustavo K. Rohde. Sliced Wasserstein auto-encoders. In International Conference on Learning Representations, 2019.

[31] K. Nadjahi, A. Durmus, U. Sim sekli, and R. Badeau. Asymptotic Guarantees for Learning Generative Models with the Sliced-Wasserstein Distance. In Advances in Neural Information Processing Systems, 2019.

[32] Johann Radon. Uber die bestimmug von funktionen durch ihre integralwerte laengs geweisser mannigfaltigkeiten. Berichte Saechsishe Acad. Wissenschaft. Math. Phys., Klass, 69:262, 1917.

[33] Sigurdur Helgason. The Radon transform on Rn. In Integral Geometry and Radon Transforms, pages 1 62. Springer, 2011.

[34] Mark Rowland, Jiri Hron, Yunhao Tang, Krzysztof Choromanski, Tamas Sarlos, and Adrian Weller. Orthogonal estimation of wasserstein distances. In Artiﬁcial Intelligence and Statistics (AISTATS), volume 89, pages 186 195, 16 18 Apr 2019.

[35] Ishan Deshpande, Yuan-Ting Hu, Ruoyu Sun, Ayis Pyrros, Nasir Siddiqui, Sanmi Koyejo, Zhizhen Zhao, David Forsyth, and Alexander Schwing. Max-sliced wasserstein distance and its use for gans. In IEEE Conference on Computer Vision and Pattern Recognition, 2019.

[36] François-Pierre Paty and Marco Cuturi. Subspace robust wasserstein distances. In International Conference on Machine Learning, 2019.

[37] Gregory Beylkin. The inversion problem and applications of the generalized Radon transform. Communications on pure and applied mathematics, 37(5):579 599, 1984.

[38] Yann Brenier. Polar factorization and monotone rearrangement of vector-valued functions. Communications on pure and applied mathematics, 44(4):375 417, 1991.

[39] Frank Natterer. The mathematics of computerized tomography, volume 32. SIAM, 1986.

[40] AS Denisyuk. Inversion of the generalized Radon transform. Translations of the American Mathematical Society-Series 2, 162:19 32, 1994.

[41] Leon Ehrenpreis. The universality of the Radon transform. Oxford University Press on Demand, 2003.

[42] Israel M Gel fand, Mark Iosifovich Graev, and Z Ya Shapiro. Differential forms and integral geometry. Functional Analysis and its Applications, 3(2):101 114, 1969.

[43] Peter Kuchment. Generalized transforms of Radon type and their applications. In Proceedings of Symposia in Applied Mathematics, volume 63, page 67, 2006.

[44] Andrew Homan and Hanming Zhou. Injectivity and stability for a generic class of generalized Radon transforms. The Journal of Geometric Analysis, 27(2):1515 1529, 2017.

[45] Gunther Uhlmann. Inside out: inverse problems and applications, volume 47. Cambridge University Press, 2003.

[46] Francois Rouviere. Nonlinear Radon and Fourier Transforms. https://math.unice.fr/ ~frou/recherche/Nonlinear%20Radon W.pdf, 2015.

[47] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

[48] Wei Wang, Yilin Mo, John A Ozolek, and Gustavo K Rohde. Penalized Fisher discriminant analysis and its application to image-based morphometry. Pattern recognition letters, 32(15):2128 2135, 2011.

[49] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.