# markovian_sliced_wasserstein_distances_beyond_independent_projections__b5ca6e95.pdf

Markovian Sliced Wasserstein Distances: Beyond Independent Projections

Khai Nguyen Department of Statistics and Data Sciences The University of Texas at Austin Austin, TX 78712 khainb@utexas.edu

Tongzheng Ren Department of Computer Science The University of Texas at Austin Austin, TX 78712 tongzheng@utexas.edu

Nhat Ho Department of Statistics and Data Sciences The University of Texas at Austin Austin, TX 78712 minhnhat@utexas.edu

Sliced Wasserstein (SW) distance suffers from redundant projections due to independent uniform random projecting directions. To partially overcome the issue, max K sliced Wasserstein (Max-K-SW) distance (K 1), seeks the best discriminative orthogonal projecting directions. Despite being able to reduce the number of projections, the metricity of the Max-K-SW cannot be guaranteed in practice due to the non-optimality of the optimization. Moreover, the orthogonality constraint is also computationally expensive and might not be effective. To address the problem, we introduce a new family of SW distances, named Markovian sliced Wasserstein (MSW) distance, which imposes a first-order Markov structure on projecting directions. We discuss various members of the MSW by specifying the Markov structure including the prior distribution, the transition distribution, and the burning and thinning technique. Moreover, we investigate the theoretical properties of MSW including topological properties (metricity, weak convergence, and connection to other distances), statistical properties (sample complexity, and Monte Carlo estimation error), and computational properties (computational complexity and memory complexity). Finally, we compare MSW distances with previous SW variants in various applications such as gradient flows, color transfer, and deep generative modeling to demonstrate the favorable performance of the MSW1.

1 Introduction

Sliced Wasserstein (SW) [7] distance has been well-known as a great alternative statistical distance for Wasserstein distance [57, 49]. In short, SW takes the average of Wasserstein distances between corresponding pairs of one-dimensional projected measures as the distance between the two original measures. Hence, the SW has a low computational complexity compared to the conventional Wasserstein distance due to the closed-form solution of optimal transport in one dimension. When the probability measures have at most n supports, the computational complexity of the SW is only O(n log n). This complexity is much lower than the computational complexity O(n3 log n) of Wasserstein distance and the complexity O(n2) [1, 34, 35, 33] of entropic Wasserstein [11] (Sinkhorn

1Code for this paper is published at https://github.com/UT-Austin-Data-Science-Group/MSW.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

divergence). Moreover, the memory complexity of the SW is O(n) which is lower than O(n2) of the Wasserstein (Sinkhorn) distance. The reason is that SW does not need to store the cost matrix between supports which cost O(n2). An additional appealing property of the SW is that it does not suffer from the curse of dimensionality, namely, its sample complexity is O(n 1/2) [40, 46] compared to O(n 1/d) [19] of the Wasserstein distance (d is the number of dimensions).

Due to the scalability, the SW has been applied to almost all applications where the Wasserstein distance is used. For example, we refer to some applications of the SW which are generative modeling [60, 15, 27, 42], domain adaptation [30], clustering [28], approximate Bayesian computation [39], gradient flows [37, 5], and variational inference [61]. Moreover, there are many attempts to improve the SW. The generalized sliced Wasserstein (GSW) distance that uses non-linear projection is proposed in [26]. Distributional sliced Wasserstein distance is proposed in [44, 45] by replacing the uniform distribution on the projecting directions in SW with an estimated distribution that puts high probabilities for discriminative directions. Spherical sliced Wasserstein which is defined between distributions that have their supports on the hyper-sphere is introduced in [4]. A sliced Wasserstein variant between probability measures over images with convolution is defined in [43].

Despite having a lot of improvements, one common property in previous variants of the SW is that they use independent projecting directions that are sampled from a distribution over a space of projecting direction e.g., the unit-hypersphere. Those projecting directions are further utilized to project two interested measures to corresponding pairs of one-dimensional measures. Due to the independence, practitioners have reported that many projections do not have the power to discriminative between two input probability measures [26, 15]. Moreover, having a lot of projections leads to redundancy and losing computation for uninformative pairs of projected measures. This problem is known as the projection complexity limitation of the SW.

To partially address the issue, the max sliced Wasserstein (Max-SW) distance is introduced in [14]. Max-SW seeks the best projecting direction that can maximize the projected Wasserstein distance. Since the Max-SW contains a constraint optimization problem, the projected subgradient ascent algorithm is performed. Since the algorithm only guarantees to obtain local maximum [46], the performance of empirical estimation Max-SW is not stable in practice [42] since the metricity of Max-SW can be only obtained at the global optimum. Another approach is to force the orthogonality between projecting directions. In particular, K-sliced Wasserstein [50] (K-SW) uses K > 1 orthogonal projecting directions. Moreover, to generalize the Max-SW and the K-SW, max-K sliced Wasserstein (Max-K-SW) distance (K > 1) appears in [12] to find the best K projecting directions that are orthogonal to each other via the projected sub-gradient ascent algorithm. Nevertheless, the orthogonality constraint is computationally expensive and might not be good in terms of reflecting discrepancy between general measures. Moreover, Max-K-SW also suffers from the non-optimality problem which leads to losing the metricity property in practice.

To avoid the independency and to satisfy the requirement of creating informative projecting directions efficiently, we propose to impose a sequential structure on projecting directions. Namely, we choose a new projecting direction based on the previously chosen directions. For having more efficiency in computation, we consider first-order Markovian structure in the paper which means that a projecting direction can be sampled by using only the previous direction. For the first projecting direction, it can follow any types of distributions on the unit-hypersphere that were used in the literature e.g., uniform distribution [7] and von Mises-Fisher distribution [23, 45] to guarantee the metricity. For the transition distribution on the second projecting direction and later, we propose two types of family which are orthogonal-based transition and input-awared transition. For the orthogonal-based transition, we choose the projecting direction uniformly on the unit hypersphere such that it is orthogonal to the previous direction. In contrast to the previous transition which does not use the information from the two input measures, the input-awared transition uses the sub-gradient with respect to the previous projecting direction of the corresponding projected Wasserstein distance between the two measures to design the transition. In particular, the projected sub-gradient update is used to create the new projecting direction. Moreover, we further improve the computational time and computational memory by introducing the burning and thinning technique to reduce the number of random projecting directions.

Contribution. In summary, our contributions are two-fold:

1. We propose a novel family of distances on the space of probability measures, named Markovian sliced Wasserstein (MSW) distances. MSW considers a first-order Markovian

structure on random projecting directions. Moreover, we derive three variants of MSW that use two different types of conditional transition distributions: orthogonal-based and input-awared. We investigate the theoretical properties of MSW including topological properties (metricity, weak convergence, and connection to other distances), statistical properties (sample complexity, and Monte Carlo estimation error), and computational properties (computational complexity and memory complexity). Moreover, we introduce a burning and thinning approach to further reduce computational and memory complexity, and we discuss the properties of the resulting distances. 2. We conduct experiments to compare MSW with SW, Max-SW, K-SW, and Max-K-SW in various applications, namely, gradient flows, color transfer, and deep generative models on standard image datasets: CIFAR10 and Celeb A. We show that the input-awared MSW can yield better qualitative and quantitative performance while consuming less computation than previous distances in gradient flows and color transfer, and comparable computation in deep generative modeling. Finally, we investigate the role of hyper-parameters of distances e.g., the number of projections, the number of time-steps, and so on, in applications.

Organization. We first provide background for Wasserstein distance, sliced Wasserstein distance, and max sliced Wasserstein distance in Section 2. In Section 3, we propose Markovian sliced Wasserstein distances and derive their theoretical properties. Section 4 contains the comparison of MSW to previous SW variants in gradient flows, color transfer, and deep generative modeling. We then conclude the paper in Section 5. Finally, we defer the proofs of key results in the paper and supplementary materials to Appendices.

Notation. For p 1, Pp(Rd) is the set of all probability measures on Rd that have finite pmoments. For any d 2, we denote U(Sd 1) is the uniform measure over the unit hyper-sphere Sd 1 := {θ Rd | ||θ||2 2 = 1}. For any two sequences an and bn, the notation an = O(bn) means that an Cbn for all n 1, where C is some universal constant. We denote θ µ is the push-forward measures of µ through the function f : Rd R that is f(x) = θ x.

2 Background

We start with reviewing the background on Wasserstein distance, sliced Wasserstein distances, their computation techniques, and their limitations.

Wasserstein distance. Given two probability measures µ Pp(Rd) and ν Pp(Rd), the Wasserstein distance [57, 48] between µ and ν is :

Wp p(µ, ν) = inf π Π(µ,ν)

Rd Rd x y p pdπ(x, y), (1)

where Π(µ, ν) is set of all couplings that have marginals are µ and ν respectively. The computational complexity and memory complexity of Wasserstein distance are O(n3 log n) and O(n2) in turn when µ and ν have at most n supports. When d = 1, the Wasserstein distance can be computed with a closed form: Wp p(µ, ν) = R 1 0 |F 1 µ (z) F 1 ν (z)|pdz, where Fµ and Fν are the cumulative distribution function (CDF) of µ and ν respectively.

Sliced Wasserstein distance. By randomly projecting two interested high-dimensional measures to corresponding pairs of one-dimensional measures, sliced Wasserstein (SW) distance can exploit the closed-form benefit of Wasserstein distance in one dimension. The definition of sliced Wasserstein distance [7] between two probability measures µ Pp(Rd) and ν Pp(Rd) is:

SWp p(µ, ν) = Eθ U(Sd 1)Wp p(θ µ, θ ν). (2)

Monte Carlo samples are often used to approximate the intractable expectation unbiasedly:

d SW p p(µ, ν) = 1

l=1 Wp p(θl µ, θl ν), (3)

where θ1, . . . , θL are drawn randomly from U(Sd 1). When µ and ν are discrete measures that have at most n supports in d dimension, the computational complexity of SW is O(Ln log2 n + Ldn) and the memory complexity for storing the projecting directions and the projected supports is O(L(d+n)).

Here, Ln log2 n is for sorting L sets of projected supports and Ld is for projecting supports to L sets of scalars.

Max sliced Wasserstein distance. To select the best discriminative projecting direction, the max sliced Wasserstein (Max-SW) distance [14] between µ Pp(Rd) and ν Pp(Rd) is introduced as follows:

Max-SWp(µ, ν) = max θ Sd 1 Wp(θ µ, θ ν). (4)

Computing Max-SW requires solving the constrained optimization problem. In practice, the projected sub-gradient ascent algorithm with T > 1 iterations is often used to obtain a surrogate projecting direction ˆθT for the global optimum. Hence, the empirical Max-SW distance is \ Max-SWp(µ, ν) = Wp(ˆθT µ, ˆθT ν). The detail of the projected sub-gradient ascent algorithm is given in Algorithm 1 in Appendix A.1. The computational complexity of Max-SW is O(Tn log2 n + Tdn) and the memory complexity of Max-SW is O(d + n). It is worth noting that the projected sub-gradient ascent can only yield local maximum [46]. Therefore, the empirical Max-SW might not be distance even when T since the metricity of Max-SW can be only obtained at the global maximum.

K-sliced Wasserstein distance. The authors in [50] propose to estimate the sliced Wasserstein distance based on orthogonal projecting directions. We refer to the distance as K-sliced Wasserstein distance (K-SW). The definition of K-SW between two probability measures µ Pp(Rd) and ν Pp(Rd) is:

K-SWp p(µ, ν) = E

i=1 Wp p(θi µ, θi ν)

where the expectation is with respect to (θ1, . . . , θK) U(Vk(Rd)) with VK(Rd) = {(θ1, . . . , θK) Sd 1| θi, θj = 0 i, j K} is the Stiefel manifold. The expectation can be approximated with Monte Carlo samples (θ1l, . . . , θKl)L l=1 from U(VK(Rd)). In the original paper, L is set to 1. To sample from the uniform distribution over the Stiefel manifold U(Vk(Rd)), it requires using the Gram-Schmidt orthogonality process which has the computational complexity O(K2d) (quadratic in K). Therefore, the total computational complexity of K-SW is O(LKn log2 n+LKdn+LK2d) and the memory complexity of K-SW is O(LK(d+n)). More detail related to K-SW including Gram-Smith process and sampling uniformly from the Stiefel manifold is given in Appendix A.1.

Max K sliced Wasserstein distance. To generalize both Max-SW and K-SW, Max K sliced Wasserstein is introduced in [12]. Its definition between µ Pp(Rd) and ν Pp(Rd) is:

Max-K-SWp p(µ, ν) = max (θ1,...,θK) VK(Rd)

i=1 Wp p(θi µ, θi ν)

Similar to Max-SW, a projected sub-gradient ascent algorithm with T > 1 iterations is used to approximate Max-K-SW. We refer the reader to Algorithm 4 in Appendix A.1 for greater detail. Since the projecting operator to the Stiefel manifold is the Gram-Smith process, the computational complexity of Max-K-SW is O(TKn log2 n + TKdn + TK2d). The memory complexity of Max K-SW is O(K(d + n)). Similar to Max-SW, the metricity of Max-K-SW is only obtained at the global optimum, hence, the empirical estimation might not be stable. Moreover, the orthogonality constraint is also computationally expensive i.e., quadratic in terms of the number of orthogonal projections K.

3 Markovian Sliced Wasserstein distances

We first define Markovian sliced Wasserstein (MSW) distance and discuss its theoretical properties including topological properties, statistical properties, and computational properties in Section 3.1. In Section 3.2, we discuss some choices in designing the Markov chain including the prior distribution and the transition distribution. Finally, we discuss the burning and thinning variant of MSW which can reduce the computational and memory complexity in Section 3.3.

3.1 Definitions, Topological, Statistical, and Computational Properties

We first start with a general definition of Markovian sliced Wasserstein distance in Definition 1.

Definition 1. For any p 1, T 1, and dimension d 1, the Markovian sliced Wasserstein of order p between two probability measures µ Pp(Rd) and ν Pp(Rd) is:

MSWp p,T (µ, ν) = E(θ1:T ) σ(θ1:T )

t=1 W p p (θt µ, θt ν)

where T is the number of time steps, the expectation is under the projecting distribution θ1:T σ(θ1:T ) with σ(θ1:T ) = σ(θ1, . . . , θT ) = σ1(θ1) QT l=2 σt(θt|θt 1), and σ1(θ1), σt(θt|θt 1) P(Sd 1) for all t = 1, . . . , T.

The first projecting direction θ1 follows the distribution σ1(θ1) with σ1(θ1) to be any distributions on the unit hyper-sphere, e.g., the uniform distribution, a von Mises-Fisher distribution, and so on. By designing the transition distribution σl(θl|θl 1), we can obtain various variants of MSW. It is worth noting that the MSW can be rewritten as the average of T expectation of onedimensional Wasserstein distance, MSWp p,T (µ, ν) = 1 T PT t=1 Eθt σt(θt)[W p p (θt µ, θt ν)], however, σt(θt) = R Qt i=1 σ1(θ1) Qt l=2 σt (θt |θt 1)dθ1 . . . dθt 1 are not the same for t = 1, . . . , T. Moreover, sampling directly from σt(θt) is intractable, hence, using the definition of the MSW in Definition 1 is more reasonable in terms of approximating the expectation using Monte Carlo samples.

Monte Carlo estimation. Similar to SW, we also need to use Monte Carlo samples to approximate the expectation in Definition 1. We first samples θ11, . . . , θL1 σ1(θ1) for L 1, then we samples θlt σt(θt|θlt 1) for t = 1, . . . , T and l = 1, . . . , L. After that, we can form an unbiased empirical estimation of MSW as follows: \ MSW p p,T (µ, ν) = 1 LT PL l=1 PT t=1 Wp p(θlt µ, θlt ν).

Before going to the specific design of those distributions, we first discuss the empirical estimation of MSW, and investigate its theoretical properties including topological properties, statistical properties, and computational properties.

Topological Properties. We first state the following assumption: A1. In MSW, the prior distribution σ1(θ1) is supported on all the unit-hypersphere or there exists a transition distribution σt(θt|θt 1) being supported on all the unit-hypersphere. The assumption A1 is easy to satisfy and it holds for all later choices of the prior distribution and transition distribution. We now consider the metricity properties of the Markovian sliced Wasserstein distance.

Theorem 1 (Metricity). For any p 1, T 1, and dimension d 1, if A1 holds, Markovian sliced Wasserstein MSWp,T ( , ) is a valid metric on the space of probability measures Pp(Rd), namely, it satisfies the (i) non-negativity, (ii) symmetry, (iii) triangle inequality, and (iv) identity.

The proof of Theorem 1 is in Appendix B.1. Next, we show that the convergence in MSW implies the weak convergence of probability measures and the reverse also holds.

Theorem 2 (Weak Convergence). For any p 1, T 1, and dimension d 1, if A1 holds, the convergence of probability measures in Pp(Rd) under the Markovian sliced Wasserstein distance MSWp,T ( , ) implies weak convergence of probability measures and vice versa.

Theorem 2 means that for any sequence of probability measures (µk)k N and µ in Pp(Rd), limk + MSWp,T (µk, µ) = 0 if and only if for any continuous and bounded function f : Rd R, limk + R f dµk = R f dµ. The proof of Theorem 2 is in Appendix B.2. Next, we discuss the connection of MSW to previous sliced Wasserstein variants.

Proposition 1. For any p 1 and dimension d 1,

(i) For any T 1 and µ, ν Pp(Rd), MSWp,T (µ, ν) Max-SWp(µ, ν) Wp(µ, ν).

(ii) If T = 1 and the prior σ1(θ1) := U(Sd 1), MSWp,T (µ, ν) = SWp(µ, ν).

The proof of Proposition 1 is in Appendix B.3.

Statistical Properties. We investigate the sample complexity (empirical estimation rate) of the MSW.

Proposition 2 (Sample Complexity). Let X1, X2, . . . , Xn be i.i.d. samples from the probability measure µ being supported on compact set of Rd. We denote the empirical measure µn = 1 n Pn i=1 δXi. Then, for any p 1 and T 1, there exists a universal constant C > 0 such that E[MSWp,T (µn, µ)] C p

(d + 1) log n/n, where the outer expectation is taken with respect to the data X1, X2, . . . , Xn.

The proof of Proposition 2 is in Appendix B.4. The proposition suggests that MSW does not suffer from the curse of dimensionality. Next, we investigate the MSW s Monte Carlo approximation error. Proposition 3 (Monte Carlo error). For any p 1, T 1, dimension d 1, and µ, ν Pp(Rd),

we have: E|[ MSW p p,T (µ, ν) MSWp p,T (µ, ν)| 1 T

LV ar h PT t=1 W p p (θt µ, θt ν) i 1

2 , where the

variance is with respect to σ(θ1, . . . , θT ).

The proof of Proposition 3 is in Appendix B.5. From the above proposition, we know that increasing the number of projections L reduces the approximation error.

Computational Properties. When µ and ν are two discrete probability measures in Pp(Rd) that have at most n supports, the computational complexity for the Monte Carlo approximation of MSW is O(TLn log2 n+TLdn) where O(TLn log n) is for computation of TL one-dimensional Wasserstein distances and O(TLdn) is the projecting complexity for TL projections from d dimension to 1 dimension. The memory complexity of MSW is O(TL(d + n)) for storing the projecting directions and the projections.

3.2 Specific Choices of Projecting Distributions

Designing the projecting distribution σ(θ1, . . . , θT ) is the central task in using MSW since it controls the projecting behavior. For each choice of the σ(θ1, . . . , θT ), we obtain a variant of MSW. Since we impose the first order Markov structure σ(θ1, . . . , θT ) = σ1(θ1) QT t=2 σt(θt|θt 1), there are two types of distributions that we need to choose: the prior distribution σ1(θ1) and the transition distribution σt(θt|θt 1) for all t = 2, . . . , T.

Prior distribution. The most simple choice of σ1(θ1) when we know nothing about probability measures that we want to compare is the uniform distribution over the unit hypersphere U(Sd 1). Moreover, the metricity of MSW is guaranteed regardless of the transition distribution with this choice. Therefore, the uniform distribution is the choice that we use in our experiments in the paper. It is worth noting that we could also use a distribution that is estimated from two interested probability measures [44]; however, this approach costs more computation.

Transition distribution. We discuss some specific choices of the transition distributions σt(θt|θt 1). Detailed algorithms for computing MSW with specific transitions are given in Appendix A.3.

Orthogonal-based transition. Motivated by the orthogonality constraint in Max-K-SW and K-SW, we can design a transition distribution that gives us an orthogonal projecting direction to the previous one. In particular, given a previous projecting direction θt 1, we want to have θt such that θt, θt 1 = 0, namely, we want to sample from the subsphere Sd 1 θt 1 := {θt Sd 1| θt, θt 1 = 0}. To the best of our knowledge, there is no explicit form of distribution (known pdf) that is defined on that set. However, we can still sample from the uniform distribution over that set: U(Sd 1 θt 1) since that distribution can be constructed by pushing the uniform distribution over the whole unit hypersphere U(Sd 1) through the projection operator: Projθt 1(θt) = Proj Sd 1 θt θt 1,θt θt 1,θt 1 θt 1 where

Proj Sd 1(θ) = θ ||θ||2 is the normalizing operator. In a greater detail, we first sample θ t U(Sd 1)

and then set θt = Projθt 1(θ t). Therefore, we have σt(θt|θt 1) = U(Sd 1 t 1 ) = Projθt 1 U(Sd 1).

Input-awared transition. The above two transition distributions do not take into account the information of the two probability measures µ and ν that we want to compare. Hence, it could be inefficient to explore good projecting directions by comparing µ and ν. Motivated by the projected sub-gradient ascent [9] update in finding the max" projecting direction, we could design the transition distribution as follows: σt(θt|θt 1) = δf(θt 1|η,µ,ν) where δ denotes the Dirac Delta function and the transition function f(θt 1|η, µ, ν) = Proj Sd 1 θt 1 + η θt 1Wp (θt 1 µ, θt 1 ν) , with η > 0 is the stepsize hyperparameter. As the current choice is a deterministic transition, it requires the prior

distribution to have supports on all Sd 1 to obtain the metricity for MSW. A choice to guarantee the metricity regardless of the prior distribution is the von Mises-Fisher (v MF) distribution [23], namely, σt(θt|θt 1) = v MF(θt|ϵ = f(θt 1|η, µ, µ), κ). The details about the v MF distribution including its probability density function, its sampling procedure, and its properties are given in Appendix A.2. In summary, the v MF distribution has two parameters: the location parameter ϵ Sd 1 which is the mean, and the concentration parameter κ R+ which plays the role as the precision. Thank the interpolation properties of the v MF distribution: limκ 0 v MF(θ|ϵ, κ) = U(Sd 1) and limκ v MF(θ|ϵ, κ) = δϵ, the transition distribution can balance between heading to the max" projecting direction and exploring the space of directions.

Stationarity of σT (θT ). A natural important question arises: what is the distribution of σT (θT ) = R . . . R σ(θ1, . . . , θT )dθ1 . . . dθT 1 when T ? The answer to the above questions depends on the choice of the projection distribution which is discussed in Section 3.2. For the Orthogonalbased transitions and the uniform distribution prior, it is unclear whether the stationary distribution exists. For the deterministic Input-awared transition and the uniform prior, we have lim T σT (θT ) = PA a=1 αaδθ a with PA a=1 αa = 1 where θ a (a = 1, . . . , A) are local maximas of the optimization problem maxθ Sd 1 Wp(θ µ, θ ν) and some unknown weights αa that depend on µ and ν. This property is due to the fact that the projected sub-gradient ascent can guarantee local maxima convergence [46]. For the Input-awared v MF transition, it is also unclear if the stationary distribution exists when the parameter κ < .

3.3 Burning and Thinning

In the definition of MSW in Definition 1, we take the expectation on the joint distribution over all timesteps σ(θ1:T ) which leads to the time and memory complexities to be linear with T in the Monte Carlo approximation. Therefore, we can adapt the practical technique from MCMC methods which is burning and thinning to reduce the number of random variables while still having the dependency. Definition 2. For any p 1, T 1, dimension d 1, the number of burned steps M 0, and the number of thinned steps N 1, the burned thinned Markovian sliced Wasserstein of order p between two probability measures µ Pp(Rd) and ν Pp(Rd) is:

MSWp p,T,N,M(µ, ν) = E

t=1 W p p (θ t µ, θ t ν)

where the expectation is under the projection distribution θ 1:N(T M) σ(θ 1:N(T M)) with σ(θ 1:N/(T M)) being the marginal distribution which is obtained by integrating out random projecting directions at the time step t such that t M or t%N = 0 from σ(θ1, . . . , θT ) = σ1(θ1) QT l=2 σt(θt|θt 1) for all t = 1, . . . , T.

Similar to MSW, the burned-thinned MSW is also a metric on Pp(Rd) when there exists a time step t that is not burned, is not thinned, and θt is a random variable that has the supports on all Sd 1. We discuss more details about the burned-thinned MSW including its topological and statistical properties in Appendix A.4. The Monte Carlo estimation of the burned-thinned MSW is given in Equation 10 in Appendix A.4. The approximation is the average of the projected Wasserstein distance from θtl with t M and t%N = 0. By reducing the number of random projecting directions, the computational complexity of the burned-thinned MSW is improved to O(((T M)Ln log2 n + (T M)Ldn)/N) in the orthogonal-based transitions. In the case of the input-awared transition, the computational complexity is still O(TLn log2 n + TLdn) since the transition requires computing the gradient of the projected Wasserstein distance. However, in all cases, the memory complexity is reduced to O((T M)L(d + n)/N).

Burned thinned MSW is the generalization of Max-SW. the empirical computation of Max SW [14] with the projected sub-gradient ascent and uniform random initialization can be viewed as a special case of burned thinned MSW with the input-awared transition and with the number of burned samples M = T 1. The difference is that Max-SW uses only one local maximum to compute the distance while the burned thinned MSW uses L 1 maximums (might not be unique).

More discussions. We refer the reader to Appendix A.5 for other related discussions e.g., K-SW is autoregressive decomposition of projecting distribution", sequential generalization of Max-K-SW", and related literature.

W2: 25.3149 10 2 (0s) W2: 0.5913 10 2 (1.07s) W2: 0.0099 10 2 (1.55s)

Max-SW T=30

W2: 25.3149 10 2 (0s) W2: 0.1091 10 2 (2.37s) W2: 0.0098 10 2 (3.48s)

i MSW L=2 T=5

W2: 25.3149 10 2 (0s)

W2: 0.0483 10 2 (0.99s)

W2: 0.0064 10 2 (1.41s)

vi MSW L=2 T=5 =50

W2: 25.3149 10 2 (0s)

W2: 0.0512 10 2 (2.05s)

W2: 0.0043 10 2 (2.94s)

Figure 1: The figures show the gradient flows that are from the empirical distribution over the color points to the empirical distribution over S-shape points. The corresponding Wasserstein-2 distance between the empirical distribution at the current step and the S-shape distribution and the computational time (in seconds) to reach the step is reported at the top of the figure.

SW (L=45), 41.77(s), W2 = 414.51

Max-SW (T=45), 59.13(s), W2 = 449.42

K-SW (L=15,K=3), 38.86(s), W2 = 411.74

Max-K-SW (K=3,T=15), 53.95(s), W2 = 479.43

o MSW (L=3,T=5), 13.69(s), W2 = 415.06

i MSW (L=3,T=5), 25.91(s), W2 = 16.97

vi MSW (L=3,T=5, =50), 29.14(s), W2 = 16.48

vi MSW (L=3,T=5, =100), 29.06(s), W2 = 17.09

Figure 2: The figures show the source image, the target image, and the transferred images from different distances. The corresponding Wasserstein-2 distance between the empirical distribution over transferred color palates and the empirical distribution over the target color palette and the computational time (in second) are reported at the top of the figure.

4 Experiments

In this section, we refer MSW with orthogonal-based transition as o MSW, MSW with input-awared transition as i MSW (using the Dirac distribution) and vi MSW (using the v MF distribution). We compare MSW variants to SW, Max-SW, K-SW, and Max-K-SW in standard applications e.g., gradient flows, color transfer, and deep generative models. Moreover, we also investigate the role of hyperparameters, e.g., concentration parameter κ, the number of projections L, the number of time steps T, the number of burning steps M, and the number of thinning steps N in applications.

4.1 Gradient Flows and Color Transfer

Gradient flows. We follow the same setting in [17]. The gradient flow models a distribution µ(t) flowing with time t along the gradient flow of a loss functional µ(t) D(µ(t), ν) that drives it towards a target distribution ν [53] where D is a given distance between probability measures. In this setup, we consider ν = 1

n Pn i=1 δYi as a fixed empirical target distribution and the model distribution µ(t) = 1

n Pn i=1 δXi(t). Here, the model distribution is parameterized by a time-varying point cloud X(t) = (Xi(t))n i=1 Rd n. Starting from an initial condition at time t = 0, we integrate the ordinary differential equation X(t) = n X(t) D 1

n Pn i=1 δXi(t), ν for each iteration. In the experiments, we utilize the Euler scheme with 300 timesteps and the step size is 10 3 to move the empirical distribution over colorful 1000 points µ(0) to the distribution over S-shape 1000 points (ν) (see Figure 1). For Max-SW, Max-K-SW, i MSW, and vi MSW, we use the learning rate parameter for projecting directions η = 0.1. We report the Wasserstein-2 distances between the empirical distribution µ(t) and the target empirical distribution ν, and the computational time in Table 1. We also give the visualization of some obtained flows in Figure 1. We refer the reader to Figure 4 in Appendix C.1 for the full visualization of all flows and detailed algorithms. We observe that i MSW

Table 1: Summary of Wasserstein-2 distances, computational time in second (s) of different gradient flows.

Distances Wasserstein-2 ( ) Time ( ) Distances Wasserstein-2 ( ) Time ( )

SW (L=30) 0.0099 10 2 1.55 Max-K-SW (K=2,T=15) 0.0146 10 2 3.35 Max-SW (T=30) 0.0098 10 2 3.48 i MSW (L=2,T=5) (ours) 0.0064 10 2 1.41 K-SW (L=15,K=2) 0.0098 10 2 1.71 vi MSW (L=2,T=5,κ=50)(ours) 0.0043 10 2 2.94

Table 2: Summary of FID and IS scores from three different runs on CIFAR10 (32x32), and Celeb A (64x64).

Method CIFAR10 (32x32) Celeb A (64x64) Method CIFAR10 (32x32) Celeb A (64x64)

FID ( ) IS ( ) FID ( ) FID ( ) IS ( ) FID ( )

SW 14.21 1.12 8.19 0.07 8.93 0.23 o MSW (ours) 14.12 0.54 8.20 0.05 9.68 0.55 Max-K-SW 14.38 0.08 8.15 0.02 8.94 0.35 i MSW (ours) 14.12 0.48 8.24 0.09 8.89 0.23 K-SW 15.24 0.02 8.15 0.03 9.41 0.16 vi MSW (ours) 13.98 0.59 8.12 0.20 8.91 0.11

gives better flows than SW, Max-SW, K-SW, and Max-K-SW. Namely, the empirical distribution µ(t) (t = 300) with i MSW is closer to ν in terms of Wasserstein distance. More importantly, i MSW consumes less computation than its competitors since it can use a smaller number of projections due to more informative projecting directions. Furthermore, vi MSW gives better final results than i MSW, however, the trade-off is doubling the time computation due to the sampling step of v MF distribution. In this case, K-SW is equivalent to our o MSW with T=K=2 since the dimension d = 2. We refer the reader to Appendix C.1 for more discussion.

Studies on hyperparameters. From Table 3 in Appendix C.1, increasing the number of projections L yields better performance for SW, K-SW, and i MSW. Similarly, increasing the number of timesteps T also helps Max-SW and i MSW better. Moreover, we find that for the same number of total projections e.g., L = 5, T = 2 and T = 2, L = 5, a larger timestep T might lead to a better result for i MSW. For burning and thinning, we see that they help to reduce the computation while the performance stays comparable or even better if choosing the right value of M and N. Also, i MSW the burning steps M = T 1 is still better than Max-SW with T time steps. For the concentration parameter κ in vi MSW, the performance of vi MSW is not monotonic in terms of κ.

Color transfer. We aim to transfer the color palate (RGB) of a source image to the color palette (RGB) target image. Therefore, it is natural to build a gradient flow that starts from the empirical distribution over the color palette of the source image to the empirical distribution over the color palette of the target image. Since the value of color palette is in the set {0, . . . , 255}3, we round the value of the supports of the empirical distribution at the final step of the Euler scheme with 2000 steps and 10 3 step size. Greater detail can be found in Appendix C.2. For Max-SW, Max-K-SW, i MSW, and vi MSW, we use the learning rate parameter for projecting directions η = 0.1. We show the transferred images, the corresponding Wasserstein-2 distances between the empirical distribution over the transferred color palette and the empirical distribution over the target color palette, and the corresponding computational time in Figure 2. From the figures, i MSW and vi MSW give the best transferred images quantitatively and qualitatively. Moreover, o MSW is comparable to SW, Max-SW, K-SW, and are better than Max-K-SW while consuming much less computation. We refer the reader to Figure 5 in Appendix C.2 for the color palette visualization and to Figure 6 for another choice of the source and target images. We also conduct studies on hyperparameters in Appendix C.2 where we observe some similar phenomenons as in gradient flow.

4.2 Deep Generative Models

We follow the setup of sliced Wasserstein deep generative models in [15]. The full settings of the framework including neural network architectures, training framework, and hyperparameters are given Appendix C.3. We compare MSW with previous baselines including SW, Max-SW, K-SW, and Max-K-SW on benchmark datasets: CIFAR10 (image size 32x32) [29], and Celeb A [36] (image size 64x64). The evaluation metrics are FID score [21] and Inception score (IS) [51] (except on Celeb A since IS score poorly captures the perceptual quality of face images [21]). A notable change in computing Max-SW is that we do not use momentum in optimization for max projecting direction like in previous works [26, 42], which leads to a better result.

Summary of generative performance. We train generative models with SW (L {100, 1000, 10000}), K-SW (L {1, 10, 100}, K = 10), Max-K-SW (K = {1, 10}, T = {1, 10, 100, 1000}, η {0.01, 0.1})(Max-K-SW (K=1) is Max-SW), MSW (all variant, L =

200 300 400 500 600 Epochs

SW Max-SW K-SW Max-K-SW o MSW i MSW vi MSW

200 300 400 500 600 Epochs

SW Max-SW K-SW Max-K-SW o MSW i MSW vi MSW

25 50 75 100 125 150 175 200 Epochs

SW Max-SW K-SW Max-K-SW o MSW i MSW vi MSW

SW Max-K-SW i MSW

Figure 3: The FID scores and the IS scores over epochs, and some generated images from Celeb A.

{10, 100}, T {10, 100}), i MSW and vi MSW (η {0.01, 0.1}), and vi MSW and (κ {10, 50}). We report the best FID score and the best IS score for each distance in Table 2. In addition, we show how scores change with respect to the training epochs in Figure 3. Overall, we observe that vi MSW and i MSW give the best generative performance in terms of the final scores and fast convergence on CIFAR10 and Celeb A. The o MSW gives comparable results to baselines. Since most computation in training deep generative models is for updating neural networks, the computational time for distances is almost the same. Furthermore, we show some generated images on Celeb A in Figure 3 and all generated images on CIFAR10 and Celeb A in Figure 7 and Figure 8 in Appendix C.3. We visually observe that the qualitative results are consistent with the quantitative results in Table 2.

Studies on hyperparameters. We conduct experiments to understand the behavior of the burning and thinning technique, and to compare the role of L and T in Table 5 in Appendix C.3. Overall, burning (thinning) sometimes helps to improve the performance of training generative models. There is no clear sign of superiority between burning and thinning. We compare two settings of the same number of total projections (same complexities): L = 10, T = 100 and L = 100, T = 10. On CIFAR10, the first setting is better while the reverse case happens on Celeb A.

5 Conclusion

We have introduced the Markovian sliced Wasserstein (MSW), a novel family of sliced Wasserstein (SW) distances, which imposes a first-order Markov structure on projecting directions. We have investigated the theoretical properties of MSW including topological properties, statistical properties, and computational properties. Moreover, we have discussed three types of transition distribution for MSW, namely, orthogonal-based, and input-awared transitions. In addition, we have proposed a burning and thinning technique to improve the computational time and memory of MSW. Finally, we have compared MSW to previous variants of SW in gradient flows, color transfer, and generative modeling to show that MSW distances are both effective and efficient.

Acknowledgements

NH acknowledges support from the NSF IFML 2019844 and the NSF AI Institute for Foundations of Machine Learning.

[1] J. Altschuler, J. Niles-Weed, and P. Rigollet. Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration. In Advances in Neural Information Processing Systems, pages 1964 1974, 2017.

[2] Y. Bai, B. Schmitzer, M. Thorpe, and S. Kolouri. Sliced optimal partial transport. ar Xiv preprint ar Xiv:2212.08049, 2022.

[3] V. I. Bogachev and M. A. S. Ruas. Measure theory, volume 1. Springer, 2007.

[4] C. Bonet, P. Berg, N. Courty, F. Septier, L. Drumetz, and M.-T. Pham. Spherical slicedwasserstein. ar Xiv preprint ar Xiv:2206.08780, 2022.

[5] C. Bonet, N. Courty, F. Septier, and L. Drumetz. Efficient gradient flows in sliced-Wasserstein space. Transactions on Machine Learning Research, 2022.

[6] N. Bonneel and D. Coeurjolly. Spot: sliced partial optimal transport. ACM Transactions on Graphics (TOG), 38(4):1 13, 2019.

[7] N. Bonneel, J. Rabin, G. Peyré, and H. Pfister. Sliced and Radon Wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision, 1(51):22 45, 2015.

[8] N. Bonnotte. Unidimensional and evolution methods for optimal transportation. Ph D thesis, Paris 11, 2013.

[9] S. Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3-4):231 357, 2015.

[10] X. Chen, Y. Yang, and Y. Li. Augmented sliced Wasserstein distances. International Conference on Learning Representations, 2022.

[11] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, pages 2292 2300, 2013.

[12] B. Dai and U. Seljak. Sliced iterative normalizing flows. In International Conference on Machine Learning, pages 2352 2364. PMLR, 2021.

[13] T. R. Davidson, L. Falorsi, N. De Cao, T. Kipf, and J. M. Tomczak. Hyperspherical variational auto-encoders. In 34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018, pages 856 865. Association For Uncertainty in Artificial Intelligence (AUAI), 2018.

[14] I. Deshpande, Y.-T. Hu, R. Sun, A. Pyrros, N. Siddiqui, S. Koyejo, Z. Zhao, D. Forsyth, and A. G. Schwing. Max-sliced Wasserstein distance and its use for GANs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10648 10656, 2019.

[15] I. Deshpande, Z. Zhang, and A. G. Schwing. Generative modeling using the sliced Wasserstein distance. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3483 3491, 2018.

[16] L. Devroye, L. Györfi, and G. Lugosi. A probabilistic theory of pattern recognition, volume 31. Springer Science & Business Media, 2013.

[17] J. Feydy, T. Séjourné, F.-X. Vialard, S.-i. Amari, A. Trouve, and G. Peyré. Interpolating between optimal transport and MMD using Sinkhorn divergences. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2681 2690, 2019.

[18] R. Flamary, N. Courty, A. Gramfort, M. Z. Alaya, A. Boisbunon, S. Chambon, L. Chapel, A. Corenflos, K. Fatras, N. Fournier, L. Gautheron, N. T. Gayraud, H. Janati, A. Rakotomamonjy, I. Redko, A. Rolet, A. Schutz, V. Seguy, D. J. Sutherland, R. Tavenard, A. Tong, and T. Vayer. Pot: Python optimal transport. Journal of Machine Learning Research, 22(78):1 8, 2021.

[19] N. Fournier and A. Guillin. On the rate of convergence in Wasserstein distance of the empirical measure. Probability Theory and Related Fields, 162:707 738, 2015.

[20] A. Genevay, G. Peyré, and M. Cuturi. Learning generative models with Sinkhorn divergences. In International Conference on Artificial Intelligence and Statistics, pages 1608 1617. PMLR, 2018.

[21] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Advances in Neural Information Processing Systems, pages 6626 6637, 2017.

[22] M. Huang, S. Ma, and L. Lai. A Riemannian block coordinate descent method for computing the projection robust Wasserstein distance. In International Conference on Machine Learning, pages 4446 4455. PMLR, 2021.

[23] P. E. Jupp and K. V. Mardia. Maximum likelihood estimators for the matrix von Mises-Fisher and Bingham distributions. The Annals of Statistics, 7(3):599 606, 1979.

[24] O. Kallenberg and O. Kallenberg. Foundations of modern probability, volume 2. Springer, 1997.

[25] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

[26] S. Kolouri, K. Nadjahi, U. Simsekli, R. Badeau, and G. Rohde. Generalized sliced Wasserstein distances. In Advances in Neural Information Processing Systems, pages 261 272, 2019.

[27] S. Kolouri, P. E. Pope, C. E. Martin, and G. K. Rohde. Sliced Wasserstein auto-encoders. In International Conference on Learning Representations, 2018.

[28] S. Kolouri, G. K. Rohde, and H. Hoffmann. Sliced Wasserstein distance for learning Gaussian mixture models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3427 3436, 2018.

[29] A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images. Master s thesis, Department of Computer Science, University of Toronto, 2009.

[30] C.-Y. Lee, T. Batra, M. H. Baig, and D. Ulbricht. Sliced Wasserstein discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10285 10295, 2019.

[31] J. Lezama, W. Chen, and Q. Qiu. Run-sort-rerun: Escaping batch size limitations in sliced Wasserstein generative models. In International Conference on Machine Learning, pages 6275 6285. PMLR, 2021.

[32] T. Lin, C. Fan, N. Ho, M. Cuturi, and M. Jordan. Projection robust Wasserstein distance and Riemannian optimization. Advances in Neural Information Processing Systems, 33:9383 9397, 2020.

[33] T. Lin, N. Ho, X. Chen, M. Cuturi, and M. I. Jordan. Fixed-support Wasserstein barycenters: Computational hardness and fast algorithm. In Neur IPS, pages 5368 5380, 2020.

[34] T. Lin, N. Ho, and M. Jordan. On efficient optimal transport: An analysis of greedy and accelerated mirror descent algorithms. In International Conference on Machine Learning, pages 3982 3991, 2019.

[35] T. Lin, N. Ho, and M. I. Jordan. On the efficiency of entropic regularized algorithms for optimal transport. Journal of Machine Learning Research (JMLR), 23:1 42, 2022.

[36] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.

[37] A. Liutkus, U. Simsekli, S. Majewski, A. Durmus, and F.-R. Stöter. Sliced-Wasserstein flows: Nonparametric generative modeling via optimal transport and diffusions. In International Conference on Machine Learning, pages 4104 4113. PMLR, 2019.

[38] N. Naderializadeh, J. Comer, R. Andrews, H. Hoffmann, and S. Kolouri. Pooling by sliced Wasserstein embedding. Advances in Neural Information Processing Systems, 34, 2021.

[39] K. Nadjahi, V. De Bortoli, A. Durmus, R. Badeau, and U. Sim sekli. Approximate Bayesian computation with the sliced-Wasserstein distance. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5470 5474. IEEE, 2020.

[40] K. Nadjahi, A. Durmus, L. Chizat, S. Kolouri, S. Shahrampour, and U. Simsekli. Statistical and topological properties of sliced probability divergences. Advances in Neural Information Processing Systems, 33:20802 20812, 2020.

[41] K. Nadjahi, A. Durmus, U. Simsekli, and R. Badeau. Asymptotic guarantees for learning generative models with the sliced-Wasserstein distance. In Advances in Neural Information Processing Systems, pages 250 260, 2019.

[42] K. Nguyen and N. Ho. Amortized projection optimization for sliced Wasserstein generative models. Advances in Neural Information Processing Systems, 2022.

[43] K. Nguyen and N. Ho. Revisiting sliced Wasserstein on images: From vectorization to convolution. Advances in Neural Information Processing Systems, 2022.

[44] K. Nguyen, N. Ho, T. Pham, and H. Bui. Distributional sliced-Wasserstein and applications to generative modeling. In International Conference on Learning Representations, 2021.

[45] K. Nguyen, S. Nguyen, N. Ho, T. Pham, and H. Bui. Improving relational regularized autoencoders with spherical sliced fused Gromov-Wasserstein. In International Conference on Learning Representations, 2021.

[46] S. Nietert, R. Sadhu, Z. Goldfeld, and K. Kato. Statistical, robustness, and computational guarantees for sliced wasserstein distances. Advances in Neural Information Processing Systems, 2022.

[47] F.-P. Paty and M. Cuturi. Subspace robust Wasserstein distances. In International Conference on Machine Learning, pages 5072 5081, 2019.

[48] G. Peyré and M. Cuturi. Computational optimal transport: With applications to data science. Foundations and Trends in Machine Learning, 11(5-6):355 607, 2019.

[49] G. Peyré and M. Cuturi. Computational optimal transport, 2020.

[50] M. Rowland, J. Hron, Y. Tang, K. Choromanski, T. Sarlos, and A. Weller. Orthogonal estimation of Wasserstein distances. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 186 195. PMLR, 2019.

[51] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training GANs. Advances in Neural Information Processing Systems, 29, 2016.

[52] T. Salimans, H. Zhang, A. Radford, and D. Metaxas. Improving GANs using optimal transport. In International Conference on Learning Representations, 2018.

[53] F. Santambrogio. Optimal transport for applied mathematicians. Birkäuser, NY, 55(58-63):94, 2015.

[54] M. Sommerfeld and A. Munk. Inference for empirical wasserstein distances on finite spaces. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(1):219 238, 2018.

[55] S. Sra. Directional statistics in machine learning: a brief review. ar Xiv preprint ar Xiv:1605.00316, 2016.

[56] N. M. Temme. Special functions: An introduction to the classical functions of mathematical physics. John Wiley & Sons, 2011.

[57] C. Villani. Optimal transport: Old and New. Springer, 2008.

[58] C. Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.

[59] M. J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint. Cambridge University Press, 2019.

[60] J. Wu, Z. Huang, D. Acharya, W. Li, J. Thoma, D. P. Paudel, and L. V. Gool. Sliced Wasserstein generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3713 3722, 2019.

[61] M. Yi and S. Liu. Sliced Wasserstein variational inference. In Fourth Symposium on Advances in Approximate Bayesian Inference, 2021.

Supplement to Markovian Sliced Wasserstein Distances: Beyond Independent Projections"

In this supplementary material, we present additional materials in Appendix A. In particular, we provide additional background on sliced Wasserstein variants in Appendix A.1, background on von Mises-Fisher distribution in Appendix A.2, algorithms for computing Markovian sliced Wasserstein distances in Appendix A.3, additional information about burned thinned MSW in Appendix A.4, and discussion on related works in Appendix A.5. We then provide skipped proofs in the main paper in Appendix B. Additional experiments are presented in Appendix C.

A Additional Materials

A.1 Background on Sliced Wasserstein Variants

We review computational aspects of sliced Wasserstein variants.

Computation of Max sliced Wasserstein distance. We demonstrate the empirical estimation of Max-SW via projected sub-gradient ascent algorithm in Algorithm 1. The initialization step for ˆθ0 is rarely discussed in previous works. Normally, ˆθ0 is randomly initialized by drawing from the uniform distribution over the unit-hypersphere. Many previous works [26, 44, 45, 42] use Adam update instead of the standard gradient ascent update for Max-SW. In this work, we find out that using the standard gradient ascent update is more stable and effective.

Algorithm 1 Max sliced Wasserstein distance

Input. Probability measures µ, ν, learning rate η, the order p, and the number of iterations T. Initialize ˆθ0. for t = 1 to T 1 do

ˆθt = ˆθt 1 + η ˆθt 1Wp(ˆθt 1 µ, ˆθt 1 ν) ˆθt = ˆθt ||ˆθt||2 end for Return. Wp(ˆθT µ, ˆθT ν)

K sliced Wasserstein distance. We first review the Gram Schmidt process in Algorithm 2. With the Gram Schmidt process, the sampling from U(VK(Rd)) can be done by sampling θ1, . . . , θk i.i.d from N(0, Id) then applying the Gram-Schmidt process on them. Therefore, we present the computation of K sliced Wasserstein distance in Algorithm 3. We would like to recall that the original work of K-SW [50] uses only one set of orthogonal projecting directions. Here, we generalize the original work by using L sets of orthogonal projecting directions.

Algorithm 2 Gram Schmidt process

Input. K vectors θ1, . . . , θK θ1 = θ1 ||θ1||2 for k = 2 to K do

for i = 1 to k 1 do

θk = θk θi,θk

θi,θi θi end for θk = θk ||θk||2 end for Return. θ1, . . . , θK

Max K sliced Wasserstein distance. We now present the empirical estimation of Max-K-SW via projected sub-gradient ascent algorithm in Algorithm 4. This algorithm is first discussed in the original paper of Max-K-SW [12]. The optimization of Max-K-SW can be solved by using Riemannian optimization since the Stiefel manifold is a Riemannian manifold. However, to the best of our knowledge, Riemannian optimization has not been applied to Max-K-SW.

Algorithm 3 K sliced Wasserstein distance

Input. Probability measures µ, ν, the dimension d, the order p, the number of projections L, the number of orthogonal projections K. for l = 1 to L do

Draw θl1, . . . , θl K i.i.d from N(0, Id). θl1, . . . , θl K = Gram Schmidt(θl1, . . . , θl K) end for

Return. 1 LK PL l=1 PK k=1 Wp p(θlk µ, θlk ν) 1

Algorithm 4 Max-K sliced Wasserstein distance

Input. Probability measures µ, ν, learning rate η, the dimension d, the order p, the number of iterations T > 1, and the number of orthogonal projections K > 1. Initialize ˆθ01, . . . , ˆθ0K to be orthogonal. for t = 1 to T 1 do

for k = 1 to K do

ˆθtk = θtk + η ˆθt 1k Wp(ˆθt 1k µ, ˆθt 1k ν) end for

ˆθt1, . . . , ˆθt K = Gram-Schmidt(ˆθt1, . . . , ˆθt K) end for

Return. 1 K PK k=1 Wp p(ˆθT k µ, ˆθT k ν) 1

A.2 Von Mises-Fisher Distribution

We first start with the definition of von Mises-Fisher (v MF) distribution.

Algorithm 5 Sampling from v MF distribution

Input. location ϵ, concentration κ, dimension d, unit vector e1 = (1, 0, .., 0) Draw v U(Sd 2)

d 1 , a (d 1)+2κ+

4 , m 4ab (1+b) (d 1) log(d 1) repeat

Draw ψ Beta 1

ω h(ψ, κ) = 1 (1+b)ψ

1 (1 b)ψ t 2ab 1 (1 b)ψ Draw u U([0, 1]) until (d 1) log(t) t + m log(u) h1 (ω,

1 ω2v ) ϵ e1 ϵ u = ϵ ||ϵ ||2 U = I 2uu Output. Uh1

Definition 3. The von Mises Fisher distribution (v MF)[23] is a probability distribution on the unit hypersphere Sd 1 with the density function be:

f(x|ϵ, κ) := Cd(κ) exp(κϵ x), (9)

where ϵ Sd 1 is the location vector, κ 0 is the concentration parameter, and Cd(κ) := κd/2 1

(2π)d/2Id/2 1(κ) is the normalization constant. Here, Iv is the modified Bessel function of the first kind at order v [56].

The v MF distribution is a continuous distribution, its mass concentrates around the mean ϵ, and its density decrease when x goes away from ϵ. When κ 0, v MF converges in distribution to U(Sd 1), and when κ , v MF converges in distribution to the Dirac distribution centered at ϵ [55].

Sampling: We review the sampling process in Algorithm 5 [13, 45]. The sampling process of v MF distribution is based on the rejection sampling procedure. It is worth noting that the sampling algorithm is doing reparameterization implicitly. However, we only use the algorithm to obtain random samples without estimating stochastic gradients.

A.3 Algorithms for Computing Markovian Sliced Wasserstein Distances

We first start with the general computation of MSW in Algorithm 6. For the orthogonal-based transition in o MSW, we use θlt U(Sd 1 θlt 1) by first sampling θ lt U(Sd 1) then set θlt = θlt θ lt,θlt θ lt,θ lt θ lt then normalize θlt = θlt ||θlt||2 . For deterministic input-awared transition, i MSW, we set θlt = θlt 1 +

η θlt 1Wp(θlt 1 µ, θlt 1 ν) then normalize θlt = θlt ||θlt||2 . For probabilistic input-awared transition, vi MSW, θlt v MF(θt|ϵ = Proj Sd 1θ lt, κ) with θ lt = θlt 1 + η θlt 1Wp(θlt 1 µ, θlt 1 ν).

Algorithm 6 Markovian sliced Wasserstein distance

Input. Probability measures µ, ν, the dimension d, the order p, the number of projections L, and the number of timesteps T. for l = 1 to L do

Draw θl0 σ(θ0) for t = 1 to T 1 do

Draw θlt σt(θt|θlt 1) end for end for

Return. 1 LT PL l=1 PT t=1 Wp p(θlt µ, θlt ν) 1

A.4 Burned Thinned Markovian Sliced Wasserstein Distance

We continue the discussion on burned thinned MSW in Section 3.3. We first start with the Monte Carlo estimation of burned thinned MSW.

Monte Carlo Estimation: We samples θ11, . . . , θL1 σ1(θ1) for L 1, then we samples θlt σt(θt|θlt 1) for t = 1, . . . , T and l = 1, . . . , L. We then obtain samples θ lt by filtering out t < M and t%N = 0 from the set {θlt} for l = 1, . . . , L and t = 1, . . . , T. The Monte Carlo approximation of the burned-thinned Markovian sliced Wasserstein distance is:

\ MSWp,T,N,M(µ, ν) =

t=1 W p p (θ lt µ, θ lt ν)

Theoretical properties. We first state the following assumption: A2. Given T > M 0, N 1, the prior distribution σ1(θ1) and the transition distribution σt(θt|θt 1) are chosen such that there exists marginals σt(θt) = R

t σ(θ1, . . . , θt)dt with t M and t%N = 0, t = {t = 1, . . . , T|t = t}.

The assumption A2 can be easily obtained by using v MF transition, e.g., in probabilistic input-awared transition. From this assumption, we can derive theoretical properties of burned-thinned MSW including topological properties and statistical complexity.

Proposition 4. For any p 1, T 1, M 0, N 1, and dimension d 1, if A2 holds, the burned thinned Markovian sliced Wasserstein distance MSWp,T,N,M( , ) is a valid metric on the space of probability measures Pp(Rd), namely, it satisfies the (i) non-negativity, (ii) symmetry, (iii) triangle inequality, and (iv) identity.

The proof of Proposition 4 follows directly the proof of Theorem 1 in Appendix B.1.

Proposition 5 (Weak Convergence). For any p 1, T 1, M 0, N 1, and dimension d 1, if A2 holds, the convergence of probability measures in Pp(Rd) under the burned thinned Markovian sliced Wasserstein distance MSWp,T,N,M( , ) implies weak convergence of probability measures and vice versa.

The proof of Proposition 5 follows directly the proof of Theorem 2 in Appendix B.2.

Proposition 6. For any p 1 and dimension d 1, for any T 1, M 0, N 1 and µ, ν Pp(Rd), MSWp,T,N,M(µ, ν) Max-SWp(µ, ν) Wp(µ, ν).

The proof of Proposition 6 follows directly the proof of Proposition 1 in Appendix B.3.

Proposition 7 (Sample Complexity). Let X1, X2, . . . , Xn be i.i.d. samples from the probability measure µ being supported on compact set of Rd. We denote the empirical measure µn = 1

n Pn i=1 δXi. Then, for any p 1 and T 1, M 0, N 1, there exists a universal constant C > 0 such that

E[MSWp,T,N,M(µn, µ)] C p

(d + 1) log n/n,

where the outer expectation is taken with respect to the data X1, X2, . . . , Xn.

The proof of Proposition 7 follows directly the proof of Proposition 2 in Appendix B.4.

Proposition 8 (Monte Carlo error). For any p 1, T 1, M 0, N 1, dimension d 1, and µ, ν Pp(Rd), we have:

E|[ MSW p p,T,N,M(µ, ν) MSWp p,T,N,M(µ, ν)|

LT(T M) V ar

t=1 W p p (θ t µ, θ t ν)

where the variance is with respect to σ(θ 1, . . . , θ (T M)/N).

The proof of Proposition 8 follows directly the proof of Proposition 3 in Appendix B.5.

A.5 Discussions on Related Works

K-SW is autoregressive decomposition. In MSW, we assume that the joint distribution over projecting directions has the first-order Markov structure: σ(θ1, . . . , θT ) = σ1(θ1) QT t=2 σt(θt|θt 1). However, we can consider the full autoregressive decomposition σ(θ1, . . . , θT ) = σ1(θ1) QT t=2 σt(θt|θ1, . . . , θt 1). Let T = K in K-SW, hence the transition distribution that is used in K-SW is: σt(θt|θ1, . . . , θt 1) = Gram-Schmidtθ1,...,θt 1 U(Sd 1), where Gram-Schmidtθ1,...,θt 1(θt) denotes the Gram-Schmidt process update that applies on θt.

Generalization of Max-K-SW. Similar to Max-SW, we can derive a Markovian-based K-sliced Wasserstein distance that generalizes the idea of the projected gradient ascent update in Max-K-SW. However, the distance considers the transition on the Stiefel manifold instead of the unit hypersphere, hence, it will be more computationally expensive. Moreover, orthogonality might not be a good constraint. Therefore, the generalization of Max-K-SW might not have many advantages.

Beyond the projected sub-gradient ascent update. In the input-awared transition for MSW, we utilize the projected sub-gradient update as the transition function to create a new projecting direction. Therefore, we could other optimization techniques such as momentum, adaptive stepsize, and so on to create the transition function. We will leave the investigation about this direction to future work.

Applications to other sliced Wasserstein variants. The Markovian approach can be applied to other variants of sliced Wasserstein distances e.g., generalized sliced Wasserstein [26], augmented sliced Wasserstein distance [10], projected robust Wasserstein (PRW) [47, 32, 22] (k > 1 dimensional projection), convolution sliced Wasserstein [43], sliced partial optimal transport [6, 2], and so on.

Markovian sliced Wasserstein distances in other applications. We can apply MSW to the setting in [31] which is an implementation technique that utilizes both RAM and GPUs memory for training sliced Wasserstein generative models. MSW can also replace sliced Wasserstein distance in pooling in [38]. Similarly, MSW can be used in applications that exist sliced Wasserstein distance e.g., clustering [28], Bayesian inference [39, 61], domain adaptation [60], and so on.

B.1 Proof of Theorem 1

(i), (ii): the MSW is an expectation of the one-dimensional Wasserstein distance hence the nonnegativity and symmetry properties of the MSW follow directly by the non-negativity and symmetry of the Wasserstein distance.

(iii) From the definition of MSW in Definition 1, given three probability measures µ1, µ2, µ3 Pp(Rd) we have:

MSWp,T (µ1, µ3) =

E(θ1:T ) σ(θ1:T )

t=1 W p p (θt µ1, θt µ3)

E(θ1:T ) σ(θ1:T )

t=1 (Wp (θt µ1, θt µ2) + Wp (θt µ2, θt µ3))p #! 1

E(θ1:T ) σ(θ1:T )

t=1 W p p (θt µ1, θt µ2)

E(θ1:T ) σ(θ1:T )

t=1 W p p (θt µ2, θt µ3)

= MSWp,T (µ1, µ2) + MSWp,T (µ2, µ3),

where the first inequality is due to the triangle inequality of Wasserstein distance and the second inequality is due to the Minkowski inequality. We complete the triangle inequality proof.

(iv) We need to show that MSWp,T (µ, ν) = 0 if and only if µ = ν. First, from the definition of MSW, we obtain directly µ = ν implies MSWp,T (µ, ν) = 0. For the reverse direction, we use the same proof technique in [8]. If MSWp,T (µ, ν) = 0, we have R

S(d 1) T 1 T PT t=1 Wp (θt µ, θt ν) dσ(θ1:T ) = 0. If A1 holds, namely, the prior distribution σ1(θ1) is supported on all the unit-hypersphere or exists a transition distribution σt(θt|θt 1) is supported on all the unit-hypersphere, we have Wp(θ µ, θ ν) = 0 for all θ Sd 1 where σ denotes the prior or the transition distribution that satisfies the assumption A1. From the identity property of the Wasserstein distance, we obtain θ µ = θ ν for σ-a.e θ Sd 1. Therefore, for any t R and θ Sd 1, we have:

F[µ](tθ) = Z

Rd e it θ,x dµ(x) = Z

R e itzdθ µ(z) = F[θ µ](t)

= F[θ ν](t) = Z

R e itzdθ ν(z) = Z

Rd e it θ,x dν(x) = F[ν](tθ),

where F[γ](w) = R

Rd e i w,x dγ(x) denotes the Fourier transform of γ P(Rd ). By the injectivity of the Fourier transform, we obtain µ = ν which concludes the proof.

B.2 Proof of Theorem 2

Our goal is to show that for any sequence of probability measures (µk)k N and µ in Pp(Rd), limk + MSWp,T (µk, µ) = 0 if and only if for any continuous and bounded function f : Rd R, limk + R f dµk = R f dµ. The proof follows the techniques in [41]. We first state the following lemma.

Lemma 1. For any T 1, and dimension d 1, if A1 holds and a sequence of probability measures (µk)k N satisfies limk + MSW1,T (µk, µ) = 0 with µ in P(Rd), there exists an increasing function ϕ : N N such that the subsequence µϕ(k)

k N converges weakly to µ.

Proof. We are given that limk + MSW1,T (µk, µ) = 0, therefore limk R

S(d 1) T 1 T PT t=1 W1 (θt µk, θt µ) dσ(θ1:T ) = 0. If A1 holds, namely, the prior

distribution σ1(θ1) is supported on all the unit-hypersphere or exists a transition distribution σt(θt|θt 1) is supported on all the unit-hypersphere, we have

Sd 1 W1 (θ µk, θ µ) dσ(θ) = 0,

where σ denotes the prior or the transition distribution that satisfies the assumption A1. From Theorem 2.2.5 in [3], there exists an increasing function ϕ : N N such that limk W1(θ µϕ(k), θ ν) = 0 for σ-a.e θ Sd 1. Since the Wasserstein distance of implies weak convergence [58], θ µϕ(k)

k N converges weakly to θ µ for σ-a.e θ Sd 1.

Rd ei v,w dµ(w) be the characteristic function of µ P(Rd), we have the weak convergence implies the convergence of characteristic function (Theorem 4.3 [24]): limk Φθ µϕ(k)(s) = Φθ µ(s), s R, for σ-a.e θ Sd 1. Therefore, limk Φµϕ(k)(z) = Φµ(z), for almost most every z Rd.

For any γ > 0 and a continuous function f : Rd R with compact support, we denote fγ(x) =

f gγ(x) = 2πγ2 d/2 R

Rd f(x z) exp z 2/ 2γ2 dz where gγ is the density function of N(0, γId). We have: Z

Rd fγ(z)dµϕ(k)(z) = Z

Rd f(w)gγ(z w)dw dµϕ(k)(z)

Rd f(w) 2πγ2 d/2 exp( ||z w||2/(2γ2))dw dµϕ(k)(z)

= 2πγ2 d/2 Z

Rd ei z w,x g1/γ(x)dx dw dµϕ(k)(z)

= 2πγ2 d/2 Z

Rd e i w,x ei z,x g1/γ(x)dx dw dµϕ(k)(z)

= 2πγ2 d/2 Z

Rd f(w)e i w,x g1/γ(x) Z

Rd ei z,x dµϕ(k)(z)dx dw

= 2πγ2 d/2 Z

Rd f(w)e i w,x g1/γ(x)Φµϕ(k)(x)dx dw

= 2πγ2 d/2 Z

Rd F[f](x)g1/γ(x)Φµϕ(k)(x)dx,

where the third equality is due to the fact that R

Rd ei z w,x g1/γ(x)dx = exp( ||z w||2/(2γ2)) and F[f](w) = R

Rd f(x)e i w,x dx denotes the Fourier transform of the bounded function f. Similarly, we have Z

Rd fγ(z)dµ(z) = Z

Rd f(w)gγ(z w)dw dµ(z)

Rd f(w) 2πγ2 d/2 exp( ||z w||2/(2γ2))dw dµ(z)

= 2πγ2 d/2 Z

Rd ei z w,x g1/γ(x)dx dw dµ(z)

= 2πγ2 d/2 Z

Rd e i w,x ei z,x g1/γ(x)dx dw dµ(z)

= 2πγ2 d/2 Z

Rd f(w)e i w,x g1/γ(x) Z

Rd ei z,x dµ(z)dx dw

= 2πγ2 d/2 Z

Rd f(w)e i w,x g1/γ(x)Φµ(x)dx dw

= 2πγ2 d/2 Z

Rd F[f](x)g1/γ(x)Φµ(x)dx.

Since f is assumed to have compact support, F[f] exists and is bounded by R

Rd |f(w)|dw < + . Hence, for any k R and x Rd, we have F[f](x)g1/γ(x)Φµϕ(k)(x)

Rd |f(w)|dw and F[f](x)g1/γ(x)Φµ(x) g1/γ(x) R

Rd |f(w)|dw. Using the proved result of limk Φµϕ(k)(z) = Φµ(z) and Lebesgue s Dominated Convergence Therefore, we obtain

Rd fγ(z)dµϕ(k)(z) = lim k 2πγ2 d/2 Z

Rd F[f](x)g1/γ(x)Φµϕ(k)(x)dx

= 2πγ2 d/2 Z

Rd F[f](x)g1/γ(x)Φµϕ(k)(x)dx

Rd fγ(z)dµ(z).

Moreover, we have:

lim γ 0 lim sup k +

Rd f(z)dµϕ(k)(z) Z

Rd f(z)dµ(z)

lim γ 0 lim sup k +

2 sup z Rd |f(z) fγ(z)| +

Rd fγ(z)dµϕ(k)(z) Z

Rd fγ(z)dµ(z)

= lim γ 0 2 sup z Rd |f(z) fγ(z)| = 0,

which implies µϕ(k)

k N converges weakly to µ.

We now continue the proof of Theorem 2. We first show that if limk MSWp,T (µk, µ) = 0, (µk)k N converges weakly to µ. We consider a sequence µϕ(k)

k N such that limk MSWp,T (µk, µ) = 0 and we suppose µϕ(k)

k N does not converge weakly to µ. Therefore, let d P be the Lévy-Prokhorov metric, limk d P(µk,µ) = 0 that implies there exists ε > 0 and a subsequence µψ(k)

k N with an increasing function ψ : N N such that for any k N: d P(µψ(k), µ) ε. However, we have

MSWp,T (µ, ν) =

E(θ1:T ) σ(θ1:T )

t=1 W p p (θt µ, θt ν)

E(θ1:T ) σ(θ1:T )

t=1 Wp (θt µ, θt ν)

E(θ1:T ) σ(θ1:T )

t=1 W1 (θt µ, θt ν)

= MSW1,T (µ, ν),

by the Holder inequality with µ, ν Pp(Rd). Therefore, limk MSW1,T (µψ(k), µ) = 0 which implies that there exists s a subsequence µϕ(ψ(k))

k N with an increasing function ϕ : N N such that µϕ(ψ(k))

k N converges weakly to µ by Lemma 1. Hence, limk d P µϕ(ψ(k)), µ = 0 which contradicts our assumption. We conclude that if limk MSWp,T (µk, µ) = 0, (µk)k N converges weakly to µ.

Now, we show that if (µk)k N converges weakly to µ, we have limk MSWp,T (µk, µ) = 0. By the continuous mapping theorem, we obtain (θ µk)k N converges weakly to θ µ for any θ Sd 1. Since the weak convergence implies the convergence under the Wasserstein distance [58], we obtain limk Wp(θ µk, µ) = 0. Moreover, the Wasserstein distance is also bounded, hence the bounded convergence theorem:

lim k MSWp p,T (µk, µ) = E(θ1:T ) σ(θ1:T )

t=1 W p p (θt µk, θt µ)

= E(θ1:T ) σ(θ1:T )

By the continuous mapping theorem with function x x1/p, we obtain limk MSWp,T (µk, µ) 0 which completes the proof.

B.3 Proof of Proposition 1

(i) We recall the definition of Max-SW:

Max-SWp(µ, ν) = max θ Sd 1 Wp(θ µ, θ ν).

Let θ = argmaxθ Sd 1Wp(θ µ, θ ν), from Definition 1, for any p 1, T 1, dimension d 1, and µ, ν Pp(Rd) we have:

MSWp,T (µ, ν) =

E(θ1:T ) σ(θ1:T )

t=1 W p p (θt µ, θt ν)

t=1 W p p (θ µ, θ ν)

p = Wp (θ µ, θ ν) = Max-SWp(µ, ν).

Furthermore, by applying the Cauchy-Schwartz inequality, we have:

Max-SWp p(µ, ν) = max θ Sd 1

inf π Π(µ,ν)

θ x θ y p dπ(x, y)

inf π Π(µ,ν)

Rd Rd θ p x y pdπ(x, y)

= inf π Π(µ,ν)

Rd Rd θ p x y pdπ(x, y)

= inf π Π(µ,ν)

Rd Rd x y pdπ(x, y)

= W p p (µ, ν),

which completes the proof.

(ii) This result can be directly obtained from the definitions of MSW and SW.

B.4 Proof of Proposition 2

In this proof, we denote Θ Rd as the compact set of the probability measure µ. From Proposition 1, we find that

E[MSWp,T (µn, µ)] E [Max-SWp(µn, µ)] .

Therefore, the proposition follows as long as we can demonstrate that

E[Max-SWp(µn, µ)] C p

(d + 1) log2 n/n

where C > 0 is some universal constant and the outer expectation is taken with respect to the data. The proof for this result follows from the proof of Proposition 3 in [43]. Here, we provide the proof for the completeness. By defining Fn,θ and Fθ as the cumulative distributions of θ µn and θ µ, the closed-form expression of the Wasserstein distance in one dimension leads to the following equations and inequalities:

Max-SWp p(µn, µ) = max θ Sd 1

0 |F 1 n,θ(u) F 1 θ (u)|pdu

= max θ Rd: θ =1

0 |F 1 n,θ(u) F 1 θ (u)|pdu

diam(Θ) max θ Rd: θ 1 |Fn,θ(x) Fθ(x)|p.

We can check that

max θ Rd: θ 1 |Fn,θ(x) Fθ(x)| = sup B B |µn(B) µ(B)|,

where B is the set of half-spaces {z Rd : θ z x} for all θ Rd such that θ 1. From VC inequality (Theorem 12.5in [16]), we have

P sup B B |µn(B) µ(B)| > t 8S(B, n)e nt2/32.

with S(B, n) is the growth function which is upper bounded by (n + 1)V C(B) due to the Sauer Lemma (Proposition 4.18 in [59]). From Example 4.21 in [59], we get V C(B) = d + 1.

Let 8S(B, n)e nt2/32 δ, we have t2 32

n log 8S(B,n)

δ . Therefore, we obtain

sup B B |µn(B) µ(B)|

n log 8S(B, n)

Now we convert the probability statement to inequality with expectation. Using the Jensen inequality

and the tail sum expectation for non-negative random variable, we have:

E sup B B |µn(B) µ(B)|

E sup B B |µn(B) µ(B)| 2 =

sup B B |µn(B) µ(B)| 2 > t

sup B B |µn(B) µ(B)| 2 > t

sup B B |µn(B) µ(B)| 2 > t

u 8S(B, n)e nt/32dt =

u + 256S(B, n)e nu/32

Let f(u) = u + 256S(B, n) e nu/32

n , we have f (u) = 1 + 8S(B, n)e nu/32, hence the minima u = 32 log(8S(B,n))

n . Since the inequality holds for any u, we have:

E sup B B |µn(B) µ(B)|

32 log(8S(B, n))

(d + 1) log(n + 1)

by using Sauer Lemma. Putting the above results together leads to

E[Max-SWp(µn, µ)] C p

(d + 1) log2 n/n, where C > 0 is some universal constant. As a consequence, we obtain the conclusion of the proposition.

B.5 Proof of Proposition 3

For any p 1, T 1, dimension d 1, and µ, ν Pp(Rd), using the Holder s inequality, we have:

E|\ MSW p p,T (µ, ν) MSWp p,T (µ, ν)|

E|\ MSW p p,T (µ, ν) MSWp p,T (µ, ν)|2 1

l=1 Wp p(θtl µ, θtl ν) Eθ1:T σ(θ1:T )

t=1 W p p (θt µ, θt ν)

l=1 W p p (θtl µ, θtl ν)

t=1 W p p (θt µ, θt ν)

which completes the proof.

Algorithm 7 Gradient flow with the Euler scheme

Input. the start distribution µ = 1

n Pn i=1 δXi, the target distribution ν = 1

n Pn i=1 δYi, number of Euler iterations T (abuse of notation), Euler step size η (abuse of notation), a metric D. for t = 1 to T do

X = X n η XD(PX, PY ) end for Output. µ = 1

n Pn i=1 δXi

W2: 25.3149 10 2 (0s) W2: 0.5913 10 2 (1.07s) W2: 0.0099 10 2 (1.55s)

Max-SW T=30

W2: 25.3149 10 2 (0s) W2: 0.1091 10 2 (2.37s) W2: 0.0098 10 2 (3.48s)

K-SW L=15 K=2

W2: 25.3149 10 2 (0s) W2: 0.5846 10 2 (1.16s) W2: 0.0098 10 2 (1.71s)

Max-K-SW K=2 T=15

W2: 25.3149 10 2 (0s) W2: 0.7388 10 2 (2.36s) W2: 0.0146 10 2 (3.35s)

DSW L=2 T=5

W2: 25.3149 10 2 (0s)

W2: 0.2716 10 2 (1.06s)

W2: 0.011 10 2 (1.5s)

o MSW L=5 T=2

W2: 25.3149 10 2 (0s) W2: 0.5783 10 2 (0.59s) W2: 0.0104 10 2 (0.87s)

i MSW L=2 T=5

W2: 25.3149 10 2 (0s)

W2: 0.0483 10 2 (0.99s)

W2: 0.0064 10 2 (1.41s)

vi MSW L=2 T=5 =50

W2: 25.3149 10 2 (0s)

W2: 0.0512 10 2 (2.05s)

W2: 0.0043 10 2 (2.94s)

Figure 4: The figures show the gradient flows that are from the empirical distribution over the color points to the empirical distribution over S-shape points produced by different distances. The corresponding Wasserstein-2 distance between the empirical distribution at the current step and the S-shape distribution and the computational time (in second) to reach the step is reported at the top of the figure.

C Additional Experiments

In this section, we present the detail of experimental frameworks and additional experiments on gradient flows, color transfer, and deep generative modeling which are not in the main paper.

C.1 Gradient Flows

Framework. We have discussed in detail the framework of gradient flow in Section 4.1 in the main paper. Here, we summarize the Euler scheme for solving the gradient flow in Algorithm 7.

Visualization of gradient flows. We show the visualization of gradient flows from all distances (Table 1) in Figure 4. Overall, we observe that the quality of the flows is consistent with the quantitative Wasserstein-2 score which is computed using [18]. From the figures, we see that i MSW and vi MSW help the flows converge very fast. Namely, Wasserstein-2 scores at steps 200 of i MSW and vi MSW are much lower than other distances. For o MSW, with L = 5, T = 2, it achieves a comparable result to SW, K-SW, and Max-SW while being faster.

Studies on hyper-parameters. We run gradient flows with different values of hyper-parameters and report the Wasserstein-2 scores and computational time in Table 3. From the table and Figure 4, we

Table 3: Summary of Wasserstein-2 scores, computational time in second (s) of different distances in gradient flow application.

Distances Wasserstein-2 ( ) Time ( ) Distances Wasserstein-2 ( ) Time ( )

SW (L=10) 0.0113 10 2 0.85 SW (L=100) 0.0096 10 2 4.32

Max-SW (T=5) 0.0231 10 2 1.02 Max-SW (T=100) 0.0083 10 2 10.46

K-SW (L=5,K=2) 0.0104 10 2 0.92 K-SW (L=20,K=2) 0.0096 10 2 1.97

Max-K-SW (K=2,T=5) 0.0152 10 2 1.41 Max-K-SW (K=2,T=100) 0.0083 10 2 10.46

i MSW (L=1,T=5) 0.0109 10 2 1.07 i MSW (L=5,T=5) 0.0055 10 2 2.44 i MSW (L=2,T=10) 0.0052 10 2 2.79 i MSW (L=5,T=2) 0.0071 10 2 1.14 i MSW (L=2,T=5,M=4) 0.0101 10 2 1.2 i MSW (L=2,T=5,M=2) 0.0055 10 2 1.25 i MSW (L=2,T=5,M=0,N=2) 0.0066 10 2 1.28 i MSW (L=2,T=5,M=2,N=2) 0.0072 10 2 1.19

vi MSW (L=2,T=5,κ=10) 0.0052 10 2 3.12 vi MSW (L=2,T=5,κ=100) 0.0053 10 2 2.76

SW (L=45), 41.77(s), W2 = 414.51

Max-SW (T=45), 59.13(s), W2 = 449.42

K-SW (L=15,K=3), 38.86(s), W2 = 411.74

Max-K-SW (K=3,T=15), 53.95(s), W2 = 479.43

o MSW (L=3,T=5), 13.69(s), W2 = 415.06

i MSW (L=3,T=5), 25.91(s), W2 = 16.97

vi MSW (L=3,T=5, =50), 29.14(s), W2 = 16.48

vi MSW (L=3,T=5, =100), 29.06(s), W2 = 17.09

Figure 5: The figures show the source image, the target image, and transferred images from different distances. The corresponding Wasserstein-2 distance between the empirical distribution over transferred color palates and the empirical distribution over the target color palette and the computational time (in second) is reported at the top of the figure. The color palates are given below the corresponding images.

see that SW with L = 10 is worse than o MSW, i MSW, and vi MSW with L = 2, T = 5 (10 total projections). Increasing the number of projections to 100, SW gets better, however, its Wasserstein-2 score is still higher than the scores of i MSW and vi MSW while its computational time is bigger. Similarly, Max-(K)-SW with T = 100 is better than Max-(K)-SW with T = 5 and T = 10, however, it is still worse than i MSW and vi MSW in terms of computation and performance. For burning and thinning, we see that the technique can help improve the computation considerably. More importantly, the burning and thinning techniques do not reduce the performance too much. For i MSW, increasing L and T leads to a better flow. For the same number of total projections e.g., 10, L = 2, T = 5 is better than L = 5, T = 2. For vi MSW, it usually performs better than i MSW, however, its computation is worse due to the sampling complexity of the v MF distribution. We vary the concentration parameter κ {10, 50, 100} and find that κ = 50 is the best. Hence, it might suggest that a good balance between heading to the max" projecting direction and exploring the space of projecting directions is the best strategy.

C.2 Color Transfer

Framework. In our experiments, we first compress the color palette of the source image and the target image to 3000 colors by using K-Mean clustering. After that, the color transfer application is

SW (L=45), 40.48(s), W2 = 68.09

Max-SW (T=45), 59.98(s), W2 = 207.12

K-SW (L=15,K=3), 39.09(s), W2 = 67.88

Max-K-SW (K=3,T=15), 54.23(s), W2 = 65.52

o MSW (L=3,T=5), 13.78(s), W2 = 68.51

i MSW (L=3,T=5), 26.12(s), W2 = 22.35

vi MSW (L=3,T=5, =50), 29.16(s), W2 = 22.1

vi MSW (L=3,T=5, =100), 29.15(s), W2 = 22.04

Figure 6: The figures show the source image, the target images, and transferred images from different distances. The corresponding Wasserstein-2 distance between the empirical distribution over transferred color palates and the empirical distribution over the target color palette and the computational time (in second) is reported at the top of the figure. The color palates are given below the corresponding images.

Algorithm 8 Color Transfer

Input. source color palette X {0, . . . , 255}n 3, target color palette Y {0, . . . , 255}n 3, number of Euler iterations T (abuse of notation), Euler step size η (abuse of notation), a metric D. for t = 1 to T do

X = X n η XD(PX, PY ) end for X = round(X, {0, . . . , 255}) Output. X

conducted by using Algorithm 8 which is a modified version of the gradient flow algorithm since the color palette contains only positive integer in {0, . . . , 255}. The flow can be seen as an incomplete transportation map that maps from the source color palette to a color palette that is close to the target color palette. This is quite similar to the iterative distribution transfer algorithm [8], however, the construction of the iterative map is different.

Visuallization of transferred images. We show the source image, the target image, and the corresponding transferred images from distances in Figure 5 and Figure 6. The color palates are given below the corresponding images. The corresponding Wasserstein-2 distance between the empirical distribution over transferred color palates and the empirical distribution over the target color palette and the computational time (in second) is reported at the top of the figure. First, we observe that

Table 4: Summary of Wasserstein-2 scores, computational time in second (s) of different distances in the color transfer application.

Distances Wasserstein-2 ( ) Time ( ) Distances Wasserstein-2 ( ) Time ( )

SW (L=45) 414.51 41.77 SW (L=15) 421.5 12.96

Max-SW (T=45) 449.42 59.13 Max-SW (T=15) 450.37 19.03

K-SW (L=15,K=3) 411.74 38.86 K-SW (L=5,K=3) 413.16 14.2

Max-K-SW (K=3,T=15) 479.43 53.95 Max-K-SW (K=3,T=5) 510.43 17.46

o MSW (L=3,T=5) 415.06 13.69 o MSW (L=3,T=15) 414.29 38.51

i MSW (L=3,T=5) 16.97 25.91 i MSW (L=3,T=15) 15.23 79.47 i MSW (L=5,T=5) 21.63 39.82 i MSW (L=5,T=3) 24.02 22.27 i MSW (L=3,T=15,M=14) 26.23 48.08 i MSW (L=3,T=15,M=10) 18.67 55.55 i MSW (L=3,T=15,M=0,N=2) 16.6 62.66 i MSW (L=3,T=15,M=10,N=2) 19.2 50.1

vi MSW (L=3,T=5,κ=50) 16.48 29.14 vi MSW (L=3,T=5,κ=100) 16.49 29.06

the qualitative comparison (transferred images and color palette) is consistent with the Wasserstein scores. We observe that i MSW and vi MSW have their transferred images closer to the target image in terms of color than other distances. More importantly, i MSW and vi MSW are faster than other distances. Max-SW and Max-K-SW do not perform well in this application, namely, they are slow and give high Wasserstein distances. For o MSW, it is comparable to SW and K-SW while being faster.

Studies on hyper-parameters. In addition to result in Figure 5, we run color transfer with other settings of distances in Table 4. From the table, increasing the number of projections L lead to a better result for SW and K-SW. However, they are still worse than i MSW and vi MSW with a smaller number of projections. Similarly, increasing T helps Max-SW, Max-K-SW, and i MSW better. As discussed in the main paper, the burning and thinning technique improves the computation and sometimes enhances the performance.

C.3 Deep Generative Models

Framework. We follow the generative modeling framework from [20, 42]. Here, we state an adaptive formulation of the framework. We are given a data distribution µ P(X) through its random samples (data). Our goal is to estimate a parametric distribution νϕ that belongs to a family of distributions indexed by parameters ϕ in a parameter space Φ. Deep generative modeling is interested in constructing νϕ via pushforward measure. In particular, νϕ is implicitly represented by pushing forward a random noise ν0 P(Z) e.g., standard multivariable Gaussian, through a parametric function Gϕ : Z X (a neural network with weights ϕ). To estimate ϕ (νϕ), the expected distance estimator [54, 41] is used:

argminϕ ΦE(X,Z) µ m ν m 0 [D(PX, PGϕ(Z))],

where m 1, D can be any distance on space of probability measures, µ is the product measures, namely, X = (x1, . . . , xm) µ is equivalent to xi µ for i = 1, . . . , m, and PX = 1

m Pm i=1 δxi. Similarly, Z = (z1, . . . , zm) with zi ν0 for i = 1, . . . , m, and Gϕ(Z) is the output of the neural work given the input mini-batch Z.

By using Wasserstein distance, sliced Wasserstein distance, and their variants as the distance D, we obtain the corresponding estimators. However, applying directly those estimators to natural image data cannot give perceptually good results [20, 15]. The reason is that Wasserstein distance, sliced Wasserstein distances, and their variants require a ground metric as input e.g., L2, however, those ground metrics are not meaningful on images. Therefore, previous works propose using a function that maps the original data space X to a feature space F where the L2 norm is meaningful [52]. We denote the feature function Fγ : X F. Now the estimator becomes:

argminϕ ΦE(X,Z) µ m ν m 0 [D(PFγ(X), PFγ(Gϕ(Z)))].

SW Max-K-SW K-SW

o MSW i MSW vi MSW

Figure 7: Random generated images of distances on CIFAR10.

The above optimization can be solved by stochastic gradient descent algorithm with the following stochastic gradient estimator:

ϕE(X,Z) µ m ν m 0 [D(PFγ(X), PFγ(Gϕ(Z)))] = E(X,Z) µ m ν m 0 [ ϕD(PFγ(X), PFγ(Gϕ(Z)))]

k=1 ϕD(PFγ(Xk), PFγ(Gϕ(Zk))),

where X1, . . . , XK are drawn i.i.d from µ m and Z1, . . . , ZK are drawn i.i.d from ν m 0 . There are several ways to estimate the feature function Fγ in practice. In our experiments, we use the following objective [15]:

EX µ m[min(0, 1 + H(Fγ(X)))] + EZ ν m 0 [min(0, 1 H(Fγ(Gϕ(Z)))))] ,

where H : F R. The above optimization problem is also solved by the stochastic gradient descent algorithm with the following gradient estimator:

γ EX µ m[min(0, 1 + H(Fγ(X)))] + EZ ν m 0 [min(0, 1 H(Fγ(Gϕ(Z)))))]

= EX µ m[ γ min(0, 1 + H(Fγ(X)))] + EZ ν m 0 [ γ min(0, 1 H(Fγ(Gϕ(Z)))))]

k=1 [ γ min(0, 1 + H(Fγ(Xk)))] + 1

k=1 [ γ min(0, 1 H(Fγ(Gϕ(Zk)))))],

where X1, . . . , XK are drawn i.i.d from µ m and Z1, . . . , ZK are drawn i.i.d from ν m 0 .

Settings. We use the following neural networks for Gϕ and Fγ:

Gϕ: z R128( ν0 : N(0, 1)) 4 4 256(Dense, Linear) Res Block up 256 Res Block up 256 Res Block up 256 BN, Re LU, 3 3 conv, 3 Tanh . Fγ1: x [ 1, 1]32 32 3 Res Block down 128 Res Block down 128 Res Block down 128 Res Block 128 Res Block 128.

Table 5: Summary of FID and IS scores of methods on CIFAR10 (32x32), and Celeb A (64x64)

Method CIFAR10 (32x32) Celeb A (64x64)

FID ( ) IS ( ) FID ( )

i MSW (L=100,T=10,M=0,N=1) 14.61 0.72 8.15 0.15 9.73 0.33 i MSW (L=100,T=10,M=9,N=1) 14.16 1.11 8.17 0.07 9.10 0.34 i MSW (L=100,T=10,M=5,N=1) 13.93 0.21 8.15 0.05 9.49 0.52 i MSW (L=100,T=10,M=0,N=2) 14.33 0.32 8.15 0.06 8.99 0.64

i MSW (L=10,T=100,M=0,N=1) 14.26 0.74 8.15 0.07 8.89 0.23 i MSW (L=10,T=100,M=99,N=1) 14.50 0.70 8.12 0.08 9.55 0.35 i MSW (L=10,T=100,M=50,N=1) 14.41 0.58 8.12 0.06 9.46 0.73 i MSW (L=10,T=100,M=0,N=2) 14.65 0.01 8.11 0.06 9.49 0.39

Fγ2: x R128 8 8 Re LU Global sum pooling(128) 1(Spectral normalization). Fγ(x) = (Fγ1(x), Fγ2(Fγ1(x))) and H(Fγ(x)) = Fγ2(Fγ1(x)). Celeb A.

Gϕ: z R128( ν0 : N(0, 1)) 4 4 256(Dense, Linear) Res Block up 256 Res Block up 256 Res Block up 256 Res Block up 256 BN, Re LU, 3 3 conv, 3 Tanh . Fγ1: x [ 1, 1]32 32 3 Res Block down 128 Res Block down 128 Res Block down 128 Res Block 128 Res Block 128. Fγ2: x R128 8 8 Re LU Global sum pooling(128) 1(Spectral normalization). Fγ(x) = (Fγ1(x), Fγ2(Fγ1(x))) and H(Fγ(x)) = Fγ2(Fγ1(x)).

For all datasets, the number of training iterations is set to 50000. We update the generator Gϕ each 5 iterations while we update the feature function Fγ every iteration. The mini-batch size m is set 128 in all datasets. The learning rate for Gϕ and Fγ is 0.0002 and the optimizer is Adam [25] with parameters (β1, β2) = (0, 0.9). We use the order p = 2 for all sliced Wasserstein variants. We use 50000 random samples from estimated generative models Gϕ for computing the FID scores and the Inception scores. In evaluating FID scores, we use all training samples for computing statistics of datasets2.

Generated images. We show generated images on CIFAR10 and Celeb A from different generative models trained by different distances in Figure 7 and in Figure 8 in turn. Overall, images are visually consistent with the quantitative FID scores in Table 2.

Studies on hyperparameters. We run some additional settings of i MSW to investigate the performance of the burning thinning technique and to compare the role of L and T in Table 5. First, we see that burning and thinning helps to improve FID score and IS score on CIFAR10 and Celeb A in the settings of L = 100, T = 10. It is worth noting that the original purpose of burning and thinning is to reduce computational complexity and memory complexity. The side benefit of improving performance requires more investigation that is left for future work. In addition, we find that for the same number of total projections 1000 without burning and thinning, the setting of L = 10, T = 100 is better than the setting of L = 100, T = 10 on CIFAR10. However, the reverse direction happens on Celeb A. Therefore, on different datasets, it might require hyperparameter tunning for finding the best setting of the number of projections L and the number of timesteps T.

2We evaluate the scores based on the code from https://github.com/Gong Xinyuu/sngan.pytorch.

SW Max-K-SW K-SW

o MSW i MSW vi MSW

Figure 8: Random generated images of distances on Celeb A.