# sliced_wasserstein_with_randompath_projecting_directions__3a261db8.pdf

Sliced Wasserstein with Random-Path Projecting Directions

Khai Nguyen 1 Shujian Zhang 1 Tam Le 2 3 Nhat Ho 1

Slicing distribution selection has been used as an effective technique to improve the performance of parameter estimators based on minimizing sliced Wasserstein distance in applications. Previous works either utilize expensive optimization to select the slicing distribution or use slicing distributions that require expensive sampling methods. In this work, we propose an optimization-free slicing distribution that provides a fast sampling for the Monte Carlo estimation of expectation. In particular, we introduce the random-path projecting direction (RPD) which is constructed by leveraging the normalized difference between two random vectors following the two input measures. From the RPD, we derive the random-path slicing distribution (RPSD) and two variants of sliced Wasserstein, i.e., the Random-Path Projection Sliced Wasserstein (RPSW) and the Importance Weighted Random-Path Projection Sliced Wasserstein (IWRPSW). We then discuss the topological, statistical, and computational properties of RPSW and IWRPSW. Finally, we showcase the favorable performance of RPSW and IWRPSW in gradient flow and the training of denoising diffusion generative models on images.

1. Introduction

Utilizing the closed-form solution of optimal transport in one dimension (Peyr e & Cuturi, 2020), the sliced Wasserstein (Bonneel et al., 2015) (SW) distance is computationally and statistically scalable. In particular, SW has the time complexity of O(n log n) and the space complexity of O(n) when dealing with two probability measures that have at most n supports. Moreover, the SW does not suffer from the curse of dimensionality since it has sample complexity

1Department of Statistics and Data Sciences, University of Texas at Austin, USA 2Department of Advanced Data Science, The Institute of Statistical Mathematics (ISM), Japan 3RIKEN AIP, Japan. Correspondence to: Khai Nguyen <khainb@utexas.edu>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

of O(n 1/2) (Nadjahi et al., 2020b; Nietert et al., 2022). As evidence for the effectiveness of the SW distance, many applications have leveraged SW as the key to improving performance e.g., generative models (Deshpande et al., 2018), domain adaptation (Lee et al., 2019), point-cloud upsampling (Savkin et al., 2022), clustering (Kolouri et al., 2018), gradient flows (Bonet et al., 2022), and so on.

A key component of the SW is the slicing distribution, i.e., the distribution of the projecting direction. The slicing distribution controls the contribution of each projecting direction to the final value of the SW. A statistically desired slicing distribution should satisfy two properties. The first property is to assign a high probability to informative projecting directions. In some applications that involve weak convergence of measures, such as generative modeling (Deshpande et al., 2018) and gradient flow (Bonet et al., 2022), an informative direction can be interpreted as a discriminative direction that highlights the difference between two measures after projection. The second property is to have the maximal projecting direction in the support set, i.e., the projecting direction that can lead to the maximum projected distance. This property ensures the identity of indiscernible of the expected value. To ensure the second property, a more relaxed criterion can be considered, which involves having a continuous density on the unit-hypersphere.

Beyond statistical properties, two computational properties are desired for the slicing distribution. Firstly, the slicing distribution should be stable to obtain, meaning that parameters of the slicing distribution can be computationally identifiable. Secondly, it should be fast and efficient to sample projecting directions from the slicing distribution. This property is essential, as Monte Carlo samples are used to approximate the intractable expectation with respect to the slicing distribution.

The conventional SW (Bonneel et al., 2015) utilizes the uniform distribution over the unit-hypersphere as the slicing distribution, which is a fixed, continuous, and easy-tosample distribution. However, the uniform slicing distribution is not discriminative. Distributional sliced Wasserstein (Nguyen et al., 2021) (DSW) performs optimization to find the best slicing distribution belonging to a chosen family of distributions over the unit-hypersphere, aiming to maximize the expected projected distance. When the

Sliced Wasserstein with Random-Path Projecting Directions

family is a collection of Dirac measures, DSW reverts to max sliced Wasserstein (Deshpande et al., 2019) (Max-SW). DSW and Max-SW are discriminative, but they are not stable, computationally expensive, and require a differentiable ground metric. This is because they involve iterative gradient-based optimization, which may not guarantee a global optimum. Energy-based sliced Wasserstein (Nguyen & Ho, 2023) (EBSW) utilizes the energy-based slicing distribution, which has a density proportional to an increasing function of the projected distance. The energy-based slicing distribution is continuous, optimization-free, and discriminative. However, it requires more sophisticated sampling techniques to obtain projecting directions, such as importance sampling and Markov chain Monte Carlo (MCMC), due to the unnormalized density. Moreover, these sampling techniques lead to biased weighted estimates.

To design an effective slicing distribution, the key is to relate the continuous slicing distribution with the two input measures. This connection can be achieved through optimization, as demonstrated in DSW (Nguyen et al., 2021), or by directly defining the slicing distribution based on the two input measures, as seen in EBSW (Nguyen & Ho, 2023). Designing an efficient slicing distribution poses a more challenging task. Interactive procedures, such as optimization (as in DSW) and sequential sampling (as in EBSW with MCMC), should be avoided. The ideal scenario is to have a slicing distribution that allows for i.i.d. sampling, making it parallelizable. Using a proposal distribution, importance sampling in EBSW could yield a fast estimate. However, importance sampling requires a careful choice of the proposal distribution for accurate estimation, especially in high dimensions. Additionally, importance sampling leads to biased estimates in the case of unnormalized densities.

To address the challenge, we propose a novel, effective, and efficient slicing distribution obtained based on a new notation of projecting direction, named random-path projecting direction (RPD). The idea is to first create a path between two input measures, then project the two measures along that path and conduct the one-dimensional Wasserstein comparison. Here, a path is defined as the normalized difference between two random variables that are distributed with two input measures, respectively, with an additional random perturbation for continuity guarantee. Since the path is defined as the random difference, it captures the directions in which the two measures differ. Therefore, using the distribution of the random-paths can lead to a discriminative slicing distribution. In addition, sampling a random path is efficient since it only requires the mild assumption that two input measures are easy to sample from, a condition usually held in machine learning applications.

Contribution. In summary, our contributions are three-fold:

1. We propose a new type of projecting direction, named

random-path projecting direction (RPD), which involves a random perturbation of the normalized difference between two random variables distributed under two input measures. From the RPD, we derive the random-path slicing distribution (RPSD), which is guaranteed to be continuous. Despite having intractable density, the RPSD is fast and simple to sample random projecting directions.

2. From the RPSD, we introduce two novel variants of sliced Wasserstein. The first variant is called random-path projection sliced Wasserstein (RPSW), which replaces the uniform distribution in SW with the RPSD. The second variant is importance-weighted random-path projection sliced Wasserstein (IWRPSW). It is defined as an expectation of a weighted average of multiple randomly projected distances with RPDs. We then establish theoretical properties of RPSW and IWRPSW, encompassing topological, statistical, and computational aspects. In greater detail, we delve into the metricity of RPSW and IWRPSW, their connections to other SW variants, their sample complexity, and their computational complexities using Monte Carlo methods.

3. We conduct a comparative analysis of the proposed RPSW and IWRPSW against existing SW variants, including SW, Max-SW, DSW, and EBSW. Our first application involves gradient flow, where distances are utilized to guide a source distribution towards a target distribution. Furthermore, we introduce a new framework for training denoising diffusion models using SW metrics through the augmented generalized mini-batch energy (AGME) distance. In this framework, SW variants serve as the kernel to assess the difference between two random sets. Here, AGME functions as the loss function, aiming to minimize the difference between forward transition distributions and their corresponding reverse transition distributions in denoising diffusion models. In addition to serving as a benchmark for comparing SW variants, our proposed training approach can improve generative quality and sampling speed.

Organization. We begin by reviewing some preliminaries in Section 2. Following that, in Section 3, we define the random-path projection direction, the random-path slicing distribution, RPSW, and IWRPSW. We then derive their theoretical and computational properties. Section 4 presents experiments comparing RPSW and IWRPSW with other SW variants. In Section 5, we provide concluding remarks. Finally, we defer the proofs of key results and additional materials to the Appendices.

Notations. For any d 2, we denote Sd 1 := {θ Rd | ||θ||2 2 = 1} and U(Sd 1) as the unit hyper-sphere and its corresponding uniform distribution . We denote P(X) as the set of all probability measures on the set X. For p 1, Pp(X) is the set of all probability measures on the set X that have finite p-moments. For any two sequences an and bn, the notation an = O(bn) means that an Cbn for all

Sliced Wasserstein with Random-Path Projecting Directions

n 1, where C is some universal constant.

2. Preliminaries

We first review some essential preliminaries.

One-dimensional Wasserstein. Let µ and ν be two onedimensional measures belongs to Pp(R), the Wasserstein distance has a closed form which is

Wp p(µ, ν) = inf π Π(µ,ν)

R R |x y|pdπ(x, y)

0 |F 1 µ (z) F 1 ν (z)|pdz

where Fµ and Fν are the cumulative distribution function (CDF) of µ and ν respectively.

Sliced Wasserstein. To utilize the closed-form, the sliced Wasserstein distance averages all possible projected Wasserstein distances i.e., the definition of sliced Wasserstein (SW) distance (Bonneel et al., 2015) between two probability measures µ Pp(Rd) and ν Pp(Rd) is:

SWp p(µ, ν) = Eθ U(Sd 1)[Wp p(θ µ, θ ν)], (1)

where θ µ is the push-forward measures of µ through the function f : Rd R that is f(x) = θ x. However, the expectation in the definition of the SW distance is intractable to compute. Therefore, the Monte Carlo scheme is employed to approximate the value:

d SW p p(µ, ν; L) = 1

l=1 Wp p(θl µ, θl ν),

where θ1, . . . , θL i.i.d U(Sd 1) and are referred to as projecting directions. The pushfoward measures θ1 µ, . . . , θL µ are called projections of µ (similarly for ν). The number of Monte Carlo samples L is often referred to as the number of projections. When µ and ν are discrete measures that have at most n supports, the time complexity and memory complexity of the SW are O(Ln log n + Ldn) and O(Ld + Ln) respectively.

Distributional sliced Wasserstein. To select a betterslicing distribution, Nguyen et al. (2021) introduce the distributional sliced Wasserstein (DSW) distance which is defined between two probability measures µ Pp(Rd) and ν Pp(Rd) as:

DSWp p(µ, ν) = max ψ Ψ Eθ σψ(θ))[Wp p(θ µ, θ ν)], (2)

where σψ(θ) P(Sd 1), e.g., an implicit distribution (Nguyen et al., 2021), von Mises-Fisher (Jupp & Mardia, 1979) (v MF) distribution and Power Spherical (PS) (Cao et al., 2018) with unknown location parameter

σψ(θ) := (PS)v MF(θ|ϵ, κ), ψ = ϵ. After using T 1 (projected) stochastic (sub)-gradient ascent iterations to obtain an estimation of the parameter ˆψT , Monte Carlo samples θ1, . . . , θL i.i.d σ ˆ ψT (θ) are used to approximate the value of the DSW. The time complexity and space complexity of the DSW are O(LTn log n+LTdn) and O(Ld+Ln) without counting the complexities of sampling from σ ˆ ψT (θ).

Max sliced Wasserstein. By letting the concentration parameter κ , the v MF and PS distributions degenerate to the Dirac distribution, we obtain the max sliced Wasserstein distance (Deshpande et al., 2019). The definition of max sliced Wasserstein (Max-SW) distance between two probability measures µ Pp(Rd) and ν Pp(Rd) is:

Max-SWp(µ, ν) = max θ Sd 1 Wp(θ µ, θ ν). (3)

The Max-SW is computed by using T 1 iterations of (projected) (sub)-gradient ascent to obtain the approximation of the max projecting direction ˆθT . The approximated value of the Max-SW is then set to Wp(ˆθT µ, ˆθT ν). It is worth noting that the optimization problem is non-convex (Nietert et al., 2022) which leads to the fact that we cannot obtain the global optimum θ . As a result, the approximation of Max-SW is not a metric even when we let the number of iterations T . The time complexity and space complexity of the Max-SW are O(Tn log n + Tdn) and O(d + n).

Energy-based sliced Wasserstein. In practice, optimization often costs more computation than sampling from a fixed slicing distribution. To avoid expensive optimization, the recent work (Nguyen & Ho, 2023) proposes an optimization-free approach that is based on energy-based slicing distribution. The energy-based sliced Wasserstein (EBSW) distance between two probability measures µ Pp(Rd) and ν Pp(Rd) is defined as follow:

EBSWp p(µ, ν; f) = Eθ σµ,ν(θ;f,p) Wp p(θ µ, θ ν) , (4)

where f : [0, ) (0, ) is an increasing energy function e.g., f(x) = ex, and σµ,ν(θ; f, p) f(Wp p(θ µ, θ ν)). EBSW can be approximated by importance sampling with the uniform proposal distribution σ0 = U(Sd 1). For

θ1, . . . , θL i.i.d σ0(θ), we have: \ EBSW p p(µ, ν; f, L) =

l=1 Wp p(θl µ, θl ν) ˆwµ,ν,σ0,f,p(θl), (5)

for wµ,ν,σ0,f,p(θ) = f(Wp p(θ µ,θ ν))

σ0(θ) is the importance weighted function and ˆwµ,ν,σ0,f,p(θl) = wµ,ν,σ0,f,p(θl) PL l =1 wµ,ν,σ0,f,p(θl ) is the normalized importance weights. The time complexity and space complexity of the EBSW are also O(Ln log n + Ldn) and O(Ld + Ln) in turn. However, the estimation is biased, i.e.,

Sliced Wasserstein with Random-Path Projecting Directions

E[ \ EBSW p p(µ, ν; f, L)] = EBSWp p(µ, ν; f). In addition to importance sampling, Markov Chain Monte Carlo (MCMC) can be used to approximate EBSW, however, they are very computationally expensive (Nguyen & Ho, 2023).

3. Random-Path Projection Sliced Wasserstein

We first discuss the random-path projecting direction and the random-path slicing distribution in Section 3.1. We then discuss the Random-Path Projection Sliced Wasserstein variants in Section 3.2.

3.1. Random-Path Projecting Direction

As discussed, discriminative projecting directions, i.e., directions that can highlight the difference between two distributions after the projection, are preferred. Although the energy-based slicing distribution in EBSW can do the job, it is not possible to sample directly from the energy-based slicing distribution. To address the issue, we propose a novel slicing distribution that is based on the new notion of projecting direction, referred to as random-path .

Random paths between two measures. We now define the random path between µ P(Rd) and ν P(Rd).

Definition 1 (Random-path). For any p 1, dimension d 1, µ P(Rd) and ν P(Rd), let X µ and Y ν, the random-path (RP) is defined as:

ZX,Y = X Y. (6)

The measure of ZX,Y can be written as σµ ν := µ ( ) ν where denotes the convolution operator, and ( ) ν denotes the pushforward measures of ν through the function f(x) = x. The density of σµ ν is intractable, however, it is simple to sample from it, i.e., sampling X µ, Y ν, set ZX,Y = X Y , then ZX,Y σµ ν. It is worth noting that σµ ν = σν µ.

Random-path projecting direction. From the random path, we can create a random projecting direction by normalizing it i.e., ZX,Y ZX,Y 2 . However, the density of the RP, i.e., σµ ν does not always have full support on Rd e.g., in discrete cases (µ and ν are discrete). Therefore, ZX,Y ZX,Y 2 does not

fully support Sd 1 which cannot guarantee theoretical property and harm practical performance. As a solution, we can add a random perturbation around the normalized random path. As long as the random perturbation has continuous density, the marginal density of the final projecting direction is continuous. Now, we define the random-path projecting direction as follows:

Definition 2 (Random-path projecting direction). For any κ > 0, dimension d 1, µ P(Rd) and ν P(Rd), let X µ and Y ν, the random-path projecting direction

θ|X, Y, κ σκ( ; PSd 1(X Y )), (7)

where PSd 1(x) = x x 2 , and σκ is a location scale distribution on Sd 1 such as v MF and PS.

In Definition 2, the reason for choosing the location-scale family is to guarantee the projecting direction concentrates around the normalized random path. We recall the density of v MF distribution v MF(θ; ϵ, κ) exp(κϵ θ) (Jupp & Mardia, 1979) and the PS distribution PS(θ; ϵ, κ) (1 + ϵ θ)κ (De Cao & Aziz, 2020). Remark 1. When µ and ν have shared supports, ZX,Y has support at 0 which makes PSd 1(x) undefined. For continuous cases, it is not the problem since the probability of ZX,Y = 0 is 0. For the discrete cases, we can solve the issue by simply adding a small constant to ZX,Y i.e., ZX,Y + c, and treat it as the random path. We find that this is not the problem in practice since two interested distributions are often continuous or rarely have shared supports.

Random-path slicing distribution. From Definition 2, we can obtain the slicing distribution of the RPD by marginalizing out X, Y . In particular, we have the random-path slicing distribution (RPSD) defined as follows:

Definition 3. For any 0 < κ < , dimension d 1, µ P(Rd) and ν P(Rd), let X µ and Y ν, the random-path slicing distribution (RPSD) is:

σRP (θ; µ, ν, σκ) = Z σκ(θ; PSd 1(x y))dµ(x)dν(y).

Although σRP (θ; µ, ν, σκ) is intractable, we can sample from it easily by sampling X µ, Y ν, then θ σκ(PSd 1(X Y )). The sampling process can also be parallelized for multiple samples θ1, . . . , θL for L > 1. Remark 2. We can rewrite σRP (θ; µ, ν, σκ) = R σκ(θ; PSd 1(x y))dπ(x, y) where π = µ ν is the independent coupling between µ and ν (efficiently for computation). As a natural extension, we can use other coupling between µ and ν. However, a more complicated coupling could cost more computation for doing sampling while the benefit of using such coupling is not trivial.

3.2. Random-Path Projection Sliced Wasserstein

We now discuss the two sliced Wasserstein variants that are based on the random-path slicing distribution.

Definitions. We first define the random-path projection sliced Wasserstein (RPSW) distance.

Definition 4. For any p 1, d 1, 0 < κ < , two probability measures µ Pp(Rd) and ν Pp(Rd), the random-path projection sliced Wasserstein (RPSW) distance

Sliced Wasserstein with Random-Path Projecting Directions

is defined as follows:

RPSWp p(µ, ν; σκ) = Eθ σRP (θ;µ,ν,σκ)[W p p (θ µ, θ ν)],

where σRP is defined as in Definition 3.

Remark 3. The equivalent definition of RPSW to Definition 4 is: RPSWp p(µ, ν; σκ) = EX µ,Y νEθ σκ(θ;PSd 1(X Y ))[W p p (θ µ, θ ν)].

Motivated by the importance of sampling estimation of EBSW in Equation (5), we can further adjust the weight of RPDs i.e., θ1, . . . , θL σRP (θ; µ, ν, σκ) by their corresponding projected distance W p p (θ1 µ, θ1 ν)), . . . , W p p (θL µ, θL ν)). We now define the importance weighted random-path projection sliced Wasserstein (IWRPSW) distance.

Definition 5. For any p 1, d 1, L 1, 0 < κ < , an increasing function f : [0, ) (0, ), two probability measures µ Pp(Rd) and ν Pp(Rd), the importance weighted random-path projection sliced Wasserstein (IWRPSW) is defined as:

IWRPSWp p(µ, ν; σκ, L, f) = Eθ1,...,θL σRP (θ;µ,ν,σκ) " L X

l=1 W p p (θl µ, θl ν) f(W p p (θl µ, θl ν)) PL j=1 f(W p p (θj µ, θj ν))

where σRP is defined as in Definition 3.

Compared to only one random projecting direction in RPSW, IWRPSW utilizes L random projecting directions in the population form.

Topological Properties. We first investigate the metricity of RPSW and IWRPSW.

Theorem 1. For any p 1, L 1, f : [0, ) (0, ) and 0 < κ < , the random-path projection sliced Wasserstein RPSWp( , ; σκ) and the importance weighted randompath projection sliced Wasserstein IWRPSWp( , ; σκ, L) are semi-metrics in the probability space on Rd, namely RPSW and IWRPSW satisfy non-negativity, symmetry, and identity of indiscernible. The RPSW satisfies the quasi - triangle inequality i.e.,

RPSWp(µ1, µ2; σκ) RPSWp(µ1, µ3; σκ, µ1, µ2)

+ RPSWp(µ3, µ2; σκ, µ1, µ2)

where we have RPSWp p(µ1, µ3; σκ, µ1, µ2) = Eθ σRP (θ;µ1,µ2,σκ)[W p p (θ µ1, θ µ3)] and a similar definition of RPSWp p(µ3, µ2; σκ, µ1, µ2).

The proof of Theorem 1 is given in Appendix A.1. It is worth noting that the triangle inequality for RPSW and IWRPSW is challenging to prove due to the dependency of the two input measures with the slicing distribution. Therefore, it is unknown if they satisfy the triangle inequality.

Remark 4. Although the σRP (θ; µ, ν, σκ) is not symmetric with respective to µ and ν, RPSW and IWRPSW are still symmetric since W p p (θ µ, θ ν) is symmetric with respect to θ i.e., W p p (θ µ, θ ν) = W p p ( θ µ, θ ν) and we have σκ(θ; PSd 1(x y)) = σκ( θ; PSd 1(y x)). We refer to the proof of Theorem 1 for a more detail.

We now discuss the connection between RPSW and IWRPSW, and their connection to other SW variants and Wasserstein distance.

Proposition 1. For any p 1, L 1, increasing function f : [0, ) (0, ), and 0 < κ < , we have the following relationship: (i) RPSWp(µ, ν; σκ) IWRPSWp(µ, ν; σκ, L, f) Max-SWp(µ, ν) Wp(µ, ν), (ii) limκ 0 RPSWp(µ, ν; σκ) SWp(µ, ν), (iii) lim L IWRPSWp(µ, ν; σκ, L, f) EBSWp(µ, ν; f), (iv) RPSWp(µ, ν; σκ) RPSW1(µ, ν; σκ).

The proof of Proposition 1 is given in Appendix A.2.

Statistical Properties. We now discuss the one-sided sample complexity of RPSW and IWRPSW.

Proposition 2. Let X1, X2, . . . , Xn be i.i.d. samples from the probability measures µ being supported on compact set of Rd. We denote the empirical measures µn = 1 n Pn i=1 δXi. Then, for any p > 1, L 1 and 0 < κ < , there exists a universal constant C > 0 such that

E[RPSWp(µn, µ; σκ)] E[IWRPSWp(µn, µ; σκ, L, f)]

(d + 1) log(n + 1)

where the outer expectation is taken with respect to X1, X2, . . . , Xn.

The proof of Proposition 2 is given in Appendix A.3. From the proposition, we can see that the approximation rate of using an empirical probability measure to a population measure of the proposed RPSW and IWRPSW is at the order of n 1/2. Therefore, it is possible to say that the proposed RPSW and IWRPSW do not suffer from the curse of dimensionality as other SW variants. Due to the missing proof of triangle inequality of RPSW and IWRPSW, it is non-trivial to extend from the one-sided sample complexity to the two-sided sample complexity as in the conventional SW (Nadjahi et al., 2019).

Computational Properties. We now present how to compute RPSW and IWRPSW in practice.

Monte Carlo estimation. The expectations in RPSW is intractable (Definition 4), hence, Monte Carlo samples are used to estimate RPSW. In particular, we sample θ1, . . . , θL σRP (θ; µ, ν, κ) as described in Section 3.1,

Sliced Wasserstein with Random-Path Projecting Directions

then we form the Monte Carlo estimate of RPSW:

\ RPSW p p(µ, ν; σκ, L) = 1

l=1 W p p (θl µ, θl ν). (8)

We refer to the reader to Algorithm 12 in Appendix C for the detailed computation of RPSW and IWRPSW. We then discuss the approximation error of the estimator.

Proposition 3. For any p 1, 0 < κ < , dimension d 1, and µ, ν Pp(Rd), we have:

E|\ RPSW p p(µ, ν; σκ) RPSWp p(µ, ν; σκ)|

L V arθ σRP (θ;µ,ν,σκ) W p p (θ µ, θ ν) 1

The proof of Proposition 3 is given in Appendix 3. From the proposition, we observe that RPSW has the same error rate as SW in terms of L i.e., L 1/2.

Similarly for IWRPSW, let H 1, we sample θ11, . . . , θHL σRP (θ; µ, ν, κ), then we form the Monte Carlo estimate of IWRPSW as follows:

\ IWRPSW p p(µ, ν; σκ, L, H) = 1

l=1 W p p (θhl µ, θhl ν) f(W p p (θhl µ, θhl ν)) PL j=1 f(W p p (θhj µ, θhj ν))

Since IWRPSW has L random projecting directions, we approximate the expectation by H sets of Monte Carlo samples. To simplify and make L as the only parameter for the number of projections, we set H = 1 in this paper. In contrast to RPSW, the error rate of IWRPSW with respect to L is non-trivial due to the importance weights.

Unbiasedness. Since RPSWp p(µ, ν; σκ) and IWRPSWp p(µ, ν; σκ, L, f) can be approximated directly by Monte Carlo samples, their corresponding estimators \ RPSW p p(µ, ν; σκ) and \ IWRPSW p p(µ, ν; σκ, L, H) are unbiased estimates.

Computational complexities. When µ and ν are discrete measures that have a most n supports, sampling from them only cost O(n) in terms of time and space. Hence, sampling L random paths between µ and ν cost O(Ldn) in time and memory. After that, sampling from the v MF distribution and the PS distribution costs O(Ld) (Algorithm 1 in (De Cao & Aziz, 2020)) in time and memory. Adding the complexities for computing one-dimensional Wasserstein distance, the time complexity and space complexity for RPSW is O(Ln log n + Ldn) and O(Ld + Ln) respectively. For IWRPSW, the complexities are multiplied by H, however, they can be kept the same if setting H = 1.

Gradient estimation. In applications where RPSW and IWRPSW are used as a risk to estimate some parameters of interest i.e., µϕ with ϕ Φ, we might want to estimate the gradient of RPSW and IWRPSW with respect to ϕ. For RPSW, we have:

ϕRPSWp p(µϕ, ν; σκ)

= ϕEX µϕ,Y νEθ σκ(θ;PSd 1(X Y ))[W p p (θ µϕ, θ ν)].

Since v MF and PS are reparameterizable (De Cao & Aziz, 2020), we can reparameterize σRP (θ; µ, ν, σκ) if µϕ is reparameterizable (e.g., µϕ := fϕ ε for ε is a fixed distribution, or discussed other distributions in (Kingma & Welling, 2013)). With parameterized sampling, we can sample θ1,ϕ, . . . , θL,ϕ σRP (θ; µϕ, ν, σκ), then form an unbiased gradient estimator as follow:

ϕRPSWp p(µϕ, ν; σκ) 1

l=1 ϕW p p (θl,ϕ µϕ, θl,ϕ ν).

For IWRPSW, we sample θ11,ϕ, . . . , θHL,ϕ σRP (θ; µϕ, ν, κ), then form the following estimator:

ϕIWRPSWp p(µϕ, ν; σκ, L, H) = 1

W p p (θhl,ϕ µϕ, θhl,ϕ ν) f(W p p (θhl,ϕ µϕ, θhl,ϕ ν)) PL j=1 f(W p p (θhj,ϕ µϕ, θhj,ϕ ν))

Remark 5. We found in later experiments that removing the dependent of ϕ in the slicing distribution i.e., using a dependent copy σRP (θ; µϕ , ν, κ) with ϕ = ϕ in value, leads to a more stable estimator in practice. It is worth noting that using a copy of both µ and ν (i.e., µ and ν ) in RPSW can even lead to a variant that satisfies the triangle inequality since the slicing distribution becomes a fixed distribution in this situation. We refer to this gradient estimator as the simplified gradient estimator.

4. Experiments

In this section, we compare the proposed RPSW and IWRPSW to the existing SW variants such as SW, Max SW, DSW, and EBSW in gradient flow in Section 4.1 and training denoising diffusion models in 4.3.

4.1. Toy Gradient Flows

The gradient flow models a dynamic distribution µ(t) flowing with time t along the gradient flow of a loss functional µ(t) f(µ(t)). In this experiment, we set f(µ(t)) = D(µ(t), ν) with ν as the target distribution and D is a given SW variant between probability measures. The flow will drive the source distribution towards ν (Feydy et al., 2019;

Sliced Wasserstein with Random-Path Projecting Directions

W2: 25.3149 10 2 (0s) W2: 0.628 10 2 (1.14s) W2: 0.0123 10 2 (1.34s)

Max-SW T=10

W2: 25.3149 10 2 (0s) W2: 0.1276 10 2 (2.58s) W2: 0.01 10 2 (3.76s)

DSW L=5 T=2

W2: 25.3149 10 2 (0s) W2: 0.3034 10 2 (2.26s) W2: 0.0091 10 2 (3.04s)

W2: 25.3149 10 2 (0s) W2: 0.1462 10 2 (1.17s) W2: 0.0075 10 2 (1.36s)

W2: 25.3149 10 2 (0s)

W2: 0.3082 10 2 (1.29s)

W2: 0.0081 10 2 (1.67s)

IWRPSW L=10

W2: 25.3149 10 2 (0s)

W2: 0.1396 10 2 (1.32s)

W2: 0.0056 10 2 (1.75s)

Figure 1. Results for gradient flows that are from the empirical distribution over the color points to the empirical distribution over S-shape points produced by different SW variants. The corresponding Wasserstein-2 distance between the empirical distribution at the current step and the S-shape distribution and the computational time (in second) to reach the step is reported at the top of the figure.

Santambrogio, 2015). In this setup, we consider the discrete setting i.e., ν = 1

n Pn i=1 δYi and µ(t) = 1

n Pn i=1 δXi(t).

Setting. To solve the flow, we integrate the ordinary differential equation X(t) = n X(t) D 1

n Pn i=1 δXi(t), ν

with the Euler scheme with 300 timesteps and the step size is 10 4. We set the total number of projections for SW variants to 10. For DSW, RPSW, and IWRPSW, we set the concentration parameter κ of the PS distribution as a

dynamic quantity i.e., κ(t) = (κ0 1) N t 1

N 1 10 +1 with

N = 300 and κ0 {100, 50}. The reason for the dynamic choice of κ is to stabilize the convergence of the flow since when two distributions are relatively closed, the uniform distribution is nearly optimal.

Results. We show both the qualitative visualization and quantitative comparison (in Wasserstein-2 distance (Flamary et al., 2021)) in Figure 1. From the figure, we observe that RPSW leads to a faster and better convergence than SW and IWRPSW leads to a faster and better convergence than EBSW. We also observe that optimization-based variants such as Max-SW and DSW are slower than sampling-based variants like SW, EBSW, RPSW, and IWRPSW. Moreover, the convergence of Max-SW and DSW is not as good as EBSW, RPSW, and IWRPSW. Overall, the experiment has shown the benefit of the random-path slicing distribution which improves the performance while having a fast computation. In the experiment, we use the simplified gradient estimator for RPSW and IWRPSW. We refer the reader to

Figure 4 in Appendix D for the result of the original gradient estimator. Despite the fact that such estimators can lead to faster convergence, they seem to be worse in remaining the original topology of the source distribution.

4.2. Gradient Flows on Images

Setting. Utilizing MNIST dataset (Le Cun et al., 1998), we select images of digit 1 to construct the source distribution and images of digit 0 to construct the target distribution. Following the previous section, we use the Euler discretization scheme with step size 1 and N = 5000 iterations. We compare SW variants i..e, SW, EBSW, RPSW, and IWRPSW with the number of projections L = 1000, Max-SW with T = 1000, DSW with L = 100, T = 10. For DSW, RPSW,

and IWRPSW, we set κ(t) = (κ0 1) N t 0.001

N 1 10 + 0.001 with N = 5000 and κ0 1000 For evaluation, we use the Wasserstein-2 distance.

Results. We report the Wasserstein-2 distance across timesteps and the computational time in Table 1. Moreover, we visualize the flows in Figure 2 and Figure 5 in Appendix D. We see that RPSW and IWRPSW help to reduce the Wasserstein-2 very fast compared to other optimizationfree variants i.e., SW and EBSW. Moreover, RPSW and IWRPSW have significantly lower computation than DSW and Max-SW. It is worth noting that the quality of the gradient flow can be improved using convolution slicing operator as in (Nguyen & Ho, 2022; Du et al., 2023).

Sliced Wasserstein with Random-Path Projecting Directions

Table 1. Wasserstein-2 distance and computational times across timesteps in gradient flow on MNIST.

Distance Step 100 Step 500 Step 1000 Step 3000 Step 5000

W2( ) Time (s)( ) W2( ) Time (s)( ) W2( ) Time (s)( ) W2( ) Time (s)( ) W2( ) Time (s)( )

SW 83.86 7.11 67.74 16.76 49.34 28.09 29.14 67.65 25.17 107.83 Max-SW 34.74 128.88 33.79 753.47 33.42 1435.14 33.00 38145.11 32.37 6436.89 DSW 50.26 136.14 48.22 462.57 40.71 764.98 29.96 202.79 27.49 291.20 EBSW 80.29 7.25 67.13 16.99 48.69 28.51 29.03 68.53 25.11 108.65 RPSW 46.54 7.58 28.97 18.13 27.57 30.61 23.83 74.71 21.86 118.83 IWRPSW 44.28 7.59 28.75 18.20 27.49 30.66 23.82 77.71 21.81 120.05

SW (Step 0) SW (Step 100) SW (Step 1000) SW (Step 5000)

IWRPSW (Step 0) IWRPSW (Step 100) IWRPSW (Step 1000) IWRPSW (Step 5000)

Figure 2. Gradient flows from MNIST digit 1 to MNIST digit 0. .

4.3. Denoising Diffusion Models

Denoising diffusion model (Ho et al., 2020; Sohl-Dickstein et al., 2015) defines a forward process that gradually adds noise to the data x0 q(x0). The process has T > 0 steps and the noise is Gaussian.

q(x1:T |x0) = Y

t 1 q(xt|xt 1),

where q(xt|xt 1) = N(xt; 1 βtxt 1, βt I) with the pre-defined variance schedule βt. Training denoising diffusion model is to estimate some parameters ϕ of a reverse denoising process defined by:

pϕ(x0:T ) = p(x T ) Y

t 1 pϕ(xt 1|xt),

where pϕ(xt 1|xt) = N(xt 1; µϕ(xt, t), σ2 t I). The conventional training uses the maximum likelihood approach by maximizing the evidence lower bound L pϕ(x0) = R pϕ(x0:T )dx1:T , which can be rewritten as:

t 1 Eq(xt)[KL(q(xt 1|xt)||pϕ(xt 1|xt))] + C,

where KL denotes the Kullback-Leibler divergence, and C is a known constant.

Implicit denoising model. To reduce the number of steps T for faster generation, denoising diffusion GANs (Xiao et al., 2021) proposes to use the implicit denoising model pϕ(xt 1|xt) = R p(ϵ)q(xt 1|xt, x0 = Gϕ(xt, ϵ, t)) where ϵ N(0, I). After that, Xiao et al. (2021) use adversarial training to estimate the parameters i.e,

t 1 Eq(xt)[Dadv(q(xt 1|xt)||pϕ(xt 1|xt))],

where Dadv is the GAN objective or the Jensen Shannon divergence (Goodfellow et al., 2014).

Augmented Generalized Mini-batch Energy distances. In this paper, we replace Dadv by the generalized mini-batch Energy distance (Salimans et al., 2018). In particular, for two measures µ and ν, and mini-batch size m 1, we have the generalized Mini-batch Energy distance with SW kernel:

GME2 D(µ, ν) = 2E[D(PX, PY )]

E[D(PX, PX ) E[D(PY , PY )],

where X, X i.i.d µ m and X, X i.i.d ν m, PX =

Sliced Wasserstein with Random-Path Projecting Directions

DDGAN RPSW-DD IWRPSW-DD

Figure 3. Random generated images on CIFAR10 from DDGAN, RPSW-DD, and IWRPSW-DD.

Table 2. Results for unconditional generation on CIFAR-10.

Model FID NFE G-Time (s )

DDPM (Ho et al., 2020) 3.17 1000 80.5 Score SDE (VE) (Song et al., 2020b) 2.20 2000 423.2 Score SDE (VP) (Song et al., 2020b) 2.41 2000 421.5 DDIM (Song et al., 2020a) 4.67 50 4.01 Prob-Flow (VP) (Song et al., 2020b) 3.08 140 50.9 LSGM (Vahdat et al., 2021) 2.10 147 44.5

DDGAN (Xiao et al., 2021) 3.64 4 0.21 SW-DD (ours) 2.90 4 0.21 Max-SW-DD (ours) 2.99 4 0.21 DSW-DD (ours) 2.88 4 0.21 EBSW-DD (ours) 2.87 4 0.21 RPSW-DD (ours) 2.82 4 0.21 IWRPSW-DD (ours) 2.70 4 0.21

1 m Pm i=1 δxi with X = (x1, . . . , xm), and D is a SW variant. In the denoising diffusion case, µ is q(xt 1|xt) and ν is pϕ(xt 1|xt)) for t = 1, . . . , T. Although the generalized Mini-batch Energy distance can guarantee statistical convergence, it is known to be not good in practical training due to the weak training signal (lack of non-linearity which is essential for images) (Salimans et al., 2018). To address the issue, we propose the augmented generalized Mini-batch Energy distance: AGME2 D(µ, ν; g) = GME2 D( µ, ν), where µ = f µ and ν = f ν with f(x) = (x, g(x)) for g : Rd R is a non-linear function. For any g, AGME2 D(µ, ν; g) is still a valid distance since f is injective. In the diffusion model setting, we train gγ (parameterized by γ) as a timeconditional discriminator as in (Xiao et al., 2021, Equation (5)). We refer the reader to Algorithm 3 in Appendix C for a detailed algorithm, and to Appendix D for more detailed experimental settings.

Results. We follow the setting in (Xiao et al., 2021) for diffusion models on CIFAR10 (Krizhevsky et al., 2009) with N = 1800 epochs. We set L = 104 for SW, DSW, EBSW, RPSW, and IWRPSW. We set T {2, 5, 10} for DSW and T {100, 1000} for Max-SW. It is worthing noting that the training time of Max-SW and DSW is more than two times the time for SW, EBSW, RPSW, and IWRPSW,

hence, we cannot increase the value of T further. We adjust concentration parameter κ of the PS distribution for each

epoch t as κ(t) = (κ0 1) N t 1

N 1 10 +1 with N = 1800 and κ0 {100, 50}. We report the FID scores (Heusel et al., 2017), the number of function evaluations, and the generation time in Table 2. Using an implicit denoising model with only 4 time steps, our SW diffusion models have a much faster generation time compared to existing diffusion models such as DDPM, Score SDE, DDIM, Probability Flow, and LSGM. Compared to DDGAN with the same 4 time steps, our SW diffusion models lead to a lower FID score and the same generation time. Among SW variants, the proposed RPSW and IWRPSW achieve the lowest scores of 2.82 and 2.70 in turn. We show some random generated images from DDGAN, RPSW-DD, and IWRPSW-DD in Figure 3 as qualitative results. Overall, AGME distance with RPSW and IWRPSW kernels could be a potential effective loss for improving training implicit diffusion models1.

5. Conclusion

We have presented the new construction of discriminative projecting directions for SW i.e., the random-path projecting direction. From that, we derive the random-path slicing distribution and two optimization-free variants of SW, named random-path sliced Wasserstein (RPSW) and importance weighted random-path sliced Wasserstein (IWRPSW). We then discuss the topological properties, statistical properties, and computational properties of the proposed RPSW and IWRPSW. Finally, we show the favorable performance of RPSW and IWRPSW in gradient flow and training denoising diffusion models. In the future, we will extend the random-path directions to probability measures that have supported on the hyper-sphere (Bonet et al., 2023a), and hyperbolic manifolds (Bonet et al., 2023b). In such cases, the random path will not be straight lines but curves.

1Code for this paper is published at https://github. com/khainb/RPSW.

Sliced Wasserstein with Random-Path Projecting Directions

Impact Statement

The paper proposes a new variant of sliced Wasserstein to compare two probability measures. Given the numerous applications in machine learning, such as generative modeling, clustering, classification, domain adaptation, and more, the proposed random-path variants of sliced Wasserstein distance could enhance the performance of downstream applications in both quality and computation, as demonstrated in the paper.

Acknowledgements

We would like to thank Peter Pao-Huang for his insightful discussion during the course of this project. NH acknowledges support from the NSF IFML 2019844 and the NSF AI Institute for Foundations of Machine Learning. TL acknowledges the support of JSPS KAKENHI Grant number 23K11243, and Mitsui Knowledge Industry Co., Ltd. grant.

Bai, Y., Schmitzer, B., Thorpe, M., and Kolouri, S. Sliced optimal partial transport. ar Xiv preprint ar Xiv:2212.08049, 2022.

Bai, Y., Tran, H., Damelin, S. B., and Kolouri, S. Partial transport for point-cloud registration. ar Xiv preprint ar Xiv:2309.15787, 2023.

Bonet, C., Courty, N., Septier, F., and Drumetz, L. Efficient gradient flows in sliced-Wasserstein space. Transactions on Machine Learning Research, 2022.

Bonet, C., Berg, P., Courty, N., Septier, F., Drumetz, L., and Pham, M.-T. Spherical sliced-Wasserstein. International Conference on Learning Representations, 2023a.

Bonet, C., Chapel, L., Drumetz, L., and Courty, N. Hyperbolic sliced-wasserstein via geodesic and horospherical projections. In Topological, Algebraic and Geometric Learning Workshops 2023, pp. 334 370. PMLR, 2023b.

Bonet, C., Mal ezieux, B., Rakotomamonjy, A., Drumetz, L., Moreau, T., Kowalski, M., and Courty, N. Slicedwasserstein on symmetric positive definite matrices for m/eeg signals. In International Conference on Machine Learning, pp. 2777 2805. PMLR, 2023c.

Bonneel, N. and Coeurjolly, D. Spot: sliced partial optimal transport. ACM Transactions on Graphics (TOG), 38(4): 1 13, 2019.

Bonneel, N., Rabin, J., Peyr e, G., and Pfister, H. Sliced and Radon Wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision, 1(51):22 45, 2015.

Bonnotte, N. Unidimensional and evolution methods for optimal transportation. Ph D thesis, Paris 11, 2013.

Cao, Z., Ma, L., Long, M., and Wang, J. Partial adversarial domain adaptation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 135 150, 2018.

De Cao, N. and Aziz, W. The power spherical distribution. ar Xiv preprint ar Xiv:2006.04437, 2020.

Deshpande, I., Zhang, Z., and Schwing, A. G. Generative modeling using the sliced Wasserstein distance. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3483 3491, 2018.

Deshpande, I., Hu, Y.-T., Sun, R., Pyrros, A., Siddiqui, N., Koyejo, S., Zhao, Z., Forsyth, D., and Schwing, A. G. Max-sliced Wasserstein distance and its use for GANs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10648 10656, 2019.

Devroye, L., Gy orfi, L., and Lugosi, G. A probabilistic theory of pattern recognition, volume 31. Springer Science & Business Media, 2013.

Du, C., Li, T., Pang, T., Yan, S., and Lin, M. Nonparametric generative modeling with conditional sliced-wasserstein flows. In International Conference on Machine Learning, pp. 8565 8584. PMLR, 2023.

Feydy, J., S ejourn e, T., Vialard, F.-X., Amari, S.-i., Trouve, A., and Peyr e, G. Interpolating between optimal transport and MMD using Sinkhorn divergences. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2681 2690, 2019.

Flamary, R., Courty, N., Gramfort, A., Alaya, M. Z., Boisbunon, A., Chambon, S., Chapel, L., Corenflos, A., Fatras, K., Fournier, N., Gautheron, L., Gayraud, N. T., Janati, H., Rakotomamonjy, A., Redko, I., Rolet, A., Schutz, A., Seguy, V., Sutherland, D. J., Tavenard, R., Tong, A., and Vayer, T. Pot: Python optimal transport. Journal of Machine Learning Research, 22(78):1 8, 2021. URL http: //jmlr.org/papers/v22/20-451.html.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672 2680, 2014.

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626 6637, 2017.

Sliced Wasserstein with Random-Path Projecting Directions

Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840 6851, 2020.

Jupp, P. E. and Mardia, K. V. Maximum likelihood estimators for the matrix von Mises-Fisher and bingham distributions. The Annals of Statistics, 7(3):599 606, 1979.

Kingma, D. P. and Welling, M. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013.

Kolouri, S., Rohde, G. K., and Hoffmann, H. Sliced Wasserstein distance for learning Gaussian mixture models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3427 3436, 2018.

Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. Master s thesis, Department of Computer Science, University of Toronto, 2009.

Le, T., Yamada, M., Fukumizu, K., and Cuturi, M. Treesliced variants of Wasserstein distances. Advances in neural information processing systems, 32, 2019.

Le, T., Nguyen, T., Phung, D., and Nguyen, V. A. Sobolev transport: A scalable metric for probability measures with graph metrics. In International Conference on Artificial Intelligence and Statistics, pp. 9844 9868, 2022.

Le, T., Nguyen, K., Sun, S., Ho, N., and Xie, X. Integrating efficient optimal transport and functional maps for unsupervised shape correspondence learning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024a.

Le, T. T., Nguyen, K., shanlin sun, Han, K., Ho, N., and Xie, X. Diffeomorphic mesh deformation via efficient optimal transport for cortical surface reconstruction. In The Twelfth International Conference on Learning Representations, 2024b.

Le Cun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Lee, C.-Y., Batra, T., Baig, M. H., and Ulbricht, D. Sliced Wasserstein discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10285 10295, 2019.

Mc Cann, R. J. A convexity principle for interacting gases. Advances in mathematics, 128(1):153 179, 1997.

Nadjahi, K., Durmus, A., Simsekli, U., and Badeau, R. Asymptotic guarantees for learning generative models with the sliced-Wasserstein distance. In Advances in

Neural Information Processing Systems, pp. 250 260, 2019.

Nadjahi, K., De Bortoli, V., Durmus, A., Badeau, R., and S ims ekli, U. Approximate Bayesian computation with the sliced-Wasserstein distance. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5470 5474. IEEE, 2020a.

Nadjahi, K., Durmus, A., Chizat, L., Kolouri, S., Shahrampour, S., and Simsekli, U. Statistical and topological properties of sliced probability divergences. Advances in Neural Information Processing Systems, 33:20802 20812, 2020b.

Nguyen, K. and Ho, N. Revisiting sliced Wasserstein on images: From vectorization to convolution. Advances in Neural Information Processing Systems, 2022.

Nguyen, K. and Ho, N. Energy-based sliced Wasserstein distance. Advances in Neural Information Processing Systems, 2023.

Nguyen, K., Ho, N., Pham, T., and Bui, H. Distributional sliced-Wasserstein and applications to generative modeling. In International Conference on Learning Representations, 2021.

Nguyen, K., Ren, T., and Ho, N. Markovian sliced Wasserstein distances: Beyond independent projections. Advances in Neural Information Processing Systems, 2023.

Nguyen, K., Bariletto, N., and Ho, N. Quasi-monte carlo for 3d sliced wasserstein. International Conference on Learning Representations, 2024.

Nietert, S., Sadhu, R., Goldfeld, Z., and Kato, K. Statistical, robustness, and computational guarantees for sliced Wasserstein distances. Advances in Neural Information Processing Systems, 2022.

Paulin, L., Bonneel, N., Coeurjolly, D., Iehl, J.-C., Webanck, A., Desbrun, M., and Ostromoukhov, V. Sliced optimal transport sampling. ACM Transactions on Graphics (TOG), 39(4):99 1, 2020.

Peyr e, G. and Cuturi, M. Computational optimal transport, 2020.

Salimans, T., Zhang, H., Radford, A., and Metaxas, D. Improving GANs using optimal transport. In International Conference on Learning Representations, 2018.

Santambrogio, F. Optimal transport for applied mathematicians. Birk auser, NY, 55(58-63):94, 2015.

Savkin, A., Wang, Y., Wirkert, S., Navab, N., and Tombari, F. Lidar upsampling with sliced Wasserstein distance. IEEE Robotics and Automation Letters, 8(1):392 399, 2022.

Sliced Wasserstein with Random-Path Projecting Directions

S ejourn e, T., Bonet, C., Fatras, K., Nadjahi, K., and Courty, N. Unbalanced optimal transport meets sliced Wasserstein. ar Xiv preprint ar Xiv:2306.07176, 2023.

Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp. 2256 2265. PMLR, 2015.

Song, J., Meng, C., and Ermon, S. Denoising diffusion implicit models. In International Conference on Learning Representations, 2020a.

Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2020b.

Tanguy, E. Convergence of sgd for training neural networks with sliced Wasserstein losses. ar Xiv preprint ar Xiv:2307.11714, 2023.

Vahdat, A., Kreis, K., and Kautz, J. Score-based generative modeling in latent space. Advances in Neural Information Processing Systems, 34:11287 11302, 2021.

Villani, C. Optimal transport: Old and New. Springer, 2008.

Wainwright, M. J. High-dimensional statistics: A nonasymptotic viewpoint. Cambridge University Press, 2019.

Xiao, Z., Kreis, K., and Vahdat, A. Tackling the generative learning trilemma with denoising diffusion gans. In International Conference on Learning Representations, 2021.

Yi, M. and Liu, S. Sliced Wasserstein variational inference. In Fourth Symposium on Advances in Approximate Bayesian Inference, 2021.

Sliced Wasserstein with Random-Path Projecting Directions

Supplement to Sliced Wasserstein with Random-Path Projecting Projections

First, we present skipped proofs in the main text in Appendix A. We then discuss some related works in B. After that, we present detailed algorithms for computing RPSW and IWRPSW with Monte Carlo estimation and training denoising diffusion models with augmented mini-batch energy distance in Appendix C. Then, we provide some additional experimental results on gradient estimators in Appendix D. Finally, we report the computational infrastructure in Appendix E.

A.1. Proof of Theorem 1

Non-negativity. Since the Wasserstein distance is non-negative, we have Wp(θ µ, θ ν) 0 for all θ Sd 1. Therefore, Eθ σRP (θ;µ,ν,σκ)[Wp(θ µ, θ ν)] 0 which leads to RPSWp(µ, ν; σκ) 0. Similarly, we have

Eθ1,...,θL σRP (θ;µ,ν,σκ)

PL l=1 W p p (θl µ, θl ν) f(W p p (θl µ,θl ν)) PL j=1 f(W p p (θj µ,θj ν))

0 which implies IWRPSWp(µ, ν; σκ, L) 0

Symmetry. From the definition of RPSW from Remark 3, and fσκ(PSd 1(x y))(θ) as the density function of σκ(PSd 1(x y)) we have:

RPSWp p(µ, ν; σκ) = EX µ,Y νEθ σκ(θ;PSd 1(X Y ))[W p p (θ µ, θ ν)]

Sd 1 W p p (θ µ, θ ν)fσκ(PSd 1(x y))(θ)dθdµ(x)dν(y)

Sd 1 W p p ( θ µ, θ ν)fσκ(PSd 1(x y))( θ)dθdµ(x)dν(y)

Sd 1 W p p (θ µ, θ ν)fσκ(PSd 1(y x))(θ)dθdµ(x)dν(y)

Sd 1 W p p (θ ν, θ µ)fσκ(PSd 1(y x))(θ)dθdν(y)dµ(x)

= EY ν,X µEθ σκ(θ;PSd 1(Y X))[W p p (θ ν, θ µ)] = RPSWp p(ν, µ; σκ),

where we use the reflection property of Sd 1, the property:

W p p (θ µ, θ ν) = inf π Π(µ,ν)

Z |θ (x y)|pdπ(x, y)

= inf π Π(µ,ν)

Z | θ (x y)|pdπ(x, y) = W p p ( θ µ, θ ν),

and fσκ(θ;PSd 1(x y))(θ) = fσκ(θ;PSd 1(y x))( θ)

which holds for both the v MF density fσκ(θ;PSd 1(x y)(θ) exp κ (x y)

θ and the PS density fσκ(θ;PSd 1(x y)(θ) 1 + (x y) x y 2

θ κ . Similarly, we have IWRPSWp p(µ, ν; σκ, L, f) = IWRPSWp p(ν, µ; σκ, L, f).

Identity. We need to show that RPSWp(µ, ν; σk) = 0 if and only if µ = ν. For the forward direction, since Wp(θ µ, θ ν) = 0 when µ = ν, we obtain directly µ = ν implies RPSWp(µ, ν; σk) = 0. For the reverse direction, we use the same proof technique in (Bonnotte, 2013). If RPSWp(µ, ν; σk) = 0, we have R

Sd 1 Wp (θ µ, θ ν) dσRP (θ; µ, ν, σk) = 0. Hence, we have Wp(θ µ, θ ν) = 0 for σRP (θ; µ, ν, σk)-almost surely θ Sd 1. Since σRP (θ; µ, ν, σk) is continuous due to the continuity of σk, we have Wp(θ µ, θ ν) = 0 for all θ Sd 1. Therefore, we obtain θ µ = θ ν for σµ,ν(θ; f, p)-a.e θ Sd 1 due to the identity of Wasserstein distance. For any t R and θ Sd 1, we have:

F[µ](tθ) = Z

Rd e it θ,x dµ(x) = Z

R e itzdθ µ(z) = F[θ µ](t)

= F[θ ν](t) = Z

R e itzdθ ν(z) = Z

Rd e it θ,x dν(x) = F[ν](tθ),

Sliced Wasserstein with Random-Path Projecting Directions

where F[γ](w) = R

Rd e i w,x dγ(x) denotes the Fourier transform of γ P(Rd ). The above transformation is also known as the slice projection theorem. By the injectivity of the Fourier transform.

For IWRPSW, when µ = ν, we have Wp(θl µ, θl ν) = 0 for any θ1, . . . , θL Sd 1. Therefore, we have PL l=1 W p p (θl µ, θl ν) f(W p p (θl µ,θl ν)) PL j=1 W p p (θj µ,θj ν) = 0 for any θ1, . . . , θL Sd 1 which implies IWRPSWp(µ, ν; σκ, L) =

E PL l=1 W p p (θl µ, θl ν) f(W p p (θl µ,θl ν)) PL j=1 W p p (θj µ,θj ν)

= 0. In the reverse direction, when IWRPSWp(µ, ν; σκ, L) = 0, it means

that we have PL l=1 W p p (θl µ, θl ν) f(W p p (θl µ,θl ν)) PL j=1 W p p (θj µ,θj ν) = 0 for any θ1, . . . , θL Sd 1. Since f(W p p (θl µ, θl ν)) > 0 for

any θj, it implies that W p p (θl µ, θl ν) = 0 for all θl Sd 1. With similar arguments to the proof of RPSW, we obtain µ = ν which completes the proof.

Quasi-Triangle Inequality. Given three probability measures µ1, µ2, µ3 Pp(Rd) we have:

RPSWp(µ1, µ2; σκ) = Eθ σRP (θ;µ1,µ2,σκ)[W p p (θ µ1, θ µ2)] 1

Eθ σRP (θ;µ1,µ2,σκ)[(Wp(θ µ1, θ µ3) + Wp(θ µ3, θ µ2))p] 1

Eθ σRP (θ;µ1,µ2,σκ)[W p p (θ µ1, θ µ3)] 1

p + Eθ σRP (θ;µ1,µ2,σκ)[W p p (θ µ3, θ µ2)] 1

= RPSWp(µ1, µ3; σκ, µ1, µ2) + RPSWp(µ3, µ2; σκ, µ1, µ2),

where the first inequality is due to the triangle inequality of Wasserstein distance and the second inequality is due to the Minkowski inequality. We complete the proof here.

A.2. Proof of Proposition 1

(i) To prove that RPSWp(µ, ν; σκ) IWRPSWp(µ, ν; σκ, L), we introduce the following lemma which had been proved in (Nguyen & Ho, 2023). Here, we provide the proof for completeness.

Lemma 1. For any L 1, 0 a1 a2 . . . a L and 0 < b1 b2 . . . b L, we have:

i=1 aibi. (10)

Proof. We prove Lemma 1 via an induction argument. For L = 1, it is clear that aibi = aibi. Now, we assume that the inequality holds for L i.e., 1

L(PL i=1 ai)(PL i=1 bi) PL i=1 aibi or (PL i=1 ai)(PL i=1 bi) L PL i=1 aibi. Now, we want to show that the inequality holds for L + 1 i.e., (PL+1 i=1 ai)(PL i=1 bi) (L + 1) PL+1 i=1 aibi. First, we have:

i=1 bi) = (

i=1 bi) + (

i=1 ai)b L+1 + (

i=1 bi)a L+1 + a L+1b L+1

i=1 aibi + (

i=1 ai)b L+1 + (

i=1 bi)a L+1 + a L+1b L+1.

By rearrangement inequality, we have a L+1b L+1 + aibi a L+1bi + b L+1ai for all 1 i L. By taking the sum of these inequalities over i from 1 to L, we obtain:

i=1 ai)b L+1 + (

i=1 bi)a L+1

i=1 aibi + La L+1b L+1.

Therefore, we have

Sliced Wasserstein with Random-Path Projecting Directions

i=1 aibi + (

i=1 ai)b L+1 + (

i=1 bi)a L+1 + a L+1b L+1

i=1 aibi + La L+1b L+1 + a L+1b L+1

which completes the proof.

From Lemma 1, with ai = W p p (θi µ, θj ν) and bi = f(W p p (θi µ, θj ν)), we have:

i=1 W p p (θi µ, θi ν)

i=1 W p p (θi µ, θi ν) f(W p p (θi µ, θi ν)) PL j=1 f(W p p (θj µ, θj ν)) .

Taking the expectation with respect to θ1, . . . , θL i.i.d σRP (θ; µ, ν, σκ), we obtain RPSWp(µ, ν; σκ) IWRPSWp(µ, ν; σκ, L).

Now to show that IWRPSWp(µ, ν; σκ, L) Max-SW(µ, ν), we have θ = argmaxθ Sd 1Wp(θ µ, θ ν) since Sd 1 is compact and the function θ Wp(θ µ, θ ν) is continuous. From the definition of the IWRPSW, for any L 1, σκ P(Sd 1) we have:

IWRPSWp(µ, ν; σκ, L) =

Eθ1,...,θL σRP (θ;µ,ν,σκ)

l=1 W p p (θl µ, θl ν) f(W p p (θl µ, θl ν)) PL j=1 f(W p p (θj µ, θj ν))

Eθ σRP (θ;µ,ν,σk) W p p (θ µ, θ ν) 1

p = W p p (θ µ, θ ν) = Max-SWp(µ, ν).

Furthermore, by applying the Cauchy-Schwartz inequality, we have:

Max-SWp(µ, ν) = max θ Sd 1

inf π Π(µ,ν)

θ x θ y p dπ(x, y) 1

inf π Π(µ,ν)

Rd Rd θ p x y pdπ(x, y) 1

= inf π Π(µ,ν)

Rd Rd θ p x y pdπ(x, y) 1

= inf π Π(µ,ν)

Rd Rd x y pdπ(x, y) 1

inf π Π(µ,ν)

Rd Rd |x y|pdπ(x, y) 1

= Wp(µ, ν),

we complete the proof.

(ii) We first recall the density of The von-Mises Fisher distribution and the Power spherical distribution. In particular, we have v MF(θ; ϵ, κ) := κd/2 1

(2π)d/2Id/2 1(κ) exp(κϵ θ) with Is denotes the modified Bessel function of the first kind, and the

Sliced Wasserstein with Random-Path Projecting Directions

PS distribution PS(θ; ϵ, κ) = 2d+κ 1π(d 1)/2 Γ((d 1)/2+κ)

Γ(d+κ 1) 1 (1 + ϵ θ)κ. We have:

lim κ 0 κd/2 1

(2π)d/2Id/2 1(κ) exp(κϵ θ) C1,

2d+κ 1π(d 1)/2 Γ((d 1)/2 + κ)

1 (1 + ϵ θ)κ C2

for some constant C1 and C2 which do not depend on θ, hence, the v MF distribution and the PS distribution converge to the uniform distribution when κ 0. Therefore, we have v MF(θ; ϵ, κ) U(Sd 1) and PS(θ; ϵ, κ) U(Sd 1). Therefore, we have σRP (θ; µ, ν, σκ) U(Sd 1).

Now, we need to show that W p p (θ µ, θ ν) is bounded and continuous with respective to θ. For the boundedness, we have:

Wp p(θ µ, θ ν) = inf π Π(ν,µ)

Rd |θ x θ y|pdπ(x, y)

inf π Π(ν,µ)

Rd x y pdπ(x, y)

= Wp p(µ, ν) < .

For the continuity, let (θt)t 1 be a sequence on Sd 1 which converges to θ Sd 1 i.e., θt θ 0 as t , and a arbitrary measure µ Pp(Rd). Then we have:

Wp(θ µ, θt µ) = inf π Π(µ,µ)

Rd |θ t x θ y|pdπ(x, y) 1/p

Rd |θ t x θ x|pdµ(x) 1/p

Rd x pµ(dx) 1/p θt θ 0 as t ,

Rd x pµ(dx) 1/p < since µ Pp(Rd), and the second inequality is due to the Cauchy-Schwartz inequality.

Using the triangle inequality, we have:

|Wp(θt µ, θt ν) Wp(θ µ, θ ν)| |Wp(θt µ, θt ν) Wp(θ µ, θt ν)| + |Wp(θ µ, θt ν) Wp(θ µ, θ ν)|

Wp(θ µ, θt µ) + Wp(θ ν, θt ν) 0 as t ,

hence, Wp(θt µ, θt ν) Wp(θ µ, θ ν) as t which complete the proof of continuity.

From the boundedness, continuity, and σRP (θ; µ, ν, σκ) U(Sd 1), we have

RPSWp p(µ, ν; σκ) = Eθ σRP (θ;µ,ν,σκ)[Wp(θ µ, θ ν)] Eθ U(Sd 1)[Wp(θ µ, θ ν)] = SW p p (µ, ν).

Applying the continuous mapping theorem for x x1/p, we obtain limκ 0 RPSWp(µ, ν; σκ) SWp(µ, ν).

(iii) Since we have proved that W p p (θ µ, θ ν) is bounded and continuous with respective to θ, we can show that PL i=1 W p p (θi µ, θi ν) f(W p p (θi µ,θi ν)) PL j=1 f(W p p (θj µ,θj ν)) are bounded and continuous with respect to θ1, . . . , θL. As L , we

have PL i=1 W p p (θi µ, θi ν) f(W p p (θi µ,θi ν)) PL j=1 f(W p p (θj µ,θj ν)) Eγ σµ,ν,f (γ)[γ µ, γ ν] = EBSWp p(µ, ν; f). Applying the continuous

mapping theorem for x x1/p, we obtain lim L IWRPSWp(µ, ν; σκ, L, f) EBSWp(µ, ν; f).

(iv) For any µ and ν, by the Holder inequality, we have:

RPSWp(µ, ν; σκ) = Eθ σRP (θ;µ,ν,σκ)[W p p (θ µ, θ ν)] 1

Eθ σRP (θ;µ,ν,σκ)[Wp(θ µ, θ ν)]

Eθ σRP (θ;µ,ν,σκ)[W1(θ µ, θ ν)]

= RPSW1(µ, ν; σκ),

which completes the proof.

Sliced Wasserstein with Random-Path Projecting Directions

A.3. Proof of Proposition 2

The proof for this result follows from the proof of Proposition 3 in (Nguyen & Ho, 2023). We assume that µ has a compact set of support X Rd.

From Proposition 1, we have:

E[RPSWp(µn, µ; σκ)] E[IWRPSWp(µn, µ; σκ, L)] E [Max-SWp(µn, µ)] ,

for any σκ P(Sd 1) and L 1. Therefore, the proposition follows as long as we can demonstrate that

E[Max-SWp(µn, µ)] C p

(d + 1) log n/n

where µn = 1

n Pn i=1 δXi with X1, . . . , Xn i.i.d µ, and C > 0 is some universal constant and the outer expectation is taken with respect to X1, . . . , Xn.

Using the closed-form of one-dimensional Wasserstein distance, we have:

Max-SWp(µn, µ) = max θ Sd 1

0 |F 1 n,θ(z) F 1 θ (z)|pdz,

where Fn,θ and Fθ as the cumulative distributions of θ µn and θ µ. Since Wp((tθ) µ, (tθ) ν) = t Wp(θ µ, θ ν) for t > 0. We can rewrite Max-SW as:

Max-SWp p(µn, µ) = max θ Rd: θ =1

0 |F 1 n,θ(z) F 1 θ (z)|pdz

diam(X) max x R,θ Rd: θ 1 |Fn,θ(x) Fθ(x)|p

= diam(X) sup A A |µn(A) µ(A)|,

where A is the set of half-spaces {z Rd : θ z x} for all θ Rd such that θ 1. From VC inequality (Theorem 12.5 in (Devroye et al., 2013)), we have

P sup A A |µn(B) µ(A)| > t 8S(A, n)e nt2/32.

with S(A, n) is the growth function. From the Sauer Lemma (Proposition 4.18 in (Wainwright, 2019)), the growth function is upper bounded by (n + 1)V C(A). Moreover, we can get V C(A) = d + 1 from Example 4.21 in (Wainwright, 2019).

Let 8S(A, n)e nt2/32 δ, we have t2 32

n log 8S(A,n)

δ . Therefore, we obtain

sup A B |µn(A) µ(A)|

n log 8S(A, n)

Using the Jensen inequality and the tail sum expectation for non-negative random variable, we have:

E sup A A |µn(A) µ(A)|

E sup A A |µn(A) µ(A)| 2 =

sup A A |µn(A) µ(A)| 2 > t

sup A A |µn(A) µ(A)| 2 > t

sup A A |µn(A) µ(A)| 2 > t

u 8S(A, n)e nt/32dt =

u + 256S(A, n)e nu/32

Sliced Wasserstein with Random-Path Projecting Directions

Since the inequality holds for any u, we search for the best u that makes the inequality tight. Let f(u) = u + 256S(A, n) e nu/32

n , we have f (u) = 1+8S(A, n)e nu/32. Setting f (u) = 0, we obtain the minima u = 32 log(8S(A,n))

n . Plugging u in the inequality, we obtain:

E sup A A |µn(A) µ(A)|

32 log(8S(A, n))

(d + 1) log(n + 1)

by using Sauer Lemma i.e., S(A, n) (n + 1)V C(A) (n + 1)d+1. Putting the above results together leads to

E[Max-SWp(µn, µ)] C p

(d + 1) log n/n,

where C > 0 is some universal constant. As a consequence, we obtain the conclusion of the proof.

A.4. Proof of Proposition 3

For any p 1, d 1, σκ P(Sd 1), and µ, ν Pp(Rd), using the Holder s inequality, we have:

E| \ RPSW p p(µ, ν; σκ) RPSWp p(µ, ν; σκ)|

E| \ RPSW p p(µ, ν; σκ) RPSWp p(µ, ν; σκ)|2 1

l=1 Wp p(θl µ, θl ν) Eθ σRP (θ;µ,ν,σκ) W p p (θ µ, θ ν) !2

L PL l=1 Wp p(θl µ, θl ν)] = 1 L PL l=1 E[Wp p(θl µ, θl ν)] = 1 L PL l=1 \ RPSW p p(µ, ν; σκ) = \ RPSW p p(µ, ν; σκ), we have:

E| \ RPSW p p(µ, ν; σκ) RPSWp p(µ, ν; σκ)|

V arθ σRP (θ;µ,ν,σκ)

l=1 W p p (θl µ, θl ν)

L V ar W p p (θ µ, θ ν) 1

since θ1, . . . , θL i.i.d σRP (θ; µ, ν, σκ), which completes the proof.

B. Related Works

Dynamic Optimal Transport. Wasserstein distance can also be seen as the shortest curve between two input measures (Villani, 2008) which creates an optimal interpolation path between two measures (Mc Cann, 1997). In this work, we work with static sliced Wasserstein distance where the random-path is only used to construct discriminative projecting directions.

Sliced Wasserstein on Manifolds. On manifolds where a straight line is generalized into a curse, the random-path definition should be different. For example, on the sphere, the shortest path between two points is the great circle. Therefore, it is natural to define the random-path as a random great circle with two endpoints that are randomly drawn from a coupling between two measures. This definition can lead to a natural extension for spherical Sliced Wasserstein (Bonet et al., 2023a). Moreover, we can also create a similar definition on hyperbolic manifolds for hyperbolic sliced Wasserstein (Bonet et al., 2023b) and on the manifold of positive definite matrix (Bonet et al., 2023c).

Dependent Projecting Directions. Markovian Sliced Wasserstein distance is introduced in (Nguyen et al., 2023). It is a variant of SW that utilizes dependent projecting directions i.e., follows a Markovian structure joint distribution. The contribution of random-path slicing distribution is orthogonal to the mentioned approach since as can be used in either the prior distribution or the transition distribution in the Markovian process. In this work, we focus on slicing distribution selection variants of SW without dependency structure which can be computed very fast in parallel.

Properties of sliced Wasserstein losses. In this work, we use sliced Wasserstein distance as a loss for gradient flows and generative modeling. We refer the reader to (Tanguy, 2023) for a more discussion about its properties i.e., regularity, gradient definition, and so on.

Sliced Wasserstein with Random-Path Projecting Directions

Algorithm 1 Computational algorithm of RPSW

Input: Probability measures µ and ν, p 1, the number of projections L, 0 < κ < for l = 1 to L do

Sample X µ, Y ν Sample θl σκ θ; X Y X Y 2

Compute vl = Wp(θl µ, θl ν) end for

Compute \ RPSWp(µ, ν; L, σκ) = 1 L PL l=1 vl 1

Return: \ RPSWp(µ, ν; L, σκ)

Algorithm 2 Computational algorithm of the IWRPSW

Input: Probability measures µ and ν, p 1, the number of projections L, 0 < κ < , and the energy function f. for l = 1 to L do

Sample X µ, Y ν Sample θl σκ θ; X Y X Y 2

Compute vl = Wp(θl µ, θl ν) Compute wl = f(Wp(θl µ, θl ν)) end for

Compute \ IWRPSWp(µ, ν; σκ, L, f) = PL l=1 vl wl PL i=1 wi

Return: \ IWRPSWp(µ, ν; σκ, L, f)

Unbalanced Sliced Wasserstein. Random-path projecting direction can be applied directly to SW variants with unbalanced transport e.g., sliced partial optimal transport (Bonneel & Coeurjolly, 2019; Bai et al., 2022), sliced unbalanced optimal transport (S ejourn e et al., 2023).

Other potential applications. Since RPSW and IWRPSW can be used as a replacement for SW, they can be applied in domain adaptation (Lee et al., 2019), point-cloud applications (Bai et al., 2023; Nguyen et al., 2024), 3D shapes matching and deformation (Le et al., 2024a;b), Bayesian inference (Yi & Liu, 2021; Nadjahi et al., 2020a), blue noise sampling (Paulin et al., 2020) and so on.

Other potential future directions. We consider extending the proposed ideas of random-path direction for a line in SW into OT problem with more general structures on supports, e.g., for a tree in tree-Wasserstein (Le et al., 2019), and for a graph in Sobolev transport (Le et al., 2022).

C. Algorithms

Computational algorithms of RPSW and IWRPSW. We present the pseudo-codes for computing RPSW and IWRPSW with Monte Carlo estimation in Algorithm 1 and Algorithm 2.

Algorithm for training denoising diffusion models with augmented mini-batch energy distance. The detailed algorithm is given in Algorithm 3.

D. Additional Experimental Details

Gradient flows with different gradient estimators. We run experiments on gradient flow with the original gradient estimators of RPSW and IWRPSW in Figure 4. We see that the original gradient estimators lead to faster convergence in terms of Wasserstein-2 distances. However, the inner topology between points, presented in colors, is destroyed especially for RPSW. The reason for this behavior is because of the independent coupling in constructing random paths. Therefore, we suggest using the simplified gradient estimator and treating the random-path slicing distribution as a separate distribution for projecting direction selection.

Sliced Wasserstein with Random-Path Projecting Directions

Algorithm 3 Training denoising diffusion models with augmented mini-batch energy distance

Input: q(x1:T |x0) = Q

t 1 q(xt|xt 1), pϕ(x0:T ) = p(x T ) Q

t 1 pϕ(xt 1|xt), pϕ(xt 1|xt) = R p(ϵ)q(xt 1|xt, x0 = Gϕ(xt, ϵ, t)) with p(ϵ) P(Rz) and Gϕ : Rd Rz R Rd, Dγ : Rd Rd R [0, 1], metric D. while ϕ not converged or not reaching the maximum epochs do

Sample a mini-batch x0,1, . . . , x0,m q(x0) with m 2. Sample t U(1, . . . , T). Sample a mini-batch xt 1,1 q(xt 1|x0,1), . . . , xt 1,m q(xt 1|x0,m). Sample a mini-batch xt,1 q(xt|xt 1,1), . . . , xt,m q(xt|xt 1,m). Sample a mini-batch of noise ϵ1, . . . , ϵm p(ϵ). Set y0,1, . . . , y0,m = Gϕ(xt,1, ϵ1, t), . . . , Gϕ(xt,m, ϵm, t). Sample a mini-batch yt 1,1 q(xt 1|xt,1, y0,1), . . . , yt 1,m q(xt 1|xt,m, y0,m). Update γ with γ 1

m Pm i=1 log(Dγ(xt 1,i, xt,i, t)) . Update γ with γ 1

m Pm i=1 log(1 Dγ(yt 1,i, xt,i, t)) . Set X = ((xt 1,1, Dγ(xt 1,1, xt,1, t)), . . . , (xt 1,m, Dγ(xt 1,m, xt,m, t))). Set Y = ((yt 1,1, Dγ(yt 1,1, xt,1, t)), . . . , (yt 1,m, Dγ(yt 1,m, xt,m, t))). Set X1, X2 = X[: m/2], X[m/2 :], and Y1, Y2 = Y [: m/2], Y [m/2 :] Update ϕ with ϕ 2D2( X, Y ) D2( X1, X2) D2( Y1, Y2)

end while Return: ϕ

RPSW (*) L=10

W2: 25.3149 10 2 (0s)

W2: 0.1975 10 2 (1.26s)

W2: 0.0072 10 2 (1.73s)

IWRPSW (*) L=10

W2: 25.3149 10 2 (0s)

W2: 0.1294 10 2 (1.45s)

W2: 0.0067 10 2 (1.93s)

Figure 4. Results for gradient flows that are from the empirical distribution over the color points to the empirical distribution over S-shape points produced by different RPSW and IWRPSW with original gradient estimator.

Gradient flows on MNIST digits. We present the full visualization of the gradient flows on MNIST in Figure 5.

Neural network architecture for diffusion model. Our generator follows the U-net structure in (Xiao et al., 2021) which has 2 Res Net blocks per scale, 128 initial channels, channel multiplier for each scale: (1, 2, 2, 2), scale of attention block: 16, latent dimension 256, 3 latent mapping layers, latent embedding dimension: 512. For the discriminator, the order of the layers are 1 1 conv2d, 128 Res Block, 128 Res Block down, 256 Res Block down, 512 Res Block down, 512 minibatch std layer Global Sum Pooling FC layer to scalar. For other settings, we refer the reader to (Xiao et al., 2021, Section C).

Hyper-parameters. We set initial learning rate for discriminator to 10 4, initial learning rate for generator to 1.6 10 4, Adam optimizer with parameters (0.5, 0.9), EMA to 0.9999, batch-size to 256. For the learning rate scheduler, we use cosine learning rate decay.

E. Computational Infrastructure

For the gradient flow experiments, we use a HP Omen 25L desktop for conducting experiments. For diffusion model experiments, we use a single NVIDIA A100 GPU.

Sliced Wasserstein with Random-Path Projecting Directions

SW (Step 0) SW (Step 100) SW (Step 1000) SW (Step 5000)

Max-SW (Step 0) Max-SW (Step 100) Max-SW (Step 1000) Max-SW (Step 5000)

DSW (Step 000) DSW (Step 100) DSW (Step 1000) DSW (Step 5000)

EBSW (Step 0) EBSW (Step 100) EBSW (Step 1000) EBSW (Step 5000)

RPSW (Step 0) RPSW (Step 100) RPSW (Step 1000) RPSW (Step 5000)

IWRPSW (Step 0) IWRPSW (Step 100) IWRPSW (Step 1000) IWRPSW (Step 5000) Figure 5. Gradient flows from MNIST digit 1 to MNIST digit 0 .