# augmented_sliced_wasserstein_distances__b7bbfd21.pdf

Published as a conference paper at ICLR 2022

AUGMENTED SLICED WASSERSTEIN DISTANCES

Xiongjie Chen1, Yongxin Yang2,3, and Yunpeng Li1

1University of Surrey, 2University of Edinburgh, 3Huawei Noah s Ark Lab 1{xiongjie.chen, yunpeng.li}@surrey.ac.uk, 2yongxin.yang@ed.ac.uk

While theoretically appealing, the application of the Wasserstein distance to large-scale machine learning problems has been hampered by its prohibitive computational cost. The sliced Wasserstein distance and its variants improve the computational efﬁciency through the random projection, yet they suffer from low accuracy if the number of projections is not sufﬁciently large, because the majority of projections result in trivially small values. In this work, we propose a new family of distance metrics, called augmented sliced Wasserstein distances (ASWDs), constructed by ﬁrst mapping samples to higher-dimensional hypersurfaces parameterized by neural networks. It is derived from a key observation that (random) linear projections of samples residing on these hypersurfaces would translate to much more ﬂexible nonlinear projections in the original sample space, so they can capture complex structures of the data distribution. We show that the hypersurfaces can be optimized by gradient ascent efﬁciently. We provide the condition under which the ASWD is a valid metric and show that this can be obtained by an injective neural network architecture. Numerical results demonstrate that the ASWD signiﬁcantly outperforms other Wasserstein variants for both synthetic and real-world problems.

1 INTRODUCTION

Comparing samples from two probability distributions is a fundamental problem in statistics and machine learning. The optimal transport (OT) theory (Villani, 2008) provides a powerful and ﬂexible theoretical tool to compare degenerative distributions by accounting for the metric in the underlying spaces. The Wasserstein distance, which arises from the optimal transport theory, has become an increasingly popular choice in various machine learning domains ranging from generative models to transfer learning (Gulrajani et al., 2017; Arjovsky et al., 2017; Kolouri et al., 2019b; Cuturi and Doucet, 2014; Courty et al., 2016).

Despiteitsfavorableproperties, suchasrobustnesstodisjointsupportsandnumericalstability(Arjovsky et al., 2017), the Wasserstein distance suffers from high computational complexity especially when the sample size is large. Besides, the Wasserstein distance itself is the result of an optimization problem it is non-trivial to be integrated into an end-to-end training pipeline of deep neural networks, unless one can make the solver for the optimization problem differentiable. Recent advances in computational optimal transport methods focus on alternative OT-based metrics that are computationally efﬁcient and solvable via a differentiable optimizer (Peyré and Cuturi, 2019). Entropy regularization is introduced in the Sinkhorn distance (Cuturi, 2013) and its variants (Altschuler et al., 2017; Dessein et al., 2018) to smooth the optimal transport problem; as a result, iterative matrix scaling algorithms can be applied to provide signiﬁcantly faster solutions with improved sample complexity (Genevay et al., 2019).

An alternative approach is to approximate the Wasserstein distance through slicing, i.e. linearly projecting, the distributions to be compared. The sliced Wasserstein distance (SWD) (Bonneel et al., 2015) is deﬁned as the expected value of Wasserstein distances between one-dimensional random projections of high-dimensional distributions. The SWD shares similar theoretical properties with the Wasserstein distance (Bonnotte, 2013) and is computationally efﬁcient since the Wasserstein distance in one-dimensional space has a closed-form solution based on sorting. (Deshpande et al., 2019) extends the sliced Wasserstein distance to the max-sliced Wasserstein distance (Max-SWD), by ﬁnding a single projection direction with the maximal distance between projected samples. The subspace robust Wasserstein distance extends the idea of slicing to projecting distributions on linear subspaces (Paty and

ar Xiv:2006.08812v7 [cs.LG] 17 Mar 2022

Published as a conference paper at ICLR 2022

Figure 1: (a) and (b) are visualizations of projections for the ASWD and the SWD between two 2-dimensional Gaussians. (c) and (d) are distance histograms for the ASWD and the SWD between two 100-dimensional Gaussians. Figure 1(a) shows that the injective neural network embedded in the ASWD learns data patterns (in the X-Y plane) and produces well-separate projected values (Z-axis) between distributions in a random projection direction. The high projection efﬁciency of the ASWD is evident in Figure 1(c), as almost all random projection directions in a 100-dimensional space lead to signiﬁcant distances between 1-dimensional projections. In contrast, random linear mappings in the SWD often produce closer 1-d projections (Z-axis) (Figure 1(b)); as a result, a large percentage of random projection directions in the 100-d space result in trivially small distances (Figure 1(d)), leading to a low projection efﬁciency in high-dimensional spaces.

Cuturi, 2019). However, the linear nature of these projections usually leads to low projection efﬁciency of the resulted metrics in high-dimensional spaces (Deshpande et al., 2019; Kolouri et al., 2019a).

Different variants of the SWD have been proposed to improve the projection efﬁciency of the SWD, either by introducing nonlinear projections or by optimizing the distribution of random projections. Speciﬁcally, (Kolouri et al., 2019a) extends the connection between the sliced Wasserstein distance and the Radon transform (Radon, 1917) to introduce generalized sliced Wasserstein distances (GSWDs) by utilizing generalized Radon transforms (GRTs), which are deﬁned by nonlinear deﬁning functions and lead to nonlinear projections. A variant named the GSWD-NN was proposed in (Kolouri et al., 2019a) to generate nonlinear projections directly with neural network outputs, but it does not ﬁt into the theoretical framework of the GSWD and does not guarantee a valid metric. In contrast, the distributional sliced Wasserstein distance (DSWD) and its nonlinear version, the distributional generalized sliced Wasserstein distance (DGSWD), improve their projection efﬁciency by ﬁnding a distribution of projections that maximizes the expected distances over these projections. The GSWD and the DGSWD exhibit higher projection efﬁciency than the SWD in the experiment evaluation, yet they require the speciﬁcation of the particular form of deﬁning functions from a limited class of candidates. However, the selection of deﬁning functions is usually a task-dependent problem and requires domain knowledge, and the impact on performance from different deﬁning functions is still unclear.

In this paper, we present the augmented sliced Wasserstein distance (ASWD), a distance metric constructed by ﬁrst mapping samples to hypersurfaces in an augmented space, which enables ﬂexible nonlinear slicing of data distributions for improved projection efﬁciency (See Figure 1). Our main contributions include: (i) We exploit the capacity of nonlinear projections employed in the ASWD by constructing injective mapping with arbitrary neural networks; (ii) We prove that the ASWD is a valid distance metric; (iii) We provide a mechanism in which the hypersurface where high-dimensional distributions are projected onto can be optimized and show that the optimization of hypersurfaces can help improve the projection efﬁciency of slice-based Wasserstein distances. Hence, the ASWD is data-adaptive, i.e. the hypersurfaces can be learned from data. This implies one does not need to manually design a function from the limited class of candidates; (iv) We demonstrate superior performance of the ASWD in numerical experiments for both synthetic and real-world datasets.

The remainder of the paper is organized as follows. Section 2 reviews the necessary background. We present the proposed method and its numerical implementation in Section 3. Related work are discussed in Section 4. Numerical experiment results are presented and discussed in Section 5. We conclude the paper in Section 6.

Published as a conference paper at ICLR 2022

2 BACKGROUND

In this section, we provide a brief review of concepts related to the proposed work, including the Wasserstein distance, (generalized) Radon transform and (generalized) sliced Wasserstein distances.

Wasserstein distance: Let Pk(Ω) be a set of Borel probability measures with ﬁnite k-th moment on a Polish metric space (Ω,d) (Villani, 2008). Given two probability measures µ, ν Pk(Ω), the Wasserstein distance of order k [1,+ ) between µ and ν is deﬁned as:

Wk(µ,ν)= inf γ Γ(µ,ν)

Ω Ω d(x,y)kdγ(x,y) 1

where d( , )k is the cost function, Γ(µ,ν) represents the set of all transportation plans γ, i.e. joint distributions whose marginals are µ and ν, respectively.

While the Wasserstein distance is generally intractable for high-dimensional distributions, there are several favorable cases where the optimal transport problem can be efﬁciently solved. If µ and ν are continuous one-dimensional measures deﬁned on a linear space equipped with the Lk norm, the Wasserstein distance between µ and ν has a closed-form solution (Peyré and Cuturi, 2019):

Wk(µ,ν)= Z 1

0 |F 1 µ (z) F 1 ν (z)|kdz 1

where F 1 µ and F 1 ν are inverse cumulative distribution functions (CDFs) of µ and ν, respectively. In practice, Wasserstein distances Wk( µ, ν) between one-dimensional empirical distributions µ = 1

N PN n=1δxn and ν = 1

N PN n=1δyn can be computed by sorting one-dimensional samples from empirical distributions, and evaluating the distances between sorted samples (Kolouri et al., 2019b):

Wk( µ, ν)= 1

n=1 |x Ix[n] y Iy[n]|k 1

where N is the number of samples, Ix[n] and Iy[n] are the indices of sorted samples satisfying x Ix[n] x Ix[n+1] and y Iy[n] y Iy[n+1], respectively.

Radon transform and generalized Radon transform: The Radon transform (Radon, 1917) maps a function f( ) L1(Rd) to the space of functions deﬁned over spaces of hyperplanes in Rd. The Radon transform of f( ) is deﬁned by line integrals of f( ) along all possible hyperplanes in Rd:

Rdf(x)δ(t x,θ )dx, (4)

where t R and θ Sd 1 represent the parameters of hyperplanes {x Rd | x,θ = t}, δ( ) is the one-dimensional Dirac delta function, and , refers to the Euclidean inner product.

By replacing the inner product x,θ in Equation (4) with β(x,θ), a speciﬁc family of functions named as deﬁning function in (Kolouri et al., 2019a), the generalized Radon transform (GRT) (Beylkin, 1984) is deﬁned as integrals of f( ) along hypersurfaces deﬁned by {x Rd | β(x,θ)=t}:

Rdf(x)δ(t β(x,θ))dx, (5)

where t R, θ Ωθ and Ωθ is a compact set of all feasible θ, e.g. Ωθ = Sd 1 for β(x,θ) = x,θ . In particular, a function β(x,θ) deﬁned on X (Rd\{0}) with X Rd is called a deﬁning function of GRTs if it satisﬁes conditions H.1 H.4 given in (Kolouri et al., 2019a).

For probability measures µ Pk(Rd), the Radon transform and the GRT can be employed as push-forward operators, and the generated push-forward measures Rµ =R#µ, Gµ =G#µ are deﬁned as follows (Bonneel et al., 2015):

Rdδ(t x,θ )dµ, (6)

Rdδ(t β(x,θ))dµ. (7)

Published as a conference paper at ICLR 2022

Notably, the Radon transform is a linear bijection (Helgason, 1980), and the sufﬁcient conditions for GRTs to be bijective are provided in (Homan and Zhou, 2017).

Sliced Wasserstein distance and generalized sliced Wasserstein distance: By applying the Radon transform to µ and ν to obtain multiple projections, the sliced Wasserstein distance (SWD) decomposes the high-dimensional Wasserstein distance into multiple one-dimensional Wasserstein distances which can be efﬁciently evaluated (Bonneel et al., 2015). The k-SWD between µ and ν is deﬁned by:

SWDk(µ,ν)= Z

Sd 1W k k Rµ( ,θ),Rν( ,θ) dθ 1

where the Radon transform R deﬁned by Equation (6) is adopted as the measure push-forward operator. The GSWD generalizes the idea of SWD by slicing distributions with hypersurfaces rather than hyperplanes (Kolouri et al., 2019a). The GSWD is deﬁned as:

GSWDk(µ,ν)= Z

Ωθ W k k Gµ( ,θ),Gν( ,θ) dθ 1

where the GRT G deﬁned by Equation (7) is used as the measure push-forward operator. From Equation (3), with L random projections and N samples, the SWD and GSWD between µ and ν can be approximated by:

SWDk(µ,ν) 1

n=1 | x Ilx[n],θl y Ily[n],θl |k 1

GSWDk(µ,ν) 1

n=1 |β(x Ilx[n],θl) β(y Ily[n],θl)|k 1

where Il x and Il y are sequences consisting of the indices of sorted samples which satisfy x Ilx[n],θl x Ilx[n+1], θl , y Ily[n], θl y Ily[n+1], θl in the SWD, and β(x Ilx[n], θl) β(x Ilx[n+1], θl), β(y Ily[n],θl) β(y Ily[n+1],θl) in the GSWD. The approximation error in estimating SWDs using Equation (10) is derived in (Nadjahi et al., 2020). It is proved in (Bonnotte, 2013) that the SWD is a valid distance metric. The GSWD is a valid metric except for its neural network variant (Kolouri et al., 2019a).

3 AUGMENTED SLICED WASSERSTEIN DISTANCES

In this section, we propose a new distance metric called the augmented sliced Wasserstein distance (ASWD), which embeds ﬂexible nonlinear projections in its construction. We also provide an implementation recipe for the ASWD.

3.1 SPATIAL RADON TRANSFORM AND AUGMENTED SLICED WASSERSTEIN DISTANCE

In the deﬁnitions of the SWD and GSWD, the Radon transform (Radon, 1917) and the generalized Radon transform (GRT) (Beylkin, 1984) are used as the push-forward operator for projecting distributions to a one-dimensional space. However, it is not straightforward to design deﬁning functions β(x,θ) for the GRT, since one needs to ﬁrst check if β(x,θ) satisﬁes the conditions to be a deﬁning function (Kolouri et al., 2019a; Beylkin, 1984), and then whether the corresponding GRT is bijective or not (Homan and Zhou, 2017). In practice, the assumption of the transform can be relaxed, as Theorem 1 shows that as long as the transform is injective, the corresponding ASWD metric is a valid distance metric.

To help us deﬁne the augmented sliced Wasserstein distance, we ﬁrst introduce the spatial Radon transform which includes the Radon transform and the polynomial GRT as special cases (See Remark 5). Deﬁnition 1. Given a measurable injective mapping g( ):Rd Rdθ and a function f( ) L1(Rd), the spatial Radon transform of f( ) is deﬁned as

Hf(t,θ;g)= Z

Rdf(x)δ(t g(x),θ) dx, (12)

where t R and θ Sdθ 1 are the parameters of hypersurfaces {x Rd | g(x),θ =t}.

Published as a conference paper at ICLR 2022

Similar to the Radon transform and the GRT, the spatial Radon transform can also be used to generate push-forward measure Hµ =H#µ for µ Pk(Rd) as in Equations (6) and (7):

Hµ(t,θ;g)= Z

Rdδ(t g(x),θ )dµ. (13)

Remark 1. Note that the spatial Radon transform can be interpreted as applying the vanilla Radon transform to ˆµg, where ˆµg refers to the push-forward measure g#µ, i.e given a measurable injective mapping g( ):Rd Rdθ, the spatial Radon transform deﬁned by Equation (13) can be rewritten as:

Hµ(t,θ;g)=Ex µ[δ(t g(x),θ )], =Eˆx ˆµg[δ(t ˆx,θ )]

= Z δ(t ˆx,θ )dˆµg

=Rˆµg(t,θ). (14)

Hence the spatial Radon transform inherits the theoretical properties of the Radon transform and incorporates nonlinear projections through g( ).

In what follows, we use µ ν to denote probability measures µ,ν Pk(Rd) that satisfy µ(X)=ν(X) for X Rd. Lemma 1. Given an injective mapping g( ):Rd Rdθ and two probability measures µ,ν P(Rd), for all t R and θ Sdθ 1, Hµ(t,θ;g) Hν(t,θ;g) if and only if µ ν, i.e. the spatial Radon transform is an injection on Pk(Rd). Moreover, the spatial Radon transform is an injection on Pk(Rd) if and only if the mapping g( ) is an injection.

See Appendix A for the proof of Lemma 1.

We now introduce the augmented sliced Wasserstein distance, by utilizing the spatial Radon transform as the measure push-forward operator: Deﬁnition 2. Given two probability measures µ,ν Pk(Rd) and an injective mapping g( ):Rd Rdθ, the augmented sliced Wasserstein distance (ASWD) of order k [1,+ ) is deﬁned as:

ASWDk(µ,ν;g)= Z

Sdθ 1W k k Hµ( ,θ;g),Hν( ,θ;g) dθ 1

where θ Sdθ 1, Wk is the k-Wasserstein distance deﬁned by Equation (1), and H refers to the spatial Radon transform deﬁned by Equation (13). Remark 2. Following the connection between the spatial Radon transform and the vanilla Radon transform as shown in Equation (14), the ASWD can be rewritten as:

ASWDk(µ,ν;g)= Z

Sdθ 1W k k Rˆµg( ,θ),Rˆνg( ,θ) dθ 1

=SWDk(ˆµg,ˆνg), (16)

where ˆµg and ˆνg are probability measures on Rdθ which satisfy g(x) ˆµg for x µ and g(y) ˆνg for y ν. Theorem 1. The augmented sliced Wasserstein distance (ASWD) of order k [1,+ ) deﬁned by Equation (15) with a mapping g( ):Rd Rdθ is a metric on Pk(Rd) if and only if g( ) is injective.

The proof of Theorem 1 is provided in Appendix C. Theorem 1 shows that the ASWD is a metric given a ﬁxed injective mapping g( ). In practical applications, the mapping g( ) needs to be optimized to project samples onto discriminating hypersurfaces. We show in Corollary 1.1 that the ASWD between µ and ν with the optimized g( ) is also a metric under mild conditions. Corollary 1.1. The augmented sliced Wasserstein distance (ASWD) of order k [1,+ ) between two probability measures µ,ν Pk(Rd) deﬁned by Equation (15) with the optimal mapping

g ( )=argmax g {ASWDk(µ,ν;g) L(µ,ν,λ;g)} (17)

is a metric on Pk(Rd), where L(µ,ν,λ;g)=λ(E

1 kx µ ||g(x)||k 2 +E

1 ky ν ||g(y)||k 2 ) for λ (1,+ ).

Published as a conference paper at ICLR 2022

The proof of Corollary 1.1 is provided in Appendix D.

Remark 3. Corollary 1.1 shows that given measures µ1,µ2,µ3 Pk(Rd), the triangle inequality holds for the ASWD when g( ) is optimized for each pair of measures, as shown in Appendix D. It is worth noting that λ>1 is a sufﬁcient condition for the ASWD to be a metric as further discussed in Remark 6, 0<λ 1 can also lead to ﬁnite ||g(x)||2 in various scenarios, resulting in valid metrics. The discussion on the impact of λ on the performance of the ASWD in practice can be found in Appendix G.2.

3.2 NUMERICAL IMPLEMENTATION

We discuss in this section how to realize injective mapping g( ) with neural networks due to their expressiveness and optimize it with gradient based methods.

Injective neural networks: As stated in Lemma 1 and Theorem 1, the injectivity of g( ) is the sufﬁcient and necessary condition for the ASWD being a valid metric. Thus we need speciﬁc architecture designs on implementing g( ) by neural networks. One option is the family of invertible neural networks (Behrmann et al., 2019; Karami et al., 2019), which are both injective and surjective. However, the running cost of those models is usually much higher than that of vanilla neural networks. We propose an alternative approach by concatenating the input x of an arbitrary neural network to its output φω(x):

gω(x)=[x,φω(x)]. (18)

It is trivial to show that gω(x) is injective, since different inputs will lead to different outputs. Although embarrassingly simple, this idea of concatenating the input and output of neural networks has found success in preserving information with dense blocks in the Dense Net (Huang et al., 2017), where the input of each layer is injective to the output of all preceding layers.

Optimization objective: We aim to slice distributions with maximally discriminating hypersurfaces between two distributions while avoiding the projected samples being arbitrarily large, so that the projected samples between the compared distributions are ﬁnite and most dissimilar regarding the ASWD, as shown in Figure 1. Similar ideas have been employed to identify important projection directions (Deshpande et al., 2019; Kolouri et al., 2019a; Paty and Cuturi, 2019) or a discriminative ground metric (Salimans et al., 2018) in optimal transport metrics. For the ASWD, the parameterized injective neural network gω( ) is optimized by maximizing the following objective:

L(µ,ν;gω,λ)= Z

Sdθ 1W k k Hµ( ,θ;gω),Hν( ,θ;gω) dθ 1

k L(µ,ν,λ;gω), (19)

where λ>0 and the regularization term L(µ,ν,λ;gω)=λ(E

1 kx µ ||gω(x)||k 2 +E

1 ky ν ||gω(y)||k 2 ) on the magnitude of the neural network s output is used, otherwise the projections may be arbitrarily large.

Remark 4. The regularization coefﬁcient λ adjusts the introduced non-linearity in the evaluation of the ASWD by controlling the norm of φω( ) in Equation (18). In particular, when λ , the nonlinear term φω( ) shrinks to 0. The intrinsic dimension of the augmented space, i.e. the number of non-zero dimensions in the augmented space, is hence explicitly controlled by the ﬂexible choice of φω( ) and implicitly regularized by L(µ,ν,λ;gω).

By plugging the optimized g ω,λ( ) = argmax gω (L(µ, ν; gω, λ)) into Equation (15), we obtain the

empirical version of the ASWD. Pseudocode is provided in Appendix E.

4 RELATED WORK

Recent work on slice-based Wasserstein distances mainly focused on improving their projection efﬁciency, leading to a reduced number of projections needed to capture the structure of data distributions (Kolouri et al., 2019a; Nguyen et al., 2021). The GSWD proposes using nonlinear projections to achieve this goal, and it has been proved to be a valid distance metric if and only if they adopt injective GRTs, which only include the circular functions and a ﬁnite number of harmonic polynomial functions with odd degrees as their feasible deﬁning functions (Ehrenpreis, 2003). While the GSWD has shown impressive performance in various applications (Kolouri et al., 2019a), its deﬁning function is restricted to the aforementioned limited class of candidates. In addition, the

Published as a conference paper at ICLR 2022

selection of deﬁning function is usually task-dependent and needs domain knowledge, and the impact on performance from different deﬁning functions is still unclear.

To tackle those limitations, (Kolouri et al., 2019a) proposed the GSWD-NN, which directly takes the outputs of a neural network as its projection results without using the standard Radon transform or GRTs. However, this brings three side effects: 1) The number of projections, which equals the number of nodes in the neural network s output layer, is ﬁxed, thus new neural networks are needed if one wants to change the number of projections. 2) There is no random projections involved in the GSWD-NN, as the projection results are determined by the inputs and weights of the neural network. 3) The GSWD-NN is a pseudo-metric since it uses a vanilla neural network, rather than Radon transform or GRTs, as its push-forward operator. Therefore, the GSWD-NN does not ﬁt into the theoretical framework of GSWD and does not inherit its geometric properties.

Another notable variant of the SWD is the distributional sliced Wasserstein distance (DSWD) (Nguyen et al., 2021). By ﬁnding a distribution of projections that maximizes the expected distances over these projections, the DSWD can slice distributions from multiple directions while having high projection efﬁciency. Injective GRTs are also used to extend the DSWD to the distributional generalized sliced Wasserstein distance (DGSWD) (Nguyen et al., 2021). Experiment results show that the DSWD and the DGSWD have superior performance in generative modelling tasks (Nguyen et al., 2021). However, neither the DSWD nor the DGSWD have solved the problem with the GSWD, i.e. they are still not able to produce nonlinear projections adaptively.

Ourcontributiondiffersfrompreviousworkinthreeways: 1)The ASWDisdata-adaptive, i.e. thehypersurfaces where high-dimensional distributions are projected onto can be learned from data. This implies one does not need to specify a deﬁning function from limited choices. 2) Unlike GSWD-NN, the ASWD takes a novel direction to incorporate neural networks into the framework of sliced-based Wasserstein distanceswhilemaintainingthepropertiesofsliced Wassersteindistances. 3)Previousworkonintroducing nonlinear projections into Radon transform either is restricted to only a few candidates of deﬁning functions (GRTs) or breaks the framework of Radon transforms (neural networks in GSWD-NN), in contrast, the spatial Radon transform provides a novel way of deﬁning nonlinear Radon-type transforms.

5 EXPERIMENTS

In this section, we describe the experiments that we have conducted to evaluate performance of the proposed distance metric. The GSWD leads to the best performance in a sliced Wasserstein ﬂow problem reported in (Kolouri et al., 2019a) and the DSWD outperforms the compared methods in the generative modeling task examined in (Nguyen et al., 2021) on CIFAR 10 (Krizhevsky, 2009), Celeb A (Liu et al., 2015), and MNIST (Le Cun et al., 1998) datasets (Appendix H.2). Hence, we compare performance of the ASWD with the state-of-the-art distance metrics in the same examples and report results as below1. We provide additional experiment results in the appendices, including a sliced Wasserstein autoencoder (SWAE) (Kolouri et al., 2019b) using the ASWD (Appendix I), image color transferring (Appendix J) and sliced Wasserstein barycenters (Appendix K).

To examine the robustness of the ASWD, throughout the experiments, we adopt the injective network architecture given in Equation (18) and set φω to be a single fully-connected layer neural network whose output dimension equals its input dimension, with a Re LU layer as its activation function. The order k is set to be 2 in all experiments.

5.1 SLICED WASSERSTEIN FLOWS

We ﬁrst consider the problem of evolving a source distribution µ to a target distribution ν by minimizing slice-based Wasserstein distances between µ and ν in the sliced Wasserstein ﬂow task reported in (Kolouri et al., 2019a). tµt = SWD(µt,ν), (20) where µt refers to the updated source distribution at each iteration t. The SWD in Equation (20) can be replaced by other sliced-Wasserstein distances to be evaluated. As in (Kolouri et al., 2019a), the 2-Wasserstein distance was used as the metric for evaluating performance of different distance metrics in this task. The set of hyperparameter values used in this experiment can be found in Appendix F.1.

1Code to reproduce experiment results is available at : https://github.com/xiongjiechen/ASWD.

Published as a conference paper at ICLR 2022

Figure 2: The ﬁrst and third columns are target distributions. The second and fourth columns are log 2-Wasserstein distances between the target distribution and the source distribution. The horizontal axis show the number of training iterations. Solid lines and shaded areas represent the average values and 95% conﬁdence intervals of log 2-Wasserstein distances over 50 runs. A more extensive set of experimental results can be found in Appendix G.1.

Without loss of generality, we initialize µ0 to be the standard normal distribution N(0,I). We repeat each experiment 50 times and record the 2-Wasserstein distance between µ and ν at every iteration. In Figure 2, we plot the 2-Wasserstein distances between the source and target distributions as a function of the training epochs and the 8-Gaussian, the Knot, the Moon, and the Swiss roll distributions are respective target distributions. For clarity, Figure 2 displays the experiment results from the 6 best performing distance metrics, including the ASWD, the DSWD, the SWD, the GSWD-NN 1, which directly generates projections through a one layer MLP, as well as the GSWD with the polynomial of degree 3, circular deﬁning functions, out of the 12 distance metrics we compared.

We observe from Figure 2 that the ASWD not only leads to smaller 2-Wasserstein distances, but also converges faster by achieving better results with fewer iterations than the other methods in these four target distributions. A complete set of experimental results with 12 compared distance metrics and 8 target distributions are included in Appendix G.1. The ASWD outperforms the compared state-of-the-art sliced-based Wasserstein distance metrics with 7 out of the 8 target distributions except for the 25-Gaussian. This is achieved through the simple injective network architecture given in Equation (18) and a one layer fully-connected neural network with equal input and output dimensions throughout the experiments. In addition, ablation study is conducted to study the effect of injective neural networks, the regularization coefﬁcient λ, the choice of the dimensionality dθ of the augmented space, and the optimization of hypersurfaces in the ASWD. Details can be found in Appendix G.2.

5.2 GENERATIVE MODELING

In this experiment, we use sliced-based Wasserstein distances for a generative modeling task described in (Nguyen et al., 2021). The task is to generate images using generative adversarial networks (GANs) (Goodfellow et al., 2014) trained on either the CIFAR10 dataset (64 64 resolution) (Krizhevsky, 2009) or the Celeb A dataset (64 64 resolution) (Liu et al., 2015). Denote the hidden layer and the output layer of the discriminator by hψ and DΨ, and the generator by GΦ, we train GAN models with the following objectives:

min Φ SWD(hψ(pr),hψ(GΦ(pz))), (21)

max Ψ,ψ Ex pr[log(DΨ(hψ(x)))]+Ez pz[log(1 DΨ(hψ(GΦ(z))))], (22)

where pz is the prior of latent variable z and pr is the distribution of real data. The SWD in Equation (21) is replaced by the ASWD and other variants of the SWD to compare their performance. The

Published as a conference paper at ICLR 2022

Figure 3: FID scores of generative models trained with different metrics on CIFAR10 (left) and Celeb A (right) datasets with L=1000 projections. The error bar represents the standard deviation of the FID scores at the speciﬁed training epoch among 10 simulation runs.

Table 1: FID scores of generative models trained with different distance metrics. Smaller scores indicate better image qualities. L is the number of projections, we run each experiment 10 times and report the average values and standard errors of FID scores for CIFAR10 dataset and CELEBA dataset. The running time per training iteration for one batch containing 512 samples is computed based on a computer with an Intel (R) Xeon (R) Gold 5218 CPU 2.3 GHz and 16GB of RAM, and a RTX 6000 graphic card with 22GB memories.

SWD (Bonneel et al., 2015)

GSWD (Kolouri et al., 2019a)

DSWD (Nguyen et al., 2021) ASWD

FID t (s/it) FID t (s/it) FID t (s/it) FID t (s/it) 10 121.4 7.0 0.34 108.3 5.6 0.41 74.2 3.1 0.55 65.7 3.2 0.58 100 104.6 5.2 0.35 105.2 3.2 0.74 66.5 3.9 0.57 62.5 1.9 0.60 1000 102.3 5.3 0.36 98.2 5.1 2.22 62.3 5.7 1.30 59.3 3.2 1.38 CELEBA 10 94.8 2.5 0.35 95.1 4.2 0.40 86.0 1.4 0.53 81.2 1.3 0.59 100 88.7 5.7 0.36 86.7 3.5 0.75 76.1 3.5 0.55 73.2 2.6 0.61 1000 86.5 4.1 0.38 85.2 6.3 2.19 71.3 4.7 1.28 67.4 2.1 1.38

GSWD with the polynomial deﬁning function and the DGSWD is not included in this experiment due to its excessively high computational cost in high-dimensional space. The Fréchet Inception Distance (FID score) (Heusel et al., 2017) is used to assess the quality of generated images. More details on the network structures and the parameter setup used in this experiment are available in Appendix F.2.

We run 200 and 100 training epochs to train the GAN models on the CIFAR10 and the Celeb A dataset, respectively. Each experiment is repeated for 10 times and results are reported in Table 1. With the same number of projections and a similar computation cost, the ASWD leads to signiﬁcantly improved FID scores among all evaluated distances metrics on both datasets, which implies that images generated with the ASWD are of higher qualities. Figure 3 plots the FID scores recorded during the training process. The GAN model trained with the ASWD exhibits a faster convergence as it reaches smaller FID scores with fewer epochs. Randomly selected samples of generated images are presented in Appendix H.1.

6 CONCLUSION

We proposed a novel variant of the sliced Wasserstein distance, namely the augmented sliced Wasserstein distance (ASWD), which is ﬂexible, has a high projection efﬁciency, and generalizes well. The ASWD adaptively updates the hypersurfaces used to slice compared distributions by learning from data. We proved that the ASWD is a valid distance metric and presented its numerical implementation. We reported empirical performance of the ASWD over state-of-the-art sliced Wasserstein metrics in various numerical experiments. We showed that ASWD with a simple injective neural network architecture can lead to the smallest distance errors over the majority of datasets in a sliced Wasserstein ﬂow task and superior performance in generative modeling tasks involving GANs and VAEs. We have also evaluated the applications of the ASWD in downstream tasks including color transferring and Wasserstein barycenters. What remains to be explored includes the topological properties of the ASWD. We leave this topic as an interesting future research direction.

Published as a conference paper at ICLR 2022

J. Altschuler, J. Niles-Weed, and P. Rigollet. Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration. In Proc. Advances in Neural Information Processing Systems (Neur IPS), pages 1964 1974, Long Beach, California, USA, 2017.

M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In Proc. International Conference on Machine Learning (ICML), pages 214 223, Sydney, Australia, 2017.

J. Behrmann, W. Grathwohl, R. T. Q. Chen, D. Duvenaud, and J. Jacobsen. Invertible residual networks. In Proc. International Conference on Machine Learning (ICML), pages 573 582, Long Beach, California, USA, 2019.

E. Bernton, P. E. Jacob, M. Gerber, and C. P. Robert. On parameter estimation with the wasserstein distance. Information and Inference: A Journal of the IMA, 8(4):657 676, 2019.

G. Beylkin. The inversion problem and applications of the generalized Radon transform. Communications on Pure and Applied Mathematics, 37(5):579 599, 1984.

N. Bonneel, J. Rabin, G. Peyré, and H. Pﬁster. Sliced and Radon Wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision, 51(1):22 45, 2015.

N. Bonnotte. Unidimensional and evolution methods for optimal transportation. Ph D thesis, Paris 11, 2013.

N. Courty, R. Flamary, D. Tuia, and A. Rakotomamonjy. Optimal transport for domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence (IPAMI), 39(9):1853 1865, 2016.

M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Proc. Advances in Neural Information Processing Systems (Neur IPS), pages 2292 2300, Lake Tahoe, Nevada, USA, 2013.

M. Cuturi and A. Doucet. Fast computation of Wasserstein barycenters. In Proc. International Conference on Machine Learning (ICML), pages 685 693, Beijing, China, 2014.

I. Deshpande, Y. Hu, R. Sun, A. Pyrros, et al. Max-sliced Wasserstein distance and its use for GANs. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 10648 10656, Long Beach, California, USA, 2019.

A. Dessein, N. Papadakis, and J. Rouas. Regularized optimal transport and the rot mover s distance. The Journal of Machine Learning Research (JMLR), 19(1):590 642, 2018.

L. Ehrenpreis. The universality of the Radon transform, chapter 5, pages 299 363. Oxford University Press, Oxford, UK, 2003.

S. Ferradans, N. Papadakis, G. Peyré, and J. Aujol. Regularized discrete optimal transport. SIAM Journal on Imaging Sciences, 7(3):1853 1882, 2014.

A. Genevay, L. Chizat, F. Bach, M. Cuturi, and G. Peyré. Sample complexity of Sinkhorn divergences. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pages 1574 1583, Okinawa, Japan, 2019.

I. Goodfellow, A. J. Pouget, M. Mirza, B. Xu, F. D. Warde, et al. Generative adversarial nets. In Proc. Advances in Neural Information Processing Systems (Neur IPS), pages 2672 2680, Montréal, Canada, 2014.

I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. Improved training of Wasserstein GANs. In Proc. Advances in neural information processing systems (Neur IPS), pages 5767 5777, Long Beach, California, USA, 2017.

S. Helgason. The Radon transform, volume 2. Basel, Switzerland: Springer, 1980.

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Proc. Advances in neural information processing systems (Neur IPS), pages 6626 6637, Long Beach, California, USA, 2017.

Published as a conference paper at ICLR 2022

A. Homan and H. Zhou. Injectivity and stability for a generic class of generalized Radon transforms. The Journal of Geometric Analysis, 27(2):1515 1529, 2017.

G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4700 4708, Hawaii, USA, 2017.

M. Karami, D. Schuurmans, J. Sohl-Dickstein, L. Dinh, and D. Duckworth. Invertible convolutional ﬂow. In Proc. Advances in Neural Information Processing Systems (Neur IPS), pages 5636 5646, Vancouver, Canada, 2019.

S. Kolouri, K. Nadjahi, U. Simsekli, R. Badeau, and G. Rohde. Generalized sliced Wasserstein distances. In Proc. Advances in Neural Information Processing Systems (Neur IPS), pages 261 272, Vancouver, Canada, 2019a.

S. Kolouri, P. E. Pope, C. E. Martin, and G. K. Rohde. Sliced-Wasserstein autoencoders. In Proc. International Conference on Learning Representations (ICLR), New Orleans, Louisiana, USA, 2019b.

A. Krizhevsky. Learning multiple layers of features from tiny images. Tech Report, 2009.

Y. Le Cun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proc. IEEE International Conference on Computer Vision (ICCV), pages 3730 3738, Las Condes, Chile, 2015.

B. Muzellec and M. Cuturi. Subspace detours: Building transport plans that are optimal on subspace projections. Proc. Advances in Neural Information Processing Systems (Neur IPS), 32:6917 6928, 2019.

K. Nadjahi et al. Statistical and topological properties of sliced probability divergences. Proc. Advances in Neural Information Processing Systems (Neur IPS), 33, 2020.

K. Nguyen, N. Ho, T. Pham, and H. Bui. Distributional sliced-Wasserstein and applications to generative modeling. In Proc. International Conference on Learning Representations (ICLR), Vienna, Austria, 2021.

F. Paty and M. Cuturi. Subspace robust Wasserstein distances. In Proc. International Conference on Machine Learning (ICML), pages 5072 5081, Long Beach, California, USA, 2019.

G. Peyré and M. Cuturi. Computational optimal transport. Foundations and Trends in Machine Learning, 11(5-6):355 607, 2019.

J. Radon. Uber die bestimmung von funktionen durch ihre integralwerte laengs gewisser mannigfaltigkeiten. Ber. Verh. Saechs. Akad. Wiss. Leipzig Math. Phys. Kl., 69:262, 1917.

D. Rezende and S. Mohamed. Variational inference with normalizing ﬂows. In International Conference on Machine Learning (ICML), pages 1530 1538, Lille, France, 2015.

T. Salimans, H. Zhang, A. Radford, and D. Metaxas. Improving GANs using optimal transport. In Proc. International Conference on Learning Representations (ICLR), Vancouver, Canada, 2018.

C. Villani. Optimal Transport: old and new, volume 338. Berlin, Germany: Springer Science & Business Media, 2008.

Published as a conference paper at ICLR 2022

APPENDIX A PROOF OF THE LEMMA 1

We prove that the spatial Radon transform deﬁned with a measurable mapping g( ):Rd Rdθ is an injection on Pk(Rd) if and only if g( ) is injective. In the following contents, we use Pk(Rd) to denote a set of Borel probability measures with ﬁnite k-th moment on Rd, and f1 f2 is used to denote functions f1( ):X R and f2( ):X R that satisfy f1(x)=f2(x) for x X, and f1 f2 is used to denote functions f1( ) : X R and f2( ) : X R that satisfy f1(x) = f2(x) for certain x X. In addition, for probability measures µ and ν, we use µ ν to denote µ,ν Pk(Rd) that satisfy µ(X)=ν(X) for X Rd, and µ ν to denote µ,ν Pk(Rd) that satisfy µ(X) =ν(X) for certain X Rd.

Proof. By using proof by contradiction, we ﬁrst prove that if g( ) is injective, the corresponding spatial Radon transform is injective. If the spatial Radon transform deﬁned with an injective mapping g( ) : Rd Rdθ is not injective, there exist µ, ν Pk(Rd), µ ν, such that Hµ(t,θ;g) Hν(t,θ;g) for t R and θ Sdθ 1.

From Equation (14), for t R and θ Sdθ 1, the spatial Radon transform of µ can be written as:

Hµ(t,θ;g)=Rˆµg(t,θ), (23)

Hν(t,θ;g)=Rˆνg(t,θ), (24)

where ˆµg and ˆνg refer to the push-forward measures g#µ and g#ν, respectively. From the assumption Hµ(t,θ;g) Hν(t,θ;g) and Equations (23) and (24), we know Rˆµg(t,θ) Rˆνg(t,θ) for t R and θ Sdθ 1, which implies ˆµg ˆνg since the Radon transform is injective.

Since g( ) is injective, for all measurable X Rd, x X if and only if ˆx=g(x) g(X), which implies P(x X)=P(ˆx g(X)), P(y X)=P(ˆy g(X)). Therefore, Z

g(X) dˆµg = Z

X dµ, (25) Z

g(X) dˆνg = Z

Since ˆµg ˆνg, from Equations (25) and (26) we have: R

X dν for X Rd. Hence, for X Rd: Z

X d(µ ν)=0, (27)

which implies µ ν, contradicting with the assumption µ ν. Therefore, if Hµ Hν, µ ν. In addition, from the deﬁnition of the spatial Radon transform in Equation (13), it is trivial to show that if µ ν, Hµ(t,θ;g) Hν(t,θ;g). Therefore, Hµ Hν if and only if µ ν, i.e. the spatial Radon transform H deﬁned with an injective mapping g( ):Rd Rdθ is injective.

We now prove that if the spatial Radon transform deﬁned with a mapping g( ) : Rd Rdθ is injective, g( ) must be injective. Again, we use proof by contradiction. If g( ) is not injective, there exist x0,y0 Rd such that x0 = y0 and g(x0) = g(y0). For Dirac measures µ1 and ν1 deﬁned by µ1(X)= R

X δ(x x0)dx and ν1(Y)= R

Yδ(y y0)dy, where X,Y Rd, we know µ1 ν1 as x0 =y0.

We deﬁne variables x µ1 and y ν1. Then for variables ˆx = g(x) µ2 and ˆy = g(y) ν2, where µ2 and ν2 are push-forward measures g#µ1 and g#ν1, it is trivial to derive for X,Y Rd

µ2(g(X))= Z

g(X) δ(ˆx g(x0))dˆx, (28)

ν2(g(Y))= Z

g(Y) δ(ˆy g(y0))dˆy, (29)

which implies µ2 ν2 as g(x0)=g(y0).

From Equations (23), (24), (28) and (29), for t R and θ Sdθ 1:

Hµ1(t,θ;g)=Rµ2(t,θ), =Rν2(t,θ), =Hν1(t,θ;g), (30)

Published as a conference paper at ICLR 2022

contradicting with the assumption that the spatial Radon transform is injective. Therefore, if the spatial Radon transform is injective, g( ) must be injective. We conclude that the spatial Radon transform is injective if and only if the mapping g( ) is an injection on Pk(Rd).

Published as a conference paper at ICLR 2022

APPENDIX B SPECIAL CASES OF SPATIAL RADON TRANSFORMS

Remark 5. The spatial Radon transform degenerates to the vanilla Radon transform when the mapping g( ) is an identity mapping. In addition, the spatial Radon transform is equivalent to the polynomial GRT (Ehrenpreis, 2003) when g(x)=(xα1,...,xαdα ), where α is multi-indices αi =(ηi,1,...,ηi,d) Nd

satisfying |αi|=Pd j=1ηi,j =m. m is the degree of the polynomial function, xαi =Qd j=1xηi,j j given an input x=(x1,...,xd) Rd, and dα is the number of all possible multi-indices αi that satisﬁes |αi|=m.

We provide a proof for the claim in Remark 5.

Proof. Given a probability measure µ P(Rd), the spatial Radon transform of µ is deﬁned as:

Hµ(t,θ;g)= Z

Rdδ(t g(x),θ )dµ, (31)

where t R and θ Sdθ 1 are the parameters of hypersurfaces in Rd. When the mapping g( ) is an identity mapping, i.e. g(x) = x for x Rd, the spatial Radon transform degenerates to the vanilla Radon transform:

Hµ(t,θ;g)= Z

Rdδ(t x,θ )dµ

=Rµ(t,θ). (32)

For GRTs, (Ehrenpreis, 2003) provides a class of injective GRTs named polynomial GRTs by adopting homogeneous polynomial functions with an odd degree m as the deﬁning function:

i=1 θixαi)dµ,

s.t.|αi|=m, (33)

where αi =(ηi,1,...,ηi,d) Nd, |αi|=Pd j=1ηi,j, xαi =Qd j=1xηi,j j for x=(x1,...,xd) Rd, dα is the number of all possible multi-indices αi that satisﬁes |αi|=m, and θ=(θ1,...,θdα) Sdα 1.

In spatial Radon transform, for x Rd, when the mapping g( ) is deﬁned as:

g(x)=(xα1,...,xαdα ), (34)

the spatial Radon transform is equivalent to the polynomial GRT deﬁned in Equation (33):

Hµ(t,θ;g)= Z

Rdδ(t g(x),θ )dµ

i=1 θixαi)dµ. (35)

Published as a conference paper at ICLR 2022

APPENDIX C PROOF OF THEOREM 1

We provide a proof that the ASWD deﬁned with a mapping g( ):Rd Rdθ is a metric on Pk(Rd), if and only if g( ) is injective. In what follows, we denote a set of Borel probability measures with ﬁnite k-th moment on Rd by Pk(Rd), and use µ,ν Pk(Rd) to refer to two probability measures deﬁned on Rd.

Proof. Symmetry: Since the k-Wasserstein distance is a metric thus symmetric (Villani, 2008):

Wk Hµ( ,θ;g),Hν( ,θ;g) =Wk Hν( ,θ;g),Hµ( ,θ;g) . (36)

ASWDk(µ,ν;g)= Z

Sdθ 1W k k Hµ( ,θ;g),Hν( ,θ;g) dθ 1

Sdθ 1W k k Hν( ,θ;g),Hµ( ,θ;g) dθ 1

k =ASWDk(ν,µ;g).

Triangle inequality: Given an injective mapping g( ) : Rd Rdθ and probability measures µ1, µ2, µ3 Pk(Rd), since the k-Wasserstein distance satisﬁes the triangle inequality (Villani, 2008), the following inequality holds:

ASWDk(µ1,µ3;g)= Z

Sdθ 1W k k Hµ1( ,θ;g),Hµ3( ,θ;g) dθ 1

Sdθ 1 Wk(Hµ1( ,θ;g),Hµ2( ,θ;g))

+Wk(Hµ2( ,θ;g),Hµ3( ,θ;g)) kdθ 1

Sdθ 1W k k Hµ1( ,θ;g),Hµ2( ,θ;g) dθ 1

Sdθ 1W k k Hµ2( ,θ;g),Hµ3( ,θ;g) dθ 1

=ASWDk(µ1,µ2;g)+ASWDk(µ2,µ3;g),

where the second inequality is due to the Minkowski inequality in Lk(Sdθ 1).

Identity of indiscernibles: Since Wk(µ,µ)=0 for µ Pk(Rd),we have

ASWDk(µ,µ;g)= Z

Sdθ 1W k k Hµ( ,θ;g),Hµ( ,θ;g) dθ 1

for µ Pk(Rd). Conversely, for µ,ν Pk(Rd), if ASWDk(µ,ν;g) = 0, from the deﬁnition of the ASWD:

ASWDk(µ,ν;g)= Z

Sdθ 1W k k Hµ( ,θ;g),Hν( ,θ;g) dθ 1

Due to the non-negativity of k-th Wasserstein distance as it is a metric on Pk(Rd) and the continuity of Wk( , ) on Pk(Rd) (Villani, 2008), Wk Hµ( ,θ;g),Hν( ,θ;g) =0 holds for θ Sdθ 1 if and only if Hµ( ,θ;g) Hν( ,θ;g). Again, given the spatial Radon transform is injective when g( ) is injective (see the proof in Appendix A), Hµ( ,θ;g) Hν( ,θ;g) implies µ ν if g( ) is injective.

In addition, if g( ) is not injective, the spatial Radon transform is not injective (see the proof in Appendix A), then µ, ν Pk(Rd), µ ν such that Hµ( ,θ;g) Hν( ,θ;g), which implies ASWDk(µ,ν;g)=0 for µ ν. Therefore, the ASWD satisﬁes the identity of indiscernibles if and only if g( ) is injective.

Non-negativity: The three axioms of a distance metric, i.e. symmetry, triangle inequality, and identity of indiscernibles imply the non-negativity of the ASWD. Since the Wasserstein distance is non-negative,

Published as a conference paper at ICLR 2022

for µ ,ν Pk(Rd), it can also be straightforwardly proved the ASWD between µ and ν is non-negative:

ASWDk(µ,ν;g)= Z

Sdθ 1W k k Hµ( ,θ;g),Hν( ,θ;g) dθ 1

Sdθ 10kdθ 1

Therefore, the ASWD is a metric on Pk(Rd) if and only if g( ) is injective.

Published as a conference paper at ICLR 2022

APPENDIX D PROOF OF COROLLARY 1.1

We ﬁrst introduce Lemma 2 to support the proof of Corollary 1.1. Lemma 2. For λ (1, + ), the optimal mapping g ( ) deﬁned by Equation (17) satisﬁes ||g (x)||2 < for x Rd µ and x Rd ν.

Proof. Recall that in Equation (16) the ASWD can be rewritten as:

ASWDk(µ,ν;g)=SWDk(ˆµg,ˆνg)

Sdθ 1W k k Rˆµg( ,θ),Rˆνg( ,θ) dθ 1

where transformed variables ˆx = g(x) ˆµg for x µ and ˆy = g(y) ˆνg for y ν, respectively. Combining the equation above with Equation (2):

ASWDk(µ,ν;g)= Z

Sdθ 1W k k Rˆµg( ,θ),Rˆνg( ,θ) dθ 1

0 |F 1 Pθ#ˆµg(z) F 1 Pθ#ˆνg(z)|kdzdθ 1

|F 1 Rθ(ˆµg)(z)|+|F 1 Rθ(ˆνg)(z)| kdzdθ 1

where # denotes the push forward operator, Pθ :x Rdθ x,θ R, and Rθ(ˆµg)=Pθ#ˆµg, Rθ(ˆνg)= Pθ#ˆνg refer to one-dimensional measures obtained by slicing ˆµg, ˆνg with a unit vector θ, F 1 Rθ(ˆµg) and

F 1 Rθ(ˆνg) are inverse cumulative distribution functions (CDFs) of Rθ(ˆµg) and Rθ(ˆνg), respectively.

By repeatedly applying the Minkowski s inequality to Equation (41), we obtain the following inequalities:

ASWDk(µ,ν;g) Z

|F 1 Rθ(ˆµg)(z)|+|F 1 Rθ(ˆνg)(z)| kdzdθ 1

0 |F 1 Rθ(ˆµg)(z)|kdz 1

0 |F 1 Rθ(ˆνg)(z)|kdz 1

0 |F 1 Rθ(ˆµg)(z)|kdzdθ 1

0 |F 1 Rθ(ˆνg)(z)|kdzdθ 1

Let s= ˆx,θ , then z =FRθ(ˆµg)(s), dz =d FRθ(ˆµg)(s): Z 1

0 |F 1 Rθ(ˆµg)(z)|kdz = Z

R |s|kd FRθ(ˆµg)(s)

Rdθ | ˆx,θ |kdˆµg

Rd| g(x),θ |kdµ, (43)

where the last two equations are obtained through the deﬁnitions of the push-forward operators. Therefore, the following inequalities hold:

ASWDk(µ,ν;g) Z

0 |F 1 Rθ(ˆµg)(z)|kdzdθ 1

0 |F 1 Rθ(ˆνg)(z)|kdzdθ 1

Rd| g(x),θ |kdµdθ 1

Rd| g(y),θ |kdνdθ 1

1 kx µ ||g(x)||k 2 +E

1 ky ν ||g(y)||k 2 . (44)

Published as a conference paper at ICLR 2022

Then we obtain the following inequalities for the optimization objective:

ASWDk(µ,ν;g) L(µ,ν,λ;g)

1 kx µ ||g(x)||k 2 +E

1 ky ν ||g(y)||k 2 λ(E

1 kx µ ||g(x)||k 2 +E

1 ky ν ||g(y)||k 2 )

1 kx µ ||g(x)||k 2 +E

1 ky ν ||g(y)||k 2 . (45)

When we set λ (1,+ ), if x Rd µ or y Rd ν such that ||g(x)||2 or ||g(y)||2 , the optimization objective approaches negative inﬁnity, implying g( ) is not the optimal mapping. Therefore, by adopting Equation (17) as the optimization objective, the optimal mapping g ( ) satisﬁes ||g (x)||2 < for x Rd µ and x Rd ν.

Remark 6. It is worth noting that λ> 1 in Corollary 1.1 is a sufﬁcient condition for the supremum of the optimization objective to be non-positive and ||g (x)||2 < for x Rd µ and x Rd ν. 0 < λ 1 can also lead to ﬁnite ||g(x)||2 in various scenarios. Speciﬁcally, the upper bound of the optimization objective given in Equation (44) is obtained by applying:

| g(x),θ |=||g(x)||2|cos(α)| ||g(x)||2, (46)

where α is the angle between θ and g(x). In high-dimensional spaces, Equation (46) gives a very loose bound since in high-dimensional spaces the majority of sampled θ would be nearly orthogonal to g(x) and cos(α) is nearly zero with high probability (Kolouri et al., 2019a). Empirically we found that across all the experiment results, λ in a candidate set of {0.01,0.05,0.1,0.5,1,10,100} all lead to ﬁnite g ( ).

We now prove Corollary 1.1, i.e ASWDk(µ,ν;g ) is a metric on Pk(Rd) , where g ( ) is the optimal mapping deﬁned by Equation (17) for λ (1,+ ).

Proof. Symmetry: Since the k-Wasserstein distance is a metric thus symmetric (Villani, 2008):

Wk Hµ( ,θ;g ),Hν( ,θ;g ) =Wk Hν( ,θ;g ),Hµ( ,θ;g ) . (47)

ASWDk(µ,ν;g )= Z

Sdθ 1W k k Hµ( ,θ;g ),Hν( ,θ;g ) dθ 1

Sdθ 1W k k Hν( ,θ;g ),Hµ( ,θ;g ) dθ 1

k =ASWDk(ν,µ;g ).

Triangle inequality: From Lemma 2, when λ (1, + ), the optimal mapping g ( ) satisﬁes ||g (x)||2 < for x Rd µ and x Rd ν, hence ASWDk(ν,µ;g ) is ﬁnite due to Equation (16). We then prove that ASWDk(ν,µ;g ) satisﬁes the triangle inequality.

Denote by g 1, g 2, and g 3 optimal mappings that result in the supremum of Equation (19) between µ1 and µ2, µ1 and µ3, µ2 and µ3, respectively, since ASWDk(µ1,µ2;g 1), ASWDk(µ1,µ3;g 2), and ASWDk(µ2,µ3;g 3) are ﬁnite, the following equations hold:

ASWDk(µ1,µ2;g 1) ASWDk(µ1,µ3;g 1)+ASWDk(µ2,µ3;g 1) (48) sup g {ASWDk(µ1,µ3;g)}+sup g {ASWDk(µ2,µ3;g)} (49)

=ASWDk(µ1,µ3;g 2)+ASWDk(µ2,µ3;g 3), (50)

where the ﬁrst inequality are from the metric property of the ASWD.

Identity of indiscernibles: Since Wk(µ,µ)=0 for µ Pk(Rd),we have

ASWDk(µ,µ;g )= Z

Sdθ 1W k k Hµ( ,θ;g ),Hµ( ,θ;g ) dθ 1

Published as a conference paper at ICLR 2022

for µ Pk(Rd). Conversely, for µ,ν Pk(Rd), if ASWDk(µ,ν;g )=0, from Equation (15):

ASWDk(µ,ν;g )= Z

Sdθ 1W k k Hµ( ,θ;g ),Hν( ,θ;g ) dθ 1

Due to the non-negativity of k-th Wasserstein distance as it is a metric on Pk(Rd) and the continuity of Wk( , ) on Pk(Rd) (Villani, 2008), Wk Hµ( ,θ;g ), Hν( ,θ;g ) =0 holds for θ Sdθ 1, which implies Hµ( ,θ;g ) Hν( ,θ;g ) for θ Sdθ 1. Therefore, given the spatial Radon transform is injective when g ( ) is injective, Hµ( ,θ;g ) Hν( ,θ;g ) implies µ ν.

Non-negativity: Since the Wasserstein distance is non-negative, for µ ,ν Pk(Rd), the ASWD deﬁned with optimal mappings g( ) between µ and ν is also non-negative:

ASWDk(µ,ν;g )= Z

Sdθ 1W k k Hµ( ,θ;g ),Hν( ,θ;g ) dθ 1

Sdθ 10kdθ 1

Therefore, the ASWD deﬁned with the optimal mapping g ( ) is non-negative, symmetric, and satisﬁes the triangle inequality and the identity of indiscernibles, i.e. the ASWD deﬁned with optimal mappings g ( ) is also a metric.

Published as a conference paper at ICLR 2022

APPENDIX E PSEUDOCODE FOR THE EMPIRICAL VERSION OF THE ASWD

Algorithm 1 The augmented sliced Wasserstein distance. All of the for loops can be parallelized.

Require: Sets of samples {xn Rd}N n=1, {yn Rd}N n=1; Require: Randomly initialized injective neural network gω( ):Rd Rdθ; Require: Number of projections L, hyperparameter λ, learning rate ϵ, number of iterations M;

1: Initialize D=0,Lλ =0,m=1; 2: while ω has not converged and m M do 3: Draw a set of samples {θl}L l=1 from Sdθ 1; 4: for n=1 to N do 5: Compute gω(xn) and gω(yn);

6: Calculate the regularization term Lλ Lλ+λ ( ||gω(xn)||k 2 N ) 1 k +( ||gω(yn)||k 2 N ) 1 k ; 7: end for 8: for l=1 to L do 9: Compute β(xn,θl)= gω(xn),θl , β(yn,θl)= gω(yn),θl for each n; 10: Sort β(xn,θl) and β(yn,θl) in ascending order s.t. β(x Ilx[n],θl) β(x Ilx[n+1],θl) and β(y Ily[n],θl) β(y Ily[n+1],θl);

11: Calculate the ASWD: D D+( 1

L PN n=1|β(x Ilx[n],θl) β(y Ily[n],θl)|k) 1 k ; 12: end for 13: L D Lλ; 14: Update ω by gradient ascent ω ω+ϵ ωL; 15: Reset D=0,Lλ =0, update m m+1; 16: end while 17: Draw a set of samples {θl}L l=1 from Sdθ 1; 18: for n=1 to N do 19: Compute gω(xn) and gω(yn); 20: end for 21: for l=1 to L do 22: Compute β(xn,θl)= gω(xn),θl , β(yn,θl)= gω(yn),θl for each n; 23: Sort β(xn, θl) and β(yn, θl) in ascending order s.t. β(x Ilx[n], θl) β(x Ilx[n+1], θl) and β(y Ily[n],θl) β(y Ily[n+1],θl);

24: Calculate the ASWD: D D+( 1

L PN n=1|β(x Ilx[n],θl) β(y Ily[n],θl)|k) 1 k ; 25: end for 26: Output: Augmented sliced Wasserstein distance D.

In Algorithm 1, Equation (17) is used as the optimization objective, where the regularization term L(µ,ν,λ;gω)=λ(E

1 kx µ ||gω(x)||k 2 +E

1 ky ν ||gω(y)||k 2 ) is used. This particular choice of the regularization term facilitates the proofs to Lemma 2 and subsequently to Corollary 1.1 with details in Appendix D. In fact, we have also examined other types of regularization terms such as the L2 norm of the output of g( ), and empirically they produce similar numerical results as the current regularization term.

Published as a conference paper at ICLR 2022

APPENDIX F EXPERIMENTAL SETUPS

F.1 HYPERPARAMETERS IN THE SLICED WASSERSTEIN FLOW EXPERIMENT

We randomly generate 500 samples both for target distributions and source distributions. We initialize the source distributions µ0 as standard normal distributions N(0,I), where I is a 2-dimensional identity matrix. We update source distributions using Adam optimizer, and set the learning rate=0.002. For all methods, we set the order k = 2. When testing the ASWD, the number of iterations M in Algorithm 1 is set to 10.

In the sliced Wasserstein ﬂow experiment the mapping g ( ) optimized by maximizing Equation (17) was found to be ﬁnite for all values of λ in the set of {0.01, 0.05, 0.1, 0.5, 1, 10, 100}, which is a sufﬁcient condition for the ASWD to be a valid metric as shown in the proof of corollary 1.1 provided in Appendix D. In addition, numerical results presented in Appendix G.2 indicate that empirical errors in the experiment are not sensitive to the choice of λ in the candidate set {0.01, 0.05, 0.1, 0.5}. The reported results in the main paper are produced with λ=0.1.

F.2 NETWORK ARCHITECTURE IN THE GENERATIVE MODELING EXPERIMENT

Denote a convolutional layer whose kernel size is s with C kernels by Conv C(s s), and a fully-connected layer whose input and output layer have s1 and s2 neurons by FC(s1 s2). The network structure used in the generative modeling experiment is conﬁgured to be the same as described in (Nguyen et al., 2021):

hψ :(64 64 3) Conv64(4 4) Leaky Re LU(0.2) Conv128(4 4) Batch Normalization Leaky Re LU(0.2) Conv256(4 4) Batch Normalization Leaky Re LU(0.2)

Conv512(4 4) Batch Normalization Tanh Output (512 4 4)

DΨ :Conv1(4 4) Sigmoid Output (1 1 1)

GΦ :z R100 Conv Transpose512(4 4) Batch Normalization Re LU Conv Transpose256(4 4) Batch Normalization Re LU Conv Transpose128(4 4) Batch Normalization Re LU Conv Transpose64(4 4) Batch Normalization Conv Transpose3(4 4) Tanh

Ouput (64 64 3)

φ:FC(8192 8192) Output (8192)-dimensional vector

We train the models with the Adam optimizer, and set the batch size to 512. Following the setup in (Nguyen et al., 2021), the learning rate is set to 0.0005 and beta=(0.5, 0.999) for both CIFAR10 dataset and Celeb A dataset. For all methods, we set the order k to 2. For the ASWD, the number of iterations M in Algorithm 1 is set to 5. The hyperparameter λ is set to 1.01 to guarantee that the ASWD being a valid metric and introduce slightly larger regularization of the optimization objective due to the small output values from the feature layer hψ.

Published as a conference paper at ICLR 2022

APPENDIX G ADDITIONAL RESULTS IN THE SLICED WASSERSTEIN FLOW EXPERIMENT

G.1 FULL EXPERIMENTAL RESULTS ON THE SLICED WASSERSTEIN EXPERIMENT

Figure 4 shows the full experimental results on the sliced Wasserstein ﬂow experiment.

Figure 4: Full experimental results on the sliced Wasserstein ﬂow example. The ﬁrst and third columns are target distributions. The second and fourth columns are log 2-Wasserstein distances between the target distributions and the source distributions. The horizontal axis shows the number of training iterations. Solid lines and shaded areas represent the average values and 95% conﬁdence intervals of log 2-Wasserstein distances over 50 runs.

Published as a conference paper at ICLR 2022

G.2 ABLATION STUDY

Impact of the injectivity and optimization of the mapping

In this ablation study, we compare ASWDs constructed by different mappings to GSWDs with different predeﬁned deﬁning functions, and investigate the effects of the optimization and injectivity of the adopted mapping gω( ) used in the ASWDs. In what follows, ASWD-vanilla" is used to denote ASWDs that employ randomly initialized neural network φω( ) to parameterize the injective mapping gω( ) = [ ,φω( )], i.e. the mapping gω( ) is not optimized in the ASWD-vanilla and the results of ASWD-vanilla reported in Figure 5 are obtained by slicing with random hypersurfaces. Furthermore, the ASWD-non-injective" refers to ASWDs that do not use the injectivity trick, i.e. the mapping gω( )=φω( ) is not guaranteed to be injective. In addition, the ASWD-vanilla-non-injective" adopts both setups in the ASWD-vanilla" and "ASWD-non-injective", resulting in a random non-injective mapping gω( ). The reported experiment results in this ablation study is calculated over 50 runs, and the neural network φω( ) is reinitialized randomly in each run.

From Figure 5, it can be observed that the ASWD-vanilla shows comparable performance to GSWDs deﬁned by polynomial and circular deﬁning functions, which implies GSWDs with predeﬁned deﬁning functions are as uninformative as slicing distributions with random hypersurfaces constructed by the ASWD. In GSWDs, the hypersurfaces are predeﬁned and cannot be optimized since they are determined by the functional forms of the deﬁning functions. On the contrary, we found that the optimization of hypersurfaces in the ASWD framework can help improve the performance of the slice-based Wasserstein distance. As in Figure 5, the ASWD and the ASWD-non-injective present signiﬁcantly better performance than methods that do not optimize their hypersurfaces (ASWD-vanilla, ASWDvanilla-non-injective, and GSWDs). In terms of the impact of the injectivity of the mapping gω, in this experiment, the ASWD-vanilla exhibits smaller 2-Wasserstein distances than the ASWD-vanilla-noninjective in all tested distributions, and the ASWD leads to more stable training than the ASWD-noninjective. Therefore, theinjectivityofthemappinggω( )doesnotonlyguaranteethe ASWDtobeavalid distance metric as proved in Section 3, but also better empirical performance in this experiment setup.

Impact of the regularization coefﬁcient

We also evaluated the sensitivity of performance of the ASWD with respect to the regularization coefﬁcient λ. The ASWD is evaluated with different values of λ and compared with other slice-based Wasserstein metrics in this ablation study. The numerical results presented in Figure 6 indicates that different values of λ in the candidate set {0.01, 0.05, 0.1, 0.5} lead similar performance of the ASWD, i.e the performance of the ASWD is not sensitive to λ. Additionally, the ASWDs with different values of λ in the candidate set outperform the other evaluated slice-based Wasserstein metrics.

We have also evaluated the performance of the ASWD when the range of λ is much larger than in the candidate set. Speciﬁcally, as presented in Figure 7, when λ is set to be large values, e.g 10 or 100, the resulted ASWD leads to decreased performance on par with the SWD. This is consistent with our expectation that excessive regularization will eliminate nonlinearity as discussed in Remark 2, leading to similar performance with the SWD.

In addition, the effect of the regularization term on the performance of Max-GSWD-NN was also investigated in this ablation study. The performance of the Max-GSWD-NN and the Max-GSWD-NN trained with the regularization term used in the ASWD are compared in Figure 8. From the numerical results presented in Figure 8, the Max-GSWD-NN 1 with regularization leads to performance similar to the Max-GSWD-NN 1 without regularization, implying that the performance gap between the ASWD and the Max-GSWD-NN is not due to the introduction of the regularization term.

Choice of injective mapping

We reported in Figure 9 the performance of the ASWD deﬁned with other types of injective mappings other than Equation (18). In particular, we examined two invertible mappings, including the planar ﬂow and radial ﬂow (Rezende and Mohamed, 2015), as alternatives to the injective mapping deﬁned by Equation(18). Thenumericalresultspresentedin Figure9showthatthe ASWDdeﬁnedwithplanarﬂow and radial ﬂow produced better performance than GSWD variants in most setups. They exhibit slightly worse performance compared with the ASWD with injective mapping deﬁned in Equation (18), possibly due to the additional restriction in invertible mapping imposed by the planar ﬂow and radial ﬂow.

Choice of the dimensionality dθ of the augmented space

Published as a conference paper at ICLR 2022

To investigate how the dimensionality dθ of the augmented space affects the performance of the ASWD, different choices of dθ are employed in the ASWD. Speciﬁcally, the injective network architecture gω(x) = [x,φω(x)] : Rd Rdθ given in Equation (18) is adopted and φω is set to be single fully-connected neural networks whose output dimension equals {1, 2, 3, 4} times its input dimension, i.e dθ ={2d,3d,4d,5d}, respectively, where d is the dimensionality of x. The numerical results are presented in Figure 10, and it can be found that the ASWDs present similar results across different choices of dθ. It can also be observed in Figure 10 that the ASWDs with different choices of dθ consistently produce better performance than the other evaluated slice-based Wasserstein metrics.

Figure 5: Ablation study on the impact from injective neural networks and the optimization of hypersurfaces on the ASWD. ASWDs with different mappings are compared to GSWDs with different deﬁning functions. The ﬁrst and third columns show target distributions. The second and fourth columns plot log 2-Wasserstein distances between the target distributions and the source distributions. In the second and fourth columns, the horizontal axis shows the number of training iterations. Solid lines and shaded areas represent the average values and 95% conﬁdence intervals of log 2-Wasserstein distances over 50 runs.

Published as a conference paper at ICLR 2022

Figure 6: Ablation study on the impact of the regularization coefﬁcient λ. The performance of the ASWDs with different values of λ are compared with other slice-based Wasserstein metrics. The ﬁrst and third columns show target distributions. The second and fourth columns plot log 2-Wasserstein distances between the target distributions and the source distributions. In the second and fourth columns, the horizontal axis shows the number of training iterations. Solid lines and shaded areas represent the average values and 95% conﬁdence intervals of log 2-Wasserstein distances over 50 runs.

Published as a conference paper at ICLR 2022

Figure 7: Ablation study on the impact from large values of λ. The performance of the ASWDs with large values of λ, e.g 10 and 100, are compared with the SWD. The ﬁrst and third columns show target distributions. The second and fourth columns plot log 2-Wasserstein distances between the target distributions and the source distributions. In the second and fourth columns, the horizontal axis shows the number of training iterations. Solid lines and shaded areas represent the average values and 95% conﬁdence intervals of log 2-Wasserstein distances over 50 runs.

Published as a conference paper at ICLR 2022

Figure 8: Ablation study on the impact from the regularization term on the performance of the Max GSW-NN. The ﬁrst and third columns show target distributions. The second and fourth columns plot log 2-Wasserstein distances between the target distributions and the source distributions. In the second and fourth columns, the horizontal axis shows the number of training iterations. Solid lines and shaded areas represent the average values and 95% conﬁdence intervals of log 2-Wasserstein distances over 50 runs.

Published as a conference paper at ICLR 2022

Figure 9: Ablation study on the impact from the choice of injective networks. The performance of the ASWDs with different types of injective networks are compared with other slice-based Wasserstein metrics. The ﬁrst and third columns show target distributions. The second and fourth columns plot log 2-Wasserstein distances between the target distributions and the source distributions. In the second and fourth columns, the horizontal axis shows the number of training iterations. Solid lines and shaded areas represent the average values and 95% conﬁdence intervals of log 2-Wasserstein distances over 50 runs.

Published as a conference paper at ICLR 2022

Figure 10: Ablation study on the choice of the dimensionality dθ of the augmented space. The performance of the ASWDs with different choices of dθ are compared with other slice-based Wasserstein metrics. The ﬁrst and third columns show target distributions. The second and fourth columns plot log 2-Wasserstein distances between the target distributions and the source distributions. In the second and fourth columns, the horizontal axis shows the number of training iterations. Solid lines and shaded areas represent the average values and 95% conﬁdence intervals of log 2-Wasserstein distances over 50 runs.

Published as a conference paper at ICLR 2022

APPENDIX H ADDITIONAL RESULTS IN THE GENERATIVE MODELING EXPERIMENT

H.1 SAMPLES OF GENERATED IMAGES OF CIFAR10 AND CELEBA DATASETS

(a) Celeb A (L=10) (b) Celeb A (L=100) (c) Celeb A (L=1000)

(d) CIFAR10 (L=10) (e) CIFAR10 (L=100) (f) CIFAR10 (L=1000)

Figure 11: Visualized experimental results of the ASWD on Celeb A and CIFAR10 dataset with 10, 100, 1000 projections. The ﬁrst row shows randomly selected samples of generated Celeb A images, the second row shows randomly selected samples of generated CIFAR10 images.

H.2 EXPERIMENT RESULTS ON MNIST DATASET

In the generative modelling experiment on the MNIST dataset, we train a generator by minimizing different slice-based Wasserstein metrics, including the ASWD, the DSWD, the GSWD (circular), and the SWD. Denote by GΦ the generator, the training objective of the experiment can be formulated as (Bernton et al., 2019): min Φ Ex pr,z pz[SWD(x,GΦ(z))], (54)

where pz and pr are the prior of latent variable z and the real data distribution, respectively. In other words, the SWD, or other slice-based Wasserstein metrics, can be considered as a discriminator in this framework. By replacing the SWD with the ASWD, the DSWD, and the GSWD, we compare the performance of learned generative models trained with different metrics. In this experiment, different methods are compared using different number of projections L={10,1000}. The 2-Wasserstein distance and the SWD between generated images and real images are used as metrics for evaluating performances of different generative models. The experiment results are presented in Figure 12. Quality of generated images and convergence rate: It can be observed from Figure 12 that the ASWD outperforms all the other methods regarding both the 2-Wasserstein distance and the SWD between generated and real images. In particular, the generative model trained with the ASWD produces smaller 2-Wasserstein distances within less iteration, which implies the generated images are of higher quality and the ASWD leads to higher convergence rates of generative models. In addition, the ASWD shows that it is able to generate higher quality images than the SWD and the GSWD with 1000 projections using only as less

Published as a conference paper at ICLR 2022

Figure 12: Visualized experimental results of different slice-based Wasserstein metrics on the MNIST dataset with 10, 1000 projections. (a) Comparison between the SWD, the GSWD, the DSWD, and the ASWD using the 2-Wasserstein distance between fake and real images as the evaluation metric. (b) Comparison between the SWD, the GSWD, the DSWD, and the ASWD using the SWD between fake and real images as the evaluation metric.

as 10 projections. In other words, the ASWD has higher projection efﬁciency than the other slice-based Wasserstein metrics. The ASWD also has the smallest SWD distance as shown in Figure 12. Although the SWD converges slightly faster than the ASWD in terms of the SWD between fake and real images, this is due to the training objective and the evaluated metric are the same for the SWD. Randomly selected images generated by different slice-based Wasserstein metrics are presented in Figure 13.

Computation cost of the ASWD: The execution time per mini-batch (512 samples) of different methods are compared in Figure 14a. We evaluate the SWD by varying the number of projections in the set {10, 1000, 10000} and all the other methods in the set {10, 1000}. We found that although the SWD requires much fewer computational time than the DSWD and the ASWD, the quality of generated data is poor even when the number of projections L increases to 10000. The GSWD is also computationally efﬁcient when using a 10 projections, but it requires the highest execution time and generates the highest 2-Wasserstein distance among all compared methods when the number of projections increases to 1000. The huge difference in the execution time of the GSWD with 10 and 1000 projections is due to the GSWD needs to calculate distance matrices of shape N L, where N and L are the number of samples and projections respectively, which is more computationally expensive than calculating inner products when the number of projections L increases. The DSWD requires a similar computational time as the ASWD in this example, while the ASWD generates higher quality images in terms of 2-Wasserstein distances.

Besides, we have also evaluated the effect of batch size on the computation cost of different slice-based Wasserstein metrics. Due to the out-of-memory error caused by the excessively high computation cost of the GSWDs in high-dimensional space, the GSWD is not included in this comparison. Speciﬁcally, the ASWD, the DSWD, and the SWD are compared in this experiment, and the computation time of the evaluated methods with L = 10,1000 projections and N = {213,214,215,216} samples are reported in Figure 14b. From the results presented in Figure 14b, it can be observed that, similar to the computational complexity of the DSWD and the SWD, the computational complexity of the ASWD empirically tends to scale in O(Nlog N).

Published as a conference paper at ICLR 2022

ASWD-10 SWD-10 DSWD-10 GSWD-10

ASWD-1000 SWD-1000 DSWD-1000 GSWD-1000

Figure 13: Randomly selected samples generated by different metrics, 10 and 1000 refer to the number of projections.

Figure 14: (a) The execution time of different methods and the 2-Wasserstein distance between the real images and the fake images generated by their corresponding models. Each dot of the curve of SWD corresponds to the performance of the SWD with the number of projections L = {10,1000,10000}, in sequence. Each dot of the other curves correspond to the performance of the other methods with the number of projections L={10,1000}, in sequence. (b) Computation cost of calculating different metrics, the horizontal axis is the number of samples from the compared distributions, and the horizontal axis is the averaged time consumption for calculating the distances. It can be found that the evaluated metrics tend to scale in O(Nlog N).

Published as a conference paper at ICLR 2022

APPENDIX I SLICED WASSERSTEIN AUTOENCODERS

We train an autoencoder using the framework proposed in (Kolouri et al., 2019b), where an encoder and a decoder are jointly trained by minimizing the following objective:

min φ,ψ BCE(ψ(φ(x)),x)+L1(ψ(φ(x)),x)+SWD(pz,φ(x)), (55)

where φ is the encoder, ψ is the decoder, pz is the prior distribution of latent variable, BCE( , ) is the binary cross entropy loss between reconstructed images and real images, and L1( , ) is the L1 loss between reconstructed images and real images. We train this model using different slice-based Wasserstein metrics, including the ASWD, the DSWD, the SWD, and the GSWD. Here we use the ring distribution as the prior distribution as shown in Figure 16. We report the binary cross entropy loss during test time and the 2-Wasserstein distance between prior and the encoded latent variable φ(x) in Figure 15. The slice-based Wasserstein metrics used as the third term in Equation (55) is also recorded at each iteration and presented in Figure 15 in order to analyze the factors that causes the differences in the performance of models trained with different slice-based Wasserstein metrics.

It can be observed from the ﬁrst two columns of Figure 15 that while the model trained with the ASWD, SWD, and DSWD lead to similar 2-Wassertein distance between the encoded latent variable distribution and the prior distribution, which implies that they have similar coverage of the prior distribution as shown in Figure 16, the model trained with the ASWD converges slightly faster to smaller binary cross entropy loss than the others. As shown in Figure 15(b) and 15(c), since the obtained GSWDs are trivially small compared with the other metrics, models trained with GSWDs only focuses on the reconstruction loss and therefore produce reconstructed images of higher qualities in terms of the binary cross entropy. However, although models trained with the GSWD polynomial and the GSWD circular lead to smaller binary cross entropy loss than the ASWD, their latent distributions present very different data structures from the speciﬁed prior distribution, which can be problematic in certain applications where the support of the latent distribution is required to be within a particular range.

Some MNIST images randomly generated by SWAEs trained with different metrics are given in Figure 17.

(a) (b) (c)

Figure 15: Convergence behavior of SWAEs trained with different slice-based Wasserstein metrics. (a) The 2-Wasserstein distance between the prior distribution pz and the distribution of encoded feature φ(x). (b) The binary cross entropy loss between the reconstruction and real data. (c) The slice-based Wasserstein metric used to train the model.

Published as a conference paper at ICLR 2022

(a) ASWD latent space (b) DSWD latent space (c) SWD latent space

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

(d) GSWD (poly 3) latent space (e) GSWD (circular) latent (f) Prior distribution

Figure 16: Comparisons between the encoded latent space generated by different slice-based Wasserstein metrics.

(a) ASWD samples (b) DSWD samples (c) SWD samples

(d) GSWD (poly 3) samples (e) GSWD (circular) samples (f) MNIST samples

Figure 17: MNIST images randomly generated by SWAEs trained with different metrics.

Published as a conference paper at ICLR 2022

APPENDIX J COLOR TRANSFERRING

Color transferring can be formulated as an optimal transport problem (Bonneel et al., 2015; Radon, 1917). In this task, the color palette of a source image is transferred to that of a target image, while keeping the content of source image unchanged. To achieve this, the optimal transport can be used to ﬁnd the alignment between image pixels by calculating the optimal mapping of color palettes. In this experiment, instead of solving the optimal mapping in the original space, we ﬁrst project the distribution onto one-dimensional spaces and average the alignment between one-dimensional samples as an approximation of the optimal mapping in the original space. After obtaining the approximation, we replace pixels of the source image with the averaged corresponding pixels in the target image. To reduce the computational cost, we utilize the approach proposed in (Muzellec and Cuturi, 2019), where the K-means algorithm is used to cluster the pixels of both source and target images, and then we implement color transfer for the quantized images whose pixels are consist of the centers of 3000 clusters rather than the original source and target images.

We present the results of color transferring in Figure 18. It can be observed that the ASWD and the DSWD produce sharper images than the SWD and the GSWD (polynomial), we conjecture that is because the ASWD and the DSWD can generate better alignment of pixels. The GSWD (circular) tends to generate high contrast images, but sometimes produces images with low color diversity as in Figure 18(a).The Max-SWD has the highest contrast among all methods, but this is due to it only uses a single projection to obtain the transport mapping, thus there is no need to average different pixels from the target image. A disadvantage of the Max-SWD is that the transferred images generated by Max-SWD is not smooth enough and do not look realistic. The ASWD can generate smooth and realistic images than the SWD and the Max-SWD, even when the number of projections is as small as 10. Transferred images obtained by transferring colors from the source to target using standard optimal transport maps is also presented in Figure 18 for reference (Ferradans et al., 2014).

Published as a conference paper at ICLR 2022

Figure 18: Top rows are source images and target images, lower rows show transferred images obtained by using different methods with different number of projections. Source and target images are from (Bonneel et al., 2015) and https://github.com/chia56028/ Color-Transfer-between-Images.

Published as a conference paper at ICLR 2022

APPENDIX K SLICED WASSERSTEIN BARYCENTER

Sliced Wasserstein distances can also be applied in the barycenter calculation and shape interpolation (Bonneel et al., 2015). Here we compare barycenters produced by different slice-based Wasserstein metrics, including the GSWD (circular and polynomial), the ASWD, the SWD, and the DSWD. Speciﬁcally, we compute barycenters of different shapes consisting of point clouds, as shown in Figure 19. Each object in Figure 19 corresponds to a speciﬁc barycenter with different weights. The Wasserstein barycenters are also presented in Figure 19 for reference, which provide geometrically meaningful barycenters at the expense of signiﬁcantly higher computational cost .

Formally, a sliced-Wasserstein barycenter of objects µ = {µ1,µ2, ,µN Pk(Rd)} assigned with weights w=[w1,w2, ,w N R] is deﬁned as:

Bar(µ,w)= argmin µ Pk(Rd)

i=1 wi SWD(µ,µi). (56)

In this experiment, we set N =3 and compute barycenters corresponding to different weights. The results are given in Figure 19.

From Figure 19, it can be observed that the ASWD produces similar barycenters as that of the SWD, which are sharper than the DSWD, and more meaningful than the GSWD (polynomial). The ﬂexibility of the injective neural networks g( ) and its optimization in the ASWD can be potentially combined with speciﬁc requirements in particular tasks to generate calibrated barycenters - we leave this as a future research direction.

(a) ASWD barycenters (b) GSWD (circular) barycenters (c) GSWD (polynomial) barycenters

(d) DSWD barycenters (e) SWD barycenters (f) Wasserstein barycenter

Figure 19: Sliced Wasserstein barycenters generated by the ASWD, the GSWD, the DSWD, and the SWD, and the Wasserstein barycenter.