# augmented_sliced_wasserstein_distances__b7bbfd21.pdf Published as a conference paper at ICLR 2022 AUGMENTED SLICED WASSERSTEIN DISTANCES Xiongjie Chen1, Yongxin Yang2,3, and Yunpeng Li1 1University of Surrey, 2University of Edinburgh, 3Huawei Noah s Ark Lab 1{xiongjie.chen, yunpeng.li}@surrey.ac.uk, 2yongxin.yang@ed.ac.uk While theoretically appealing, the application of the Wasserstein distance to large-scale machine learning problems has been hampered by its prohibitive computational cost. The sliced Wasserstein distance and its variants improve the computational efficiency through the random projection, yet they suffer from low accuracy if the number of projections is not sufficiently large, because the majority of projections result in trivially small values. In this work, we propose a new family of distance metrics, called augmented sliced Wasserstein distances (ASWDs), constructed by first mapping samples to higher-dimensional hypersurfaces parameterized by neural networks. It is derived from a key observation that (random) linear projections of samples residing on these hypersurfaces would translate to much more flexible nonlinear projections in the original sample space, so they can capture complex structures of the data distribution. We show that the hypersurfaces can be optimized by gradient ascent efficiently. We provide the condition under which the ASWD is a valid metric and show that this can be obtained by an injective neural network architecture. Numerical results demonstrate that the ASWD significantly outperforms other Wasserstein variants for both synthetic and real-world problems. 1 INTRODUCTION Comparing samples from two probability distributions is a fundamental problem in statistics and machine learning. The optimal transport (OT) theory (Villani, 2008) provides a powerful and flexible theoretical tool to compare degenerative distributions by accounting for the metric in the underlying spaces. The Wasserstein distance, which arises from the optimal transport theory, has become an increasingly popular choice in various machine learning domains ranging from generative models to transfer learning (Gulrajani et al., 2017; Arjovsky et al., 2017; Kolouri et al., 2019b; Cuturi and Doucet, 2014; Courty et al., 2016). Despiteitsfavorableproperties, suchasrobustnesstodisjointsupportsandnumericalstability(Arjovsky et al., 2017), the Wasserstein distance suffers from high computational complexity especially when the sample size is large. Besides, the Wasserstein distance itself is the result of an optimization problem it is non-trivial to be integrated into an end-to-end training pipeline of deep neural networks, unless one can make the solver for the optimization problem differentiable. Recent advances in computational optimal transport methods focus on alternative OT-based metrics that are computationally efficient and solvable via a differentiable optimizer (Peyré and Cuturi, 2019). Entropy regularization is introduced in the Sinkhorn distance (Cuturi, 2013) and its variants (Altschuler et al., 2017; Dessein et al., 2018) to smooth the optimal transport problem; as a result, iterative matrix scaling algorithms can be applied to provide significantly faster solutions with improved sample complexity (Genevay et al., 2019). An alternative approach is to approximate the Wasserstein distance through slicing, i.e. linearly projecting, the distributions to be compared. The sliced Wasserstein distance (SWD) (Bonneel et al., 2015) is defined as the expected value of Wasserstein distances between one-dimensional random projections of high-dimensional distributions. The SWD shares similar theoretical properties with the Wasserstein distance (Bonnotte, 2013) and is computationally efficient since the Wasserstein distance in one-dimensional space has a closed-form solution based on sorting. (Deshpande et al., 2019) extends the sliced Wasserstein distance to the max-sliced Wasserstein distance (Max-SWD), by finding a single projection direction with the maximal distance between projected samples. The subspace robust Wasserstein distance extends the idea of slicing to projecting distributions on linear subspaces (Paty and ar Xiv:2006.08812v7 [cs.LG] 17 Mar 2022 Published as a conference paper at ICLR 2022 Figure 1: (a) and (b) are visualizations of projections for the ASWD and the SWD between two 2-dimensional Gaussians. (c) and (d) are distance histograms for the ASWD and the SWD between two 100-dimensional Gaussians. Figure 1(a) shows that the injective neural network embedded in the ASWD learns data patterns (in the X-Y plane) and produces well-separate projected values (Z-axis) between distributions in a random projection direction. The high projection efficiency of the ASWD is evident in Figure 1(c), as almost all random projection directions in a 100-dimensional space lead to significant distances between 1-dimensional projections. In contrast, random linear mappings in the SWD often produce closer 1-d projections (Z-axis) (Figure 1(b)); as a result, a large percentage of random projection directions in the 100-d space result in trivially small distances (Figure 1(d)), leading to a low projection efficiency in high-dimensional spaces. Cuturi, 2019). However, the linear nature of these projections usually leads to low projection efficiency of the resulted metrics in high-dimensional spaces (Deshpande et al., 2019; Kolouri et al., 2019a). Different variants of the SWD have been proposed to improve the projection efficiency of the SWD, either by introducing nonlinear projections or by optimizing the distribution of random projections. Specifically, (Kolouri et al., 2019a) extends the connection between the sliced Wasserstein distance and the Radon transform (Radon, 1917) to introduce generalized sliced Wasserstein distances (GSWDs) by utilizing generalized Radon transforms (GRTs), which are defined by nonlinear defining functions and lead to nonlinear projections. A variant named the GSWD-NN was proposed in (Kolouri et al., 2019a) to generate nonlinear projections directly with neural network outputs, but it does not fit into the theoretical framework of the GSWD and does not guarantee a valid metric. In contrast, the distributional sliced Wasserstein distance (DSWD) and its nonlinear version, the distributional generalized sliced Wasserstein distance (DGSWD), improve their projection efficiency by finding a distribution of projections that maximizes the expected distances over these projections. The GSWD and the DGSWD exhibit higher projection efficiency than the SWD in the experiment evaluation, yet they require the specification of the particular form of defining functions from a limited class of candidates. However, the selection of defining functions is usually a task-dependent problem and requires domain knowledge, and the impact on performance from different defining functions is still unclear. In this paper, we present the augmented sliced Wasserstein distance (ASWD), a distance metric constructed by first mapping samples to hypersurfaces in an augmented space, which enables flexible nonlinear slicing of data distributions for improved projection efficiency (See Figure 1). Our main contributions include: (i) We exploit the capacity of nonlinear projections employed in the ASWD by constructing injective mapping with arbitrary neural networks; (ii) We prove that the ASWD is a valid distance metric; (iii) We provide a mechanism in which the hypersurface where high-dimensional distributions are projected onto can be optimized and show that the optimization of hypersurfaces can help improve the projection efficiency of slice-based Wasserstein distances. Hence, the ASWD is data-adaptive, i.e. the hypersurfaces can be learned from data. This implies one does not need to manually design a function from the limited class of candidates; (iv) We demonstrate superior performance of the ASWD in numerical experiments for both synthetic and real-world datasets. The remainder of the paper is organized as follows. Section 2 reviews the necessary background. We present the proposed method and its numerical implementation in Section 3. Related work are discussed in Section 4. Numerical experiment results are presented and discussed in Section 5. We conclude the paper in Section 6. Published as a conference paper at ICLR 2022 2 BACKGROUND In this section, we provide a brief review of concepts related to the proposed work, including the Wasserstein distance, (generalized) Radon transform and (generalized) sliced Wasserstein distances. Wasserstein distance: Let Pk(Ω) be a set of Borel probability measures with finite k-th moment on a Polish metric space (Ω,d) (Villani, 2008). Given two probability measures µ, ν Pk(Ω), the Wasserstein distance of order k [1,+ ) between µ and ν is defined as: Wk(µ,ν)= inf γ Γ(µ,ν) Ω Ω d(x,y)kdγ(x,y) 1 where d( , )k is the cost function, Γ(µ,ν) represents the set of all transportation plans γ, i.e. joint distributions whose marginals are µ and ν, respectively. While the Wasserstein distance is generally intractable for high-dimensional distributions, there are several favorable cases where the optimal transport problem can be efficiently solved. If µ and ν are continuous one-dimensional measures defined on a linear space equipped with the Lk norm, the Wasserstein distance between µ and ν has a closed-form solution (Peyré and Cuturi, 2019): Wk(µ,ν)= Z 1 0 |F 1 µ (z) F 1 ν (z)|kdz 1 where F 1 µ and F 1 ν are inverse cumulative distribution functions (CDFs) of µ and ν, respectively. In practice, Wasserstein distances Wk( µ, ν) between one-dimensional empirical distributions µ = 1 N PN n=1δxn and ν = 1 N PN n=1δyn can be computed by sorting one-dimensional samples from empirical distributions, and evaluating the distances between sorted samples (Kolouri et al., 2019b): Wk( µ, ν)= 1 n=1 |x Ix[n] y Iy[n]|k 1 where N is the number of samples, Ix[n] and Iy[n] are the indices of sorted samples satisfying x Ix[n] x Ix[n+1] and y Iy[n] y Iy[n+1], respectively. Radon transform and generalized Radon transform: The Radon transform (Radon, 1917) maps a function f( ) L1(Rd) to the space of functions defined over spaces of hyperplanes in Rd. The Radon transform of f( ) is defined by line integrals of f( ) along all possible hyperplanes in Rd: Rdf(x)δ(t x,θ )dx, (4) where t R and θ Sd 1 represent the parameters of hyperplanes {x Rd | x,θ = t}, δ( ) is the one-dimensional Dirac delta function, and , refers to the Euclidean inner product. By replacing the inner product x,θ in Equation (4) with β(x,θ), a specific family of functions named as defining function in (Kolouri et al., 2019a), the generalized Radon transform (GRT) (Beylkin, 1984) is defined as integrals of f( ) along hypersurfaces defined by {x Rd | β(x,θ)=t}: Rdf(x)δ(t β(x,θ))dx, (5) where t R, θ Ωθ and Ωθ is a compact set of all feasible θ, e.g. Ωθ = Sd 1 for β(x,θ) = x,θ . In particular, a function β(x,θ) defined on X (Rd\{0}) with X Rd is called a defining function of GRTs if it satisfies conditions H.1 H.4 given in (Kolouri et al., 2019a). For probability measures µ Pk(Rd), the Radon transform and the GRT can be employed as push-forward operators, and the generated push-forward measures Rµ =R#µ, Gµ =G#µ are defined as follows (Bonneel et al., 2015): Rdδ(t x,θ )dµ, (6) Rdδ(t β(x,θ))dµ. (7) Published as a conference paper at ICLR 2022 Notably, the Radon transform is a linear bijection (Helgason, 1980), and the sufficient conditions for GRTs to be bijective are provided in (Homan and Zhou, 2017). Sliced Wasserstein distance and generalized sliced Wasserstein distance: By applying the Radon transform to µ and ν to obtain multiple projections, the sliced Wasserstein distance (SWD) decomposes the high-dimensional Wasserstein distance into multiple one-dimensional Wasserstein distances which can be efficiently evaluated (Bonneel et al., 2015). The k-SWD between µ and ν is defined by: SWDk(µ,ν)= Z Sd 1W k k Rµ( ,θ),Rν( ,θ) dθ 1 where the Radon transform R defined by Equation (6) is adopted as the measure push-forward operator. The GSWD generalizes the idea of SWD by slicing distributions with hypersurfaces rather than hyperplanes (Kolouri et al., 2019a). The GSWD is defined as: GSWDk(µ,ν)= Z Ωθ W k k Gµ( ,θ),Gν( ,θ) dθ 1 where the GRT G defined by Equation (7) is used as the measure push-forward operator. From Equation (3), with L random projections and N samples, the SWD and GSWD between µ and ν can be approximated by: SWDk(µ,ν) 1 n=1 | x Ilx[n],θl y Ily[n],θl |k 1 GSWDk(µ,ν) 1 n=1 |β(x Ilx[n],θl) β(y Ily[n],θl)|k 1 where Il x and Il y are sequences consisting of the indices of sorted samples which satisfy x Ilx[n],θl x Ilx[n+1], θl , y Ily[n], θl y Ily[n+1], θl in the SWD, and β(x Ilx[n], θl) β(x Ilx[n+1], θl), β(y Ily[n],θl) β(y Ily[n+1],θl) in the GSWD. The approximation error in estimating SWDs using Equation (10) is derived in (Nadjahi et al., 2020). It is proved in (Bonnotte, 2013) that the SWD is a valid distance metric. The GSWD is a valid metric except for its neural network variant (Kolouri et al., 2019a). 3 AUGMENTED SLICED WASSERSTEIN DISTANCES In this section, we propose a new distance metric called the augmented sliced Wasserstein distance (ASWD), which embeds flexible nonlinear projections in its construction. We also provide an implementation recipe for the ASWD. 3.1 SPATIAL RADON TRANSFORM AND AUGMENTED SLICED WASSERSTEIN DISTANCE In the definitions of the SWD and GSWD, the Radon transform (Radon, 1917) and the generalized Radon transform (GRT) (Beylkin, 1984) are used as the push-forward operator for projecting distributions to a one-dimensional space. However, it is not straightforward to design defining functions β(x,θ) for the GRT, since one needs to first check if β(x,θ) satisfies the conditions to be a defining function (Kolouri et al., 2019a; Beylkin, 1984), and then whether the corresponding GRT is bijective or not (Homan and Zhou, 2017). In practice, the assumption of the transform can be relaxed, as Theorem 1 shows that as long as the transform is injective, the corresponding ASWD metric is a valid distance metric. To help us define the augmented sliced Wasserstein distance, we first introduce the spatial Radon transform which includes the Radon transform and the polynomial GRT as special cases (See Remark 5). Definition 1. Given a measurable injective mapping g( ):Rd Rdθ and a function f( ) L1(Rd), the spatial Radon transform of f( ) is defined as Hf(t,θ;g)= Z Rdf(x)δ(t g(x),θ) dx, (12) where t R and θ Sdθ 1 are the parameters of hypersurfaces {x Rd | g(x),θ =t}. Published as a conference paper at ICLR 2022 Similar to the Radon transform and the GRT, the spatial Radon transform can also be used to generate push-forward measure Hµ =H#µ for µ Pk(Rd) as in Equations (6) and (7): Hµ(t,θ;g)= Z Rdδ(t g(x),θ )dµ. (13) Remark 1. Note that the spatial Radon transform can be interpreted as applying the vanilla Radon transform to ˆµg, where ˆµg refers to the push-forward measure g#µ, i.e given a measurable injective mapping g( ):Rd Rdθ, the spatial Radon transform defined by Equation (13) can be rewritten as: Hµ(t,θ;g)=Ex µ[δ(t g(x),θ )], =Eˆx ˆµg[δ(t ˆx,θ )] = Z δ(t ˆx,θ )dˆµg =Rˆµg(t,θ). (14) Hence the spatial Radon transform inherits the theoretical properties of the Radon transform and incorporates nonlinear projections through g( ). In what follows, we use µ ν to denote probability measures µ,ν Pk(Rd) that satisfy µ(X)=ν(X) for X Rd. Lemma 1. Given an injective mapping g( ):Rd Rdθ and two probability measures µ,ν P(Rd), for all t R and θ Sdθ 1, Hµ(t,θ;g) Hν(t,θ;g) if and only if µ ν, i.e. the spatial Radon transform is an injection on Pk(Rd). Moreover, the spatial Radon transform is an injection on Pk(Rd) if and only if the mapping g( ) is an injection. See Appendix A for the proof of Lemma 1. We now introduce the augmented sliced Wasserstein distance, by utilizing the spatial Radon transform as the measure push-forward operator: Definition 2. Given two probability measures µ,ν Pk(Rd) and an injective mapping g( ):Rd Rdθ, the augmented sliced Wasserstein distance (ASWD) of order k [1,+ ) is defined as: ASWDk(µ,ν;g)= Z Sdθ 1W k k Hµ( ,θ;g),Hν( ,θ;g) dθ 1 where θ Sdθ 1, Wk is the k-Wasserstein distance defined by Equation (1), and H refers to the spatial Radon transform defined by Equation (13). Remark 2. Following the connection between the spatial Radon transform and the vanilla Radon transform as shown in Equation (14), the ASWD can be rewritten as: ASWDk(µ,ν;g)= Z Sdθ 1W k k Rˆµg( ,θ),Rˆνg( ,θ) dθ 1 =SWDk(ˆµg,ˆνg), (16) where ˆµg and ˆνg are probability measures on Rdθ which satisfy g(x) ˆµg for x µ and g(y) ˆνg for y ν. Theorem 1. The augmented sliced Wasserstein distance (ASWD) of order k [1,+ ) defined by Equation (15) with a mapping g( ):Rd Rdθ is a metric on Pk(Rd) if and only if g( ) is injective. The proof of Theorem 1 is provided in Appendix C. Theorem 1 shows that the ASWD is a metric given a fixed injective mapping g( ). In practical applications, the mapping g( ) needs to be optimized to project samples onto discriminating hypersurfaces. We show in Corollary 1.1 that the ASWD between µ and ν with the optimized g( ) is also a metric under mild conditions. Corollary 1.1. The augmented sliced Wasserstein distance (ASWD) of order k [1,+ ) between two probability measures µ,ν Pk(Rd) defined by Equation (15) with the optimal mapping g ( )=argmax g {ASWDk(µ,ν;g) L(µ,ν,λ;g)} (17) is a metric on Pk(Rd), where L(µ,ν,λ;g)=λ(E 1 kx µ ||g(x)||k 2 +E 1 ky ν ||g(y)||k 2 ) for λ (1,+ ). Published as a conference paper at ICLR 2022 The proof of Corollary 1.1 is provided in Appendix D. Remark 3. Corollary 1.1 shows that given measures µ1,µ2,µ3 Pk(Rd), the triangle inequality holds for the ASWD when g( ) is optimized for each pair of measures, as shown in Appendix D. It is worth noting that λ>1 is a sufficient condition for the ASWD to be a metric as further discussed in Remark 6, 0<λ 1 can also lead to finite ||g(x)||2 in various scenarios, resulting in valid metrics. The discussion on the impact of λ on the performance of the ASWD in practice can be found in Appendix G.2. 3.2 NUMERICAL IMPLEMENTATION We discuss in this section how to realize injective mapping g( ) with neural networks due to their expressiveness and optimize it with gradient based methods. Injective neural networks: As stated in Lemma 1 and Theorem 1, the injectivity of g( ) is the sufficient and necessary condition for the ASWD being a valid metric. Thus we need specific architecture designs on implementing g( ) by neural networks. One option is the family of invertible neural networks (Behrmann et al., 2019; Karami et al., 2019), which are both injective and surjective. However, the running cost of those models is usually much higher than that of vanilla neural networks. We propose an alternative approach by concatenating the input x of an arbitrary neural network to its output φω(x): gω(x)=[x,φω(x)]. (18) It is trivial to show that gω(x) is injective, since different inputs will lead to different outputs. Although embarrassingly simple, this idea of concatenating the input and output of neural networks has found success in preserving information with dense blocks in the Dense Net (Huang et al., 2017), where the input of each layer is injective to the output of all preceding layers. Optimization objective: We aim to slice distributions with maximally discriminating hypersurfaces between two distributions while avoiding the projected samples being arbitrarily large, so that the projected samples between the compared distributions are finite and most dissimilar regarding the ASWD, as shown in Figure 1. Similar ideas have been employed to identify important projection directions (Deshpande et al., 2019; Kolouri et al., 2019a; Paty and Cuturi, 2019) or a discriminative ground metric (Salimans et al., 2018) in optimal transport metrics. For the ASWD, the parameterized injective neural network gω( ) is optimized by maximizing the following objective: L(µ,ν;gω,λ)= Z Sdθ 1W k k Hµ( ,θ;gω),Hν( ,θ;gω) dθ 1 k L(µ,ν,λ;gω), (19) where λ>0 and the regularization term L(µ,ν,λ;gω)=λ(E 1 kx µ ||gω(x)||k 2 +E 1 ky ν ||gω(y)||k 2 ) on the magnitude of the neural network s output is used, otherwise the projections may be arbitrarily large. Remark 4. The regularization coefficient λ adjusts the introduced non-linearity in the evaluation of the ASWD by controlling the norm of φω( ) in Equation (18). In particular, when λ , the nonlinear term φω( ) shrinks to 0. The intrinsic dimension of the augmented space, i.e. the number of non-zero dimensions in the augmented space, is hence explicitly controlled by the flexible choice of φω( ) and implicitly regularized by L(µ,ν,λ;gω). By plugging the optimized g ω,λ( ) = argmax gω (L(µ, ν; gω, λ)) into Equation (15), we obtain the empirical version of the ASWD. Pseudocode is provided in Appendix E. 4 RELATED WORK Recent work on slice-based Wasserstein distances mainly focused on improving their projection efficiency, leading to a reduced number of projections needed to capture the structure of data distributions (Kolouri et al., 2019a; Nguyen et al., 2021). The GSWD proposes using nonlinear projections to achieve this goal, and it has been proved to be a valid distance metric if and only if they adopt injective GRTs, which only include the circular functions and a finite number of harmonic polynomial functions with odd degrees as their feasible defining functions (Ehrenpreis, 2003). While the GSWD has shown impressive performance in various applications (Kolouri et al., 2019a), its defining function is restricted to the aforementioned limited class of candidates. In addition, the Published as a conference paper at ICLR 2022 selection of defining function is usually task-dependent and needs domain knowledge, and the impact on performance from different defining functions is still unclear. To tackle those limitations, (Kolouri et al., 2019a) proposed the GSWD-NN, which directly takes the outputs of a neural network as its projection results without using the standard Radon transform or GRTs. However, this brings three side effects: 1) The number of projections, which equals the number of nodes in the neural network s output layer, is fixed, thus new neural networks are needed if one wants to change the number of projections. 2) There is no random projections involved in the GSWD-NN, as the projection results are determined by the inputs and weights of the neural network. 3) The GSWD-NN is a pseudo-metric since it uses a vanilla neural network, rather than Radon transform or GRTs, as its push-forward operator. Therefore, the GSWD-NN does not fit into the theoretical framework of GSWD and does not inherit its geometric properties. Another notable variant of the SWD is the distributional sliced Wasserstein distance (DSWD) (Nguyen et al., 2021). By finding a distribution of projections that maximizes the expected distances over these projections, the DSWD can slice distributions from multiple directions while having high projection efficiency. Injective GRTs are also used to extend the DSWD to the distributional generalized sliced Wasserstein distance (DGSWD) (Nguyen et al., 2021). Experiment results show that the DSWD and the DGSWD have superior performance in generative modelling tasks (Nguyen et al., 2021). However, neither the DSWD nor the DGSWD have solved the problem with the GSWD, i.e. they are still not able to produce nonlinear projections adaptively. Ourcontributiondiffersfrompreviousworkinthreeways: 1)The ASWDisdata-adaptive, i.e. thehypersurfaces where high-dimensional distributions are projected onto can be learned from data. This implies one does not need to specify a defining function from limited choices. 2) Unlike GSWD-NN, the ASWD takes a novel direction to incorporate neural networks into the framework of sliced-based Wasserstein distanceswhilemaintainingthepropertiesofsliced Wassersteindistances. 3)Previousworkonintroducing nonlinear projections into Radon transform either is restricted to only a few candidates of defining functions (GRTs) or breaks the framework of Radon transforms (neural networks in GSWD-NN), in contrast, the spatial Radon transform provides a novel way of defining nonlinear Radon-type transforms. 5 EXPERIMENTS In this section, we describe the experiments that we have conducted to evaluate performance of the proposed distance metric. The GSWD leads to the best performance in a sliced Wasserstein flow problem reported in (Kolouri et al., 2019a) and the DSWD outperforms the compared methods in the generative modeling task examined in (Nguyen et al., 2021) on CIFAR 10 (Krizhevsky, 2009), Celeb A (Liu et al., 2015), and MNIST (Le Cun et al., 1998) datasets (Appendix H.2). Hence, we compare performance of the ASWD with the state-of-the-art distance metrics in the same examples and report results as below1. We provide additional experiment results in the appendices, including a sliced Wasserstein autoencoder (SWAE) (Kolouri et al., 2019b) using the ASWD (Appendix I), image color transferring (Appendix J) and sliced Wasserstein barycenters (Appendix K). To examine the robustness of the ASWD, throughout the experiments, we adopt the injective network architecture given in Equation (18) and set φω to be a single fully-connected layer neural network whose output dimension equals its input dimension, with a Re LU layer as its activation function. The order k is set to be 2 in all experiments. 5.1 SLICED WASSERSTEIN FLOWS We first consider the problem of evolving a source distribution µ to a target distribution ν by minimizing slice-based Wasserstein distances between µ and ν in the sliced Wasserstein flow task reported in (Kolouri et al., 2019a). tµt = SWD(µt,ν), (20) where µt refers to the updated source distribution at each iteration t. The SWD in Equation (20) can be replaced by other sliced-Wasserstein distances to be evaluated. As in (Kolouri et al., 2019a), the 2-Wasserstein distance was used as the metric for evaluating performance of different distance metrics in this task. The set of hyperparameter values used in this experiment can be found in Appendix F.1. 1Code to reproduce experiment results is available at : https://github.com/xiongjiechen/ASWD. Published as a conference paper at ICLR 2022 Figure 2: The first and third columns are target distributions. The second and fourth columns are log 2-Wasserstein distances between the target distribution and the source distribution. The horizontal axis show the number of training iterations. Solid lines and shaded areas represent the average values and 95% confidence intervals of log 2-Wasserstein distances over 50 runs. A more extensive set of experimental results can be found in Appendix G.1. Without loss of generality, we initialize µ0 to be the standard normal distribution N(0,I). We repeat each experiment 50 times and record the 2-Wasserstein distance between µ and ν at every iteration. In Figure 2, we plot the 2-Wasserstein distances between the source and target distributions as a function of the training epochs and the 8-Gaussian, the Knot, the Moon, and the Swiss roll distributions are respective target distributions. For clarity, Figure 2 displays the experiment results from the 6 best performing distance metrics, including the ASWD, the DSWD, the SWD, the GSWD-NN 1, which directly generates projections through a one layer MLP, as well as the GSWD with the polynomial of degree 3, circular defining functions, out of the 12 distance metrics we compared. We observe from Figure 2 that the ASWD not only leads to smaller 2-Wasserstein distances, but also converges faster by achieving better results with fewer iterations than the other methods in these four target distributions. A complete set of experimental results with 12 compared distance metrics and 8 target distributions are included in Appendix G.1. The ASWD outperforms the compared state-of-the-art sliced-based Wasserstein distance metrics with 7 out of the 8 target distributions except for the 25-Gaussian. This is achieved through the simple injective network architecture given in Equation (18) and a one layer fully-connected neural network with equal input and output dimensions throughout the experiments. In addition, ablation study is conducted to study the effect of injective neural networks, the regularization coefficient λ, the choice of the dimensionality dθ of the augmented space, and the optimization of hypersurfaces in the ASWD. Details can be found in Appendix G.2. 5.2 GENERATIVE MODELING In this experiment, we use sliced-based Wasserstein distances for a generative modeling task described in (Nguyen et al., 2021). The task is to generate images using generative adversarial networks (GANs) (Goodfellow et al., 2014) trained on either the CIFAR10 dataset (64 64 resolution) (Krizhevsky, 2009) or the Celeb A dataset (64 64 resolution) (Liu et al., 2015). Denote the hidden layer and the output layer of the discriminator by hψ and DΨ, and the generator by GΦ, we train GAN models with the following objectives: min Φ SWD(hψ(pr),hψ(GΦ(pz))), (21) max Ψ,ψ Ex pr[log(DΨ(hψ(x)))]+Ez pz[log(1 DΨ(hψ(GΦ(z))))], (22) where pz is the prior of latent variable z and pr is the distribution of real data. The SWD in Equation (21) is replaced by the ASWD and other variants of the SWD to compare their performance. The Published as a conference paper at ICLR 2022 Figure 3: FID scores of generative models trained with different metrics on CIFAR10 (left) and Celeb A (right) datasets with L=1000 projections. The error bar represents the standard deviation of the FID scores at the specified training epoch among 10 simulation runs. Table 1: FID scores of generative models trained with different distance metrics. Smaller scores indicate better image qualities. L is the number of projections, we run each experiment 10 times and report the average values and standard errors of FID scores for CIFAR10 dataset and CELEBA dataset. The running time per training iteration for one batch containing 512 samples is computed based on a computer with an Intel (R) Xeon (R) Gold 5218 CPU 2.3 GHz and 16GB of RAM, and a RTX 6000 graphic card with 22GB memories. SWD (Bonneel et al., 2015) GSWD (Kolouri et al., 2019a) DSWD (Nguyen et al., 2021) ASWD FID t (s/it) FID t (s/it) FID t (s/it) FID t (s/it) 10 121.4 7.0 0.34 108.3 5.6 0.41 74.2 3.1 0.55 65.7 3.2 0.58 100 104.6 5.2 0.35 105.2 3.2 0.74 66.5 3.9 0.57 62.5 1.9 0.60 1000 102.3 5.3 0.36 98.2 5.1 2.22 62.3 5.7 1.30 59.3 3.2 1.38 CELEBA 10 94.8 2.5 0.35 95.1 4.2 0.40 86.0 1.4 0.53 81.2 1.3 0.59 100 88.7 5.7 0.36 86.7 3.5 0.75 76.1 3.5 0.55 73.2 2.6 0.61 1000 86.5 4.1 0.38 85.2 6.3 2.19 71.3 4.7 1.28 67.4 2.1 1.38 GSWD with the polynomial defining function and the DGSWD is not included in this experiment due to its excessively high computational cost in high-dimensional space. The Fréchet Inception Distance (FID score) (Heusel et al., 2017) is used to assess the quality of generated images. More details on the network structures and the parameter setup used in this experiment are available in Appendix F.2. We run 200 and 100 training epochs to train the GAN models on the CIFAR10 and the Celeb A dataset, respectively. Each experiment is repeated for 10 times and results are reported in Table 1. With the same number of projections and a similar computation cost, the ASWD leads to significantly improved FID scores among all evaluated distances metrics on both datasets, which implies that images generated with the ASWD are of higher qualities. Figure 3 plots the FID scores recorded during the training process. The GAN model trained with the ASWD exhibits a faster convergence as it reaches smaller FID scores with fewer epochs. Randomly selected samples of generated images are presented in Appendix H.1. 6 CONCLUSION We proposed a novel variant of the sliced Wasserstein distance, namely the augmented sliced Wasserstein distance (ASWD), which is flexible, has a high projection efficiency, and generalizes well. The ASWD adaptively updates the hypersurfaces used to slice compared distributions by learning from data. We proved that the ASWD is a valid distance metric and presented its numerical implementation. We reported empirical performance of the ASWD over state-of-the-art sliced Wasserstein metrics in various numerical experiments. We showed that ASWD with a simple injective neural network architecture can lead to the smallest distance errors over the majority of datasets in a sliced Wasserstein flow task and superior performance in generative modeling tasks involving GANs and VAEs. We have also evaluated the applications of the ASWD in downstream tasks including color transferring and Wasserstein barycenters. What remains to be explored includes the topological properties of the ASWD. We leave this topic as an interesting future research direction. Published as a conference paper at ICLR 2022 J. Altschuler, J. Niles-Weed, and P. Rigollet. Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration. In Proc. Advances in Neural Information Processing Systems (Neur IPS), pages 1964 1974, Long Beach, California, USA, 2017. M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In Proc. International Conference on Machine Learning (ICML), pages 214 223, Sydney, Australia, 2017. J. Behrmann, W. Grathwohl, R. T. Q. Chen, D. Duvenaud, and J. Jacobsen. Invertible residual networks. In Proc. International Conference on Machine Learning (ICML), pages 573 582, Long Beach, California, USA, 2019. E. Bernton, P. E. Jacob, M. Gerber, and C. P. Robert. On parameter estimation with the wasserstein distance. Information and Inference: A Journal of the IMA, 8(4):657 676, 2019. G. Beylkin. The inversion problem and applications of the generalized Radon transform. Communications on Pure and Applied Mathematics, 37(5):579 599, 1984. N. Bonneel, J. Rabin, G. Peyré, and H. Pfister. Sliced and Radon Wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision, 51(1):22 45, 2015. N. Bonnotte. Unidimensional and evolution methods for optimal transportation. Ph D thesis, Paris 11, 2013. N. Courty, R. Flamary, D. Tuia, and A. Rakotomamonjy. Optimal transport for domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence (IPAMI), 39(9):1853 1865, 2016. M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Proc. Advances in Neural Information Processing Systems (Neur IPS), pages 2292 2300, Lake Tahoe, Nevada, USA, 2013. M. Cuturi and A. Doucet. Fast computation of Wasserstein barycenters. In Proc. International Conference on Machine Learning (ICML), pages 685 693, Beijing, China, 2014. I. Deshpande, Y. Hu, R. Sun, A. Pyrros, et al. Max-sliced Wasserstein distance and its use for GANs. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 10648 10656, Long Beach, California, USA, 2019. A. Dessein, N. Papadakis, and J. Rouas. Regularized optimal transport and the rot mover s distance. The Journal of Machine Learning Research (JMLR), 19(1):590 642, 2018. L. Ehrenpreis. The universality of the Radon transform, chapter 5, pages 299 363. Oxford University Press, Oxford, UK, 2003. S. Ferradans, N. Papadakis, G. Peyré, and J. Aujol. Regularized discrete optimal transport. SIAM Journal on Imaging Sciences, 7(3):1853 1882, 2014. A. Genevay, L. Chizat, F. Bach, M. Cuturi, and G. Peyré. Sample complexity of Sinkhorn divergences. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1574 1583, Okinawa, Japan, 2019. I. Goodfellow, A. J. Pouget, M. Mirza, B. Xu, F. D. Warde, et al. Generative adversarial nets. In Proc. Advances in Neural Information Processing Systems (Neur IPS), pages 2672 2680, Montréal, Canada, 2014. I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. Improved training of Wasserstein GANs. In Proc. Advances in neural information processing systems (Neur IPS), pages 5767 5777, Long Beach, California, USA, 2017. S. Helgason. The Radon transform, volume 2. Basel, Switzerland: Springer, 1980. M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Proc. Advances in neural information processing systems (Neur IPS), pages 6626 6637, Long Beach, California, USA, 2017. Published as a conference paper at ICLR 2022 A. Homan and H. Zhou. Injectivity and stability for a generic class of generalized Radon transforms. The Journal of Geometric Analysis, 27(2):1515 1529, 2017. G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4700 4708, Hawaii, USA, 2017. M. Karami, D. Schuurmans, J. Sohl-Dickstein, L. Dinh, and D. Duckworth. Invertible convolutional flow. In Proc. Advances in Neural Information Processing Systems (Neur IPS), pages 5636 5646, Vancouver, Canada, 2019. S. Kolouri, K. Nadjahi, U. Simsekli, R. Badeau, and G. Rohde. Generalized sliced Wasserstein distances. In Proc. Advances in Neural Information Processing Systems (Neur IPS), pages 261 272, Vancouver, Canada, 2019a. S. Kolouri, P. E. Pope, C. E. Martin, and G. K. Rohde. Sliced-Wasserstein autoencoders. In Proc. International Conference on Learning Representations (ICLR), New Orleans, Louisiana, USA, 2019b. A. Krizhevsky. Learning multiple layers of features from tiny images. Tech Report, 2009. Y. Le Cun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proc. IEEE International Conference on Computer Vision (ICCV), pages 3730 3738, Las Condes, Chile, 2015. B. Muzellec and M. Cuturi. Subspace detours: Building transport plans that are optimal on subspace projections. Proc. Advances in Neural Information Processing Systems (Neur IPS), 32:6917 6928, 2019. K. Nadjahi et al. Statistical and topological properties of sliced probability divergences. Proc. Advances in Neural Information Processing Systems (Neur IPS), 33, 2020. K. Nguyen, N. Ho, T. Pham, and H. Bui. Distributional sliced-Wasserstein and applications to generative modeling. In Proc. International Conference on Learning Representations (ICLR), Vienna, Austria, 2021. F. Paty and M. Cuturi. Subspace robust Wasserstein distances. In Proc. International Conference on Machine Learning (ICML), pages 5072 5081, Long Beach, California, USA, 2019. G. Peyré and M. Cuturi. Computational optimal transport. Foundations and Trends in Machine Learning, 11(5-6):355 607, 2019. J. Radon. Uber die bestimmung von funktionen durch ihre integralwerte laengs gewisser mannigfaltigkeiten. Ber. Verh. Saechs. Akad. Wiss. Leipzig Math. Phys. Kl., 69:262, 1917. D. Rezende and S. Mohamed. Variational inference with normalizing flows. In International Conference on Machine Learning (ICML), pages 1530 1538, Lille, France, 2015. T. Salimans, H. Zhang, A. Radford, and D. Metaxas. Improving GANs using optimal transport. In Proc. International Conference on Learning Representations (ICLR), Vancouver, Canada, 2018. C. Villani. Optimal Transport: old and new, volume 338. Berlin, Germany: Springer Science & Business Media, 2008. Published as a conference paper at ICLR 2022 APPENDIX A PROOF OF THE LEMMA 1 We prove that the spatial Radon transform defined with a measurable mapping g( ):Rd Rdθ is an injection on Pk(Rd) if and only if g( ) is injective. In the following contents, we use Pk(Rd) to denote a set of Borel probability measures with finite k-th moment on Rd, and f1 f2 is used to denote functions f1( ):X R and f2( ):X R that satisfy f1(x)=f2(x) for x X, and f1 f2 is used to denote functions f1( ) : X R and f2( ) : X R that satisfy f1(x) = f2(x) for certain x X. In addition, for probability measures µ and ν, we use µ ν to denote µ,ν Pk(Rd) that satisfy µ(X)=ν(X) for X Rd, and µ ν to denote µ,ν Pk(Rd) that satisfy µ(X) =ν(X) for certain X Rd. Proof. By using proof by contradiction, we first prove that if g( ) is injective, the corresponding spatial Radon transform is injective. If the spatial Radon transform defined with an injective mapping g( ) : Rd Rdθ is not injective, there exist µ, ν Pk(Rd), µ ν, such that Hµ(t,θ;g) Hν(t,θ;g) for t R and θ Sdθ 1. From Equation (14), for t R and θ Sdθ 1, the spatial Radon transform of µ can be written as: Hµ(t,θ;g)=Rˆµg(t,θ), (23) Hν(t,θ;g)=Rˆνg(t,θ), (24) where ˆµg and ˆνg refer to the push-forward measures g#µ and g#ν, respectively. From the assumption Hµ(t,θ;g) Hν(t,θ;g) and Equations (23) and (24), we know Rˆµg(t,θ) Rˆνg(t,θ) for t R and θ Sdθ 1, which implies ˆµg ˆνg since the Radon transform is injective. Since g( ) is injective, for all measurable X Rd, x X if and only if ˆx=g(x) g(X), which implies P(x X)=P(ˆx g(X)), P(y X)=P(ˆy g(X)). Therefore, Z g(X) dˆµg = Z X dµ, (25) Z g(X) dˆνg = Z Since ˆµg ˆνg, from Equations (25) and (26) we have: R X dν for X Rd. Hence, for X Rd: Z X d(µ ν)=0, (27) which implies µ ν, contradicting with the assumption µ ν. Therefore, if Hµ Hν, µ ν. In addition, from the definition of the spatial Radon transform in Equation (13), it is trivial to show that if µ ν, Hµ(t,θ;g) Hν(t,θ;g). Therefore, Hµ Hν if and only if µ ν, i.e. the spatial Radon transform H defined with an injective mapping g( ):Rd Rdθ is injective. We now prove that if the spatial Radon transform defined with a mapping g( ) : Rd Rdθ is injective, g( ) must be injective. Again, we use proof by contradiction. If g( ) is not injective, there exist x0,y0 Rd such that x0 = y0 and g(x0) = g(y0). For Dirac measures µ1 and ν1 defined by µ1(X)= R X δ(x x0)dx and ν1(Y)= R Yδ(y y0)dy, where X,Y Rd, we know µ1 ν1 as x0 =y0. We define variables x µ1 and y ν1. Then for variables ˆx = g(x) µ2 and ˆy = g(y) ν2, where µ2 and ν2 are push-forward measures g#µ1 and g#ν1, it is trivial to derive for X,Y Rd µ2(g(X))= Z g(X) δ(ˆx g(x0))dˆx, (28) ν2(g(Y))= Z g(Y) δ(ˆy g(y0))dˆy, (29) which implies µ2 ν2 as g(x0)=g(y0). From Equations (23), (24), (28) and (29), for t R and θ Sdθ 1: Hµ1(t,θ;g)=Rµ2(t,θ), =Rν2(t,θ), =Hν1(t,θ;g), (30) Published as a conference paper at ICLR 2022 contradicting with the assumption that the spatial Radon transform is injective. Therefore, if the spatial Radon transform is injective, g( ) must be injective. We conclude that the spatial Radon transform is injective if and only if the mapping g( ) is an injection on Pk(Rd). Published as a conference paper at ICLR 2022 APPENDIX B SPECIAL CASES OF SPATIAL RADON TRANSFORMS Remark 5. The spatial Radon transform degenerates to the vanilla Radon transform when the mapping g( ) is an identity mapping. In addition, the spatial Radon transform is equivalent to the polynomial GRT (Ehrenpreis, 2003) when g(x)=(xα1,...,xαdα ), where α is multi-indices αi =(ηi,1,...,ηi,d) Nd satisfying |αi|=Pd j=1ηi,j =m. m is the degree of the polynomial function, xαi =Qd j=1xηi,j j given an input x=(x1,...,xd) Rd, and dα is the number of all possible multi-indices αi that satisfies |αi|=m. We provide a proof for the claim in Remark 5. Proof. Given a probability measure µ P(Rd), the spatial Radon transform of µ is defined as: Hµ(t,θ;g)= Z Rdδ(t g(x),θ )dµ, (31) where t R and θ Sdθ 1 are the parameters of hypersurfaces in Rd. When the mapping g( ) is an identity mapping, i.e. g(x) = x for x Rd, the spatial Radon transform degenerates to the vanilla Radon transform: Hµ(t,θ;g)= Z Rdδ(t x,θ )dµ =Rµ(t,θ). (32) For GRTs, (Ehrenpreis, 2003) provides a class of injective GRTs named polynomial GRTs by adopting homogeneous polynomial functions with an odd degree m as the defining function: i=1 θixαi)dµ, s.t.|αi|=m, (33) where αi =(ηi,1,...,ηi,d) Nd, |αi|=Pd j=1ηi,j, xαi =Qd j=1xηi,j j for x=(x1,...,xd) Rd, dα is the number of all possible multi-indices αi that satisfies |αi|=m, and θ=(θ1,...,θdα) Sdα 1. In spatial Radon transform, for x Rd, when the mapping g( ) is defined as: g(x)=(xα1,...,xαdα ), (34) the spatial Radon transform is equivalent to the polynomial GRT defined in Equation (33): Hµ(t,θ;g)= Z Rdδ(t g(x),θ )dµ i=1 θixαi)dµ. (35) Published as a conference paper at ICLR 2022 APPENDIX C PROOF OF THEOREM 1 We provide a proof that the ASWD defined with a mapping g( ):Rd Rdθ is a metric on Pk(Rd), if and only if g( ) is injective. In what follows, we denote a set of Borel probability measures with finite k-th moment on Rd by Pk(Rd), and use µ,ν Pk(Rd) to refer to two probability measures defined on Rd. Proof. Symmetry: Since the k-Wasserstein distance is a metric thus symmetric (Villani, 2008): Wk Hµ( ,θ;g),Hν( ,θ;g) =Wk Hν( ,θ;g),Hµ( ,θ;g) . (36) ASWDk(µ,ν;g)= Z Sdθ 1W k k Hµ( ,θ;g),Hν( ,θ;g) dθ 1 Sdθ 1W k k Hν( ,θ;g),Hµ( ,θ;g) dθ 1 k =ASWDk(ν,µ;g). Triangle inequality: Given an injective mapping g( ) : Rd Rdθ and probability measures µ1, µ2, µ3 Pk(Rd), since the k-Wasserstein distance satisfies the triangle inequality (Villani, 2008), the following inequality holds: ASWDk(µ1,µ3;g)= Z Sdθ 1W k k Hµ1( ,θ;g),Hµ3( ,θ;g) dθ 1 Sdθ 1 Wk(Hµ1( ,θ;g),Hµ2( ,θ;g)) +Wk(Hµ2( ,θ;g),Hµ3( ,θ;g)) kdθ 1 Sdθ 1W k k Hµ1( ,θ;g),Hµ2( ,θ;g) dθ 1 Sdθ 1W k k Hµ2( ,θ;g),Hµ3( ,θ;g) dθ 1 =ASWDk(µ1,µ2;g)+ASWDk(µ2,µ3;g), where the second inequality is due to the Minkowski inequality in Lk(Sdθ 1). Identity of indiscernibles: Since Wk(µ,µ)=0 for µ Pk(Rd),we have ASWDk(µ,µ;g)= Z Sdθ 1W k k Hµ( ,θ;g),Hµ( ,θ;g) dθ 1 for µ Pk(Rd). Conversely, for µ,ν Pk(Rd), if ASWDk(µ,ν;g) = 0, from the definition of the ASWD: ASWDk(µ,ν;g)= Z Sdθ 1W k k Hµ( ,θ;g),Hν( ,θ;g) dθ 1 Due to the non-negativity of k-th Wasserstein distance as it is a metric on Pk(Rd) and the continuity of Wk( , ) on Pk(Rd) (Villani, 2008), Wk Hµ( ,θ;g),Hν( ,θ;g) =0 holds for θ Sdθ 1 if and only if Hµ( ,θ;g) Hν( ,θ;g). Again, given the spatial Radon transform is injective when g( ) is injective (see the proof in Appendix A), Hµ( ,θ;g) Hν( ,θ;g) implies µ ν if g( ) is injective. In addition, if g( ) is not injective, the spatial Radon transform is not injective (see the proof in Appendix A), then µ, ν Pk(Rd), µ ν such that Hµ( ,θ;g) Hν( ,θ;g), which implies ASWDk(µ,ν;g)=0 for µ ν. Therefore, the ASWD satisfies the identity of indiscernibles if and only if g( ) is injective. Non-negativity: The three axioms of a distance metric, i.e. symmetry, triangle inequality, and identity of indiscernibles imply the non-negativity of the ASWD. Since the Wasserstein distance is non-negative, Published as a conference paper at ICLR 2022 for µ ,ν Pk(Rd), it can also be straightforwardly proved the ASWD between µ and ν is non-negative: ASWDk(µ,ν;g)= Z Sdθ 1W k k Hµ( ,θ;g),Hν( ,θ;g) dθ 1 Sdθ 10kdθ 1 Therefore, the ASWD is a metric on Pk(Rd) if and only if g( ) is injective. Published as a conference paper at ICLR 2022 APPENDIX D PROOF OF COROLLARY 1.1 We first introduce Lemma 2 to support the proof of Corollary 1.1. Lemma 2. For λ (1, + ), the optimal mapping g ( ) defined by Equation (17) satisfies ||g (x)||2 < for x Rd µ and x Rd ν. Proof. Recall that in Equation (16) the ASWD can be rewritten as: ASWDk(µ,ν;g)=SWDk(ˆµg,ˆνg) Sdθ 1W k k Rˆµg( ,θ),Rˆνg( ,θ) dθ 1 where transformed variables ˆx = g(x) ˆµg for x µ and ˆy = g(y) ˆνg for y ν, respectively. Combining the equation above with Equation (2): ASWDk(µ,ν;g)= Z Sdθ 1W k k Rˆµg( ,θ),Rˆνg( ,θ) dθ 1 0 |F 1 Pθ#ˆµg(z) F 1 Pθ#ˆνg(z)|kdzdθ 1 |F 1 Rθ(ˆµg)(z)|+|F 1 Rθ(ˆνg)(z)| kdzdθ 1 where # denotes the push forward operator, Pθ :x Rdθ x,θ R, and Rθ(ˆµg)=Pθ#ˆµg, Rθ(ˆνg)= Pθ#ˆνg refer to one-dimensional measures obtained by slicing ˆµg, ˆνg with a unit vector θ, F 1 Rθ(ˆµg) and F 1 Rθ(ˆνg) are inverse cumulative distribution functions (CDFs) of Rθ(ˆµg) and Rθ(ˆνg), respectively. By repeatedly applying the Minkowski s inequality to Equation (41), we obtain the following inequalities: ASWDk(µ,ν;g) Z |F 1 Rθ(ˆµg)(z)|+|F 1 Rθ(ˆνg)(z)| kdzdθ 1 0 |F 1 Rθ(ˆµg)(z)|kdz 1 0 |F 1 Rθ(ˆνg)(z)|kdz 1 0 |F 1 Rθ(ˆµg)(z)|kdzdθ 1 0 |F 1 Rθ(ˆνg)(z)|kdzdθ 1 Let s= ˆx,θ , then z =FRθ(ˆµg)(s), dz =d FRθ(ˆµg)(s): Z 1 0 |F 1 Rθ(ˆµg)(z)|kdz = Z R |s|kd FRθ(ˆµg)(s) Rdθ | ˆx,θ |kdˆµg Rd| g(x),θ |kdµ, (43) where the last two equations are obtained through the definitions of the push-forward operators. Therefore, the following inequalities hold: ASWDk(µ,ν;g) Z 0 |F 1 Rθ(ˆµg)(z)|kdzdθ 1 0 |F 1 Rθ(ˆνg)(z)|kdzdθ 1 Rd| g(x),θ |kdµdθ 1 Rd| g(y),θ |kdνdθ 1 1 kx µ ||g(x)||k 2 +E 1 ky ν ||g(y)||k 2 . (44) Published as a conference paper at ICLR 2022 Then we obtain the following inequalities for the optimization objective: ASWDk(µ,ν;g) L(µ,ν,λ;g) 1 kx µ ||g(x)||k 2 +E 1 ky ν ||g(y)||k 2 λ(E 1 kx µ ||g(x)||k 2 +E 1 ky ν ||g(y)||k 2 ) 1 kx µ ||g(x)||k 2 +E 1 ky ν ||g(y)||k 2 . (45) When we set λ (1,+ ), if x Rd µ or y Rd ν such that ||g(x)||2 or ||g(y)||2 , the optimization objective approaches negative infinity, implying g( ) is not the optimal mapping. Therefore, by adopting Equation (17) as the optimization objective, the optimal mapping g ( ) satisfies ||g (x)||2 < for x Rd µ and x Rd ν. Remark 6. It is worth noting that λ> 1 in Corollary 1.1 is a sufficient condition for the supremum of the optimization objective to be non-positive and ||g (x)||2 < for x Rd µ and x Rd ν. 0 < λ 1 can also lead to finite ||g(x)||2 in various scenarios. Specifically, the upper bound of the optimization objective given in Equation (44) is obtained by applying: | g(x),θ |=||g(x)||2|cos(α)| ||g(x)||2, (46) where α is the angle between θ and g(x). In high-dimensional spaces, Equation (46) gives a very loose bound since in high-dimensional spaces the majority of sampled θ would be nearly orthogonal to g(x) and cos(α) is nearly zero with high probability (Kolouri et al., 2019a). Empirically we found that across all the experiment results, λ in a candidate set of {0.01,0.05,0.1,0.5,1,10,100} all lead to finite g ( ). We now prove Corollary 1.1, i.e ASWDk(µ,ν;g ) is a metric on Pk(Rd) , where g ( ) is the optimal mapping defined by Equation (17) for λ (1,+ ). Proof. Symmetry: Since the k-Wasserstein distance is a metric thus symmetric (Villani, 2008): Wk Hµ( ,θ;g ),Hν( ,θ;g ) =Wk Hν( ,θ;g ),Hµ( ,θ;g ) . (47) ASWDk(µ,ν;g )= Z Sdθ 1W k k Hµ( ,θ;g ),Hν( ,θ;g ) dθ 1 Sdθ 1W k k Hν( ,θ;g ),Hµ( ,θ;g ) dθ 1 k =ASWDk(ν,µ;g ). Triangle inequality: From Lemma 2, when λ (1, + ), the optimal mapping g ( ) satisfies ||g (x)||2 < for x Rd µ and x Rd ν, hence ASWDk(ν,µ;g ) is finite due to Equation (16). We then prove that ASWDk(ν,µ;g ) satisfies the triangle inequality. Denote by g 1, g 2, and g 3 optimal mappings that result in the supremum of Equation (19) between µ1 and µ2, µ1 and µ3, µ2 and µ3, respectively, since ASWDk(µ1,µ2;g 1), ASWDk(µ1,µ3;g 2), and ASWDk(µ2,µ3;g 3) are finite, the following equations hold: ASWDk(µ1,µ2;g 1) ASWDk(µ1,µ3;g 1)+ASWDk(µ2,µ3;g 1) (48) sup g {ASWDk(µ1,µ3;g)}+sup g {ASWDk(µ2,µ3;g)} (49) =ASWDk(µ1,µ3;g 2)+ASWDk(µ2,µ3;g 3), (50) where the first inequality are from the metric property of the ASWD. Identity of indiscernibles: Since Wk(µ,µ)=0 for µ Pk(Rd),we have ASWDk(µ,µ;g )= Z Sdθ 1W k k Hµ( ,θ;g ),Hµ( ,θ;g ) dθ 1 Published as a conference paper at ICLR 2022 for µ Pk(Rd). Conversely, for µ,ν Pk(Rd), if ASWDk(µ,ν;g )=0, from Equation (15): ASWDk(µ,ν;g )= Z Sdθ 1W k k Hµ( ,θ;g ),Hν( ,θ;g ) dθ 1 Due to the non-negativity of k-th Wasserstein distance as it is a metric on Pk(Rd) and the continuity of Wk( , ) on Pk(Rd) (Villani, 2008), Wk Hµ( ,θ;g ), Hν( ,θ;g ) =0 holds for θ Sdθ 1, which implies Hµ( ,θ;g ) Hν( ,θ;g ) for θ Sdθ 1. Therefore, given the spatial Radon transform is injective when g ( ) is injective, Hµ( ,θ;g ) Hν( ,θ;g ) implies µ ν. Non-negativity: Since the Wasserstein distance is non-negative, for µ ,ν Pk(Rd), the ASWD defined with optimal mappings g( ) between µ and ν is also non-negative: ASWDk(µ,ν;g )= Z Sdθ 1W k k Hµ( ,θ;g ),Hν( ,θ;g ) dθ 1 Sdθ 10kdθ 1 Therefore, the ASWD defined with the optimal mapping g ( ) is non-negative, symmetric, and satisfies the triangle inequality and the identity of indiscernibles, i.e. the ASWD defined with optimal mappings g ( ) is also a metric. Published as a conference paper at ICLR 2022 APPENDIX E PSEUDOCODE FOR THE EMPIRICAL VERSION OF THE ASWD Algorithm 1 The augmented sliced Wasserstein distance. All of the for loops can be parallelized. Require: Sets of samples {xn Rd}N n=1, {yn Rd}N n=1; Require: Randomly initialized injective neural network gω( ):Rd Rdθ; Require: Number of projections L, hyperparameter λ, learning rate ϵ, number of iterations M; 1: Initialize D=0,Lλ =0,m=1; 2: while ω has not converged and m M do 3: Draw a set of samples {θl}L l=1 from Sdθ 1; 4: for n=1 to N do 5: Compute gω(xn) and gω(yn); 6: Calculate the regularization term Lλ Lλ+λ ( ||gω(xn)||k 2 N ) 1 k +( ||gω(yn)||k 2 N ) 1 k ; 7: end for 8: for l=1 to L do 9: Compute β(xn,θl)= gω(xn),θl , β(yn,θl)= gω(yn),θl for each n; 10: Sort β(xn,θl) and β(yn,θl) in ascending order s.t. β(x Ilx[n],θl) β(x Ilx[n+1],θl) and β(y Ily[n],θl) β(y Ily[n+1],θl); 11: Calculate the ASWD: D D+( 1 L PN n=1|β(x Ilx[n],θl) β(y Ily[n],θl)|k) 1 k ; 12: end for 13: L D Lλ; 14: Update ω by gradient ascent ω ω+ϵ ωL; 15: Reset D=0,Lλ =0, update m m+1; 16: end while 17: Draw a set of samples {θl}L l=1 from Sdθ 1; 18: for n=1 to N do 19: Compute gω(xn) and gω(yn); 20: end for 21: for l=1 to L do 22: Compute β(xn,θl)= gω(xn),θl , β(yn,θl)= gω(yn),θl for each n; 23: Sort β(xn, θl) and β(yn, θl) in ascending order s.t. β(x Ilx[n], θl) β(x Ilx[n+1], θl) and β(y Ily[n],θl) β(y Ily[n+1],θl); 24: Calculate the ASWD: D D+( 1 L PN n=1|β(x Ilx[n],θl) β(y Ily[n],θl)|k) 1 k ; 25: end for 26: Output: Augmented sliced Wasserstein distance D. In Algorithm 1, Equation (17) is used as the optimization objective, where the regularization term L(µ,ν,λ;gω)=λ(E 1 kx µ ||gω(x)||k 2 +E 1 ky ν ||gω(y)||k 2 ) is used. This particular choice of the regularization term facilitates the proofs to Lemma 2 and subsequently to Corollary 1.1 with details in Appendix D. In fact, we have also examined other types of regularization terms such as the L2 norm of the output of g( ), and empirically they produce similar numerical results as the current regularization term. Published as a conference paper at ICLR 2022 APPENDIX F EXPERIMENTAL SETUPS F.1 HYPERPARAMETERS IN THE SLICED WASSERSTEIN FLOW EXPERIMENT We randomly generate 500 samples both for target distributions and source distributions. We initialize the source distributions µ0 as standard normal distributions N(0,I), where I is a 2-dimensional identity matrix. We update source distributions using Adam optimizer, and set the learning rate=0.002. For all methods, we set the order k = 2. When testing the ASWD, the number of iterations M in Algorithm 1 is set to 10. In the sliced Wasserstein flow experiment the mapping g ( ) optimized by maximizing Equation (17) was found to be finite for all values of λ in the set of {0.01, 0.05, 0.1, 0.5, 1, 10, 100}, which is a sufficient condition for the ASWD to be a valid metric as shown in the proof of corollary 1.1 provided in Appendix D. In addition, numerical results presented in Appendix G.2 indicate that empirical errors in the experiment are not sensitive to the choice of λ in the candidate set {0.01, 0.05, 0.1, 0.5}. The reported results in the main paper are produced with λ=0.1. F.2 NETWORK ARCHITECTURE IN THE GENERATIVE MODELING EXPERIMENT Denote a convolutional layer whose kernel size is s with C kernels by Conv C(s s), and a fully-connected layer whose input and output layer have s1 and s2 neurons by FC(s1 s2). The network structure used in the generative modeling experiment is configured to be the same as described in (Nguyen et al., 2021): hψ :(64 64 3) Conv64(4 4) Leaky Re LU(0.2) Conv128(4 4) Batch Normalization Leaky Re LU(0.2) Conv256(4 4) Batch Normalization Leaky Re LU(0.2) Conv512(4 4) Batch Normalization Tanh Output (512 4 4) DΨ :Conv1(4 4) Sigmoid Output (1 1 1) GΦ :z R100 Conv Transpose512(4 4) Batch Normalization Re LU Conv Transpose256(4 4) Batch Normalization Re LU Conv Transpose128(4 4) Batch Normalization Re LU Conv Transpose64(4 4) Batch Normalization Conv Transpose3(4 4) Tanh Ouput (64 64 3) φ:FC(8192 8192) Output (8192)-dimensional vector We train the models with the Adam optimizer, and set the batch size to 512. Following the setup in (Nguyen et al., 2021), the learning rate is set to 0.0005 and beta=(0.5, 0.999) for both CIFAR10 dataset and Celeb A dataset. For all methods, we set the order k to 2. For the ASWD, the number of iterations M in Algorithm 1 is set to 5. The hyperparameter λ is set to 1.01 to guarantee that the ASWD being a valid metric and introduce slightly larger regularization of the optimization objective due to the small output values from the feature layer hψ. Published as a conference paper at ICLR 2022 APPENDIX G ADDITIONAL RESULTS IN THE SLICED WASSERSTEIN FLOW EXPERIMENT G.1 FULL EXPERIMENTAL RESULTS ON THE SLICED WASSERSTEIN EXPERIMENT Figure 4 shows the full experimental results on the sliced Wasserstein flow experiment. Figure 4: Full experimental results on the sliced Wasserstein flow example. The first and third columns are target distributions. The second and fourth columns are log 2-Wasserstein distances between the target distributions and the source distributions. The horizontal axis shows the number of training iterations. Solid lines and shaded areas represent the average values and 95% confidence intervals of log 2-Wasserstein distances over 50 runs. Published as a conference paper at ICLR 2022 G.2 ABLATION STUDY Impact of the injectivity and optimization of the mapping In this ablation study, we compare ASWDs constructed by different mappings to GSWDs with different predefined defining functions, and investigate the effects of the optimization and injectivity of the adopted mapping gω( ) used in the ASWDs. In what follows, ASWD-vanilla" is used to denote ASWDs that employ randomly initialized neural network φω( ) to parameterize the injective mapping gω( ) = [ ,φω( )], i.e. the mapping gω( ) is not optimized in the ASWD-vanilla and the results of ASWD-vanilla reported in Figure 5 are obtained by slicing with random hypersurfaces. Furthermore, the ASWD-non-injective" refers to ASWDs that do not use the injectivity trick, i.e. the mapping gω( )=φω( ) is not guaranteed to be injective. In addition, the ASWD-vanilla-non-injective" adopts both setups in the ASWD-vanilla" and "ASWD-non-injective", resulting in a random non-injective mapping gω( ). The reported experiment results in this ablation study is calculated over 50 runs, and the neural network φω( ) is reinitialized randomly in each run. From Figure 5, it can be observed that the ASWD-vanilla shows comparable performance to GSWDs defined by polynomial and circular defining functions, which implies GSWDs with predefined defining functions are as uninformative as slicing distributions with random hypersurfaces constructed by the ASWD. In GSWDs, the hypersurfaces are predefined and cannot be optimized since they are determined by the functional forms of the defining functions. On the contrary, we found that the optimization of hypersurfaces in the ASWD framework can help improve the performance of the slice-based Wasserstein distance. As in Figure 5, the ASWD and the ASWD-non-injective present significantly better performance than methods that do not optimize their hypersurfaces (ASWD-vanilla, ASWDvanilla-non-injective, and GSWDs). In terms of the impact of the injectivity of the mapping gω, in this experiment, the ASWD-vanilla exhibits smaller 2-Wasserstein distances than the ASWD-vanilla-noninjective in all tested distributions, and the ASWD leads to more stable training than the ASWD-noninjective. Therefore, theinjectivityofthemappinggω( )doesnotonlyguaranteethe ASWDtobeavalid distance metric as proved in Section 3, but also better empirical performance in this experiment setup. Impact of the regularization coefficient We also evaluated the sensitivity of performance of the ASWD with respect to the regularization coefficient λ. The ASWD is evaluated with different values of λ and compared with other slice-based Wasserstein metrics in this ablation study. The numerical results presented in Figure 6 indicates that different values of λ in the candidate set {0.01, 0.05, 0.1, 0.5} lead similar performance of the ASWD, i.e the performance of the ASWD is not sensitive to λ. Additionally, the ASWDs with different values of λ in the candidate set outperform the other evaluated slice-based Wasserstein metrics. We have also evaluated the performance of the ASWD when the range of λ is much larger than in the candidate set. Specifically, as presented in Figure 7, when λ is set to be large values, e.g 10 or 100, the resulted ASWD leads to decreased performance on par with the SWD. This is consistent with our expectation that excessive regularization will eliminate nonlinearity as discussed in Remark 2, leading to similar performance with the SWD. In addition, the effect of the regularization term on the performance of Max-GSWD-NN was also investigated in this ablation study. The performance of the Max-GSWD-NN and the Max-GSWD-NN trained with the regularization term used in the ASWD are compared in Figure 8. From the numerical results presented in Figure 8, the Max-GSWD-NN 1 with regularization leads to performance similar to the Max-GSWD-NN 1 without regularization, implying that the performance gap between the ASWD and the Max-GSWD-NN is not due to the introduction of the regularization term. Choice of injective mapping We reported in Figure 9 the performance of the ASWD defined with other types of injective mappings other than Equation (18). In particular, we examined two invertible mappings, including the planar flow and radial flow (Rezende and Mohamed, 2015), as alternatives to the injective mapping defined by Equation(18). Thenumericalresultspresentedin Figure9showthatthe ASWDdefinedwithplanarflow and radial flow produced better performance than GSWD variants in most setups. They exhibit slightly worse performance compared with the ASWD with injective mapping defined in Equation (18), possibly due to the additional restriction in invertible mapping imposed by the planar flow and radial flow. Choice of the dimensionality dθ of the augmented space Published as a conference paper at ICLR 2022 To investigate how the dimensionality dθ of the augmented space affects the performance of the ASWD, different choices of dθ are employed in the ASWD. Specifically, the injective network architecture gω(x) = [x,φω(x)] : Rd Rdθ given in Equation (18) is adopted and φω is set to be single fully-connected neural networks whose output dimension equals {1, 2, 3, 4} times its input dimension, i.e dθ ={2d,3d,4d,5d}, respectively, where d is the dimensionality of x. The numerical results are presented in Figure 10, and it can be found that the ASWDs present similar results across different choices of dθ. It can also be observed in Figure 10 that the ASWDs with different choices of dθ consistently produce better performance than the other evaluated slice-based Wasserstein metrics. Figure 5: Ablation study on the impact from injective neural networks and the optimization of hypersurfaces on the ASWD. ASWDs with different mappings are compared to GSWDs with different defining functions. The first and third columns show target distributions. The second and fourth columns plot log 2-Wasserstein distances between the target distributions and the source distributions. In the second and fourth columns, the horizontal axis shows the number of training iterations. Solid lines and shaded areas represent the average values and 95% confidence intervals of log 2-Wasserstein distances over 50 runs. Published as a conference paper at ICLR 2022 Figure 6: Ablation study on the impact of the regularization coefficient λ. The performance of the ASWDs with different values of λ are compared with other slice-based Wasserstein metrics. The first and third columns show target distributions. The second and fourth columns plot log 2-Wasserstein distances between the target distributions and the source distributions. In the second and fourth columns, the horizontal axis shows the number of training iterations. Solid lines and shaded areas represent the average values and 95% confidence intervals of log 2-Wasserstein distances over 50 runs. Published as a conference paper at ICLR 2022 Figure 7: Ablation study on the impact from large values of λ. The performance of the ASWDs with large values of λ, e.g 10 and 100, are compared with the SWD. The first and third columns show target distributions. The second and fourth columns plot log 2-Wasserstein distances between the target distributions and the source distributions. In the second and fourth columns, the horizontal axis shows the number of training iterations. Solid lines and shaded areas represent the average values and 95% confidence intervals of log 2-Wasserstein distances over 50 runs. Published as a conference paper at ICLR 2022 Figure 8: Ablation study on the impact from the regularization term on the performance of the Max GSW-NN. The first and third columns show target distributions. The second and fourth columns plot log 2-Wasserstein distances between the target distributions and the source distributions. In the second and fourth columns, the horizontal axis shows the number of training iterations. Solid lines and shaded areas represent the average values and 95% confidence intervals of log 2-Wasserstein distances over 50 runs. Published as a conference paper at ICLR 2022 Figure 9: Ablation study on the impact from the choice of injective networks. The performance of the ASWDs with different types of injective networks are compared with other slice-based Wasserstein metrics. The first and third columns show target distributions. The second and fourth columns plot log 2-Wasserstein distances between the target distributions and the source distributions. In the second and fourth columns, the horizontal axis shows the number of training iterations. Solid lines and shaded areas represent the average values and 95% confidence intervals of log 2-Wasserstein distances over 50 runs. Published as a conference paper at ICLR 2022 Figure 10: Ablation study on the choice of the dimensionality dθ of the augmented space. The performance of the ASWDs with different choices of dθ are compared with other slice-based Wasserstein metrics. The first and third columns show target distributions. The second and fourth columns plot log 2-Wasserstein distances between the target distributions and the source distributions. In the second and fourth columns, the horizontal axis shows the number of training iterations. Solid lines and shaded areas represent the average values and 95% confidence intervals of log 2-Wasserstein distances over 50 runs. Published as a conference paper at ICLR 2022 APPENDIX H ADDITIONAL RESULTS IN THE GENERATIVE MODELING EXPERIMENT H.1 SAMPLES OF GENERATED IMAGES OF CIFAR10 AND CELEBA DATASETS (a) Celeb A (L=10) (b) Celeb A (L=100) (c) Celeb A (L=1000) (d) CIFAR10 (L=10) (e) CIFAR10 (L=100) (f) CIFAR10 (L=1000) Figure 11: Visualized experimental results of the ASWD on Celeb A and CIFAR10 dataset with 10, 100, 1000 projections. The first row shows randomly selected samples of generated Celeb A images, the second row shows randomly selected samples of generated CIFAR10 images. H.2 EXPERIMENT RESULTS ON MNIST DATASET In the generative modelling experiment on the MNIST dataset, we train a generator by minimizing different slice-based Wasserstein metrics, including the ASWD, the DSWD, the GSWD (circular), and the SWD. Denote by GΦ the generator, the training objective of the experiment can be formulated as (Bernton et al., 2019): min Φ Ex pr,z pz[SWD(x,GΦ(z))], (54) where pz and pr are the prior of latent variable z and the real data distribution, respectively. In other words, the SWD, or other slice-based Wasserstein metrics, can be considered as a discriminator in this framework. By replacing the SWD with the ASWD, the DSWD, and the GSWD, we compare the performance of learned generative models trained with different metrics. In this experiment, different methods are compared using different number of projections L={10,1000}. The 2-Wasserstein distance and the SWD between generated images and real images are used as metrics for evaluating performances of different generative models. The experiment results are presented in Figure 12. Quality of generated images and convergence rate: It can be observed from Figure 12 that the ASWD outperforms all the other methods regarding both the 2-Wasserstein distance and the SWD between generated and real images. In particular, the generative model trained with the ASWD produces smaller 2-Wasserstein distances within less iteration, which implies the generated images are of higher quality and the ASWD leads to higher convergence rates of generative models. In addition, the ASWD shows that it is able to generate higher quality images than the SWD and the GSWD with 1000 projections using only as less Published as a conference paper at ICLR 2022 Figure 12: Visualized experimental results of different slice-based Wasserstein metrics on the MNIST dataset with 10, 1000 projections. (a) Comparison between the SWD, the GSWD, the DSWD, and the ASWD using the 2-Wasserstein distance between fake and real images as the evaluation metric. (b) Comparison between the SWD, the GSWD, the DSWD, and the ASWD using the SWD between fake and real images as the evaluation metric. as 10 projections. In other words, the ASWD has higher projection efficiency than the other slice-based Wasserstein metrics. The ASWD also has the smallest SWD distance as shown in Figure 12. Although the SWD converges slightly faster than the ASWD in terms of the SWD between fake and real images, this is due to the training objective and the evaluated metric are the same for the SWD. Randomly selected images generated by different slice-based Wasserstein metrics are presented in Figure 13. Computation cost of the ASWD: The execution time per mini-batch (512 samples) of different methods are compared in Figure 14a. We evaluate the SWD by varying the number of projections in the set {10, 1000, 10000} and all the other methods in the set {10, 1000}. We found that although the SWD requires much fewer computational time than the DSWD and the ASWD, the quality of generated data is poor even when the number of projections L increases to 10000. The GSWD is also computationally efficient when using a 10 projections, but it requires the highest execution time and generates the highest 2-Wasserstein distance among all compared methods when the number of projections increases to 1000. The huge difference in the execution time of the GSWD with 10 and 1000 projections is due to the GSWD needs to calculate distance matrices of shape N L, where N and L are the number of samples and projections respectively, which is more computationally expensive than calculating inner products when the number of projections L increases. The DSWD requires a similar computational time as the ASWD in this example, while the ASWD generates higher quality images in terms of 2-Wasserstein distances. Besides, we have also evaluated the effect of batch size on the computation cost of different slice-based Wasserstein metrics. Due to the out-of-memory error caused by the excessively high computation cost of the GSWDs in high-dimensional space, the GSWD is not included in this comparison. Specifically, the ASWD, the DSWD, and the SWD are compared in this experiment, and the computation time of the evaluated methods with L = 10,1000 projections and N = {213,214,215,216} samples are reported in Figure 14b. From the results presented in Figure 14b, it can be observed that, similar to the computational complexity of the DSWD and the SWD, the computational complexity of the ASWD empirically tends to scale in O(Nlog N). Published as a conference paper at ICLR 2022 ASWD-10 SWD-10 DSWD-10 GSWD-10 ASWD-1000 SWD-1000 DSWD-1000 GSWD-1000 Figure 13: Randomly selected samples generated by different metrics, 10 and 1000 refer to the number of projections. Figure 14: (a) The execution time of different methods and the 2-Wasserstein distance between the real images and the fake images generated by their corresponding models. Each dot of the curve of SWD corresponds to the performance of the SWD with the number of projections L = {10,1000,10000}, in sequence. Each dot of the other curves correspond to the performance of the other methods with the number of projections L={10,1000}, in sequence. (b) Computation cost of calculating different metrics, the horizontal axis is the number of samples from the compared distributions, and the horizontal axis is the averaged time consumption for calculating the distances. It can be found that the evaluated metrics tend to scale in O(Nlog N). Published as a conference paper at ICLR 2022 APPENDIX I SLICED WASSERSTEIN AUTOENCODERS We train an autoencoder using the framework proposed in (Kolouri et al., 2019b), where an encoder and a decoder are jointly trained by minimizing the following objective: min φ,ψ BCE(ψ(φ(x)),x)+L1(ψ(φ(x)),x)+SWD(pz,φ(x)), (55) where φ is the encoder, ψ is the decoder, pz is the prior distribution of latent variable, BCE( , ) is the binary cross entropy loss between reconstructed images and real images, and L1( , ) is the L1 loss between reconstructed images and real images. We train this model using different slice-based Wasserstein metrics, including the ASWD, the DSWD, the SWD, and the GSWD. Here we use the ring distribution as the prior distribution as shown in Figure 16. We report the binary cross entropy loss during test time and the 2-Wasserstein distance between prior and the encoded latent variable φ(x) in Figure 15. The slice-based Wasserstein metrics used as the third term in Equation (55) is also recorded at each iteration and presented in Figure 15 in order to analyze the factors that causes the differences in the performance of models trained with different slice-based Wasserstein metrics. It can be observed from the first two columns of Figure 15 that while the model trained with the ASWD, SWD, and DSWD lead to similar 2-Wassertein distance between the encoded latent variable distribution and the prior distribution, which implies that they have similar coverage of the prior distribution as shown in Figure 16, the model trained with the ASWD converges slightly faster to smaller binary cross entropy loss than the others. As shown in Figure 15(b) and 15(c), since the obtained GSWDs are trivially small compared with the other metrics, models trained with GSWDs only focuses on the reconstruction loss and therefore produce reconstructed images of higher qualities in terms of the binary cross entropy. However, although models trained with the GSWD polynomial and the GSWD circular lead to smaller binary cross entropy loss than the ASWD, their latent distributions present very different data structures from the specified prior distribution, which can be problematic in certain applications where the support of the latent distribution is required to be within a particular range. Some MNIST images randomly generated by SWAEs trained with different metrics are given in Figure 17. (a) (b) (c) Figure 15: Convergence behavior of SWAEs trained with different slice-based Wasserstein metrics. (a) The 2-Wasserstein distance between the prior distribution pz and the distribution of encoded feature φ(x). (b) The binary cross entropy loss between the reconstruction and real data. (c) The slice-based Wasserstein metric used to train the model. Published as a conference paper at ICLR 2022 (a) ASWD latent space (b) DSWD latent space (c) SWD latent space 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 (d) GSWD (poly 3) latent space (e) GSWD (circular) latent (f) Prior distribution Figure 16: Comparisons between the encoded latent space generated by different slice-based Wasserstein metrics. (a) ASWD samples (b) DSWD samples (c) SWD samples (d) GSWD (poly 3) samples (e) GSWD (circular) samples (f) MNIST samples Figure 17: MNIST images randomly generated by SWAEs trained with different metrics. Published as a conference paper at ICLR 2022 APPENDIX J COLOR TRANSFERRING Color transferring can be formulated as an optimal transport problem (Bonneel et al., 2015; Radon, 1917). In this task, the color palette of a source image is transferred to that of a target image, while keeping the content of source image unchanged. To achieve this, the optimal transport can be used to find the alignment between image pixels by calculating the optimal mapping of color palettes. In this experiment, instead of solving the optimal mapping in the original space, we first project the distribution onto one-dimensional spaces and average the alignment between one-dimensional samples as an approximation of the optimal mapping in the original space. After obtaining the approximation, we replace pixels of the source image with the averaged corresponding pixels in the target image. To reduce the computational cost, we utilize the approach proposed in (Muzellec and Cuturi, 2019), where the K-means algorithm is used to cluster the pixels of both source and target images, and then we implement color transfer for the quantized images whose pixels are consist of the centers of 3000 clusters rather than the original source and target images. We present the results of color transferring in Figure 18. It can be observed that the ASWD and the DSWD produce sharper images than the SWD and the GSWD (polynomial), we conjecture that is because the ASWD and the DSWD can generate better alignment of pixels. The GSWD (circular) tends to generate high contrast images, but sometimes produces images with low color diversity as in Figure 18(a).The Max-SWD has the highest contrast among all methods, but this is due to it only uses a single projection to obtain the transport mapping, thus there is no need to average different pixels from the target image. A disadvantage of the Max-SWD is that the transferred images generated by Max-SWD is not smooth enough and do not look realistic. The ASWD can generate smooth and realistic images than the SWD and the Max-SWD, even when the number of projections is as small as 10. Transferred images obtained by transferring colors from the source to target using standard optimal transport maps is also presented in Figure 18 for reference (Ferradans et al., 2014). Published as a conference paper at ICLR 2022 Figure 18: Top rows are source images and target images, lower rows show transferred images obtained by using different methods with different number of projections. Source and target images are from (Bonneel et al., 2015) and https://github.com/chia56028/ Color-Transfer-between-Images. Published as a conference paper at ICLR 2022 APPENDIX K SLICED WASSERSTEIN BARYCENTER Sliced Wasserstein distances can also be applied in the barycenter calculation and shape interpolation (Bonneel et al., 2015). Here we compare barycenters produced by different slice-based Wasserstein metrics, including the GSWD (circular and polynomial), the ASWD, the SWD, and the DSWD. Specifically, we compute barycenters of different shapes consisting of point clouds, as shown in Figure 19. Each object in Figure 19 corresponds to a specific barycenter with different weights. The Wasserstein barycenters are also presented in Figure 19 for reference, which provide geometrically meaningful barycenters at the expense of significantly higher computational cost . Formally, a sliced-Wasserstein barycenter of objects µ = {µ1,µ2, ,µN Pk(Rd)} assigned with weights w=[w1,w2, ,w N R] is defined as: Bar(µ,w)= argmin µ Pk(Rd) i=1 wi SWD(µ,µi). (56) In this experiment, we set N =3 and compute barycenters corresponding to different weights. The results are given in Figure 19. From Figure 19, it can be observed that the ASWD produces similar barycenters as that of the SWD, which are sharper than the DSWD, and more meaningful than the GSWD (polynomial). The flexibility of the injective neural networks g( ) and its optimization in the ASWD can be potentially combined with specific requirements in particular tasks to generate calibrated barycenters - we leave this as a future research direction. (a) ASWD barycenters (b) GSWD (circular) barycenters (c) GSWD (polynomial) barycenters (d) DSWD barycenters (e) SWD barycenters (f) Wasserstein barycenter Figure 19: Sliced Wasserstein barycenters generated by the ASWD, the GSWD, the DSWD, and the SWD, and the Wasserstein barycenter.