# sliced_wasserstein_estimation_with_control_variates__50b633c3.pdf Published as a conference paper at ICLR 2024 SLICED WASSERSTEIN ESTIMATION WITH CONTROL VARIATES Khai Nguyen & Nhat Ho Department of Statistics and Data Sciences The University of Texas at Austin Austin, TX 78712, USA {khainb,minhnhat}@utexas.edu The sliced Wasserstein (SW) distances between two probability measures are defined as the expectation of the Wasserstein distance between two one-dimensional projections of the two measures. The randomness comes from a projecting direction that is used to project the two input measures to one dimension. Due to the intractability of the expectation, Monte Carlo integration is performed to estimate the value of the SW distance. Despite having various variants, there has been no prior work that improves the Monte Carlo estimation scheme for the SW distance in terms of controlling its variance. To bridge the literature on variance reduction and the literature on the SW distance, we propose computationally efficient control variates to reduce the variance of the empirical estimation of the SW distance. The key idea is to first find Gaussian approximations of projected one-dimensional measures, then we utilize the closed-form of the Wasserstein-2 distance between two Gaussian distributions to design the control variates. In particular, we propose using a lower bound and an upper bound of the Wasserstein-2 distance between two fitted Gaussians as two computationally efficient control variates. We empirically show that the proposed control variate estimators can help to reduce the variance considerably when comparing measures over images and point-clouds. Finally, we demonstrate the favorable performance of the proposed control variate estimators in gradient flows to interpolate between two point-clouds and in deep generative modeling on standard image datasets, such as CIFAR10 and Celeb A1. 1 INTRODUCTION Recent machine learning applications have recognized the Wasserstein distance (Villani, 2008b; Peyré & Cuturi, 2019), as a universal effective tool. In particular, there are various applications that achieve notable performance by using the Wasserstein distance as a component, for example, (hierarchical) clustering (Ho et al., 2017), domain adaptation (Courty et al., 2016; Damodaran et al., 2018), generative modeling (Arjovsky et al., 2017; Tolstikhin et al., 2018), and so on. Despite its appealing performance, the Wasserstein distance has two major weaknesses. The first weakness is that it has high computational complexities. As discussed in (Peyré & Cuturi, 2019), the time complexity of computing the Wasserstein distance between two discrete measures that have at most n supports is O(n3 log n). Additionally, the space complexity required for the pair-wise transportation cost matrix is O(n2). As a result, computing the Wasserstein distance on a large-scale discrete distribution is expensive. The second weakness is that the Wasserstein distance suffers from the curse of dimensionality. More specifically, the sample complexity of the Wasserstein distance is O(n 1/d). Therefore, using the Wasserstein distance in high-dimensional statistical inference may not be stable. The Wasserstein distance has a famous alternative called the sliced Wasserstein (SW) distance, which is derived from the one-dimensional Wasserstein distance as its base metric. The one-dimensional Wasserstein distance has a closed-form solution, which can be computed in O(n log n) time complexity and O(n) space complexity in the discrete case. Here, supports" refers to the discrete probability measures being compared. To apply this closed-form solution in a high-dimension setting, random 1Code for the paper is published at https://github.com/khainb/CV-SW. Published as a conference paper at ICLR 2024 projections are employed, transforming two measures into infinite pairs of one-dimensional measures using a projecting function with infinite projecting directions that belong to the unit-hypersphere. The closed-form of the one-dimensional Wasserstein distance is then applied to pairs of one-dimensional measures to obtain projections, and the SW distance is computed by aggregating the values of projections. The most commonly used method of aggregation is averaging, resulting in an expectation of the projected one-dimensional Wasserstein distance with respect to the uniform distribution over projecting directions. Since the space of projecting directions is the unit-hypersphere, the SW distance does not suffer from the curse of dimensionality, and its sample complexity is O(n 1/2). Due to its scalability, the SW distance has been successfully applied in various applications such as point-cloud reconstruction (Nguyen et al., 2023; Savkin et al., 2022), generative models (Deshpande et al., 2018), domain adaptation (Lee et al., 2019), shape matching (Le et al., 2024b), clustering (Kolouri et al., 2018), variational inference (Yi & Liu, 2021), and many other tasks. The SW distance is typically estimated using the Monte Carlo integration, as the intractable expectation requires approximation in practice. The number of Monte Carlo samples drawn from the unit-form distribution over the unit-hypersphere is referred to as the number of projections. Increasing the number of projections improves the accuracy and stability of the estimation. However, to the best of our knowledge, there is no prior work on improving the Monte Carlo approximation scheme of the SW distance, especially in terms of variance reduction. In this work, we propose a novel approach that combines the SW distance literature with the variance reduction literature by introducing the very first control variates, which are a technique for reducing the variance of Monte Carlo estimates. Contribution. In summary, our contributions are two-fold: 1. To address the issue of variance in estimating the SW distance using Monte Carlo samples, we propose a novel approach based on control variates. Specifically, we first identify two Gaussian distributions that closely approximate the two projected one-dimensional measures by minimizing Kullback Leibler divergence. Next, we construct control variates using the closed-form solution of the Wasserstein distance between two Gaussian distributions. We propose two control variates, an upper bound and a lower bound, for the Wasserstein distance between the fitted Gaussian distributions. By using these control variates, we show that the computation is only linear in terms of the number of supports and the number of dimensions, even when dealing with discrete probability measures. This means that the proposed control variates estimators have the same computational complexity as the conventional estimator of the SW distance. Overall, our proposed approach provides a practical and efficient method for estimating the SW distance with reduced variance. 2. We empirically show that the proposed control variate estimators yield considerably smaller variances than the conventional computational estimator of SW distance, using a finite number of projections, when comparing empirical distributions over images and point clouds. Moreover, we illustrate that the computation for control variates is negligible compared to the computation of the one-dimensional Wasserstein distance. Finally, we further demonstrate the favorable performance of the control variate approach in gradient flows between two point clouds, and in learning deep generative models on the CIFAR10 (Krizhevsky et al., 2009) and Celeb A (Liu et al., 2015) datasets. From our experiments, we observe that using the proposed control variate estimators considerably improves performance, while their computation time is approximately the same as that of the conventional estimator. Organization. The remainder of the paper is organized as follows. First, we review the background on the Wasserstein distance, including its special cases, such as the one-dimensional case and the Gaussian case, as well as the sliced Wasserstein distance and the Monte Carlo estimation scheme, in Section 2. Next, we discuss the construction of control variates and their resulting estimator of the SW distance in Section 3. We then present experimental results in Section 4 to demonstrate the benefits of the proposed control variate estimators. Finally, we conclude in Section 5 and provide proofs of key results, related works, and additional materials in the Appendices. Notations. For any d 2, we denote Sd 1 := {θ Rd | ||θ||2 2 = 1} as the unit hyper-sphere and U(Sd 1) as its corresponding uniform distribution. We denote P(Sd 1) as the set of all probability measures on Sd 1. For p 1, Pp(Rd) is the set of all probability measures on Rd that have finite p-moments. For any two sequences an and bn, the notation an = O(bn) means that an Cbn for all n 1, where C is some universal constant. We denote θ µ as the push-forward measure of µ through the function f : Rd R, where f(x) = θ x. We denote PX as the empirical measure 1 m Pm i=1 δxi, where X := (x1, . . . , xm) Rdm is a vector. We denote Tr(A) as the trace operator Published as a conference paper at ICLR 2024 of the matrix A and A1/2 is the square root of matrix A i.e., A1/2 = B such that BB = A. Given a probability measure µ, we denote Fµ is the cumulative distribution function (CDF) of µ. 2 BACKGROUND We now review some essential materials for the paper. Wasserstein distance. Given p 1, two probability measures µ Pp(Rd) and ν Pp(Rd), the Wasserstein distance (Villani, 2008a; Peyré & Cuturi, 2019) between µ and ν is : Wp p(µ, ν) = infπ Π(µ,ν) R Rd Rd x y p pdπ(x, y), where Π(µ, ν) is set of all couplings that have marginals are µ and ν respectively. The computational complexity and memory complexity of Wasserstein distance are O(n3 log n) and O(n2) in turn when µ and ν have at most n supports. However, there are some special cases where the Wasserstein distance can be computed efficiently. Special Cases. When p = 2, we have Gaussian distributions µ := N(m1, Σ1) and ν := N(m2, Σ2), the Wasserstein distance between µ and ν has the following closed-form: W2 2(µ, ν) = m1 m2 2 2 + Tr(Σ1 + Σ2 2(Σ1/2 1 Σ2Σ1/2 1 )1/2). (1) When d = 1, the Wasserstein distance between µ Pp(R) and ν Pp(R) can also be computed with a closed form: Wp p(µ, ν) = R 1 0 |F 1 µ (z) F 1 ν (z)|pdz. When µ and ν are one-dimension discrete probability measures that have a most n supports, the computational complexity and memory complexity, in this case, are only O(n log n) and O(n). Sliced Wasserstein distance. Using random projections, the sliced Wasserstein (SW) distance can exploit the closed-form benefit of Wasserstein distance in one dimension. The definition of sliced Wasserstein distance (Bonneel et al., 2015) between two probability measures µ Pp(Rd) and ν Pp(Rd) is: SWp p(µ, ν) = Eθ U(Sd 1)[Wp p(θ µ, θ ν)], (2) where U(Sd 1) is the uniform distribution over the unit-hyper sphere. Monte Carlo estimation. The expectation in the SW is often intractable, hence, Monte Carlo samples are often used to estimate the expectation: d SW p p(µ, ν; L) = 1 l=1 Wp p(θl µ, θl ν), (3) where projecting directions θ1, . . . , θL are drawn i.i.d from U(Sd 1). When µ and ν are empirical measures that have at most n supports in d dimension, the time complexity of SW is O(Ln log n + Ldn). Here, Ln log n is for sorting L sets of projected supports and Ldn is for projecting supports to L sets of scalars. Similarly, the space complexity for storing the projecting directions and the projected supports of SW is O(Ld + Ln). We refer to Algorithm 1 in Appendix D for the detailed algorithm for the SW. Variance and Error. By using some simple transformations, we can derive the variance of the Monte Carlo approximation of the SW distance as: Var[d SW p p(µ, ν; L)] = 1 LVar[Wp p(θ µ, θ ν)]. We also have the error of the Monte Carlo approximation (Nadjahi et al., 2020b) is: E h |d SW p p(µ, ν; L) SWp p(µ, ν)| i 1 L Var Wp p(θ µ, θ ν) 1/2 . (4) Here, can see that Var[Wp p(θ µ, θ ν) plays an important role in controlling the approximation error. Therefore, a natural question arises: "Can we construct a function Z(θ; µ, ν) such that E[Z(θ; µ, ν)] = SWp p(µ, ν) while Var[Z(θ; µ, ν)] Var[Wp p(θ µ, θ ν)?". If we can design such Z(θ; µ, ν), we will L PL l=1 Z(θl; µ, ν) SWp p(µ, ν) i 1 LVar[Z(θ; µ, ν)]1/2 1 LVar[Wp p(θ µ, θ ν)]1/2 which implies that the estimator 1 L PL l=1 Z(θl; µ, ν) is better than d SW p p(µ, ν; L). 3 CONTROL VARIATE SLICED WASSERSTEIN ESTIMATORS We first adapt notations from the control variates literature to the SW case in Section 3.1. After that, we discuss how to construct control variates and their computational properties in Section 3.2. Published as a conference paper at ICLR 2024 3.1 CONTROL VARIATE FOR SLICED WASSERSTEIN DISTANCE Short Notations. We are approximating the SW which is an expectation with respect to the random variable θ. For convenience, let denote Wp p(θ µ, θ ν) as W(θ; µ, ν), we have: SWp p(µ, ν) = E[W(θ; µ, ν)], and d SW p p(µ, ν; L) = 1 l=1 W(θl; µ, ν), (5) where θ1, . . . , θL i.i.d σ0(θ) := U(Sd 1). Control Variate. A control variate (Owen, 2013) is a random variable C(θ) such that its expectation is tractable i.e., E[C(θ)] = B. From the control variate, we consider the following variable: W(θ; µ, ν) γ(C(θ) B) (6) where γ R. Therefore, it is easy to check that E[W(θ; µ, ν) γ(C(θ) B)] = E[W(θ; µ, ν)] = SWp p(µ, ν) since E[C(θ)] = B. Therefore, the Monte Carlo estimation of E[W(θ; µ, ν) γ(C(θ) B)], i.e., 1 L PL l=1(W(θk; µ, ν) γ(C(θl) B)), is an unbiased estimation of E[W(θ; µ, ν)]. Now, we consider the variance of the variable W(θ; µ, ν) γ(C(θ) B). Var[W(θ; µ, ν) γ(C(θ) B)] = Var[W(θ; µ, ν)] 2γCov[W(θ; µ, ν), C(θ)] + γ2Var[C(θ)]. The above variance attains its minimum, with respect to γ, for γ = Cov[W (θ;µ,ν),C(θ)] Var[C(θ)] Using the optimal γ , we the minimum variance of is: Var[W(θ; µ, ν)] 1 Cov[W(θ; µ, ν), C(θ)]2 Var[W(θ; µ, ν)]Var[C(θ)] Therefore, the variance of W(θ; µ, ν) γ (C(θ) B) with is lower than W(θ; µ, ν) if the control variate C(θ) has a correlation with W(θ; µ, ν). Definition 1. Given a control variate C(θ), the corresponding controlled projected one-dimensional Wasserstein distance is: Z(θ; µ, ν, C(θ)) = W(θ; µ, ν) Cov[W(θ; µ, ν), C(θ)] Var[C(θ)] (C(θ) B). (8) In practice, computing γ might be intractable, hence, we can estimate γ using Monte Carlo samples. Control Variate Estimator of the SW distance. Now, we can form the new estimation of SWp p(µ, ν). Definition 2. Given a control variate C(θ) with E[C(θ)] = B, the number of projections L 1, the Control Variate Sliced Wasserstein estimator is: \ CV-SW p p(µ, ν; L, C(θ)) = c SW p p(µ, ν; L) c γ L 1 L l=1 (C(θl) B), (9) where θ1, . . . , θL σ0(θ), c SW p p(µ, ν; L) = 1 L PL l=1 W(θl; µ, ν), and the estimated optimal coefficient c γ L = c Cov[W (θ;µ,ν),C(θ);L] c Var[C(θ);L] with d Cov[W(θ; µ, ν), C(θ); L] = 1 L PL l=1(W(θl; µ, ν) c SW p p(µ, ν; L))(C(θl) B), c Var[C(θ); L] = 1 L PL l=1(C(θl) B)2. Remark 1. In Definition 2, the optimal coefficient c γ L is estimated by reusing Monte Carlo samples θ1, . . . , θL. It is possible to estimate c γ L using new samples θ 1, . . . , θ L to make it independent to c SW p p(µ, ν; L). Nevertheless, this resampling approach costs doubling computational complexity. Therefore, the estimation in Definition 2 is preferable in practice (Owen, 2013). It is worth noting that the estimation is only asymptotic unbiased, however, it is negligible since the unbiasedness is always lost after taking the p-root for computing the value of SW with finite L. In this section, we have not yet specified C(θ). In the next session, we will focus on constructing a computationally efficient C(θ) that is also effective by conditioning it on two probability measures µ and ν. Specifically, we define C(θ) := C(θ; µ, ν), which involves conditioning on these two probability measures µ and ν. Published as a conference paper at ICLR 2024 3.2 CONSTRUCTING CONTROL VARIATES We recall that we are interested in the random variable W(θ; µ, ν) = Wp p(θ µ, θ ν). It is desirable to have a control variate C(θ) such that C(θ) W(θ; µ, ν) (Owen, 2013). Therefore, it is natural to also construct C(θ) := C(θ; µ, ν) based on two measures µ, ν. Moreover, since we know the closed-form solution of Wasserstein distance between two Gaussians, control variates in the form C(θ; µ, ν) = W2 2(N(m1(θ; µ), σ2 1(θ; µ)), N(m2(θ; ν), σ2 2(θ; ν))) could be tractable. Gaussian Approximation. The question of constructing the control variate i.e., specifying N(m1(θ; µ), σ2 1(θ; µ)) and N(m2(θ; ν), σ2 2(θ; ν)) now requires finding the Gaussian approximation of two projected measures θ µ and θ ν. When two measures are discrete, namely, µ = Pn i=1 αiδxi (Pn i=1 αi = 1) and ν = Pm i=1 βiδyi (Pm i=1 βi = 1), we perform the following optimization: m1(θ; µ), σ2 1(θ; µ) = argmaxm1,σ2 1 2πσ2 1 exp 1 2σ2 1 (θ xi m1)2 !# m2(θ; ν), σ2 2(θ; ν) = argmaxm2,σ2 2 2πσ2 2 exp 1 2σ2 2 (θ yi m2)2 !# Proposition 1. Let µ and µ be two discrete probability measures i.e., µ = Pn i=1 αiδxi (Pn i=1 αi = 1) and ν = Pm i=1 βiδyi (Pm i=1 βi = 1), we have: m1(θ; µ) = Pn i=1 αiθ xi, σ2 1(θ; µ) = Pn i=1 αi θ xi m1(θ; µ) 2, m2(θ; ν) = Pm i=1 βiθ yi, σ2 2(θ; ν) = Pm i=1 βi θ yi m2(θ; ν) 2 , are solution of the problems in equation 1011. We refer to Appendix C.1 for the detailed proof of Proposition 1. It is worth noting that the closed-form in Proposition 1 can also be seen as the solution from using moment matching. When µ and ν are continuous, we can solve the optimization by using stochastic gradient descent by sampling from µ and ν respectively. In addition, we can use corresponding empirical measures of µ and ν i.e., µn = 1 n Pn i=1 δXi and νn = 1 n Pn i=1 δYi as proxies (where X1, . . . , Xn i.i.d µ and Y1, . . . , Yn i.i.d ν). Beyond optimization, we can also utilize the Laplace approximation to obtain two approximated Gaussians for θ µ and θ ν when the push-forward densities are tractable. We refer the reader to Appendix A for more detail. Constructing Control Variates. From the closed-form of the Wasserstein-2 distance between two Gaussians in equation 1, we have: W2 2(N(m1(θ; µ), σ2 1(θ; µ)), N(m2(θ; ν), σ2 2(θ; ν))) = (m1(θ; µ) m2(θ; ν))2 + σ1(θ; µ)2 + σ2(θ; ν)2 2σ1(θ; µ)σ2(θ; ν). To calculate E[W2 2(N(m1(θ; µ), σ2 1(θ; µ)), N(m2(θ; ν), σ2 2(θ; ν)))] as the requirement of a control variate, we could calculate the expectation of each term i.e., E[(m1(θ; µ) m2(θ; ν))2], E σ2 1(θ; µ) , E σ2 2(θ; ν) , and E[σ1(θ; µ)σ2(θ; ν)]. Proposition 2. Given two discrete measures µ and ν, m1(θ; µ), m2(θ; ν), σ2 1(θ; µ), and σ2 1(θ; µ) as in Proposition 1, we obtain: E[(m1(θ; µ) m2(θ; ν))2] = 1 E σ2 1(θ; µ) = 1 , E σ2 2(θ; ν) = 1 The proof of Proposition 2 is given in Appendix C.2. Unfortunately, we cannot compute the expectation for the last term E[σ1(θ; µ)σ2(θ; ν)]. However, we could construct control variates using lower-bound and upper-bound of W2 2(N(m1(θ; µ), σ2 1(θ; µ)), N(m2(θ; ν), σ2 2(θ; ν))). Definition 3. Given two Gaussian approximations of θ µ and θ ν i.e., N(m1(θ; µ), σ2 1(θ; µ)) and N(m2(θ; ν), σ2 2(θ; ν)), we define the following two control variates: Clow(θ; µ, ν) = (m1(θ; µ) m2(θ; ν))2 (14) Cup(θ; µ, ν) = (m1(θ; µ) m2(θ; ν))2 + σ2 1(θ; µ) + σ2 2(θ; ν). (15) Published as a conference paper at ICLR 2024 We now discuss some properties of the proposed control variates. Proposition 3. We have the following relationship: Clow(θ; µ, ν) W2 2(N(m1(θ; µ), σ2 1(θ; µ)), N(m2(θ; ν), σ2 2(θ; ν))) Cup(θ; µ, ν). (16) The proof of Proposition 3 follows directly from the construction of the control variates. For completeness, we provide the proof in Appendix C.3. Proposition 4. Let µ = Pn i=1 αiδxi and ν = Pm i=1 βiδyi, using the control variate Clow(θ; µ, ν) is equivalent to using the following control variate: W2 2(θ N(m1(µ), σ2 1(µ)I), θ N(m2(ν), σ2 2(ν)I)), (17) where m1(µ), σ2 1(µ) = argmaxm1,σ2 1 Pn i=1 αi log 1 (2π)d|σ2 1I|e( 1 2 (xi m1) |σ2 1I| 1(xi m1)) and m2(ν), σ2 2(ν) = argmaxm2,σ2 2 Pm i=1 βi log 1 (2π)d|σ2 2I|e( 1 2 (yi m2) |σ2 2I| 1(yi m2)) . The proof for The Proposition 4 is given in Appendix C.4. The proposition means that using the control variate Clow(θ; µ, ν) is equivalent to using the Wasserstein-2 distance between two projected location-scale multivariate Gaussians with these location-scale multivariate Gaussians are the approximation of the two original probability measures µ and ν on the original space. Control Variate Estimators of the SW distance. From two defined control variates in Definition 3, we define the corresponding controlled projected one-dimensional Wasserstein distances and control variate sliced Wasserstein estimators. Definition 4. Given two control variates Clow(θ; µ, ν), Cup(θ; µ, ν) in Definition 3, the corresponding controlled projected one-dimensional Wasserstein distances are defined as: Zlow(θ; µ, ν) = Z(θ; µ, ν, Clow(θ; µ, ν)), Zup(θ; µ, ν) = Z(θ; µ, ν, Cup(θ; µ, ν)), (18) where Z(θ; µ, ν, C(θ)) is defined in Definition 1. Definition 5. Given Zlow(θ, µ, ν), Zup(θ, µ, ν), \ CV-SW p p(µ, ν; L, C(θ)) as defined in Definition 4 and Definition 4, the lowerbound and upperbound control variate sliced Wasserstein estimators are: \ LCV-SW p p(µ, ν; L) = 1 i=1 Zlow(θl, µ, ν) = \ CV-SW p p(µ, ν; L, Clow(θ; µ, ν)), (19) \ UCV-SW p p(µ, ν; L) = 1 i=1 Zup(θl, µ, ν) = \ CV-SW p p(µ, ν; L, Cup(θ; µ, ν)). (20) We refer to Algorithm 23 for the detailed algorithms for the control variate estimators in Appendix D. Computational Complexities. When dealing with two discrete probability measures µ and ν that have at most n supports, projecting the supports of the two measures to L sets of projected supports costs O(Ldn) in time complexity. After that, fitting one-dimensional Gaussians also costs O(Ldn). From Proposition 2 and Definition 3, computing the control variate Clow(θ; µ, ν) and its expected value Blow(µ, ν) costs O(dn). Similarly, computing the control variate Cup(θ; µ, ν) and its expected value Bup also costs O(dn). Overall, the time complexity of \ LCV-SW p p(µ, ν; L) and \ UCV-SW p p(µ, ν; L) is the same as the conventional SW, which is O(Ln log n + Ldn). Similarly, their space complexities are also O(Ld + Ln) as the conventional estimator. Gaussian Approximation of the sliced Wasserstein distance. In a recent work (Nadjahi et al., 2021), the closed-form of Wasserstein between Gaussian distributions is utilized to approximate the value of the SW distance. The key idea is based on the concentration of random projections in high-dimension, i.e., they are approximately Gaussian. However, the resulting deterministic approximation is not very accurate since it is only based on the first two moments of two original measures. As a result, the proposed approximation can only work well when two measures have weak dimensional dependence. In contrast, we use the closed-form of Wasserstein between Gaussian distributions to design control variates, which can reduce the variance of the Monte Carlo estimator. Published as a conference paper at ICLR 2024 0 2000 4000 6000 8000 10000 L Absolute Error SW LCV-SW UCV-SW 0 2000 4000 6000 8000 10000 L SW LCV-SW UCV-SW 0 2000 4000 6000 8000 10000 L Absolute Error 3D Point-Cloud SW LCV-SW UCV-SW 0 2000 4000 6000 8000 10000 L 3D Point-Cloud SW LCV-SW UCV-SW Figure 1: The empirical errors of the conventional estimator (SW) and the control variate estimators (LCV-SW, UCV-SW) when comparing empirical distributions over MNIST images and point-clouds. Table 1: Summary of Wasserstein-2 (Flamary et al., 2021) (multiplied by 104), computational time in second (s) to reach step 8000 of different sliced Wasserstein estimators in gradient flows from three different runs. Distances Step 3000 (W2 ) Step 4000 (W2 ) Step 5000 (W2 ) Step 6000(W2 ) Step 8000 (W2 ) Time (s ) SW L=10 305.5196 0.4921 137.8767 0.3631 36.2454 0.1387 0.1164 0.0022 2.1e 5 1.0e 5 26.13 0.04 LCV-SW L=10 303.1001 0.1787 136.0129 0.0923 35.0929 0.1459 0.0538 0.0047 1.5e 5 0.2e 5 27.22 0.24 UCV-SW L=10 303.1017 0.1783 136.0139 0.0921 35.0916 0.1444 0.0535 0.0065 1.4e 5 0.2e 5 29.48 0.23 SW L=100 300.7326 0.2375 134.1498 0.3146 33.9253 0.1349 0.0183 0.0011 1.6e 5 0.2e 5 220.20 0.21 LCV-SW L=100 300.3924 0.0053 133.6243 0.0065 33.5102 0.0031 0.0134 8.7e 5 1.4e 5 0.1e 5 221.75 0.35 UCV-SW L=100 300.3927 0.0050 133.6242 0.0065 33.5088 0.0028 0.0136 0.0003 1.4e 5 0.2e 5 233.71 0.06 Our estimators are still stochastic, converge to the true value of the SW when increasing the number of projections, and work well with probability measures that have arbitrary structures and dimensions. Beyond uniform slicing distribution and related works. In the SW distance, the projecting direction follows the uniform distribution over the unit-hypersphere σ0(θ) = U(Sd 1). In some cases, changing the slicing distribution might benefit downstream applications. For example, the distributional sliced Wasserstein distance (Nguyen et al., 2021) can be written as DSWp p(µ, ν) = Eθ σ(θ) Wp p(θ µ, θ ν) , where σ(θ) is a distribution over the unit hypersphere. We can directly apply the proposed control variates to the DSW as long as we can calculate Eθ σ(θ)[θθ ]. If this is not the case, we can still use the proposed control variates by approximating the DSW via importance sampling. In particular, we can rewrite DSWp p(µ, ν) = Eθ σ0(θ) h Wp p(θ µ, θ ν) σ(θ) σ0(θ) i , then use control variates. We discuss other related works including how we can apply the control variates to other variants of the SW distance in Appendix E. Gradient estimators. In some cases, we might be interested in obtaining the gradient ϕSWp p(µ, νϕ) = ϕE[Wp p(θ µ, θ νϕ)] e.g., statistical inference. Since the expectation in the SW does not depend on ϕ, we can use the Leibniz rule to exchange differentiation and expectation, namely, ϕE[Wp p(θ µ, θ νϕ)] = E[ ϕWp p(θ µ, θ νϕ)]. After that, we can use ϕC(θ; µ, νϕ) ( ϕClow(θ; µ, νϕ) or ϕCup(θ; µ, νϕ)) as the control variate. However, estimating the optimal γ i = Cov[ ϕi Wp p(θ µ,θ νϕ), ϕi C(θ;µ,νϕ)] Var[ ϕi C(θ;µ,νϕ)] (for i = 1, . . . , d with d is the number of dimensions of parameters ϕ) is computationally expensive and depends significantly on the specific settings of νϕ. Therefore, we utilize a more computationally efficient estimator of the gradient i.e., E[ ϕZ(θ; µ, νϕ)] for Z(θ; µ, νϕ) is Zlow(θ; µ, νϕ) or Zup(θ; µ, νϕ) in Definition 4. After that, Monte Carlo samples are used to approximate expectations to obtain a stochastic gradient. 4 EXPERIMENTS Our experiments aim to show that using control variates can benefit applications of the sliced Wasserstein distance. In particular, we show that the proposed control variate estimators have lower variance in practice when comparing probability measures over images and point-clouds in Section 4.1. After that, we show that using the proposed control variate estimators can help to drive a gradient flow to converge faster to a target probability measure from a source probability measure while retaining approximately the same computation in Section 4.2. Finally, we show the proposed control variate estimators can also improve training deep generative models on standard benchmark images dataset such as CIFAR10, and Celeb A in Section 4.3. Due to the space constraint, additional experiments and detailed experimental settings are deferred to Appendix F. 4.1 COMPARING EMPIRICAL PROBABILITY MEASURES OVER IMAGES AND POINT-CLOUDS Settings. We aim to first test how well the proposed control variates can reduce the variance in practice. To do this, we use conventional Monte Carlo estimation for the SW 2 2 with a very Published as a conference paper at ICLR 2024 Figure 2: Point-cloud gradient flows for L = 10 from SW, LCV-SW, and UCV-SW respectively. large L value of 100000 and treat it as the population value. Next, we vary L in the set {2, 5, 10, 50, 100, 500, 1000, 5000, 7000, 10000} and compute the corresponding estimates of the SW with both the conventional estimator and control variate estimators. Finally, we calculate the absolute difference between the estimators and the estimated population as the estimated error. We apply this process to compare the empirical distributions over images of digit 0 with the empirical distribution images of digit 1 in MNIST (Le Cun et al., 1998) (784 dimensions, with about 6000 supports). Similarly, we compare two empirical distributions over two randomly chosen point-clouds in the Shape Net Core-55 dataset (Chang et al., 2015) (3 dimensions, 2048 supports). We also estimate the variance of the estimators via Monte Carlo samples with L = 100000 in Table 3 in Appendix F.1. Results. We show the estimated errors and the computational time which are averaged from 5 different runs in Figure 1. From the figure, we observe that control variate estimators reduce considerably the error in both cases. On MNIST, the lower bound control variate estimator (LCV-SW) gives a lower error than the upper bound control variate estimator (UCV-SW) while being slightly faster. The computational time of the LCV-SW has nearly identical computational time as the conventional estimator (SW). In the lower dimension i.e., the point-cloud case, two control variates give the same quality of reducing error and the same computational time. In this case, the computational time of both two control variates estimators are the same as the conventional estimator. From Table 3 in Appendix F.1, we observe that the variance of the control variate estimators is lower than the conventional estimator considerably, especially the LCV-SW. We refer to Appendix F.1 for experiments on different pairs of digits and point-clouds which show the same phenomenon. 4.2 POINT CLOUD GRADIENT FLOWS Settings. We model a distribution µ(t) flowing with time t along the sliced Wasserstein gradient flow µ(t) SWp(µ(t), ν), which drives it towards a target distribution ν (Santambrogio, 2015). Here, we set ν = 1 n Pn i=1 δYi as a fixed empirical target distribution and the model distribution µ(t) = 1 n Pn i=1 δXi(t), where the time-varying point cloud X(t) = (Xi(t))n i=1 Rd n. Starting at time t = 0, we integrate the ordinary differential equation X(t) = n X(t) SWp 1 n Pn i=1 δXi(t), ν for each iteration. We choose µ(0) and ν are two random point-cloud shapes in Shape Net Core-55 dataset (Chang et al., 2015) which were used in Section 4.1. After that, we use different estimators of the SW i.e., d SWp, \ LCV-SWp, and \ UCV-SWp, as the replacement of the for SWp to estimate the gradient including the proposed control variate estimators and the conventional estimator. Finally, we use p = 2, and the Euler scheme with 8000 iterations and step size 0.01 to solve the flows. Results. We use the Wasserstein-2 distance as a neutral metric to evaluate how close the model distribution µ(t) is to the target distribution ν. We show the results for the conventional Monte Carlo estimator (denoted as SW), the control variate Clow estimator (denoted as LCV-SW), and the control variate Cup estimator (denoted as UCV-SW), with the number of projections L = 10, 100 in Table 1. In the table, we also report the time in seconds that estimators need to reach the last step. In addition, we visually the flows for L = 100 in Figure 2. Moreover, we observe the same phenomenon on a different pair of point-clouds in Table 4 and in Figure 6 in Appendix F.2. 4.3 DEEP GENERATIVE MODELING Settings. We conduct deep generative modeling on CIFAR10 (with image size 32x32) (Krizhevsky et al., 2009), non-croppeed Celeb A (with image size 64x64) (Liu et al., 2015) with the SNGAN Published as a conference paper at ICLR 2024 SW L=10 LCV-SW L=10 UCV-SW L=10 Figure 3: Random generated images of distances on CIFAR10 and Celeb A. Table 2: Summary of FID and IS scores from three different runs on CIFAR10 (32x32), and Celeb A (64x64). Method CIFAR10 (32x32) Celeb A (64x64) Method CIFAR10 (32x32) Celeb A (64x64) FID ( ) IS ( ) FID ( ) FID ( ) IS ( ) FID ( ) SW L=10 21.72 0.37 7.54 0.06 11.14 1.12 SW L=1000 14.07 0.97 8.07 0.06 10.05 0.40 LCV-SW L=10 18.67 0.39 7.74 0.04 10.17 0.73 LCV-SW L=1000 13.58 0.12 8.23 0.02 9.78 0.47 UCV-SW L=10 21.71 0.36 7.57 0.01 11.00 0.04 UCV-SW L=1000 13.21 0.49 8.21 0.09 9.51 0.30 architecture (Miyato et al., 2018). For the framework, we follow the sliced Wasserstein generator (Deshpande et al., 2018; Nguyen & Ho, 2023) which is described in detail in Appendix F.3. The main evaluation metrics are FID score (Heusel et al., 2017) and Inception score (IS) (Salimans et al., 2016). On Celeb A, we do not report the IS since it poorly captures the perceptual quality of face images (Heusel et al., 2017). The detailed settings about architectures, hyper-parameters, and evaluation of FID and IS are provided in Appendix F.3. Results. We train generators with the conventional estimator (SW), and the proposed control variates estimators (LCV-SW and UCV-SW) with the number of projections L = 10, 1000. We report FID scores and IS scores in Table 2. From the table, we observe that the LCV-SW gives the best generative models on both CIFAR10 and Celeb A when L = 10. In particular, the LCV-SW can reduce about 14% FID score on CIFAR10, and about 10% FID score on Celeb A. Also, the LCV-SW increases the IS score by about 2.6%. For the UCV-SW, it can enhance the performance slightly compared with the SW when L = 10. It is worth noting that, the computational times of estimators are approximately the same since the main computation in the application is for training neural networks. Therefore, we can see the benefit of reducing the variance by using control variates here. Increasing L to 1000, despite the gap being closer, the control variate estimators still give better scores than the conventional estimator. In greater detail, the LCV-SW can improve about 3.5% FID score on CIFAR10, about 2% IS score on CIFAR10, and about 2.7% FID score on Celeb A compared to the SW. In this setting of L, the UCV-SW gives the best FID scores on both CIFAR10 and Celeb A. Concretely, it helps to decrease the FID score by about 6.1% on CIFAR10 and about 5.4% on Celeb A. From the result, the UCV-SW seems to need more projections than the LCV-SW to approximate well the optimal control variate coefficient (γ ) since the corresponding control variate of the UCV-SW has two more random terms. For the qualitative result, we show randomly generated images from trained models in Figure 3 for L = 10 and in Figure 7 for L = 1000 in Appendix F.3. Overall, we obverse that generated images are visually consistent to the FID scores and IS scores in Table 2. 5 CONCLUSION We have presented a method for reducing the variance of Monte Carlo estimation of the sliced Wasserstein (SW) distance using control variates. Moreover, we propose two novel control variates that serve as lower and upper bounds for the Wasserstein distance between two Gaussian approximations of two measures. By using the closed-form solution of the Wasserstein distance between two Gaussians and the closed-form of Gaussian fitting of discrete measures via Kullback Leibler divergence, we demonstrate that the proposed control variates can be computed in linear time. Consequently, the control variate estimators have the same computational complexity as the conventional estimator of SW distance. On the experimental side, we demonstrate that the proposed control variate estimators have smaller variances than the conventional estimator when comparing probability measures over images and point clouds. Finally, we show that using the proposed control variate estimators can lead to improved performance of point-cloud SW gradient flows and generative models. Published as a conference paper at ICLR 2024 ACKNOWLEDGEMENTS We would like to thank Peter Mueller for his insightful discussion during the course of this project. NH acknowledges support from the NSF IFML 2019844 and the NSF AI Institute for Foundations of Machine Learning. Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International Conference on Machine Learning, pp. 214 223, 2017. Clément Bonet, Paul Berg, Nicolas Courty, François Septier, Lucas Drumetz, and Minh Tan Pham. Spherical sliced-wasserstein. In The Eleventh International Conference on Learning Representations, 2023. Nicolas Bonneel, Julien Rabin, Gabriel Peyré, and Hanspeter Pfister. Sliced and Radon Wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision, 1(51):22 45, 2015. Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository. ar Xiv preprint ar Xiv:1512.03012, 2015. Nicolas Courty, Rémi Flamary, Devis Tuia, and Alain Rakotomamonjy. Optimal transport for domain adaptation. IEEE transactions on pattern analysis and machine intelligence, 39(9):1853 1865, 2016. Bharath Bhushan Damodaran, Benjamin Kellenberger, Rémi Flamary, Devis Tuia, and Nicolas Courty. Deepjdot: Deep joint distribution optimal transport for unsupervised domain adaptation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 447 463, 2018. Ishan Deshpande, Ziyu Zhang, and Alexander G Schwing. Generative modeling using the sliced Wasserstein distance. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3483 3491, 2018. Rémi Flamary, Nicolas Courty, Alexandre Gramfort, Mokhtar Z. Alaya, Aurélie Boisbunon, Stanislas Chambon, Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo Fournier, Léo Gautheron, Nathalie T.H. Gayraud, Hicham Janati, Alain Rakotomamonjy, Ievgen Redko, Antoine Rolet, Antony Schutz, Vivien Seguy, Danica J. Sutherland, Romain Tavenard, Alexander Tong, and Titouan Vayer. Pot: Python optimal transport. Journal of Machine Learning Research, 22(78):1 8, 2021. URL http://jmlr.org/papers/v22/20-451.html. Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626 6637, 2017. Nhat Ho, Xuan Long Nguyen, Mikhail Yurochkin, Hung Hai Bui, Viet Huynh, and Dinh Phung. Multilevel clustering via Wasserstein means. In International Conference on Machine Learning, pp. 1501 1509, 2017. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Soheil Kolouri, Gustavo K Rohde, and Heiko Hoffmann. Sliced Wasserstein distance for learning Gaussian mixture models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3427 3436, 2018. Soheil Kolouri, Kimia Nadjahi, Umut Simsekli, Roland Badeau, and Gustavo Rohde. Generalized sliced Wasserstein distances. In Advances in Neural Information Processing Systems, pp. 261 272, 2019. Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Master s thesis, Department of Computer Science, University of Toronto, 2009. Published as a conference paper at ICLR 2024 Thanh Tung Le, Khai Nguyen, shanlin sun, Kun Han, Nhat Ho, and Xiaohui Xie. Diffeomorphic mesh deformation via efficient optimal transport for cortical surface reconstruction. In The Twelfth International Conference on Learning Representations, 2024a. Tung Le, Khai Nguyen, Shanlin Sun, Nhat Ho, and Xiaohui Xie. Integrating efficient optimal transport and functional maps for unsupervised shape correspondence learning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024b. Yann Le Cun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. Chen-Yu Lee, Tanmay Batra, Mohammad Haris Baig, and Daniel Ulbricht. Sliced Wasserstein discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10285 10295, 2019. Tao Li, Cheng Meng, Jun Yu, and Hongteng Xu. Hilbert curve projection distance for distribution comparison. ar Xiv preprint ar Xiv:2205.15059, 2022. Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015. Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. ar Xiv preprint ar Xiv:1802.05957, 2018. Kimia Nadjahi, Valentin De Bortoli, Alain Durmus, Roland Badeau, and Umut Sim sekli. Approximate Bayesian computation with the sliced-Wasserstein distance. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5470 5474. IEEE, 2020a. Kimia Nadjahi, Alain Durmus, Lénaïc Chizat, Soheil Kolouri, Shahin Shahrampour, and Umut Simsekli. Statistical and topological properties of sliced probability divergences. Advances in Neural Information Processing Systems, 33:20802 20812, 2020b. Kimia Nadjahi, Alain Durmus, Pierre E Jacob, Roland Badeau, and Umut Simsekli. Fast approximation of the sliced-Wasserstein distance using concentration of random projections. Advances in Neural Information Processing Systems, 34, 2021. Khai Nguyen and Nhat Ho. Energy-based sliced Wasserstein distance. Advances in Neural Information Processing Systems, 2023. Khai Nguyen, Nhat Ho, Tung Pham, and Hung Bui. Distributional sliced-Wasserstein and applications to generative modeling. In International Conference on Learning Representations, 2021. Khai Nguyen, Dang Nguyen, and Nhat Ho. Self-attention amortized distributional projection optimization for sliced wasserstein point-cloud reconstruction. Proceedings of the 40th International Conference on Machine Learning, 2023. Khai Nguyen, Nicola Bariletto, and Nhat Ho. Quasi-monte carlo for 3d sliced Wasserstein. In The Twelfth International Conference on Learning Representations, 2024a. Khai Nguyen, Shujian Zhang, Tam Le, and Nhat Ho. Sliced Wasserstein with random-path projecting directions. ar Xiv preprint ar Xiv:2401.15889, 2024b. Art B Owen. Monte carlo theory, methods and examples. 2013. Gabriel Peyré and Marco Cuturi. Computational optimal transport: With applications to data science. Foundations and Trends in Machine Learning, 11(5-6):355 607, 2019. Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training GANs. Advances in Neural Information Processing Systems, 29, 2016. Filippo Santambrogio. Optimal transport for applied mathematicians. Birkäuser, NY, 55(58-63):94, 2015. Published as a conference paper at ICLR 2024 Artem Savkin, Yida Wang, Sebastian Wirkert, Nassir Navab, and Federico Tombari. Lidar upsampling with sliced wasserstein distance. IEEE Robotics and Automation Letters, 8(1):392 399, 2022. Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein autoencoders. In International Conference on Learning Representations, 2018. Cédric Villani. Optimal transport: Old and New. Springer, 2008a. Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008b. Jie Wang, Rui Gao, and Yao Xie. Two-sample test with kernel projected wasserstein distance. In International Conference on Artificial Intelligence and Statistics, pp. 8022 8055. PMLR, 2022. Mingxuan Yi and Song Liu. Sliced Wasserstein variational inference. In Fourth Symposium on Advances in Approximate Bayesian Inference, 2021. Published as a conference paper at ICLR 2024 Supplement to Sliced Wasserstein Estimation with Control Variates" We first provide the skipped proofs in Appendix C. Additionally, we present the detailed algorithms for the conventional estimator of the sliced Wasserstein distance and the control variate estimators in Appendix D. Sliced Wasserstein variants are discussed in Appendix E, and we also discuss how to use control variates in those variants in the same appendix. In Appendix F, we provide additional experiments and their detailed experimental settings. Finally, we report the computational infrastructure in Appendix G. A DISCUSSION Laplace Approximation. We are interested in approximating a continuous distribution µ by an Gaussian N(m, σ2). We assume that we know the pdf of µ which is doubly differentiable and referred to as f(x). We first find m such that f (m) = 0. After, we use the second-order Taylor expansion of log f(x) around m: log f(x) log f(m) 1 2σ2 (x m)2, where the first order does not appear since f (m) = 0 and σ2 = 1 dx2 log f(x) x=m . Sampling-based approximation. We are interested in approximating a continuous distribution µ by an Gaussian N(m, σ2). Here, we assume that we do not know the pdf of µ, however, we can sample from X1, . . . , Xn i.i.d µ. Therefore, we can approximate µ by N(m, σ2) with m = 1 n Pn i=1 Xi and σ2 = 1 n Pn i=1(Xi m)2 which is equivalent to doing maximum likelihood estimate, moment matching, and doing optimization in Equation 10 with the proxy empirical measure µn = 1 n Pn i=1 δXi. B NON-PARAMETRIC HYPOTHESIS TESTING We follow the non-parametric two-sample test in (Wang et al., 2022). We use the permutation test for two cases. The first case is with two measures as Gaussians with diagonal covariance matrix and the second case is with two measures as Gaussians with full covariance matrix. We use the same hyper-parameters setting as in Section 5.1 in (Wang et al., 2022) and use SW, LCV-SW, and UCV-SW with L = 100 as the testing distance. We obtain the result in Figure 4. 22 23 24 25 26 27 28 Dimension D Average Power Power for Covariance Shifted (diagonal) Gaussian 22 23 24 25 26 27 28 Dimension D Average Power Power for Covariance Shifted (non-diagonal) Gaussian 22 23 24 25 26 27 28 Dimension D Average Type-I Error Type-I Error for Gaussian SW LCV-SW UCV-SW Figure 4: Testing results on Gaussian distributions across different choices of dimension D. Left: power for Gaussian distributions, where the shifted covariance matrix is still diagonal; Middle: power for Gaussian distributions, where the shifted covariance matrix is non-diagonal; Right: Type-I error. C.1 PROOF OF PROPOSITION 1 We have µ and ν are two empirical measures i.e., µ = Pn i=1 αiδxi (Pn i=1 αi = 1) and ν = Pm i=1 βiδyi (Pm i=1 βi = 1), we have their projected measures θ µ = Pn i=1 αiδθ xi and θ ν = Published as a conference paper at ICLR 2024 Pn i=1 βiδθ yi. Now, we have: argmaxm1,σ2 1 2πσ2 1 exp 1 2σ2 1 (θ xi m1)2 !# = argmaxm1,σ2 1 2σ2 1 (θ xi m1)2 αi 2 log(σ2 1) # = argmaxm1,σ2 1f(m1, σ2 1). Taking the derivatives and setting them to 0, we have: d dm1 f(m1, σ2 1) = αi σ2 1 (m1 θ xi) = 1 d dσ2 1 f(m1, σ2 1) = 2σ4 1 (θ xi m1)2 αi αi σ2 1 (θ xi m1)2 Hence, we obtain: m1 = Pn i=1 αiθ xi Pn i=1 αi = i=1 αiθ xi, σ2 1 = Pn i=1 αi(θ xi m1)2 Pn i=1 αi = i=1 αi(θ xi m1)2. With similar derivation, we obtain the result for m2, σ2 2 and complete the proof. C.2 PROOF OF PROPOSITION 2 We first prove the following lemma. Lemma 1. Let θ is a uniformely distributed vector on the unit-hypersphere i.e., θ U(Sd 1), we have E[θθ ] = 1 Proof. Since we can obtain a uniformly distributed vector θ on Sd 1 by normalizing a unit Gaussian distributed vector i.e., θ = z ||z||2 with z N(0, Id). Therefore, we can rewrite: E[θθ ] = E z ||z||2 Since the expectation of a random matrix is a matrix of expectation, we now need to calculate E[θiθj] or E h zi ||z||2 zj ||z||2 i for all i, j [[d]]. Let denote θ = (θ1, . . . , θd) and z = (z1, . . . , zd), we have θ = (θ1, . . . , θi, . . . , θd) (for any i [[d]] := {1, 2, . . . , d} ) also follow U(Sd 1) since θ = z ||z ||2 with z = (z1, . . . , zi, . . . , zd) N(0, Id) (linear mapping of a Gaussian is a Gaussian). Therefore, we have E[θiθj] = E[θiθj] which means E[θiθj] = 0 for all j [[d]] = i. Now, we calculate the variance E[θ2 i ] for all i [[d]]. We have i=1 E[θ2 i ] = i=1 E z2 i ||z||2 2 " z2 i Pd j=1 z2 j Moreover, θ1, . . . , θd are exchangeable with a similar argument with the symmetry and linear mapping property of Gaussian distribution. Therefore, they are indentically distributed i.e., E[θ2 i ] = E[θ2 j] = 1/d for all i, j [[d]]. We complete the proof here. Published as a conference paper at ICLR 2024 We now go back to the proof of Proposition 2. From Proposition 1, we have: E[(m1(θ; µ) m2(θ; ν))2] = E where we use the result of Eθ U(Sd 1)[θθ ] = 1 d from Lemma 1. Next, we have: E σ2 1(θ; µ) = E i =1 αi θ xi Similarly, we have E σ2 2(θ; ν) = 1 d Pm i=1 βi yi Pm i =1 βi yi 2 . C.3 PROOF OF PROPOSITION 3 We recall that: Clow(θ; µ, ν) = (m1(θ; µ) m2(θ; ν))2, Cup(θ; µ, ν) = (m1(θ; µ) m2(θ; ν))2 + σ1(θ; µ)2 + σ2(θ; ν)2, W2 2(N(m1(θ; µ), σ2 1(θ; µ)), N(m2(θ; ν), σ2 2(θ; ν))) = (m1(θ; µ) m2(θ; ν))2 + (σ1(θ; µ) σ2(θ; ν))2. Since 0 (σ1(θ; µ) σ2(θ; ν))2 σ1(θ; µ)2+σ2(θ; ν)2, we obtain the result by adding (m1(θ; µ) m2(θ; ν))2 to the inequalities. C.4 PROOF OF PROPOSITION 4 We have µ and ν are two empirical measures i.e., µ = Pn i=1 αiδxi (Pn i=1 αi = 1) and ν = Pm i=1 βiδyi (Pm i=1 βi = 1). Now, we have: argmaxm1,σ2 1 (2π)d|σ2 1I| exp 1 2(xi m1) |σ2 1I| 1(xi m1) ! = argmaxm1,σ2 1 1 2σ2d 1 (xi m1) (xi m1) d 2 log(σ2 1) = argmaxm1,σ2 1f(m1, σ2 1) Taking the derivatives and setting them to 0, we have: m1f(m1, σ2 1) = αi σ2d 1 (m1 xi) = 1 σ2d 1 i=1 αim1 αixi d dσ2 1 f(m1, σ2 1) = d 2σ2d+2 1 (xi m1) (xi m1) d 2σ2 1 αi xi m1 2 2 σ2d 1 αi Published as a conference paper at ICLR 2024 Hence, we obtain: m1 = Pn i=1 αixi Pn i=1 αi = σ2 1 = Pn i=1 αi xi m1 2 2 Pn i=1 αi i=1 αi xi m1 2 2 Similarly, we obtain m2 = Pm i=1 βiyi and σ2 2 = Pm i=1 βi yi m2 2 2 1 d . Using the linearity of Gaussian distributions, we have θ N(m1, σ2 1I) = N(θ m1, σ2 1θ Iθ) = N(θ m1, σ2 1) (θ θ = 1). Similarly, we obtain θ N(m2, σ2 2I) = N(θ m2, σ2 2). Therefore, we have: W2 2(θ N(m1, σ2 1I), θ N(m2, σ2 2I)) = (θ m1 θ m2)2 + (σ1 σ2)2. Calculating the expectation, we have: E W2 2(θ N(m1, σ2 1I), θ N(m2, σ2 2I)) = (m1 m2) E[θθ ](m1 m2) + (σ1 σ2)2 d m1 m2 2 2 + (σ1 σ2)2. Let denote C(θ) = W2 2(θ N(m1, σ2 1I), θ N(m2, σ2 2I)), we have: C(θ) E[C(θ)] = (θ m1 θ m2)2 1 d m1 m2 2 2 i=1 αixi θ m X 2 = Clow(θ; µ, ν) E[Clow(θ; µ, ν)]. Similarly, we obtain: Var [C(θ)] = E (C(θ) E[C(θ)])2 = E (Clow(θ; µ, ν) E[Clow(θ; µ, ν)])2 = Var[Clow(θ; µ, ν)], Cov [C(θ), W(θ; µ, ν)] = E [(C(θ) E[C(θ)])(W(θ; µ, ν) E[W(θ; µ, ν)])] = E [(Clow(θ; µ, ν) E[Clow(θ; µ, ν)])(W(θ; µ, ν) E[W(θ; µ, ν)])] = Cov [Clow(θ; µ, ν), W(θ; µ, ν)] . Therefore, it is sufficient to claim that using the Clow control variate is equivalent to use W2 2(θ N(m1, σ2 1I), θ N(m2, σ2 2I)) as the control variate. D ALGORITHMS We present the algorithm for computing the conventional Monte Carlo estimator of the sliced Wasserstein distance between two discrete measures in Algorithm 1. Similarly, we provide the algorithms for the lower bound control variate estimator and the upper bound control variate estimator in Algorithm 2 and in Algorithm 3 respectively. E RELATED WORKS Sliced Wasserstein variants with different projecting functions. The conventional sliced Wasserstein is based on a linear projecting function i.e., the inner product to project measures to one dimension. In some special cases of probability measures, some other projecting functions might be preferred e.g., generalized sliced Wasserstein (Kolouri et al., 2019) distance with circular, polynomial projecting function, spherical sliced Wasserstein (Bonet et al., 2023) with geodesic spherical projecting function, and so on. Despite having different projecting functions, all mentioned sliced Published as a conference paper at ICLR 2024 Algorithm 1 The conventional estimator of sliced Wasserstein distance. Input: Probability measures µ = Pn i=1 αiδxi and ν = Pm i=1 βiδyi, p 1, and the number of projections L. Set d SW p p(µ, ν; L) = 0 for l = 1 to L do Sample θl U(Sd 1) Compute d SW p p(µ, ν; L) = d SW p p(µ, ν; L) + 1 L R 1 0 |F 1 θl µ(z) F 1 θl ν(z)|pdz end for Return: d SW p p(µ, ν; L) Algorithm 2 The lower bound control variate estimator of sliced Wasserstein distance. Input: Probability measures µ = Pn i=1 αiδxi and ν = Pm i=1 βiδyi, p 1, and the number of projections L. Set d SW p p(µ, ν; L) = 0 Compute x = Pn i=1 αixi, and y = Pm i=1 βiyi for l = 1 to L do Sample θl U(Sd 1) Compute wl = R 1 0 |F 1 θl µ(z) F 1 θl ν(z)|pdz Compute cl = (θ x θ y)2 end for Compute d SW p p(µ, ν; L) = 1 L PL l=1 wl Compute b = 1 Compute γ = 1 L PL l=1(wl c SW p p(µ,ν;L))(cl b) 1 L PL l=1(cl b)2 Compute \ LCV-SW p p(µ, ν; L) = d SW p p(µ, ν; L) γ 1 L PL l=1(cl b) Return: \ LCV-SW p p(µ, ν; L) Algorithm 3 The upper bound control variate estimator of sliced Wasserstein distance. Input: Probability measures µ = Pn i=1 αiδxi and ν = Pm i=1 βiδyi, p 1, and the number of projections L. Set d SW p p(µ, ν; L) = 0 Compute x = Pn i=1 αixi, and y = Pm i=1 βiyi for l = 1 to L do Sample θl U(Sd 1) Compute wl = R 1 0 |F 1 θl µ(z) F 1 θl ν(z)|pdz Compute cl = (θ x θ y)2 + Pn i=1 αi(θ xi θ x)2 + Pm i=1 βi(θ yi θ y)2 end for Compute d SW p p(µ, ν; L) = 1 L PL l=1 wl Compute b = 1 d x y 2 2 + 1 d Pn i=1 αi (xi x) 2 2 + 1 d Pm i=1 βi (yi y) 2 2 Compute γ = 1 L PL l=1(wl c SW p p(µ,ν;L))(cl b) 1 L PL l=1(cl b)2 Compute \ UCV-SW p p(µ, ν; L) = d SW p p(µ, ν; L) γ 1 L PL l=1(cl b) Return: \ UCV-SW p p(µ, ν; L) Wasserstein variants utilize random projecting directions that follow the uniform distribution over the unit hypersphere and are estimated using the Monte Carlo integration scheme. Therefore, we could directly adapt the proposed control variates in these variants. Applications of the control variate estimators. Since the proposed control variates are used to estimate the SW distance, they can be applied to all applications where sliced Wasserstein exists. We would like to mention some other applications such as domain adaptation (Lee et al., 2019), Published as a conference paper at ICLR 2024 0 2000 4000 6000 8000 10000 L Absolute Error SW LCV-SW UCV-SW 0 2000 4000 6000 8000 10000 L SW LCV-SW UCV-SW 0 2000 4000 6000 8000 10000 L Absolute Error 3D Point-Cloud SW LCV-SW UCV-SW 0 2000 4000 6000 8000 10000 L 3D Point-Cloud SW LCV-SW UCV-SW Figure 5: The empirical errors of the conventional estimator (SW) and the control variate estimators (LCV-SW, UCV-SW) when comparing empirical distributions over MNIST images and point-clouds. Table 3: Estimated variances of the conventional estimator s and the control variate estimators. Estimator MNIST 0-1 MNIST 1-7 Point-cloud-1 Point-cloud-2 SW 4700.87 59.12 1205.62 14.17 12.78 0.025 12.79 0.026 LCV-SW 0.045 0.008 0.0017 0.001 0.0025 0.0000 0.0021 0.0000 UCV-SW 0.061 0.016 0.0018 0.0001 0.0025 0.0000 0.0021 0.0000 Table 4: Summary of Wasserstein-2 scores (multiplied by 104) from 3 different runs, computational time in second (s) to reach step 500 of different sliced Wasserstein variants in gradient flows. Distances Step 3000 (W2 ) Step 4000 (W2 ) Step 5000 (W2 ) Step 6000(W2 ) Step 8000 (W2 ) Time (s ) SW L=10 305.3907 0.4919 137.7762 0.3630 36.1807 0.1383 0.1054 0.0022 2.3e 5 1.0e 5 26.30 0.03 LCV-SW L=10 302.9718 0.1788 135.9132 0.0922 35.0292 0.1457 0.0452 0.0045 1.7e 5 0.3e 5 27.65 0.01 UCV-SW L=10 302.9717 0.1788 135.9132 0.0922 35.0295 0.1458 0.0446 0.0038 2.0e 5 0.4e 5 29.56 0.01 SW L=100 300.6303 0.2375 134.0492 0.3146 33.8608 0.1348 0.0121 0.0010 1.6e 5 0.2e 5 222.06 1.34 LCV-SW L=100 300.2362 0.0054 133.5238 0.0065 33.4460 0.0030 0.0084 5.8e 5 1.4e 5 0.1e 5 223.79 0.82 UCV-SW L=100 300.2631 0.0054 133.5237 0.0065 33.4459 0.0030 0.0083 8.5e 5 1.6e 5 0.1e 5 235.29 1.65 approximate Bayesian computation (Nadjahi et al., 2020a), color transfer (Li et al., 2022), point-cloud reconstruction (Nguyen et al., 2024a), mesh deformation Le et al. (2024a), diffusion models (Nguyen et al., 2024b), and many other tasks. F ADDITIONAL EXPERIMENTS In this section, we provide some additional experiments for applications in the main text. In particular, we calculate the empirical variances of the conventional estimator and the control variate estimators on different images of digits and point-clouds in Appendix F.1. We provide the point-cloud gradient flow between two new point-cloud in Appendix F.2. We then provide the detailed training of deep generative models and additional generated images in Appendix F.3. F.1 COMPARING EMPIRICAL PROBABILITY MEASURES OVER IMAGES AND POINT-CLOUDS Settings. We follow the same settings which are used in the main text. However, for MNIST, we compare probability measures over digit 1 and digit 7. For point-clouds, we compare two different point-clouds from the main text. We report the estimated variances of W(θ; µ, ν), Zlow(θ; µ, ν), Zup(θ; µ, ν) defined in Definition 4 in Table 3. We use a large number of samples e.g., 100000 Monte Carlo samples. In the table, point-cloud-1 denotes the pair in Figure 2 and point-cloud-2 denotes the pair in Figure 6. Results. We show the estimated errors and the corresponding computational time in Figure 5. From the figure, we observe the same phenomenon as in the main text. In particular, the control variate estimators reduce considerably the errors in both cases while having approximately the same computational time. Published as a conference paper at ICLR 2024 Figure 6: Point-Cloud gradient flows for L = 10 from SW, LCV-SW, and UCV-SW. SW L=1000 LCV-SW L=1000 UCV-SW L=1000 Figure 7: Random generated images of distances on CIFAR10 and Celeb A. F.2 POINT-CLOUD GRADIENT FLOWS Settings. We follow the same settings which are used in the main text. However, we use different point-clouds which are also used in Appendix F.1. Results. We show the quantitative results in Table 4 and the corresponding qualitative result in Figure 6. Overall, we observe the same phenomenon as in the main text. In particular, the control variate estimators i.e., LCV-SW, UCV-SW help to drive the flows to converge faster to the target point-cloud than the conventional estimator i.e., SW. It is worth noting that the computational time of the control variate estimators is only slightly higher than the conventional estimator for both settings of the number of projections L = 10, 100. F.3 DEEP GENERATIVE MODELING Setting. We denote µ as our data distribution. We then design the model distribution νϕ as a push forward probability measure that is created by pushing a unit multivariate Gaussian (ϵ) through a Published as a conference paper at ICLR 2024 neural network Gϕ that maps from the realization of the noise to the data space. We use a second neural network Tβ that maps from data space to a single scalar. We denote Tβ1 is the sub neural network of Tβ that maps from the data space to a feature space (output of the last Resnet block), and Tβ2 that maps from the feature space (image of Tβ1) to a single scalar. More precisely, Tβ = Tβ2 Tβ1. We use the following neural networks for Gϕ and Tβ: Gϕ: z R128( ϵ : N(0, 1)) 4 4 256(Dense, Linear) Res Block up 256 Res Block up 256 Res Block up 256 BN, Re LU, 3 3 conv, 3 Tanh . Tβ1: x [ 1, 1]32 32 3 Res Block down 128 Res Block down 128 Res Block down 128 Res Block 128 Res Block 128. Tβ2: x R128 8 8 Re LU Global sum pooling(128) 1(Spectral normalization). Tβ(x) = Tβ2(Tβ1(x)). Gϕ: z R128( ϵ : N(0, 1)) 4 4 256(Dense, Linear) Res Block up 256 Res Block up 256 Res Block up 256 Res Block up 256 BN, Re LU, 3 3 conv, 3 Tanh . Tβ1: x [ 1, 1]32 32 3 Res Block down 128 Res Block down 128 Res Block down 128 Res Block 128 Res Block 128. Tβ2: x R128 8 8 Re LU Global sum pooling(128) 1(Spectral normalization). Tβ(x) = Tβ2(Tβ1(x)). We use the following bi-optimization problem to train our neural networks: min β1,β2 (Ex µ[min(0, 1 + Tβ(x))] + Ez ϵ[min(0, 1 Tβ(Gϕ(z)))]) , min ϕ EX µ m,Z ϵ m[S( Tβ1,β2 PX, Tβ1,β2 Gϕ PZ)], where the function Tβ1,β2 = [Tβ1(x), Tβ2(Tβ1(x))] which is the concatenation vector of Tβ1(x) and Tβ2(Tβ1(x)), S is an estimator of the sliced Wasserstein distance. The number of training iterations is set to 100000 on CIFAR10 and 50000 in Celeb A. We update the generator Gϕ every 5 iterations and we update the feature function Tβ every iteration. The mini-batch size m is set to 128 in all datasets. We use the Adam (Kingma & Ba, 2014) optimizer with parameters (β1, β2) = (0, 0.9) for both Gϕ and Tβ with the learning rate 0.0002. We use 50000 random samples from estimated generative models Gϕ for computing the FID scores and the Inception scores. In evaluating FID scores, we use all training samples for computing statistics of datasets. Results. In addition to the result in the main text, we provide generated images from the conventional estimator (SW) and the control variate estimator (LCV-SW and UCV-SW) with the number of projections L = 1000 in Figure 7. Overall, we see that by increasing the number of projections, the generated images are visually improved for all estimators. This result is consistent with the FID scores and the IS scores in Table 2. G COMPUTATIONAL INFRASTRUCTURE For comparing empirical probability measures over images and point-cloud application, and the point-cloud gradient flows application, we use a Macbook Pro M1 for conducting experiments. For deep generative modeling, experiments are run on a single NVIDIA V100 GPU.