# distributional_convergence_of_the_sliced_wasserstein_process__caf58db3.pdf Distributional Convergence of the Sliced Wasserstein Process Jiaqi Xi1 and Jonathan Niles-Weed1,2 1Courant Institute of Mathematical Sciences, New York University, NY 10012 2Center for Data Science, New York University, NY 10011 Motivated by the statistical and computational challenges of computing Wasserstein distances in high-dimensional contexts, machine learning researchers have defined modified Wasserstein distances based on computing distances between one-dimensional projections of the measures. Different choices of how to aggregate these projected distances (averaging, random sampling, maximizing) give rise to different distances, requiring different statistical analyses. We define the Sliced Wasserstein Process, a stochastic process defined by the empirical Wasserstein distance between projections of empirical probability measures to all one-dimensional subspaces, and prove a uniform distributional limit theorem for this process. As a result, we obtain a unified framework in which to prove sample complexity and distributional limit results for all Wasserstein distances based on one-dimensional projections. We illustrate these results on a number of examples where no distributional limits were previously known. 1 Introduction The Wasserstein distances have become useful tools in machine learning and data science, with applications in transfer learning [6, 33], generative modeling [4, 19], statistics [8, 20], and various scientific domains [36, 44]. Despite the popularity of these distances, they suffer from serious drawbacks in high dimensions. From a statistical standpoint, estimating the Wasserstein distances from data suffers from the curse of dimensionality, with convergence rates degrading sharply as the dimension increases [16, 29, 37, 43]. From a computational standpoint, despite recent algorithmic advances [1, 7], the best algorithms for approximately computing general Wasserstein distances between distributions supported on n points in Rd for d 2 have running times scaling quadratically in n, which is prohibitive on very large data sets. These deficiencies have motivated the development of modifications of the Wasserstein distances which reduce the high-dimensional case to a series of one-dimensional problems. Given two compactly supported probability distributions P and Q in Rd, we write Pu and Qu for the projections of P and Q onto the one-dimensional subspace spanned by u, for any u Sd 1. Explicitly, if X P, we let Pu denote the law of u X. The measures Pu and Qu are probability distributions on R obtained by collapsing P and Q to the one-dimensional slice in the direction of u. Crucially, no matter how large d is, the Wasserstein distance W p p (Pu, Qu) between the one dimensional measures is always easy to work with: it can be estimated from data at the rate that is independent of the dimension, and if Pu and Qu are supported on n points, then W p p (Pu, Qu) can be computed in nearly linear time by a simple sorting procedure. This observation has given rise to a number of different proposals for defining a distance between P and Q by aggregating the one-dimensional distances, the most prominent of which is the sliced 36th Conference on Neural Information Processing Systems (Neur IPS 2022). Wasserstein distance [3, 32]: SW p p (P, Q) := Z W p p (Pu, Qu) dσ(u) , (1) where σ denotes the uniform measure on Sd 1. Other options include: Discrete Sliced Wasserstein distance: d SW p p(P, Q) := 1 L PL i=1 W p p (Pui, Qui), {ui} Sd 1 [3]. Max-Sliced Wasserstein distance: MSW p p (P, Q) := maxu Sd 1 W p p (Pu, Qu) [15, 29, 31]. Distributional Sliced Wasserstein distance: DSW p p (P, Q) := supτ PC R W p p (Pu, Qu) dτ(u), where PC is a subset of probability measures on Sd 1 [28]. Though the details of these techniques differ, they can be put on a common footing: if we view the function W : u 7 W p p (Pu, Qu) as a bounded function on Sd 1, then each of these sliced distances takes the form F(W) for some function F : ℓ (Sd 1) R. Since these distances are all based on one-dimensional projections, it is natural to conjecture that they enjoy improved statistical performance. This conjecture has been verified in certain special cases [25, 27, 29] However, the analysis of these distances has largely been conducted separately, with different arguments tailored to each distance. This raises the following fundamental question: is there a unified approach to the analysis of these distances, which provides statistical guarantees for all of them simultaneously? In this work, we develop such a unified approach. In addition to generalizing prior works, our techniques allow us to prove new distributional convergence results for the sliced Wasserstein distance and its many variants. These results make it possible to construct asymptotically valid confidence intervals for variants of the sliced Wasserstein distances and to guarantee the validity of the bootstrap. Prior to our work, such results were only known for the standard sliced Wasserstein distance (1) [25] or for the sliced and max-sliced Wasserstein distances between discrete distributions [30].1 Obtaining distributional limits for empirical Wasserstein distances is an active area of research. In the one-dimensional case, fundamental contributions were made by [10, 11, 12], and further progress has been made in the case where one or both of the measures are discrete [14, 38, 40]. Multi-dimensional limits were recently obtained by [9, 13], but these are not centered at the population-level quantities, making them of limited utility for inference. However, when the distributions are very smooth, there exist estimators with distributional limits with good centering [26]. In this work, we draw on techniques recently proposed in [22] to obtain central limit theorems by exploiting duality. We consider compactly supported probability measures P and Q in Rd with connected supports, and the Wasserstein distances W p p for p > 1. To analyze the empirical behavior of the sliced Wasserstein distance and its variants, we define a stochastic process Gn(u) := n(W p p (Pnu, Qnu) W p p (Pu, Qu)) u Sd 1 . (2) where Pn and Qn consist of i.i.d. samples. We may view Gn as a random element of ℓ (Sd 1), which records the deviation of the Wasserstein distance from its population counterpart along every direction simultaneously. We call Gn the Sliced Wasserstein Process. Our main result shows that Gn G ℓ (Sd 1) , (3) where G is a tight Gaussian process on Sd 1. That is, the collection of random variables n(W p p (Pnu, Qnu) W p p (Pu, Qu)) indexed by elements of Sd 1 enjoys a uniform central limit theorem. As is well known in the statistics literature [41], uniform central limit theorems of this type give rise to distributional limits for any sufficiently regular functional on ℓ (Sd 1) via the functional delta method in particular, we directly obtain distributional limit theorems for the sliced Wasserstein distance and its many variants as a special case. Our results likewise give techniques for proving the consistency of the bootstrap for any of the mentioned functionals as a consequence of general results for uniform central limit theorems. 1Concurrently and independently of our work, [21] proved distributional limits for the sliced and max-sliced Wasserstein, but not for other variants, as a byproduct of general results for distributional limits for Wasserstein distances. 2 Main Result Throughout, P and Q denote two probability distributions in Rd with compact supports, contained in a closed ball B(0, R) around the origin. We fix a p > 1, and consider the Wasserstein distance of order p: W p p (P, Q) = inf π Π(P,Q) Z x y p dπ(x, y) , (4) where the infimum is taken over all couplings between P and Q. It is well known (see [42]) that this problem possesses a dual formulation: W p p (P, Q) = sup Z f d P + Z f c d Q , (5) where f c denotes the c-transform: f c(y) = infx B(0,R) x y p f(x). It can be shown (e.g.,[24, Lemma 1 & 5]) that the supremum in this dual formulation is achieved, and that without loss of generality we may assume that f satisfies f(0) = 0 and f Lip p Rp 1. We denote the class of such functions by C, and call any maximizer a Kantorovich potential. In order to obtain Gaussian limits, we adopt the following assumption: (CC) For all u Sd 1, the support of Pu or Qu is an interval. For p > 1, assumption (CC) guarantees that the supremum in (5) is achieved at a unique Kantorovich potential in C [35, Proposition 7.18]. In the absence of this uniqueness, Gaussian limits fail to hold for the optimal transport problem, even for discrete measures [38]. We now state our main result. Theorem 2.1. Suppose that P and Q are two probability distributions in Rd whose supports are contained in the closed d-ball B(0, R) for some R > 0. Assume that P and Q satisfy (CC). Let Pn and Qm denote empirical measures consisting of n and m i.i.d. samples from P and Q, respectively. If n/(n + m) λ (0, 1) as n, m , then r nm n + m W p p (Pn , Qm ) W p p (P , Q ) G in ℓ (Sd 1) , (6) where G is a tight zero-mean Gaussian process on Sd 1 with covariance function EG(u)G(v) =(1 λ) Z fu(u x)fv(v x) d P(x) + λ Z f c u(u y)f c v(v y) d Q(y) (1 λ) Z fu(u x) d P(x) Z fv(v x) d P(x) λ Z f c u(u y) d Q(x) Z f c v(v Y ) d Q(x), where fu, fv C are the unique Kantorovich potentials for (Pu, Qu) and (Pv, Qv), respectively. Theorem 2.1 formally includes the one-sample case as well, by taking λ = 0, 1. Remark 2.2. The assumption of compact support guarantees that the set of Kantorovich potentials corresponding to Pu and Qu for any u Sd 1 is uniformly Lipschitz, and is therefore a subset of a Donsker class. If the supports of P and Q were unbounded, in order to deduce the Donsker property, we would need additional assumptions on the 1-dimensional projections of P and Q as well as the cost function, (see e.g. [22, Theorem 5.2]) that do not hold for p-Wasserstein distances (p > 1) in general. Remark 2.3. In the proof, assumption (CC) is only used to guarantee that for each u Sd 1, there exists a unique Kantorovich potential achieving the supremum in the dual formulation of W p p (Pu, Qu). It is therefore possible to replace (CC) by any weaker assumption know to guarantee uniqueness [39, 45], but we adopt (CC) because it is the simplest such assumption we are aware of. In particular, Theorem 2.1 holds for p = 1 under the additional assumption that, for each u Sd 1, the Kantorovich potential for W1(Pu, Qu) is unique. As alluded to above, Theorem 2.1 gives rise to a wealth of statistical theorems as easy corollaries. To describe these implications, we return to the abstract setting described above: denote by W : Sd 1 R the function W(u) = W p p (Pu, Qu), and consider any functional F : ℓ (Sd 1) R. Then the distances we consider take the form F(W). By different choices of F, we obtain the sliced Wasserstein distance, the max-sliced Wasserstein distance, and the other variants described above. Theorem 2.1 will allow us to compare F(W) to its empirical counterpart F(Wnm), where Wnm(u) = W p p (Pnu, Qmu). We recall the definition of directional Hadamard differentiability [34]: we say that F is directionally Hadamard differentiable at Φ if for all sequences hn 0 and Ψn Ψ ℓ (Sd 1), the limit lim n F(Φ + hnΨn) F(Φ) hn =: F Φ(Ψ) exists. We verify in Section 3 the directional Hadamard differentiability of several examples. Under this assumption, we have the following. Corollary 2.4. Assume F is directionally Hadamard differentiable. Under the assumptions of Theorem 2.1, r nm n + m(F(Wnm) F(W)) F W (G) . Proof. See [34]. We also obtain a consistency result for the bootstrap, which we state for simplicity in the n = m case. Corollary 2.5. Assume that F is directionally Hadamard differentiable, and adopt the assumptions of Theorem 2.1. Let Pn = 1 n Pn i=1 δXi and Qn = 1 n Pn i=1 δYi, and for k n denote by P and Q bootstrap empirical measures consisting of k i.i.d. draws from Pn and Qm, respectively, and set W = W p p (P u, Q u). If k and k/n 0, then sup h BL(1) E[h( k(F(W )) F(Wn))|X1, . . . , Xn, Y1, . . . , Yn] E[h( n(F(Wn) F(W)))] p 0 , where BL(1) is the set of functions with bounded Lipschitz norm 1. Proof. See [17]. Finally, when the functional F : ℓ (Sd 1) R has a linear Hadamard derivative, the resulting statistic will again be asymptotically Gaussian. The following uniform convergence result shows that we can consistently estimate the covariance function of G from data, which can be used to obtain asymptotic confidence intervals in this setting. Theorem 2.6. Under the same assumptions as Theorem 2.1, there exists an estimator {ˆΣu,v}u,v Sd 1 for the covariance functions {Σu,v}u,v Sd 1 of the limiting process G in the sense that EP,Q sup u,v Sd 1 ˆΣu,v Σu,v 0 as n 0. (8) Remark 2.7. Theorem 2.6 can be used to obtain asymptotic confidence intervals via Slutsky s theorem. For instance, if the functional F has a linear derivative F W which is of the form F W (Φ) = R Φ(u) dτ(u) for a Borel measure τ, then Theorem 2.6 implies that ˆσ2 := R R ˆΣu,v dτ(u) dτ(v) converges in probability to var(F W (G)), and therefore that F(Wnm) ˆσzδ/2 q nm is an asymptotic (1 δ) confidence interval for F(W). 3 Applications In this section, we focus on three of the variants we discussed the sliced, max-sliced, and distributional sliced Wasserstein distances and show how our main results obtained in the previous section can be used to obtain accurate asymptotic inference for these quantities. For notational simplicity, we focus on the case where n = m, and rescale the resulting Gaussian process by a factor of 3.1 Sliced Wasserstein Distance Asymptotic and finite-sample inference for the sliced Wasserstein distance (SW) has already been thoroughly studied by [25]. We show that we can recover some of their results from our techniques. Their main focus was on a robustification of the SW distance, the trimmed SW distance, defined as SWp,δ(P, Q) := δ |F 1 u (t) G 1 u (t)|p dt dσ(u) where σ denotes the uniform probability measure on Sd 1 and F 1 u , G 1 u are the (pseudo-)inverses of the CDFs of Pu, Qu respectively. When δ = 0, SWp,δ reduces to the original sliced Wasserstein distance SWp. The trimmed SW distance SWp,δ discards the mass of Pθ and Qθ above and below the 1 δ and δ quantiles (hence the term trimmed ), and is therefore more robust in the face of outliers. This robustification is necessary when P and Q are no longer assumed to have compact supports. [25] derive Gaussian limits and bootstrap consistency for this functional. To see how their results for the standard sliced Wasserstein distance (i.e., δ = 0) can be derived under our stricter assumptions from Theorem 2.1, we denote by F : ℓ (Sd 1) R the integration functional: F(Φ) = Z Φ(u) dσ(u) . The dominated convergence theorem immediately implies that F is Hadamard differentiable, with derivative F Φ(Ψ) = Z Ψ(u) dσ(u) . We obtain the following. Theorem 3.1. Suppose that two compactly supported probability distributions P and Q in Rd satisfy (CC). Then n(SW p p (Pn, Qn) SW p p (P, Q)) d S := Z Sd 1 G(θ) dσd(θ). (9) The random variable S is Gaussian, and by integrating (7) it can be shown that its limiting variance agrees with the expression in [25]. 3.2 Max-sliced Wasserstein distance Unlike integration, the supremum functional is not smooth and does not possess a linear Hadamard derivative. Write MSW p p (P, Q) = supu Sd 1 W p p (Pu, Qu), and note that MSW p p (P, Q) = ω(W), where ω : ℓ (Sd 1) R is the supremum functional. It is shown in Theorem 2.1 of [5] that ω is Hadamard directionally differentiable with the derivative ω f(g) = lim ϵ 0 sup x Aϵ(f) g(x), where Aϵ(f) := {x : f(x) sup f ϵ}. Moreover, if f and g are continuous on Sd 1 with respect to the standard Euclidean distance, then lim ϵ 0 sup Aϵ(f) g(x) = sup x A0(f) g(x) where A0(f) = {x : f(x) = sup f}. (See Corollary 2.3 of [5]) Applying the functional delta method to ω and the uniform weak convergence (6), we note that the limiting process G has continuous samples paths a.s., so the limiting distribution of MSW can be written as ω W p p (W )(G) = sup u A0(W p p (W )) G(u). (10) In the spiked transport model (STM), this expression can be further simplified. The STM was introduced by [29] to formalize the situation where two distributions differ only in a low dimensional subspace of Rd. We describe the special case of one-dimensional spike here. Fix some v Sd 1 and let X, Y L := span(v) be two random variables with different laws. Let Z be another random variable independent of (X, Y ) and supported on the orthogonal complement L of L. Then we define two distributions in Rd by P := law(X + Z), Q := law(Y + Z). It is shown there that MSW p p (P, Q) = Wp(law(X), law(Y )) = W2(P, Q). In addition, in this model the set Aϵ(W p p (P , Q )) shrinks to the singleton set {v} as ϵ goes down to 0. In fact, the Hadamard derivative can be reduced to the random variable G(u), i.e. the marginal of the limiting Gaussian process G along v. We summarize the result in the following theorem. Theorem 3.2. Suppose that two compactly supported probability distributions P and Q in Rd fit the spiked transport model with spike v Sd 1. Assume furthermore P and Q also satisfy (CC). Then, n MSW p p,1(Pn, Qn) MSW p p,1(P, Q) d G(v). (11) Remark 3.3. Note that it is not necessary for P and Q to satisfy the spiked transport model in order to deduce the CLT of max-sliced Wasserstein. Namely, even if the set {u Sd 1 : W p p (Pu, Qu) = MSW(P, Q)} is not a singleton, the same proof still works but the limiting distribution is the supremum of the Gaussian process G over the set {u Sd 1 : W p p (Pu, Qu) = MSW(P, Q)} which is not necessarily Gaussian. This example shows that certain functionals give rise to non Gaussian limits, even though the limit in (6) is Gaussian. We given an example of this behavior in the supplementary material. 3.3 Distributional Sliced Wasserstein Distance Proposed in [28], the distributional sliced Wasserstein distance is a generalization of sliced Wasserstein distance. Formally, given two probability measures P and Q on Rd with finite p-th moments where p > 1 and a subset of probability distributions PC on Sd 1 such that Eθ,θ τ|θ θ | C for all τ PC for some constant C > 0, the distributional sliced p-sliced Wasserstein distance between P and Q is defined by DSWp(P, Q; C) := sup τ PC Sd 1 W p p (Pθ, Qθ) dτ(θ) 1/p . We may therefore write DSW p p (P, Q; C) = ωC(FC(W)), where ωC : ℓ (PC) R is the supremum functional and FC : ℓ (Sd 1 PC) is defined by FC(Φ)( ) := R Sd 1 Φ(u) d (u). The function FC is trivially Hadamard differentiable following the same argument for the standard sliced Wasserstein distance given above. The supremum functional on ℓ (PC) is also Hadamard directionally differentiable by Theorem 2.1 of [5]. Since the composition of Hadamard differentiable functions are still Hadamard directionally differentiable (see e.g. [2, Proposition 2.47]), under the assumption (CC), may conclude that ωC FC is Hadamard directionally differentiable, i.e. n DSW p p (Pn, Qn; C) DSW p p (P, Q; C) d lim ϵ 0 sup τ Aϵ(DSW p p (P,Q;C)) Sd 1 G(θ) dτ(θ) . 4 Simulation Studies We illustrate our distributional limit results in Monte Carlo simulations. Specifically, we investigate the speed of convergence of the sliced Wasserstein distance and the max-sliced Wasserstein distance. We also investigate the convergence speed of the amplitude, which provides an example of a functional not covered in prior work. We also illustrate the accuracy of the approximation using the re-scaled bootstrap. All simulations were performed using Python. The Wasserstein distances as well as the sliced Wasserstein distances were calculated using the Python package POT [18] and the max-sliced Wasserstein distances were approximated by the Riemannian optimization method proposed in [23]. 4.1 Sliced Wasserstein Distance We present an example that concerns two different distributions with connected projections along all directions. Consider a simple model of transport where source and target distributions P, Q are uniform on unit sphere S2 and the unit sphere S2 (1,1,1) centered at (1, 1, 1) respectively. We first give an explicit representation of the theoretical limit of the example given in section 3.1. Fix any point θ S2, the projections of P and Q along θ are uniform over ( 1, 1) and ( 1 + aθ, 1 + aθ) respectively where aθ := θ1 + θ2 + θ3. Then the unique Kantorovich potential that achieves 2-Wasserstein distance between Pθ and Qθ is φθ 0(x) = 2aθx. Hence, we have n W 2 2 (Pn , Qn W 2 2 (P , Q ) G, where G is the mean-zero Gaussian process indexed by S2 with covariance functions EG(u)G(v) = 8 3auav u, v . It follows from Theorem 3.1 that the limiting distribution of the empirical 2-Wasserstein distance is the centered Gaussian S with variance S2 auav u, v dσ(u)dσ(v) 0.832. We sample i.i.d. observations X1, . . . , Xn P and Y1, . . . , Yn Q with size n = 50, 100, 500. This process is repeated 500 times. We then compare the finite distributions of 1-Wasserstein distance with the theoretical limit given in section 3.1. We demonstrate the results using kernel density estimators in Figure 1 along with the corresponding Q-Q plots. We see that the finite-sample empirical distribution gets closer to the limiting Gaussian distribution in (9) as the sample size n increases. In addition, we simulate the re-scaled plug-in bootstrap approximations by sampling n = 1000 observations of P and Q. Fix some empirical SW n SW 2 2 (Pn, Qn), we generate B = 500 replications of l(SW 2 2 ( ˆP n, ˆQ n) SW 2 2 (Pn, Qn)). The distributions of the replications with various replacement numbers l, compared with the finite-sample empirical distribution and the theoretical limit, are shown in Figure 2. We observe that the naive bootstrap (l = n) better approximates the finite sample distribution compared to fewer replacements (l = n1/2, n3/4). This is consistent with the observation of inference on finite spaces. [30] Figure 1: Top: Comparison of the finite sample density (pale turquoise) and the limit distribution of the empirical sliced distance (pink). Bottom: The corresponding Q-Q plots where the red solid line indicates perfect fit. 4.2 Max-sliced Wasserstein Distance We present an example that simulates the behavior of the max-sliced Wasserstein distance when p = 2. We take P to be the uniform distribution on the unit sphere S2 and Q to be uniform on the surface of ellipsoid x2/a2 + y2 + z2 = 1 where a = 8.5. We sample i.i.d. observations with size n = 50, 100, 500 and this process is repeated 2000 times. The estimation plotted in the top part of Figure 2: Bootstrap for the empirical sliced distance. Illustration of the re-scaled plug-in bootstrap approximation (n = 1000) with replacement l {n, , n3/4, n1/2}. Finite bootstrap densities (pale green) are compared to the corresponding finite sample density (pale turquoise) and the limit distribution (pink). Figure 3: Comparison of the finite sample density (pale turquoise) and the limit distribution of the empirical max-sliced Wasserstein distance (pink). Figure 3 indicates that the finite sample distributions approximate the limiting Gaussian distribution derived in Theorem 3.2 very well even when the sample size is small. In terms of the re-scaled bootstrap, the accuracy of the bootstrap approximation seems to be good for all replacement numbers in this case, which is again consistent with the observation of the case when the underlying distributions are supported in finite sets. [30] See Figure 4 for the simulation. 4.3 Amplitude In this section, we give an example of a new functional, the amplitude, to which our theory applies. For f ℓ (Sd 1), we write amp(f) := sup f inf f. When Q is chosen to be a radially symmetric reference distribution, e.g., uniform on the sphere, the quantity amp(W 2 2 (P , Q )) = sup u Sd 1 W 2 2 (Pu, Qu) inf u Sd 1 W 2 2 (Pu, Qu) is the natural measure of the radial homogeneity of P if amp(W 2 2 (P , Q )) is small, then P differs from Q by similar amounts in each direction. The amplitude functional defined on ℓ (S2) is Figure 4: Bootstrap for the empirical max-sliced Wasserstein distance. Illustration of the re-scaled plug-in bootstrap approximation (n = 1000) with replacement l {n, n3/4, n1/2}. Finite bootstrap densities (pale green) are compared to the corresponding finite sample density (pale turquoise) and the limit distribution (pink). Figure 5: Comparison of the finite sample density (pale turquoise) and the limit distribution of the empirical PRW (pink). Left: P U({ x2 4 + 4y2 + z2 = 1}), Right: P U({ x2 16 = 1}). Q U(S2). Hadamard directionally differentiable. [5] Consider P being uniform over the surface of the ellipsoid {x2/4 + 4y2 + z2 = 1} and Q uniform on S2. Applying Corollary 2.4 to P, Q and amp, we obtain n amp(W 2 2 (Pn , Qn )) 5/4 d G((1, 0, 0)). We simulate the density of the amplitude of empirical Wasserstein distances of 1d projections. The finite sample density generated by n = 600 samples and theoretical limit are given in Figure 5. Let P be uniform on {x2/4 + y2/4 + z2/16 = 1} and keep Q unchanged. Then n amp(W 2 2 (Pn , Qn )) 8/3 d G((0, 0, 1)) G((0, 1, 0)). We generate n = 600 samples according to P and Q and the result is also shown in Figure 5. Both finite sample densities indeed converge to the theoretical Gaussian limits as the sample size increases. 5 Conclusion This paper defines the Sliced Wasserstein Process, a stochastic process indexed by elements of the unit sphere Sd 1 in Rd, and shows that under regularity assumptions on P and Q, this process converges to a tight Gaussian process on ℓ (Sd 1). This convergence result, which can be viewed as a uniform central limit theorem for the empirical Wasserstein distance along all directions simultaneously, immediately implies distributional convergence and bootstrap consistency results for the sliced Wasserstein distance and its variants, thereby unifying and streamlining existing proofs in the literature and providing distributional limits for variants of the sliced Wasserstein distance for which no such results were previously known. An important question left open by our work is whether a similar result holds under weaker assumptions on P and Q. We conjecture that the compact support assumption can be lifted, though doing so would likely require making relatively stringent tail conditions. Avoiding assumption (CC) is more subtle, as some assumption of uniqueness of potentials is required to obtain Gaussian limits. Finally, we anticipate that our techniques can also be applied to entropically regularized variants of the Wasserstein distance, where empirical process theory arguments have also been central in proving both sample complexity and distributional limit results. We leave this extension to future work. [1] J. Altschuler, J. Weed, and P. Rigollet. Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 1961 1971, 2017. URL http://papers.nips.cc/paper/ 6792-near-linear-time-approximation-algorithms-for-optimal-transport-via-sinkhorn-iteration. [2] J. F. Bonnans and A. Shapiro. Perturbation Analysis of Optimization Problems. 1431-8598. Springer New York, NY, 1 edition. [3] N. Bonneel, J. Rabin, G. Peyré, and H. Pfister. Sliced and Radon Wasserstein barycenters of measures. J. Math. Imaging Vision, 51(1):22 45, 2015. ISSN 0924-9907. doi: 10.1007/ s10851-014-0506-3. URL https://doi.org/10.1007/s10851-014-0506-3. [4] O. Bousquet, S. Gelly, I. Tolstikhin, C.-J. Simon-Gabriel, and B. Schoelkopf. From optimal transport to generative modeling: the vegan cookbook. ar Xiv preprint ar Xiv:1705.07642, 2017. [5] J. C arcamo, L.-A. Rodr iguez, and A. Cuevas. Directional differentiability for supremum-type functionals: Statistical applications. Bernoulli, 2020. [6] N. Courty, R. Flamary, D. Tuia, and A. Rakotomamonjy. Optimal transport for domain adaptation. IEEE Trans. Pattern Anal. Mach. Intell., 39(9):1853 1865, 2017. [7] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States., pages 2292 2300, 2013. URL http://papers.nips.cc/paper/ 4927-sinkhorn-distances-lightspeed-computation-of-optimal-transport. [8] N. Deb, P. Ghosal, and B. Sen. Rates of estimation of optimal transport maps using plug-in estimators via barycentric projections. Advances in Neural Information Processing Systems, 34, 2021. [9] E. Del Barrio and J.-M. Loubes. Central limit theorems for empirical transportation cost in general dimension. The Annals of Probability, 47(2):926 951, 2019. [10] E. Del Barrio, E. Giné, and C. Matrán. Central limit theorems for the wasserstein distance between the empirical and the true distributions. Annals of Probability, pages 1009 1071, 1999. [11] E. Del Barrio, E. Giné, and F. Utzet. Asymptotics for l2 functionals of the empirical quantile process, with applications to tests of fit based on weighted wasserstein distances. Bernoulli, 11 (1):131 189, 2005. [12] E. del Barrio, P. Gordaliza, and J.-M. Loubes. A central limit theorem for lp transportation cost on the real line with application to fairness assessment in machine learning. Information and Inference: A Journal of the IMA, 8(4):817 849, 2019. [13] E. del Barrio, A. González-Sanz, and J.-M. Loubes. Central limit theorems for general transportation costs. ar Xiv preprint ar Xiv:2102.06379, 2021. [14] E. del Barrio, A. González-Sanz, and J.-M. Loubes. Central limit theorems for semidiscrete wasserstein distances. ar Xiv preprint ar Xiv:2202.06380, 2022. [15] I. Deshpande, Y.-T. Hu, R. Sun, A. Pyrros, N. Siddiqui, S. Koyejo, Z. Zhao, D. Forsyth, and A. G. Schwing. Max-sliced wasserstein distance and its use for gans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10648 10656, 2019. [16] R. M. Dudley. The speed of mean Glivenko-Cantelli convergence. Ann. Math. Statist, 40:40 50, 1969. [17] L. Dümbgen. On nondifferentiable functions and the bootstrap. Probability Theory and Related Fields, 95(1):125 140, 1993. [18] R. Flamary, N. Courty, A. Gramfort, M. Z. Alaya, A. Boisbunon, S. Chambon, L. Chapel, A. Corenflos, K. Fatras, N. Fournier, L. Gautheron, N. T. Gayraud, H. Janati, A. Rakotomamonjy, I. Redko, A. Rolet, A. Schutz, V. Seguy, D. J. Sutherland, R. Tavenard, A. Tong, and T. Vayer. Pot: Python optimal transport. Journal of Machine Learning Research, 22(78):1 8, 2021. URL http://jmlr.org/papers/v22/20-451.html. [19] A. Genevay, G. Peyré, and M. Cuturi. Learning generative models with sinkhorn divergences. In International Conference on Artificial Intelligence and Statistics, AISTATS 2018, 9-11 April 2018, Playa Blanca, Lanzarote, Canary Islands, Spain, pages 1608 1617, 2018. URL http://proceedings.mlr.press/v84/genevay18a.html. [20] P. Ghosal and B. Sen. Multivariate ranks and quantiles using optimal transport: Consistency, rates, and nonparametric testing. ar Xiv preprint ar Xiv:1905.05340, 2019. [21] Z. Goldfeld, K. Kato, G. Rioux, and R. Sadhu. Statistical inference with regularized optimal transport. ar Xiv preprint ar Xiv:2205.04283, 2022. [22] S. Hundrieser, M. Klatt, T. Staudt, and A. Munk. A unifying approach to distributional limits for empirical optimal transport. ar Xiv preprint ar Xiv:2202.12790, 2022. [23] T. Lin, C. Fan, N. Ho, M. Cuturi, and M. Jordan. Projection robust wasserstein distance and riemannian optimization. Advances in Neural Information Processing Systems, 33:9383 9397, 2020. [24] T. Manole and J. Niles-Weed. Sharp convergence rates for empirical optimal transport with smooth costs. ar Xiv preprint ar Xiv:2106.13181, 2021. [25] T. Manole, S. Balakrishnan, and L. Wasserman. Minimax confidence intervals for the sliced wasserstein distance. ar Xiv: Statistics Theory, 2019. [26] T. Manole, S. Balakrishnan, J. Niles-Weed, and L. Wasserman. Plugin estimation of smooth optimal transport maps. ar Xiv preprint ar Xiv:2107.12364, 2021. [27] K. Nadjahi, A. Durmus, L. Chizat, S. Kolouri, S. Shahrampour, and U. Simsekli. Statistical and topological properties of sliced probability divergences. Advances in Neural Information Processing Systems, 33:20802 20812, 2020. [28] K. Nguyen, N. Ho, T. Pham, and H. Bui. Distributional sliced-wasserstein and applications to generative modeling. ar Xiv preprint ar Xiv:2002.07367, 2020. [29] J. Niles-Weed and P. Rigollet. Estimation of wasserstein distances in the spiked transport model. ar Xiv preprint ar Xiv:1909.07513, 2019. [30] R. Okano and M. Imaizumi. Inference for projection-based wasserstein distances on finite spaces. ar Xiv preprint ar Xiv:2202.05495, 2022. [31] F.-P. Paty and M. Cuturi. Subspace robust wasserstein distances. In International conference on machine learning, pages 5072 5081. PMLR, 2019. [32] J. Rabin, G. Peyré, J. Delon, and M. Bernot. Wasserstein barycenter and its application to texture mixing. International Conference on Scale Space and Variational Methods in Computer Vision, pages 435 446, 2011. [33] I. Redko, A. Habrard, and M. Sebban. Theoretical analysis of domain adaptation with optimal transport. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 737 753. Springer, 2017. [34] W. Römisch. Delta Method, Infinite Dimensional. John Wiley & Sons, Ltd, 2006. ISBN 9780471667193. [35] F. Santambrogio. Optimal transport for applied mathematicians, volume 87 of Progress in Nonlinear Differential Equations and their Applications. Birkhäuser/Springer, Cham, 2015. ISBN 978-3-319-20827-5; 978-3-319-20828-2. doi: 10.1007/978-3-319-20828-2. URL https: //doi.org/10.1007/978-3-319-20828-2. Calculus of variations, PDEs, and modeling. [36] G. Schiebinger, J. Shu, M. Tabaka, B. Cleary, V. Subramanian, A. Solomon, J. Gould, S. Liu, S. Lin, P. Berube, et al. Optimal-transport analysis of single-cell gene expression identifies developmental trajectories in reprogramming. Cell, 176(4):928 943, 2019. [37] S. Singh and B. Póczos. Minimax distribution estimation in Wasserstein distance. ar Xiv preprint ar Xiv:1802.08855, 2018. [38] M. Sommerfeld and A. Munk. Inference for empirical wasserstein distances on finite spaces. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(1):219 238, 2018. [39] T. Staudt, S. Hundrieser, and A. Munk. On the uniqueness of kantorovich potentials. 2022. [40] C. Tameling, M. Sommerfeld, and A. Munk. Empirical optimal transport on countable metric spaces: Distributional limits and statistical applications. The Annals of Applied Probability, 29 (5):2744 2781, 2019. [41] A. van der Vaart and J. A. Wellner. Weak convergence and empirical processes: With applications to statistics. 1996. [42] C. Villani. Optimal Transport: Old and New. Grundlehren der mathematischen Wissenschaften. Springer Berlin Heidelberg, 2008. ISBN 9783540710509. URL https://books.google. com/books?id=h V8o5R7_5tk C. [43] J. Weed and F. Bach. Sharp asymptotic and finite-sample rates of convergence of empirical measures in Wasserstein distance. Bernoulli, 25(4A):2620 2648, 2019. doi: 10.3150/ 18-BEJ1065. URL https://doi.org/10.3150/18-BEJ1065. [44] K. D. Yang, K. Damodaran, S. Venkatachalapathy, A. C. Soylemezoglu, G. Shivashankar, and C. Uhler. Predicting cell lineages using autoencoders and optimal transport. PLo S computational biology, 16(4):e1007828, 2020. [45] Y. Yang, L. Nurbekyan, E. Negrini, R. Martin, and M. Pasha. Optimal transport for parameter identification of chaotic dynamics via invariant measures. ar Xiv preprint ar Xiv:2104.15138, 2021. 6 Checklist 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] See the discussion at the beginning of section 2. (c) Did you discuss any potential negative societal impacts of your work? [N/A] (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] See the statements of theorems along with discussions and remarks before or after the results. (b) Did you include complete proofs of all theoretical results? [Yes] See appendix. 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] We have included the code and instructions in the supplemental material. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See section 4. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] See section 4. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See section 4 and the related reference. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [Yes] (c) Did you include any new assets either in the supplemental material or as a URL? [Yes] See the reference. (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]