# the_wasserstein_transform__ebd905ff.pdf The Wasserstein Transform Facundo M emoli * 1 2 Zane Smith * 3 Zhengchao Wan * 1 We introduce the Wasserstein transform, a method for enhancing and denoising datasets defined on general metric spaces. The construction draws inspiration from Optimal Transportation ideas. We establish the stability of our method under data perturbation and, when the dataset is assumed to be Euclidean, we also exhibit a precise connection between the Wasserstein transform and the mean shift family of algorithms. We then use this connection to prove that mean shift also inherits stability under perturbations. We study the performance of the Wasserstein transform method on different datasets as a preprocessing step prior to clustering and classification tasks. 1. Introduction Optimal transport (OT) is concerned with finding cost efficient ways of deforming a given source probability distribution into a target distribution (Villani, 2003; 2008; Santambrogio, 2015). In recent years, ideas from OT have found applications in machine learning and data analysis in general. Applications range from image equalization (Delon, 2004), shape interpolation (Solomon et al., 2015), image/shape (Solomon et al., 2016; Rubner et al., 1998) and document classification (Kusner et al., 2015; Rolet et al., 2016), semisupervised learning (Solomon et al., 2014), to population analysis of Gaussian processes (Mallasto & Feragen, 2017) and domain adaptation (Courty et al., 2017). In line with previous applications of OT, we represent datasets as probability measures on an ambient metric space. We introduce the so called Wasserstein transform (WT) which takes this input dataset and alters its interpoint dis- *Equal contribution 1Department of Mathematics, The Ohio State University, Ohio, USA 2Department of Computer Science and Engineering, University of Minnesota, Minnesota, USA 3Department of Computer Science and Engineering, The Ohio State University, Ohio, USA. Correspondence to: Facundo M emoli , Zane Smith , Zhengchao Wan . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). tance information in order to both enhance features, such as clusters, present in the data, and to denoise the data. As a theoretical contribution, we prove the stability of our construction to perturbations in the input data (i.e. changes in the input probability measure). We also interpret our proposed feature enhancing method as both a generalization and a strengthening of Mean Shift (MS) (Cheng, 1995; Fukunaga & Hostetler, 1975) which can operate on general metric spaces. Although mean shift has been generalized to data living on Riemannian manifolds (Subbarao & Meer, 2009; Shamir et al., 2006), our interpretation departs from the ones in those papers in that we do not attempt to estimate a local mean or median of the data but, instead, we use the local density of points to iteratively directly adjust the distance function on the metric space. We do this without appealing to any intermediate embedding into a Euclidean space. As a further contribution, through this connection between the WT and MS, we are able to prove that MS is stable to data perturbations. We are not aware of any extant results in the literature that address this type of stability for MS methods. Our experiments show that the Wasserstein transform is effective in both denoising and resolving the well known chaining effect that affects linkage based clustering methods. Furthermore, we compared the perfomance of our method with mean shift on the MNIST dataset (Le Cun et al., 1998) and on Grassmannian manifold data (Cetingul & Vidal, 2009). 2. Optimal Transport Concepts Given a compact metric space (X, d X) one of the fundamental concepts of OT (Villani, 2003) is the so called Wasserstein distance on the set of all probability measures P(X) on X. The ℓ1-Wassertein distance d W,1(α, β) between probability measures α, β P(X) is obtained by solving the following linear optimization problem: d W,1(α, β) := inf µ Π(α,β) X X d X(x, x ) dµ(x x ), where Π(α, β) is the set of all couplings between the probability measures α and β: namely, µ in Π(α, β) is a probability measure on X X whose marginals are α and β, respectively. The Wasserstein Transform Figure 1. In this illustration α is the empirical probability measure associated to the point cloud X shown in the figure, and d X is the Euclidean distance. With the truncation kernel, the Wasserstein transform Wε will calculate the dissimilarity (via d W,1) of the εneighborhoods (shown as light red disks) corresponding to all pairs of points to produce a new distance d(ε) α on X. For example, for the pair of left most points, A and B, their respective ε-neighborhoods are not only similar, but also the distance between these regions is small so d(ε) α (A, B) will be small too. Something similar is true for the pair C and D. In contrast, despite the fact that the points B and C are very close to eachother, their ε-neighborhoods are structurally different: the neighborhood of B is essentially 2dimensional whereas that of C is 1-dimensional. This will result in d(ε) α (B, C) being large. Similarly, since the ε-neighborhood of E is 0-dimensional and that of G is 1-dimensional, despite being very close to each other d(ε) α (E, G) will be large. Finally, d(ε) α (E, F) will equal the ground distance between E and F since their respective neighborhoods consist of a single point (cf. Remark 2.1). Remark 2.1 (Wasserstein distance between Dirac measures). A simple but important remark (Villani, 2003) is that for points x, x X, if one considers the Dirac measures supported at those points (which will be probability measures), δx and δx , then the Wasserstein distance between these Dirac measures equals the ground distance: d W,1(δx, δx ) = d X(x, x ). Remark 2.2 (A lower bound on Euclidean spaces). It is known (Rubner et al., 1998) that in Euclidean space Rd, mean(α) mean(β) d W,1(α, β) for any α, β P(Rd). In words, in Euclidean spaces, the Wasserstein distance between two probability measures is bounded below by the Euclidean distance between their respective means, which is compatible with the fact that α and β can certainly have the same means but can still be quite different as measures. In Section 3.4, this simple fact will help elucidating a relationship between MS and WT on Euclidean spaces. 3. The Wasserstein Transform Given a compact metric space (X, d X), we introduce a subset Pf(X) of P(X), which consists of those probability measures on X with full support: the support supp(α) of a probability measure α is the largest closed subset such that every open neighborhood of a point in supp(α) has positive measure. Given an ambient metric space X = (X, d X), we interpret a given probability measure α Pf(X) as the data. For example, given point cloud X = {x1, . . . , xn} Rd one could choose α to be the empirical measure 1 n Pn i=1 δxi. The ambient space distance between data points (in this case the Euclidean distance) is not always directly useful, and by absorbing information about the spatial density of data points, the Wasserstein transform introduced below produces a new metric on the data points which can be used in applications to reveal and concentrate interesting features present but not apparent in the initial presentation of the data. The essential idea behind the Wasserstein transform is to first capture local information of the data and then induce a new distance function between pairs of points based on the dissimilarity between their respective neighborhoods. Localization operators are gadgets that capture these neighborhoods. 3.1. Localization Operators One can always regard a point in a metric space as a Dirac measure supported at that point. More generally, a point in a metric space can be replaced by any reasonable probability measure which includes information about the neighborhood of the point this leads to the notion of localization operators for probability measures. Definition 1. Let (X, d X) be a metric space referred to as the ambient metric space. A localization operator L is a map from Pf(X) to Markov kernels over X, i.e., given α Pf(X), L produces L(α) = (X, m L α( )), where for every x X, m L α(x) is a probability measure on X. We refer to m L α(x) as the localized measure at x. The following are two simple extreme examples. (a) Given α in Pf(X), let m L α(x) α, x X, which assigns to all points in X the probability measure α. This is a trivial example in that it does not localize the measure α at all. (b) For any α in Pf(X), let m L α(x) = δx, x X. This is a legitimate localization operator but it does not retain any information from α. We will see some useful choices of localization operators in the next couple sections. 3.2. The Wasserstein Transform After specifying a localization operator L and given α Pf(X), one associates each point x in X with a probability measure m L α(x), and then obtains a new metric space by considering the Wasserstein distance between each pair of these localized measures. Definition 2 (The Wasserstein transform). Let (X, d X) be a given ambient metric space and let α Pf(X). Given a localization operator L, the Wasserstein transform WL applied to α gives the distance function d L α on X defined by d L α(x, x ) := d W,1 m L α(x), m L α(x ) , x, x X. By WL(α) we will denote the (pseudo) metric space (X, d L α). Even if in this paper we consider only the ℓ1Wasserstein transform, it is possible to formulate a similar transform using the notion of ℓp-Wasserstein distance. The Wasserstein Transform Localization operator d W,1 WL(α) = (X, d L α) Wasserstein Transform of α Figure 2. Iterating the Wasserstein transform: iteratively change the metric based on density of points as indicated by α. Remark 3.1 (Iterating the Wasserstein transform). The Wasserstein transform can be iterated any desired number of times with the purpose of successively enhancing features and/or reducing noise. See Figure 2. After applying the Wasserstein transform once to α Pf(X), the ambient metric space (X, d X) is transformed into (X, d L α). Then we can apply the Wasserstein transform again to α on the ambient space (X, d L α) etc. This fact is useful in applications such as clustering; see Section 5. 3.3. Local Truncations We now concentrate on a particular type of localization operator which we call local truncation. Given α Pf(X) and a scale parameter ε > 0, consider for each x X the probability measure m (ε) α (x) := α|Bε(x) α(Bε(x)), arising from restricting α to the closed ball Bε(x) and then renormalizing to obtain a new probability measure. In other words, for each set A X, the measure of that set is m(ε) α (x)(A) = α(Bε(x) A) α(Bε(x)) . When X is finite, X = {x1, . . . , xn}, and α is its empirical measure, this formula becomes m (ε) α (x)(A) = #{i| xi A and d X(xi, x) ε} #{i| d X(xi, x) ε} . We denote the resulting Wasserstein transform by Wε, and in this case, for each α, the new metric produced by Wε(α) will be denoted as d(ε) α . See Figure 1 for an intuitive explanation. Remark 3.2 (Behavior across scales). Notice that as ε one has m(ε) α (x) = α for any x X. However, for ε 0, m(ε) α (x) δx. In words, ε acts as a localization parameter: for small ε the renormalized measures absorb local information, whereas for large values the renormalized measures for different points become indistinguishable. Thus we have the following for any x, x in X: (1) as ε 0 one has d(ε) α (x, x ) d X(x, x ); and (2) as ε one has d(ε) α (x, x ) 0. A B C D E F Figure 3. After applying one iteration of the Wasserstein transform, both the distance between A, C and the distance between C, E should remain almost the same since these are all critical points of f. According to the formula in Remark 3.3, since f has negative sign at B and B lies to the right of A, B will be pushed towards A, while D will be pushed away from A since f (D) > 0 and it lies to the right of A. Similarly both D and F are pushed towards E. Interpretation of Wε(α) on the real line. Using the fact that the Wasserstein distance on R admits a closed form expression (Villani, 2003) we are able to prove the following Taylor expansion. Remark 3.3 (Taylor expansion for d(ε) α (x, x ) ). When X is a subset of the real line, and the probability measure α has a density f, we have the asymptotic formula for d(ε) α (x, x ) as ε 0: for x > x and f(x), f(x ) > 0, d (ε) α (x, x ) = x x + 1 f(x ) f (x) ε2 + O(ε3). The interpretation is that after one iteration of the Wasserstein transform Wε of α, pairs of points x and x on very dense areas (reflected by large values of f(x) and f(x )) will be at roughly the same distance they were before applying the Wasserstein transform. However, if one of the points, say x is in a sparse area (i.e. f(x ) is small), then the Wasserstein transform will push it away from x. It is also interesting what happens when x and x are both critical points of f: in that case the distance does not change (up to order ε2). See Figure 3 for an illustration. See the supplementary document for a proof of this Taylor expansion. 3.4. The Wasserstein Transform as a Generalization of Mean Shift to Any Metric Space Mean Shift (Cheng, 1995; Fukunaga & Hostetler, 1975) is a clustering method for Euclidean data which operates by iteratively updating each data point until convergence according to a rule that moves points towards the mean/barycenter of their neighbors. More specifically, given a point cloud X = {x1, . . . , xn} in Rd, a kernel function K : R+ R+, and a scale parameter ε > 0, then in the kth iteration the ith point is shifted as follows: xi(0) = xi and for k 0, xi(k + 1) = Pn j=1 K xj(k) xi(k) ε xj(k) Pn j=1 K xj(k) xi(k) The kernels of choice are the Gaussian kernel K(t) = e t2/2, the Epanechnikov kernel K(t) = max{1 t, 0}, or The Wasserstein Transform the truncation kernel K(t) (which equals 1 if t [0, 1] and is zero otherwise). To see how the Mean Shift method can be embedded in the framework of the Wasserstein Transform, let us firstly introduce a new type of localization operator. We assume that the ambient space X is a convex compact subset of Rd endowed with Euclidean distance. Given any localization operator L, define a new localization operator Lms as follows: for α Pf(Rd), and x X, m Lms α (x) := δmean(m L α(x)). In words, at a fixed point x, Lms applied to a measure α at x first localizes α via L to obtain m L α(x), and then further localizes this measure by only retaining information about its mean. The fact that we can actually compute the mean (or barycenter) of a probability measure (and that this mean remains in X) is enabled by the assumption that the ambient space is (a convex) subset of Euclidean space. Since by Remark 2.1, the Wasserstein distance between Dirac measures equals the ground distance between their support points, then, considering the Wasserstein transform WLms(α) arising from Lms, we have for all x, x X that d Lms α (x, x ) = mean m L α(x) mean m L α(x ) . The connection with mean shift. Now, given any kernel function K and ε > 0 as in the case of mean shift one obtains an associated kernel based localization operator LK,ε such that for any x X and A X, m LK,ε α (x)(A) := Now, if for a point cloud X = {x1, . . . , xn} in Rd we consider α to be the empirical measure induced by X, that is, α = 1 n Pn i=1 δxi, then, for the localization operator m LK,ε defined above, we obtain, for x X, the following formula which agrees with the result (1) of applying one iteration of mean shift to the points in X: mean m LK,ε α (x) = Pn i=1 K x xi ε xi Pn i=1 K x xi Now, that the metric space WLK,ε(α) contains the same information as the collection of mean shift points above follows from the classical distance geometry fact that any finite set in Rd can be reconstructed up to rigid transformations from its interpoint distance matrix (Blumenthal, 1953). Remark 3.4 (The Wasserstein transform as a strengthening of mean shift). Note that in general, because of Remark 2.2 one has that whenever X Rd is convex and compact, and α P(Rd), then for all x, x X, mean(m (ε) α (x)) mean(m (ε) α (x )) d (ε) α (x, x ), which indicates that the mean shift procedure provides a lower bound for the result of applying the Wasserstein transform to a dataset represented by α. In other words, in a certain sense the Wasserstein transform retains, via d(ε) α , more information about the dataset than mean shift. 4. Stability Under Perturbations of α The goal of this section is to establish the stability of the Wasserstein transform Wε(α) under perturbations of the probability measure α representing the dataset. As a byproduct of this, we will also obtain a novel stability result for mean shift. We are not aware of extant related stability results for MS in the literature. As before we fix a compact metric space (X, d X) (the ambient space). Probability measures on X are required to satisfy a mild doubling type condition. Definition 3. Given Λ > 0, we say that a Borel measure α on X satisfies the Λ-doubling condition if for all x supp(α), r1 r2 > 0 one has α(Br1(x)) α(Br2(x)) r1 Remark 4.1. Suppose α Pf(X) and diam(X) < D. If α satisfies the Λ-doubling condition, then we have α(Br(x)) ψΛ,D(r), for all x X and r > 0, where ψΛ,D(r) := min 1, r Proof. Take r1 = D, r2 = r in Definition 3, we have when r D, α(BD(x)) r Λ . Notice that X = BD(x), hence we have α(BD(x)) = α(X) = 1. Therefore When r > D, obviously we have α(Br(x)) = α(X) = Setup and assumptions. We assume that the diameter of X satisfies diam(X) := maxx,x X d X(x, x ) < D for some D > 0. Additionally, we assume that two (fully supported) probability measures α and β in Pf(X) are given and satisfy the doubling condition for some Λ > 0. Also, since our results below are for local truncations, we fix a scale parameter ε > 0. Define ΦΛ,D,ε(η) := η ψΛ,D(ε) + h 1 + η ε Λ 1 i for η 0. Remark 4.2. Notice that ΦΛ,D,ε(η) is an increasing function of η and furhermore that limη 0 ΦΛ,D,ε(η) = 0. The Wasserstein Transform Then, we have the following stability result for the localization (via local truncations) of two different probability measures on the same ambient space. The stability is expressed in terms of the Wasserstein distance itself. Theorem 4.3 (Stability of local truncations). sup x X d W,1 m (ε) α (x), m (ε) β (x) (1 + 2ε)ΦΛ,D,ε d W,1(α, β) . Remark 4.4. We have also established a stability theorem for WT with Lipschitz kernels based localization operators in which no condition such as Λ-doubling condition is required. Assume X is a compact metric space. If K : R+ R+ is any C-Lipschitz kernel, then there exist constants N > 0 and M > 0 only depending on K and X, such that for all probability measures α and β on X, and all x X: d W,1 m K α (x), m K β (x) 2C diam(X) + M N d W,1(α, β). Above, m K α (x)(A) = A K(d(x,y) dα(y) R X K(d(x,y) dα(y) for A X. See (M emoli et al., 2018) for more details. By Remark 4.2, Theorem 4.3 indicates that if α and β are similar in terms of the Wasserstein distance, then for every point x X the localized measures m(ε) α (x) and m (ε) β (x) will also be similar. As a consequence of Theorem 4.3 we obtain the following two theorems: Theorem 4.5 (Stability of d(ε) α ). sup x,x X d (ε) α (x, x ) d (ε) β (x, x ) 2(1 + 2ε)ΦΛ,D,ε d W,1(α, β) . Proof. By applying the triangle inequality for the Wasserstein distance [1], we have for any x, x X d (ε) α (x, x ) d (ε) β (x, x ) = d W,1 m (ε) α (x), m (ε) α (x ) d W,1 m (ε) β (x), m (ε) β (x ) d W,1 m (ε) α (x), m (ε) α (x ) d W,1 m (ε) α (x), m (ε) β (x ) + d W,1 m (ε) α (x), m (ε) β (x ) d W,1 m (ε) β (x), m (ε) β (x ) d W,1 m (ε) α (x ), m (ε) β (x ) + d W,1 m (ε) α (x), m (ε) β (x) Therefore by taking supremum on both sides and invoking Theorem 4.3 we obtain the claim. Theorem 4.6 (Stability of mean shift for local truncations). Assume that (X, d X) is a subspace of Rn with Euclidean distance. Then, for mean shift arising from local ε-truncations we have: mean(m (ε) α (x)) mean(m (ε) β (x)) (1 + 2ε) ΦΛ,D,ε d W,1(α, β) . Proof. By Remark 3.4 and Theorem 4.3 we have x X, mean (m (ε) α (x)) mean m (ε) β (x) d W,1 m (ε) α (x), m (ε) β (x) (1 + 2ε)ΦΛ,D,ε q d W,1(α, β) . 4.1. The Proof of Theorem 4.3 Proof of Theorem 4.3. To bound the Wasserstein distance between the localized measures associated to α and β, d W,1 m(ε) α (x), m (ε) β (x) , it is more convenient to first analyze the Prokhorov distance (Gibbs & Su, 2002), and then convert the result to a Wasserstein distance bound by the lemma below. Recall that the Prokhorov distance d P (α, β) between probability measures α and β equals inf{δ > 0 : α(A) β(Aδ) + δ, A X}. Here Aδ is the δ-fattening of A: the set of points in X which are at distance less than δ from a point in A. Though seemingly asymmetric, d P is actually symmetric (Gibbs & Su, 2002). Lemma 4.7 (Theorem 2 of (Gibbs & Su, 2002)). Given a metric space (X, d X) with bounded diameter, then α, β Pf(X), we have the following relation between the Wasserstein and Prokhorov distances: d P (α, β) 2 d W,1(α, β) 1 + diam(X) d P (α, β). Remark 4.8. If α and β are not fully supported, then by restricting the metric d to S = supp(α) supp(β) X, the rightmost inequality above can be improved to d W,1(α, β) 1 + diam(S) d P (α, β). Claim 1. For any x X, we have d P m(ε) α (x), m (ε) β (x) ΦΛ,D,ε d P (α, β) . Proof of Claim 1. Suppose d P (α, β) < η for some η > 0. Fix x X and assume WLOG that β(Bε(x)) α(Bε(x)). Then invoke the expression d P (m(ε) α (x), m (ε) β (x)) = inf{δ > 0 : m(ε) α (x)(A) m (ε) β (x)(Aδ) + δ, A X}. The Wasserstein Transform For any A X we have the following inclusions: (A Bε(x))η Aη (Bε(x))η Aη Bε+η(x) = Aη Bε(x) Bε+η(x)\Bε(x) Aη Bε(x) [ Aη Bε+η(x)\Bε(x) Aη Bε(x) [ Bε+η(x)\Bε(x). Then by monotonicity of measure and the fact that d P (α, β) < η, we have m (ε) α (x)(A) = α(A Bε(x)) β((A Bε(x))η) + η α(Bε(x)) β((A Bε(x))η) + η β(Aη Bε(x)) + β(Bη ε (x)\Bε(x)) β(Bε(x)) + η β(Bε(x)) β(Aη Bε(x)) β(Bε(x)) + β(Bε+η(x)) + η m (ε) β (x)(Aη) + 1 + η Λ 1 + η β(Bε(x)) m (ε) β (x)(Aη) + ξ, where ξ := ΦΛ,D(η) = 1 + η ε Λ 1 + η ψΛ,D(ε), and the last inequality follows from Remark 4.1. Note that since 1 + η ε Λ 1 0, and ψΛ,D(ε) 1, then ξ η. Thus, from the inequality above, and since Aη Aξ, then m(ε) α (x)(A) m (ε) β (x)(Aη) + ξ m (ε) β (x)(Aξ) + ξ. Therefore d P (m(ε) α (x), m (ε) β (x)) ξ = ΦΛ,D(η). Then by letting η d P (α, β) we have d P (m(ε) α (x), m (ε) β (x)) ΦΛ,D,ε d P (α, β) , where the RHS is independent of x, so the proof is done. We now finish the proof of Theorem 4.3. Since supp m(ε) α (x) and supp m (ε) β (x) are both contained in Bε(x) and diam(Bε(x)) 2ε, we have from Remark 4.8 that d W,1 m(ε) α (x), m (ε) β (x) (1 + 2ε) d P m(ε) α (x), m (ε) β (x) . Now, from this inequality, by Claim 1 above we in turn obtain d W,1 m(ε) α (x), m (ε) β (x) (1 + 2ε) ΦΛ,D,ε (d P (α, β)). Finally, since ΦΛ,D,ε(η) is an increasing function of η, by Lemma 4.7 we obtain the statement of the theorem. 5. Implementation and Experiments In the case of the WT arising from local truncations, Wε, for each pair of points x, x X, the computation of d(ε) α (x, x ) = d W,1(m(ε) α (x), m(ε) α (x )) only requires knowledge of the rectangular chunk of d X consisting of those points in Bε(x) Bε(x ) and, as such, the size of each instance of d W,1 can be controlled by choosing ε to be sufficiently small. The solution of the associated Kantorovich optimal transport problem was carried via entropic regularization (Cuturi, 2013; Genevay et al., 2016; Peyr e et al., 2017) using the Sinkhorn code from (Peyre, 2017). The computation of the matrix d(ε) α (x, x ) x,x X is an eminently parallelizable task. In our implementation we ran this on a 24 core server via Matlab s parallel computing toolbox. In all of our experiments we used the implementation sinhorn_log from (Peyre, 2017) with options.niter = 2, epsilon = 0.05, and options.tau = 0. Ameliorating the chaining effect. In this application we considered the case of clustering two well defined disk shaped blobs (each containing 100 points) connected by a thin trail consisting of 30 points. This is a standard scenario in which standard single linkage hierarchical clustering fails to detect the existence of two clusters due to the so called chaining effect. However, successive applications of the Wasserstein transform Wε (corresponding to local truncations) consistently improve the quality of the dendrograms. See Figure 4 for results. See Figure 5 for a study of the effects of increasing ε and the number of iterations on this dataset. As already suggested by the interpretation in Figure 1, ε-neighborhoods of points in the interior of the dumbbell are essentially two dimensional, whereas ε-neighborhoods of points on the chain are one dimensional this means that their Wasserstein distance will be quite large, thus having the desired effect of separating the clusters in the sense of the distance d(ε) α . Figure 4. (Chaining effect and WT.) Top left: A dumbbell shape consisting of two disk shaped blobs each with 100 points and separated by a thin chain of 30 points in the plane with Euclidean distance. The diameter of the initial shape was approximately 4. From left to right: 0,1, 2, 3, and 4, iterations of Wε for ε = 0.3. The top row shows MDS (multi-dimensional scaling) plots of the successive metric spaces thus obtained (color is by class: first blob, chain, and second blob), the middle row shows their distance matrices (ordered so that first we see the points in one blob, then the points on the connecting chain, and then the points of the second blob. The third row shows the corresponding single linkage dendrograms. Notice how the the MDS plot/distance matrices/dendrograms at iteration 5 exhibit clearly defined clusters. The Wasserstein Transform Table 1. (Classification results over the MNIST database.) Entries show the number of incorrectly classified images over 5K test images (with 5K training images). The best performance of d T (tangent distance) is for k=1 and k=3. In both cases WT provides a better performance than d T. There is no major difference between the performance of the exact calculation of WT and the one via Sinkhorn, however these take place for different values of knn. Notice that in this dataset the best performances of MS and WT are similar (although they happen for different values of knn). knn 1 2 3 4 5 10 20 50 100 500 WT-Exact-1 121 144 121 122 116 135 181 258 345 1231 WT-Sinkhorn-1 125 145 118 121 114 142 183 259 342 1182 MS-1 117 131 128 133 133 157 199 285 371 1223 d T 133 166 130 141 145 176 219 324 435 1198 Table 2. (Classification results over the Grassmann manifold dataset: error rates.) The best overall performance (0.9753) was obtained by 3-iterations of WT with ε = 0.9. For all ε except 1.2 WT has better performance than directly using the manifold distance (which is the baseline). The best performance (1.8148) of MS corresponds to the Gaussian kernel for ε = 0.8. The best performance of WT for that value of ε (1.0039) is the one corresponding to the 3rd iteration. For ε = 0.9 the best performance of MS (1.9155) is worse than the baseline 1.7903 whereas the performance of WT is still better (0.9751). Interestingly, whereas in some cases successive iterations of WT improve its performance, with these parameter choices, in no case did further iterations of MS improve performance. For ε = 1.0 and 1.1 the only method that performs better than the baseline is one iteration of WT. For ε = 1.2 all methods performed below the baseline. For all values of ε the performance of MS with the truncation kernel was not competitive neither with that of MS with Gaussian kernel, nor with that of WT. See Figure 8 regarding the choice of the range of ε. N = 1.0 dist trunc1 trunc2 trunc3 gauss1 gauss2 gauss3 wass1 wass2 wass3 ε = 0.8 1.7903 13.1189 22.7249 27.1865 1.8148 8.3543 31.5804 1.4912 1.1150 1.0039 ε = 0.9 1.7903 15.7174 28.9120 51.0221 1.9155 13.2684 44.0966 1.2031 1.3344 0.9753 ε = 1.0 1.7903 19.3057 43.2942 58.2056 2.1883 15.7550 54.4154 1.2328 2.1466 8.7605 ε = 1.1 1.7903 25.8416 58.0784 75.3460 2.3203 23.3293 65.4943 1.7193 7.3658 75.2454 ε = 1.2 1.7903 28.2444 68.2164 71.4105 2.4537 33.3808 74.4915 2.5633 65.2158 75.2193 Figure 5. (Chaining effect: varying ε and number of iterations of WT.) In this figure we computed 14 different iterations of the dumbbell dataset for ε = 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, and 0.4. Notice how distance matrices corresponding to the lower right corner show a very well defined block structure indicative of the presence two large clusters (the blobs) and a smaller one (the points originally corresponding to the chain) . Denoising of a circle. In this example we study a dataset consisting of 800 points uniformly spaced on a circle with radius 1 and centered at the origin in R2. This circle is heavily corrupted by 1200 outliers chosen uniformly at random in the unit square [ 1, 1] [ 1, 1]. This type of preprocessing may help in applications where one wishes to detect voids/loops in data which may signal periodicity (Perea, 2016; Emrani et al., 2014). We compare the performance of Figure 6. (Denoising of a circle: several iterations of mean shift vs. Wε.) The top row shows the result of applying mean shift with the truncation kernel; the bottom row shows 2D MDS plots of the results obtained from applying the local truncation Wasserstein transform Wε. In each case ε was chosen to be 0.3 relative to the diameter at each iteration. The first column shows the initial dataset which is the same for both cases. From left to right we show increasing number of iterations. Notice how Wε is able to better resolve the shape of the circle; in particular, it is better at displacing interior points towards the high density area around the circle. This feature indicates that Wε can be useful as a preprocessing step for persistent homology calculations (Perea, 2016) or before applying nonlinear dimensionality reduction or manifolds learning techniques to a dataset. Wε with MS (with the same kernel and same parameter ε) through 6 succesive iterations. See Figure 6. Experiment on the MNIST dataset. In this example we performed knn classification on 5K test images (using 5K training images) from the MNIST dataset. We used both deskewing and the tangent distance d T as explained in (Le Cun et al., 1998). We tested 3 different methods via one The Wasserstein Transform dist trunc1 trunc2 trunc3 gauss1 gauss2 gauss3 wass1 wass2 wass3 Figure 7. (Grassmann manifold experiment: distance matrices.) This figure shows the distance matrices (for one realization of the 400 random points in G10,6) corresponding to the rows and columns from Table 2. From this figure, it is evident that WT can help accentuate clustering by moving apart points in different clusters and by concentrating similar points. It is apparent that when the parameter ε is large, all three methods perform poorly, furthermore, their performance degrades with further iterations (columns 4, 7, and 10). Figure 8 provides one possible explanation. 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 0 Figure 8. (Grassmann manifold experiment: choice of ε.) The figure shows the histogram of the manifold distance matrix on G10,6 corresponding to one realization of the 400 random matrices. Whereas ε = 0.8 seems to be small enough so that any ball Bε(x) around a point x of the dataset captures only local information, values of ε 1.2 would induce balls Bε(x) each of which cover a very large portion of the dataset thus failing to be local. This helps explain the poor performance of both MS (truncation and Gaussian) and WT when ε = 1.0, 1.1 and 1.2 in Table 2. iteration of each method. WT-Exact-1 means the exact computation of the WT (via a standard linear programming algorithm for solving each OT problem (Rubner, 1998)), WT-Sinkhorn-1 means computation via the Sinkhorn algorithm, whereas MS-1 refers to mean shift (with the truncation kernel). In each of the three cases the ε parameter (defining the neighborhood) was chosen to be the same and equal to 0.075 of the maximal value of the tangent distance matrix corresponding to the 10K points under consideration. The classification was done for different values of knn. In each case, we partitioned the data into the same 5K training points and 5K test points. See Table 1 for the classification results, which indicate that the best performance for MS is comparable to the best performance of WT (for both the exact computation and Sinkhorn). We now present results on a dataset with a more intricate underlying geometric structure. Experiment on a Grassmann manifold data. We tested WT on the synthetic dataset employed in (Cetingul & Vidal, 2009). In our experiments we generated 400 matrices from the Grassmann Manifold G10,6 (Absil et al., 2004; H uper et al., 2010) as follows: we first generated 4 randomly selected (well separated) loci. We then randomly perturbed each loci 100 times. This was done following Section 4.1 of (Cetingul & Vidal, 2009) by corrupting the angles determining each loci by uniform random noise of width N = 1 and mean zero. We then randomly split the set into 100 test matrices and 300 train matrices and estimate the error on a 3nn classification task. The error is averaged over 10000 random selections of the test and train sets. The manifold G10,6 comes equipped with a certain distance which we simply refer to as the manifold distance (see the supplementary document). For each point, its 3nns are determined by four different metrics: the manifold distance, the manifold distance after MS with the truncation kernel, the manifold distance after MS with the Gaussian kernel, and the WT of the manifold distance. Both the WT and MS with truncation kernel require a parameter ε determining the truncation width. The Gaussian kernel requires a standard deviation parameter which we set to 2/3 the ε-value of used for Wε. We then repeated the above for the parameter ε ranging over {0.8, 0.9, 1.0, 1.1, 1.2}. See Table 2 and Figures 7 and 8. For details about computational techniques and mathematics on the Grassmannian data, see the supplementary document. 6. Conclusions We introduced the Wasserstein transform as a method that takes a dataset represented by a distance matrix and a probability measure and iteratively alters the distance matrix with the goal of enhancing features and/or removing noise. We established the stability of our method under data perturbations, and exhibited a connection with the MS algorithm which permits establishing the stability of MS as well. We validated the ability of WT to reduce the chaining effect and to denoise data on synthetic datasets. We applied WT to classification tasks on MNIST and Grassmann manifold datasets and showed that WT performed at least as well as MS, despite being applicable in wider settings. For future work, it seems of interest to investigate the theoretical behaviour of the iterated WT, e.g., its connection with Ricci/gradient flows. It also seems of interest to study the experimental performance of versions of the WT based on ℓp-Wasserstein distances for p > 1 and/or other localization operators. Acknowledgements We acknowledge funding from these sources: NSF AF 1526513, NSF DMS 1723003, NSF DMS 1547357, and NSF CCF 1740761. The Wasserstein Transform Absil, P.-A., Mahony, R., and Sepulchre, R. Riemannian geometry of Grassmann manifolds with a view on algorithmic computation. Acta Applicandae Mathematica, 80 (2):199 220, 2004. Blumenthal, L. M. Theory and applications of distance geometry. Oxford: Clarendon Press, 1953. Cetingul, H. E. and Vidal, R. Intrinsic mean shift for clustering on Stiefel and Grassmann manifolds. In CVPR 2009., pp. 1896 1902. IEEE, 2009. Cheng, Y. Mean shift, mode seeking, and clustering. IEEE transactions on pattern analysis and machine intelligence, 17(8):790 799, 1995. Courty, N., Flamary, R., Tuia, D., and Rakotomamonjy, A. Optimal transport for domain adaptation. IEEE transactions on pattern analysis and machine intelligence, 39(9): 1853 1865, 2017. Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. In NIPS, pp. 2292 2300, 2013. Delon, J. Midway image equalization. Journal of Mathematical Imaging and Vision, 21(2):119 134, 2004. Emrani, S., Gentimis, T., and Krim, H. Persistent homology of delay embeddings and its application to wheeze detection. IEEE SP Letters, 21(4):459 463, 2014. Fukunaga, K. and Hostetler, L. The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Trans. Inf., 21(1):32 40, 1975. Genevay, A., Cuturi, M., Peyr e, G., and Bach, F. Stochastic optimization for large-scale optimal transport. In NIPS, pp. 3440 3448, 2016. Gibbs, A. L. and Su, F. E. On choosing and bounding probability metrics. International statistical review, 70 (3):419 435, 2002. H uper, K., Helmke, U., and Herzberg, S. On the computation of means on Grassmann manifolds. In Proc. Int. Symp. MTNS, volume 19, pp. 2439 2441, 2010. Kusner, M., Sun, Y., Kolkin, N., and Weinberger, K. From word embeddings to document distances. In ICML, pp. 957 966, 2015. Le Cun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. Mallasto, A. and Feragen, A. Learning from uncertain curves: The 2-Wasserstein metric for Gaussian processes. In NIPS, pp. 5660 5670, 2017. M emoli, F., Smith, Z., and Wan, Z. The Wasserstein transform. ar Xiv preprint ar Xiv:1810.07793, 2018. Perea, J. A. Persistent homology of toroidal sliding window embeddings. In ICASSP 2016, pp. 6435 6439. IEEE, 2016. Peyre, G. Sinkhorn code. https://github.com/ gpeyre/2017-MCOM-unbalanced-ot, 2017. Peyr e, G., Cuturi, M., et al. Computational optimal transport. Technical report, 2017. Rolet, A., Cuturi, M., and Peyr e, G. Fast dictionary learning with a smoothed Wasserstein loss. In Artificial Intelligence and Statistics, pp. 630 638, 2016. Rubner, Y. Code for the earth movers distance. http://robotics.stanford.edu/ rubner/ emd/default.htm, 1998. Rubner, Y., Tomasi, C., and Guibas, L. J. A metric for distributions with applications to image databases. In Computer Vision, 1998. Sixth International Conference on, pp. 59 66. IEEE, 1998. Santambrogio, F. Optimal transport for applied mathematicians. Birk auser, NY, pp. 99 102, 2015. Shamir, A., Shapira, L., and Cohen-Or, D. Mesh analysis using geodesic mean-shift. The Visual Computer, 22(2): 99 108, 2006. Solomon, J., Rustamov, R., Guibas, L., and Butscher, A. Wasserstein propagation for semi-supervised learning. In ICML, pp. 306 314, 2014. Solomon, J., De Goes, F., Peyr e, G., Cuturi, M., Butscher, A., Nguyen, A., Du, T., and Guibas, L. Convolutional Wasserstein distances: Efficient optimal transportation on geometric domains. ACM TOG, 34(4):66, 2015. Solomon, J., Peyr e, G., Kim, V. G., and Sra, S. Entropic metric alignment for correspondence problems. ACM TOG, 35(4):72, 2016. Subbarao, R. and Meer, P. Nonlinear mean shift over Riemannian manifolds. IJCV, 84(1):1, 2009. Villani, C. Topics in optimal transportation. Number 58. American Mathematical Soc., 2003. Villani, C. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.