# learning_embeddings_into_entropic_wasserstein_spaces__e8260fee.pdf Published as a conference paper at ICLR 2019 LEARNING EMBEDDINGS INTO ENTROPIC WASSERSTEIN SPACES Charlie Frogner MIT CSAIL and MIT-IBM Watson AI Lab frogner@mit.edu Farzaneh Mirzazadeh MIT-IBM Watson AI Lab and IBM Research farzaneh@ibm.com Justin Solomon MIT CSAIL and MIT-IBM Watson AI Lab jsolomon@mit.edu Euclidean embeddings of data are fundamentally limited in their ability to capture latent semantic structures, which need not conform to Euclidean spatial assumptions. Here we consider an alternative, which embeds data as discrete probability distributions in a Wasserstein space, endowed with an optimal transport metric. Wasserstein spaces are much larger and more flexible than Euclidean spaces, in that they can successfully embed a wider variety of metric structures. We exploit this flexibility by learning an embedding that captures semantic information in the Wasserstein distance between embedded distributions. We examine empirically the representational capacity of our learned Wasserstein embeddings, showing that they can embed a wide variety of metric structures with smaller distortion than an equivalent Euclidean embedding. We also investigate an application to word embedding, demonstrating a unique advantage of Wasserstein embeddings: We can visualize the high-dimensional embedding directly, since it is a probability distribution on a low-dimensional space. This obviates the need for dimensionality reduction techniques like t-SNE for visualization. 1 INTRODUCTION Learned embeddings form the basis for many state-of-the-art learning systems. Word embeddings like word2vec (Mikolov et al., 2013), Glo Ve (Pennington et al., 2014), fast Text (Bojanowski et al., 2017), and ELMo (Peters et al., 2018) are ubiquitous in natural language processing, where they are used for tasks like machine translation (Neubig et al., 2018), while graph embeddings (Nickel et al., 2016) like node2vec (Grover & Leskovec, 2016) are used to represent knowledge graphs and pre-trained image models (Simon et al., 2016) appear in many computer vision pipelines. An effective embedding should capture the semantic structure of the data with high fidelity, in a way that is amenable to downstream tasks. This makes the choice of a target space for the embedding important, since different spaces can represent different types of semantic structure. The most common choice is to embed data into Euclidean space, where distances and angles between vectors encode their levels of association (Mikolov et al., 2013; Weston et al., 2011; Kiros et al., 2014; Mirzazadeh et al., 2014). Euclidean spaces, however, are limited in their ability to represent complex relationships between inputs, since they make restrictive assumptions about neighborhood sizes and connectivity. This drawback has been documented recently for tree-structured data, for example, where spaces of negative curvature are required due to exponential scaling of neighborhood sizes (Nickel & Kiela, 2017; 2018). In this paper, we embed input data as probability distributions in a Wasserstein space. Wasserstein spaces endow probability distributions with an optimal transport metric, which measures the distance traveled in transporting the mass in one distribution to match another. Recent theory has shown that Wasserstein spaces are quite flexible more so than Euclidean spaces allowing a variety of other metric spaces to be embedded within them while preserving their original distance metrics. As Published as a conference paper at ICLR 2019 such, they make attractive targets for embeddings in machine learning, where this flexibility might capture complex relationships between objects when other embeddings fail to do so. Unlike prior work on Wasserstein embeddings, which has focused on embedding into Gaussian distributions (Muzellec & Cuturi, 2018; Zhu et al., 2018), we embed input data as discrete distributions supported at a fixed number of points. In doing so, we attempt to access the full flexibility of Wasserstein spaces to represent a wide variety of structures. Optimal transport metrics and their gradients are costly to compute, requiring the solution of a linear program. For efficiency, we use an approximation to the Wasserstein distance called the Sinkhorn divergence (Cuturi, 2013), in which the underlying transport problem is regularized to make it more tractable. While less well-characterized theoretically with respect to embedding capacity, the Sinkhorn divergence is computed efficiently by a fixed-point iteration. Moreover, recent work has shown that it is suitable for gradient-based optimization via automatic differentiation (Genevay et al., 2018b). To our knowledge, our work is the first to explore embedding properties of the Sinkhorn divergence. We empirically investigate two settings for Wasserstein embeddings. First, we demonstrate their representational capacity by embedding a variety of complex networks, for which Wasserstein embeddings achieve higher fidelity than both Euclidean and hyperbolic embeddings. Second, we compute Wasserstein word embeddings, which show retrieval performance comparable to existing methods. One major benefit of our embedding is that the distributions can be visualized directly, unlike most embeddings, which require a dimensionality reduction step such as t-SNE before visualization. We demonstrate the power of this approach by visualizing the learned word embeddings. 2 PRELIMINARIES 2.1 OPTIMAL TRANSPORT AND WASSERSTEIN DISTANCE The p-Wasserstein distance between probability distributions µ and ν over a metric space X is Wp(µ, ν) = inf π Π(µ,ν) X X d(x1, x2)p dπ(x1, x2) 1 where the infimum is taken over transport plans π that distribute the mass in µ to match that in ν, with the p-th power of the ground metric d(x1, x2) on X giving the cost of moving a unit of mass from support point x1 X underlying distribution µ to point x2 X underlying ν. The Wasserstein distance is the cost of the optimal transport plan matching µ and ν (Villani, 2003). In this paper, we are concerned with discrete distributions supported on finite sets of points in Rn: i=1 uiδx(i) and ν = i=1 viδy(i). (2) Here, u and v are vectors of nonnegative weights summing to 1, and {x(i)}M i=1, {y(i)}N i=1 Rn are the support points. In this case, the transport plan π matching µ and ν in Equation 1 becomes discrete as well, supported on the product of the two support sets. Define D RM N + to be the matrix of pairwise ground metric distances, with Dij = d(x(i), y(j)). Then, for discrete distributions, Equation 1 is equivalent to solving the following: Wp(µ, ν)p = min T 0 tr(Dp T ) subject to T1 = u, T 1 = v, (3) with Tij giving the transported mass between xi and yj. The power Dp is taken elementwise. 2.2 SINKHORN DIVERGENCE Equation 3 is a linear program that can be challenging to solve in practice. To improve efficiency, recent learning algorithms use an entropic regularizer proposed by Cuturi (2013). The resulting Sinkhorn divergence solves a modified version of Equation 3: Wλ p (µ, ν)p = min T 0 tr(Dp T ) + λtr T(log(T) 11 ) s.t. T1 = u, T 1 = v, (4) Published as a conference paper at ICLR 2019 where log( ) is applied elementwise and λ 0 is the regularization parameter. For λ > 0, the optimal solution takes the form T = (r) exp ( Dp/λ) (c), where (r) and (c) are diagonal matrices with diagonal elements r and c, resp. Rather than optimizing over matrices T, one can optimize for r and c, reducing the size of the problem to M +N. This can be solved via matrix balancing, starting from an initial matrix K := exp( Dp λ ) and alternately projecting onto the marginal constraints until convergence: r u./Kc c v./K r. (5) Here, ./ denotes elementwise division for vectors. Beyond simplicity of implementation, Equation 5 has an additional advantage for machine learning: The steps of this algorithm are differentiable. With this observation in mind, Genevay et al. (2018b) incorporate entropic transport into learning pipelines by applying automatic differentiation (back propagation) to a fixed number of Sinkhorn iterations. 2.3 WHAT CAN WE EMBED IN THEORY? Given two metric spaces A and B, an embedding of A into B is a map φ : A B that approximately preserves distances, in the sense that the distortion is small: Ld A(u, v) d B(φ(u), φ(v)) CLd A(u, v), u, v A, (6) for some uniform constants L > 0 and C 1. The distortion of the embedding φ is the smallest C such that Equation 6 holds. One can characterize how large a space is (its representational capacity) by the spaces that embed into it with low distortion. In practical terms, this capacity determines the types of data (and relationships between them) that can be well-represented in the embedding space. Rn with the Euclidean metric, for example, embeds into the L1 metric with low distortion, while the reverse is not true (Deza & Laurent, 2009). We do not expect Manhattan-structured data to be well-represented in Euclidean space, no matter how clever the mapping. Wasserstein spaces are very large: Many spaces can embed into Wasserstein spaces with low distortion, even when the converse is not true. Wp(A), for A an arbitrary metric space, embeds any product space An, for example (Kloeckner, 2010), via discrete distributions supported at n points. Even more generally, certain Wasserstein spaces are universal, in the sense that they can embed arbitrary metrics on finite spaces. W1(ℓ1) is one such space (Bourgain, 1986), and it is still an open problem to determine if W1(Rk) is universal for any k < + . Recently it has been shown that every finite metric space embeds the 1 p power of its metric into Wp(R3), p > 1, with vanishing distortion (Andoni et al., 2015). A hopeful interpretation suggests that W1(R3) may be a plausible target space for arbitrary metrics on symbolic data, with a finite set of symbols; we are unaware of similar universality results for Lp or hyperbolic spaces, for example. The reverse direction, embedding Wasserstein spaces into others, is well-studied in the case of discrete distributions. Theoretical results in this domain are motivated by interest in efficient algorithms for approximating Wasserstein distances by embedding into spaces with easily-computed metrics. In this direction, low-distortion embeddings are difficult to find. W2(R3), for example, is known not to embed into L1 (Andoni et al., 2016). Some positive results exist, nevertheless. For a Euclidean ground metric, for example, the 1-Wasserstein distance can be approximated in a wavelet domain (Shirdhonkar & Jacobs, 2008) or by high-dimensional embedding into L1 (Indyk & Thaper, 2003). In 4, we empirically investigate the embedding capacity of Wasserstein spaces, by attempting to learn low-distortion embeddings for a variety of input spaces. For efficiency, we replace the Wasserstein distance by its entropically-regularized counterpart, the Sinkhorn divergence ( 2.2). The embedding capacity of Sinkhorn divergences is previously unstudied, to our knowledge, except in the weak sense that the approximation error with respect to the Wasserstein distance vanishes with the regularizer taken to zero (Carlier et al., 2017; Genevay et al., 2018a). 2.4 RELATED WORK While learned vector space embeddings have a long history, there is a recent trend in the representation learning community towards more complex target spaces, such as spaces of probability distributions (Vilnis & Mc Callum, 2015; Athiwaratkun & Wilson, 2018), Euclidean norm balls (Mirzazadeh Published as a conference paper at ICLR 2019 et al., 2015; Mirzazadeh, 2017), Poincar e balls (Nickel & Kiela, 2017), and Riemannian manifolds (Nickel & Kiela, 2018). From a modeling perspective, these more complex structures assist in representing uncertainty about the objects being embedded (Vilnis & Mc Callum, 2015; Bojchevski & G unnemann, 2018) as well as complex relations such as inclusion, exclusion, hierarchy, and ordering (Mirzazadeh et al., 2015; Vendrov et al., 2015; Athiwaratkun & Wilson, 2018). In the same vein, our work takes probability distributions in a Wasserstein space as our embedding targets. The distance or discrepancy measure between target structures is a major defining factor for a representation learning model. Lp distances as well as angle-based discrepancies are fairly common (Mikolov et al., 2013), as is the KL divergence (Kullback & Leibler, 1951), when embedding into probability distributions. For distributions, however, the KL divergence and Lp distances are problematic, in the sense that they ignore the geometry of the domain of the distributions being compared. For distributions with disjoint support, for example, these divergences do not depend on the separation between the supports. Optimal transport distances (Villani, 2008; Cuturi, 2013; Peyr e & Cuturi, 2017; Solomon, 2018), on the other hand, explicitly account for the geometry of the domain. Hence, models based on optimal transport are gaining popularity in machine learning; see (Rubner et al., 1998; Courty et al., 2014; Frogner et al., 2015; Kusner et al., 2015; Arjovsky et al., 2017; Genevay et al., 2018b; Claici et al., 2018; Singh et al., 2018) for some examples. Learned embeddings into Wasserstein spaces are relatively unexplored. Recent research proposes embedding into Gaussian distributions (Muzellec & Cuturi, 2018; Zhu et al., 2018). Restricting to parametric distributions enables closed-form expressions for transport distances, but the resulting representation space may lose expressiveness. We note that Courty et al. (2018) study embedding in the opposite direction, from Wasserstein into Euclidean space. In contrast, we learn to embed into the space of discrete probability distributions endowed with the Wasserstein distance. Discrete distributions are dense in W2 (Kloeckner, 2012; Brancolini et al., 2009). 3 LEARNING WASSERSTEIN EMBEDDINGS 3.1 THE LEARNING PROBLEM We consider the task of recovering a pairwise distance or similarity relationship that may be only partially observed. We are given a collection of objects C these can be words, symbols, images, or any other data as well as samples u(i), v(i), r(u(i), v(i)) of a target relationship r : C C R that tells us the degree to which pairs of objects are related. Our objective is to find a map φ : C Wp(X) such that the relationship r(u, v) can be recovered from the Wasserstein distance between φ(u) and φ(v), for any u, v C. Examples include: 1. METRIC EMBEDDING: r is a distance metric, and we want Wp(φ(u), φ(v)) r(u, v) for all u, v C. 2. GRAPH EMBEDDING: C contains the vertices of a graph and r : C C {0, 1} is the adjacency relation; we would like the neighborhood of each φ(u) in Wp to coincide with graph adjacency. 3. WORD EMBEDDING: C contains individual words and r is a semantic similarity between words. We want distances in Wp to predict this semantic similarity. Although the details of each task require some adjustment to the learning architecture, our basic representation and training procedure detailed below applies to all three examples. 3.2 OPTIMIZATION Given a set of training samples S = u(i), v(i), r(i) N i=1 C C R, we want to learn a map φ : C Wp(X). We must address two issues. First we must define the range of our map φ. The whole of Wp(X) is infinite-dimensional, and for a tractable problem we need a finite-dimensional output. We restrict ourselves to discrete distributions with an a priori fixed number of support points M, reducing optimal transport to the linear program in Equation 3. Such a distribution is parameterized by the locations of its support points {x(j)}M j=1, forming a point cloud in the ground metric space X. For simplicity, we restrict to uniform weights u, v 1, although it is possible to optimize simultaneously over weights and locations. As noted Published as a conference paper at ICLR 2019 in (Brancolini et al., 2009; Kloeckner, 2012; Claici et al., 2018), however, when constructing a discrete M-point approximation to a fixed target distribution, allowing non-uniform weights does not improve the asymptotic approximation error.1 The second issue is that, as noted in 2.2, exact computation of Wp in general is costly, requiring the solution of a linear program. As in (Genevay et al., 2018b), we replace Wp with the Sinkhorn divergence Wλ p , which is solvable by a the fixed-point iteration in Equation 5. Learning then takes the form of empirical loss minimization: φ = arg min φ H i=1 L Wλ p φ(u(i)), φ(v(i)) , r(i) , (7) over a hypothesis space of maps H. The loss L is problem-specific and scores the similarity between the regularized Wasserstein distance Wλ p and the target relationship r at u(i), v(i) . As mentioned in 2.2, gradients are available from automatic differentiation of the Sinkhorn procedure, and hence with a suitable loss function the learning objective Equation 7 can be optimized by standard gradientbased methods. In our experiments, we use the Adam optimizer (Kingma & Ba, 2014). 4 EMPIRICAL STUDY 4.1 REPRESENTATIONAL CAPACITY: EMBEDDING COMPLEX NETWORKS We first demonstrate the representational power of learned Wasserstein embeddings. As discussed in 2.3, theory suggests that Wasserstein spaces are quite flexible, in that they can embed a wide variety of metrics with low distortion. We show that this is true in practice as well. To generate a variety of metrics to embed, we take networks with various patterns of connectivity and compute the shortest-path distances between vertices. The collection of vertices for each network serves as the input space C for our embedding, and our goal is to learn a map φ : C Wp(Rk) such that the 1-Wasserstein distance W1(φ(u), φ(v)) matches as closely as possible the shortest path distance between vertices u and v, for all pairs of vertices. We learn a minimum-distortion embedding: Given a fully-observed distance metric d C : C C R in the input space, we minimize the mean distortion: φ = arg min φ |Wλ 1 (φ(vi), φ(vj)) d C(vi, vj)| d C(vi, vj) . (8) φ is parameterized as in 3.2, directly specifying the support points of the output distribution. We examine the performance of Wasserstein embedding using both random networks and real networks. The random networks in particular allow us systematically to test robustness of the Wasserstein embedding to particular properties of the metric we are attempting to embed. Note that these experiments do not explore generalization performance: We are purely concerned with the representational capacity of the learned Wasserstein embeddings. For random networks, we use three standard generative models: Barab asi Albert (Albert & Barab asi, 2002), Watts Strogatz (Watts & Strogatz, 1998), and the stochastic block model (Holland et al., 1983). Random scale-free networks are generated from the Barab asi Albert model, and possess the property that distances are on average much shorter than in a Euclidean spatial graph, scaling like the logarithm of the number of vertices. Random small-world networks are generated from the Watts Strogatz model; in addition to log-scaling of the average path length, the vertices of Watts Strogatz graphs cluster into distinct neighborhoods. Random communitystructured networks are generated from the stochastic block model, which places vertices within densely-connected communities, with sparse connections between the different communities. We additionally generate random trees by choosing a random number of children2 for each node, progressing in breadth-first order until a specified total number of nodes is reached. In all cases, we generate networks with 128 vertices. 1In both the non-uniform and uniform cases, the order of convergence in Wp of the nearest weighted point cloud to the target measure, as we add more points, is O(M 1/d), for a d-dimensional ground metric space. This assumes the underlying measure is absolutely continuous and compactly-supported. 2Each non-leaf node has a number of children drawn uniformly from {2, 3, 4}. Published as a conference paper at ICLR 2019 0 10 20 30 40 50 60 Total embedding dimension log10 distortion Euclidean Hyperbolic Wass R2 Wass R3 Wass R4 (a) Random scale-free networks. 5 10 15 20 25 30 Total embedding dimension log10 distortion Euclidean Hyperbolic Wass R2 Wass R3 Wass R4 (b) Random small-world networks. 5 10 15 20 25 30 Total embedding dimension log10 distortion Euclidean Hyperbolic Wass R2 Wass R3 Wass R4 (c) Random community-structured networks. 5 10 15 20 25 30 Total embedding dimension log10 distortion Euclidean Hyperbolic Wass R2 Wass R3 Wass R4 (d) Random trees. Figure 1: Random networks: Learned Wasserstein embeddings achieve lower distortion than Euclidean/hyperbolic embeddings. Hyperbolic embeddings outperform specifically on random trees. We compare against two baselines, trained using the same distortion criterion and optimization method: Euclidean embeddings, and hyperbolic embeddings. Euclidean embeddings we expect to perform poorly on all of the chosen graph types, since they are limited to spatial relationships with zero curvature. Hyperbolic embeddings model tree-structured metrics, capturing the exponential scaling of graph neighborhoods; they have been suggested for a variety of other graph families as well (Zhao et al., 2011). Figure 1 shows the result of embedding random networks.3 As the total embedding dimension increases, the distortion decreases for all methods. Importantly, Wasserstein embeddings achieve lower distortion than Euclidean and hyperbolic embeddings, establishing their flexibility under the varying conditions represented by the different network models. In some cases, the Wasserstein distortion continues to decrease long after the other embeddings have saturated their capacity. As expected, hyperbolic space significantly outperforms both Euclidean and Wasserstein specifically on tree-structured metrics. We test R2, R3, and R4 as ground metric spaces. For all of the random networks we examined, the performance between R3 and R4 is nearly indistinguishable. This observation is consistent with theoretical results ( 2.3) suggesting that R3 is sufficient to embed a wide variety of metrics. We also examine fragments of real networks: an Ar Xiv co-authorship network, an Amazon product co-purchasing network, and a Google web graph (Leskovec & Krevl, 2014). For each graph fragment, we choose uniformly at random a starting vertex and then extract the subgraph on 128 vertices taken in breadth-first order from that starting vertex. Distortion results are shown in Figure 2. Again the Wasserstein embeddings achieve lower distortion than Euclidean or hyperbolic embeddings. 3The solid line is the median over 20 randomly-generated inputs, while shaded is the middle 95%. Published as a conference paper at ICLR 2019 5 10 15 20 25 30 Total embedding dimension log10 distortion Euclidean Hyperbolic Wass R2 Wass R3 Wass R4 (a) ar Xiv co-authorship. 5 10 15 20 25 30 Total embedding dimension log10 distortion Euclidean Hyperbolic Wass R2 Wass R3 Wass R4 (b) Amazon product co-purchases. 5 10 15 20 25 30 Total embedding dimension log10 distortion Euclidean Hyperbolic Wass R2 Wass R3 Wass R4 (c) Google web graph. Figure 2: Real networks: Learned Wasserstein embeddings achieve lower distortion than Euclidean and hyperbolic embeddings of real network fragments. one: f, two, i, after, four Wλ 1 (R2) united: series, professional, team, east, central algebra: skin, specified, equation, hilbert, reducing one: two, three, s, four, after Wλ 1 (R3) united: kingdom, australia, official, justice, officially algebra: binary, distributions, reviews, ear, combination one: six, eight, zero, two, three Wλ 1 (R4) united: army, union, era, treaty, federal algebra: tables, transform, equations, infinite, differential Table 1: Change in the 5-nearest neighbors when increasing dimensionality of each point cloud with fixed total length of representation. 4.2 WORD2CLOUD: WASSERSTEIN WORD EMBEDDINGS In this section, we embed words as point clouds. In a sentence s = (x0, . . . , xn), a word xi is associated with word xj if xj is in the context of xi, which is a symmetric window around xi. This association is encoded as a label r; rxi,xj = 1 if and only if |i j| l where l is the window size. For word embedding, we use a contrastive loss function (Hadsell et al., 2006) φ = arg min φ xi,xj s rxi,xj Wλ 1 φ(xi), φ(xj) 2 +(1 rxi,xj) h m Wλ 1 φ(xi), φ(xj) i (9) which tries to embed words xi, xj near each other in terms of 1-Wasserstein distance (here Wλ 1 ) if they co-occur in the context; otherwise, it prefers moving them at least distance m away from one another. This approach is similar to that suggested by Mikolov et al. (2013), up to the loss and distance functions. We use a Siamese architecture (Bromley et al., 1993) for our model, with negative sampling (as in Mikolov et al. (2013)) for selecting words outside the context. The network architecture in each branch consists of a linear layer with 64 nodes followed by our point cloud embedding layer. The two branches of the Siamese network connect via the Wasserstein distance, computed as in 2.2. The training dataset is Text84, which consists of a corpus with 17M tokens from Wikipedia and is commonly used as a language modeling benchmark. We choose a vocabulary of 8000 words and a context window size of l = 2 (i.e., 2 words on each side), λ = 0.05, number of epochs of 3, negative sampling rate of 1 per positive and Adam (Kingma & Ba, 2014) for optimization. We first study the effect of dimensionality of the point cloud on the quality of the semantic neighborhood captured by the embedding. We fix the total number of output parameters, being the product of the number of support points and the dimension of the support space, to 63 or 64 parameters. Table 1 shows the 5 nearest neighbors in the embedding space. Notably, increasing the dimensionality directly improves the quality of the learned representation. Interestingly, it is more effective to use a budget of 64 parameters in a 16-point, 4-dimensional cloud than in a 32-point, 2-dimensional cloud. Next we evaluate these models on a number of benchmark retrieval tasks from (Faruqui & Dyer, 2014), which score a method by the correlation of its output similarity scores with human similarity 4From http://mattmahoney.net/dc/text8.zip Published as a conference paper at ICLR 2019 # Wλ 1 (R2) Wλ 1 (R3) Wλ 1 (R4) R M S G W Task Name Pairs 17M 17M 17M 63M 631M 900M 100B RG-65 65 0.32 0.67 0.81 0.27 -0.02 0.50 0.66 0.54 WS-353 353 0.15 0.27 0.33 0.24 0.10 0.49 0.62 0.64 WS-353-S 203 0.23 0.40 0.44 0.36 0.15 0.61 0.70 0.70 WS-353-R 252 0.05 0.19 0.21 0.18 0.09 0.40 0.56 0.61 MC-30 30 0.04 0.45 0.54 0.47 -0.14 0.57 0.66 0.63 Rare-Word 2034 0.06 0.22 0.10 0.29 0.11 0.39 0.06 0.39 MEN 3000 0.25 0.28 0.26 0.24 0.09 0.57 0.31 0.65 MTurk-287 287 0.40 0.38 0.49 0.33 0.09 0.59 0.36 0.67 MTurk-771 771 0.11 0.23 0.25 0.26 0.10 0.50 0.32 0.57 Sim Lex-999 999 0.09 0.05 0.07 0.23 0.01 0.27 0.10 0.31 Verb-143 144 0.03 0.03 0.16 0.29 0.06 0.36 0.44 0.27 Table 2: Performance on a number of similarity benchmarks when dimensionality of point clouds increase given a fixed total number of parameters. The middle block shows the performance of the proposed models. The right block shows the performance of baselines. The training corpus size when known appears below each model name. judgments, for various pairs of words. Results are shown in Table 2. The results of our method, which use Sinkhorn distance to compute the point cloud (dis)similarities, appear in the middle block of Table 2. Again, we mainly see gradual improvement with increasing dimensionality of the point clouds. The right block in Table 2 shows baselines: Respectively, RNN(80D) (Kombrink et al., 2011), Metaoptimize (50D) (Turian et al., 2010), SENNA (50D) (Collobert, 2011) Global Context (50D) (Huang et al., 2012) and word2vec (80D) (Mikolov et al., 2013). In the right block, as in (Faruqui & Dyer, 2014), the cosine similarity is used for point embeddings. The reported performance measure is the correlation with ground-truth rankings, computed as in (Faruqui & Dyer, 2014). Note there are many ways to improve the performance: increasing the vocabulary/window size/number of epochs/negative sampling rate, using larger texts, and accelerating performance. We defer this tuning to future work focused specifically on NLP. 4.2.1 DIRECT, INTERPRETABLE VISUALIZATION OF HIGH-DIMENSIONAL EMBEDDINGS Wasserstein embeddings over low-dimensional ground metric spaces have a unique property: We can directly visualize the embedding, which is a point cloud in the low-dimensional ground space. This is not true for most existing embedding methods, which rely on dimensionality reduction techniques such as t-SNE for visualization. Whereas dimensionality reduction only approximately captures proximity of points in the embedding space, with Wasserstein embeddings we can display the exact embedding of each input, by visualizing the point cloud. We demonstrate this property by visualizing the learned word representations. Importantly, each point cloud is strongly clustered, which leads to apparent, distinct modes in its density. We therefore use kernel density estimation to visualize the densities. In Figure 3a, we visualize three distinct words, thresholding each density at a low value and showing its upper level set to reveal the modes. These level sets are overlaid, with each color in the figure corresponding to a distinct embedded word. The density for each word is depicted by the opacity of the color within each level set. It is easy to visualize multiple sets of words in aggregate, by assigning all words in a set a single color. This immediately reveals how well-separated the sets are, as shown in Figure 3b: As expected, military and political terms overlap, while names of sports are more distant. Examining the embeddings in more detail, we can dissect relationships (and confusion) between different sets of words. We observe that each word tends to concentrate its mass in two or more distinct regions. This multimodal shape allows for multifaceted relationships between words, since a word can partially overlap with many distinct groups of words simultaneously. Figure 3c shows the embedding for a word that has multiple distinct meanings (kind), alongside synonyms for both senses of the word (nice, friendly, type). We see that kind has two primary modes, which overlap separately with friendly and type. nice is included to show a failure of the embedding to capture the full semantics: Figure 3d shows that the network has learned that nice is a city in France, ignoring its interpretation as an adjective. This demonstrates the potential of this visualization for debugging, helping identify and attribute an error. Published as a conference paper at ICLR 2019 soccer politics art (a) Densities of three embedded words. soccer baseball football basketball rugby tennis senate congress vote election politics army military navy weapon battle (b) Class separation. kind friendly type nice (c) Word with multiple meanings: kind. nice paris france (d) Explaining a failed association: nice. Figure 3: Directly visualizing high-dimensional word embeddings. 5 DISCUSSION AND CONCLUSION Several characteristics determine the value and effectiveness of an embedding space for representation learning. The space must be large enough to embed a variety of metrics, while admitting a mathematical description compatible with learning algorithms; additional features, including direct interpretability, make it easier to understand, analyze, and potentially debug the output of a representation learning procedure. Based on their theoretical properties, Wasserstein spaces are strong candidates for representing complex semantic structures, when the capacity of Euclidean space does not suffice. Empirically, entropy-regularized Wasserstein distances are effective for embedding a wide variety of semantic structures, while enabling direct visualization of the embedding. Our work suggests several directions for additional research. Beyond simple extensions like weighting points in the point cloud, one observation is that we can lift nearly any representation space X to distributions over that space W(X) represented as point clouds; in this paper we focused on the case X = Rn. Since X embeds within W(X) using δ-functions, this might be viewed as a general lifting procedure increasing the capacity of a representation. We can also consider other tasks, such as co-embedding of different modalities into the same transport space. Additionally, our empirical results suggest that theoretical study of the embedding capacity of Sinkhorn divergences may be profitable. Finally, following recent work on computing geodesics in Wasserstein space (Seguy & Cuturi, 2015), it may be interesting to invert the learned mappings and use them for interpolation. ACKNOWLEDGEMENTS J. Solomon acknowledges the support of Army Research Office grant W911NF-12-R-0011 ( Smooth Modeling of Flows on Graphs ), of National Science Foundation grant IIS-1838071 ( BIGDATA:F: Statistical and Computational Optimal Transport for Geometric Data Analysis ), from an Amazon Research Award, from the MIT IBM Watson AI Laboratory, and from the Skoltech MIT Next Generation Program. Published as a conference paper at ICLR 2019 R eka Albert and Albert-L aszl o Barab asi. Statistical mechanics of complex networks. Reviews of Modern Physics, 74(1):47, 2002. Alexandr Andoni, Assaf Naor, and Ofer Neiman. Snowflake universality of Wasserstein spaces. ar Xiv:1509.08677, 2015. Alexandr Andoni, Assaf Naor, and Ofer Neiman. Impossibility of sketching of the 3d transportation metric with quadratic cost. In LIPIcs-Leibniz International Proceedings in Informatics, 2016. Martin Arjovsky, Soumith Chintala, and L eon Bottou. Wasserstein generative adversarial networks. In ICML, 2017. Ben Athiwaratkun and Andrew Gordon Wilson. On modeling hierarchical data via probabilistic order embeddings. In ICLR, 2018. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors with subword information. TACL, 5:135 146, 2017. Aleksandar Bojchevski and Stephan G unnemann. Deep Gaussian embedding of graphs: Unsupervised inductive learning via ranking. In ICLR, 2018. Jean Bourgain. The metrical interpretation of superreflexivity in banach spaces. Israel Journal of Mathematics, 56(2):222 230, 1986. Alessio Brancolini, Giuseppe Buttazzo, Filippo Santambrogio, and Eugene Stepanov. Long-term planning versus short-term planning in the asymptotical location problem. ESAIM: Control, Optimisation and Calculus of Variations, 15(3):509 524, 2009. Jane Bromley, James W. Bentz, L eon Bottou, Isabelle Guyon, Yann Le Cun, Cliff Moore, Eduard S ackinger, and Roopak Shah. Signature verification using A Siamese time delay neural network. In NIPS, 1993. Guillaume Carlier, Vincent Duval, Gabriel Peyr e, and Bernhard Schmitzer. Convergence of entropic schemes for optimal transport and gradient flows. SIAM Journal on Mathematical Analysis, 49 (2):1385 1418, 2017. Sebastian Claici, Edward Chien, and Justin Solomon. Stochastic Wasserstein barycenters. In ICML, 2018. Ronan Collobert. Deep learning for efficient discriminative parsing. In AISTATS, 2011. Nicolas Courty, R emi Flamary, and Devis Tuia. Domain adaptation with regularized optimal transport. In ECML-PKDD, 2014. Nicolas Courty, R emi Flamary, and M elanie Ducoffe. Learning Wasserstein embeddings. In ICLR, 2018. Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In NIPS, 2013. Michel Marie Deza and Monique Laurent. Geometry of Cuts and Metrics, volume 15. Springer, 2009. Manaal Faruqui and Chris Dyer. Community evaluation and exchange of word vectors at wordvectors.org. In Association for Computational Linguistics: System Demonstrations, 2014. Charlie Frogner, Chiyuan Zhang, Hossein Mobahi, Mauricio Araya-Polo, and Tomaso A. Poggio. Learning with a Wasserstein loss. In NIPS, 2015. Aude Genevay, L enaic Chizat, Francis Bach, Marco Cuturi, and Gabriel Peyr e. Sample complexity of Sinkhorn divergences. ar Xiv:1810.02733, 2018a. Aude Genevay, Gabriel Peyr e, and Marco Cuturi. Learning generative models with Sinkhorn divergences. In AISTATS, 2018b. Published as a conference paper at ICLR 2019 Aditya Grover and Jure Leskovec. Node2vec: Scalable feature learning for networks. In KDD, 2016. Raia Hadsell, Sumit Chopra, and Yann Le Cun. Dimensionality reduction by learning an invariant mapping. In CVPR, 2006. Paul W Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. Stochastic blockmodels: First steps. Social Networks, 5(2):109 137, 1983. Eric H Huang, Richard Socher, Christopher D Manning, and Andrew Y Ng. Improving word representations via global context and multiple word prototypes. In ACL, 2012. Piotr Indyk and Nitin Thaper. Fast image retrieval via embeddings. In ICCV Workshop on Statistical and Computational Theories of Vision, 2003. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv:1412.6980, 2014. Ryan Kiros, Richard S. Zemel, and Ruslan Salakhutdinov. A multiplicative model for learning distributed text-based attribute representations. In NIPS, 2014. Benoit Kloeckner. A geometric study of Wasserstein spaces: Euclidean spaces. Annali della Scuola Normale Superiore di Pisa-Classe di Scienze-Serie V, 9(2):297, 2010. Benoit Kloeckner. Approximation by finitely supported measures. ESAIM: Control, Optimisation and Calculus of Variations, 18(2):343 359, 2012. Stefan Kombrink, Tomas Mikolov, Martin Karafi at, and Luk as Burget. Recurrent neural network based language modeling in meeting recognition. In INTERSPEECH, 2011. Solomon Kullback and Richard A Leibler. On information and sufficiency. The Annals of Mathematical Statistics, 22(1):79 86, 1951. Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. From word embeddings to document distances. In ICML, 2015. Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford large network dataset collection. http: //snap.stanford.edu/data, June 2014. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In NIPS, 2013. Farzaneh Mirzazadeh. Solving Association Problems with Convex Co-embedding. Ph D thesis, University of Alberta, 2017. Chapter 5. Farzaneh Mirzazadeh, Yuhong Guo, and Dale Schuurmans. Convex co-embedding. In AAAI, 2014. Farzaneh Mirzazadeh, Siamak Ravanbakhsh, Nan Ding, and Dale Schuurmans. Embedding inference for structured multilabel prediction. In NIPS, 2015. Boris Muzellec and Marco Cuturi. Generalizing point embeddings using the Wasserstein space of elliptical distributions. In NIPS, 2018. Graham Neubig, Matthias Sperber, Xinyi Wang, Matthieu Felix, Austin Matthews, Sarguna Padmanabhan, Ye Qi, Devendra Singh Sachan, Philip Arthur, Pierre Godard, John Hewitt, Rachid Riad, and Liming Wang. XNMT: the extensible neural machine translation toolkit. In AMTA (1), pp. 185 192, 2018. Maximilian Nickel and Douwe Kiela. Poincar e embeddings for learning hierarchical representations. In NIPS, 2017. Maximilian Nickel and Douwe Kiela. Learning continuous hierarchies in the Lorentz model of hyperbolic geometry. In ICML, 2018. Published as a conference paper at ICLR 2019 Maximilian Nickel, Kevin Murphy, Volker Tresp, and Evgeniy Gabrilovich. A review of relational machine learning for knowledge graphs. Proc. IEEE, 104(1):11 33, 2016. Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In EMNLP, 2014. Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In NAACL-HLT, 2018. Gabriel Peyr e and Marco Cuturi. Computational optimal transport. Technical report, 2017. Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. A metric for distributions with applications to image databases. In ICCV, 1998. Vivien Seguy and Marco Cuturi. Principal geodesic analysis for probability measures under the optimal transport metric. In NIPS, 2015. Sameer Shirdhonkar and David W Jacobs. Approximate Earth Mover s distance in linear time. In CVPR, 2008. Marcel Simon, Erik Rodner, and Joachim Denzler. Imagenet pre-trained models with batch normalization. ar Xiv:1612.01452, 2016. Sidak Pal Singh, Andreas Hug, Aymeric Dieuleveut, and Martin Jaggi. Context mover s distance barycenters: Optimal transport of contexts for building representations. ar Xiv:1808.09663, 2018. Justin Solomon. Optimal transport on discrete domains. AMS Short Course on Discrete Differential Geometry, 2018. Joseph P. Turian, Lev-Arie Ratinov, and Yoshua Bengio. Word representations: A simple and general method for semi-supervised learning. In ACL, 2010. Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. Order-embeddings of images and language. In ICLR, 2015. C edric Villani. Topics in Optimal Transportation. Number 58 in Graduate studies in mathematics. American Mathematical Soc., 2003. C edric Villani. Optimal Transport: Old and New, volume 338. Springer Science & Business Media, 2008. Luke Vilnis and Andrew Mc Callum. Word representations via Gaussian embedding. In ICLR, 2015. Duncan J Watts and Steven H Strogatz. Collective dynamics of small-world networks. Nature, 393(6684):440, 1998. Jason Weston, Samy Bengio, and Nicolas Usunier. WSABIE: scaling up to large vocabulary image annotation. In IJCAI, 2011. Xiaohan Zhao, Alessandra Sala, Haitao Zheng, and Ben Y Zhao. Fast and scalable analysis of massive social graphs. ar Xiv:1107.5114, 2011. Dingyuan Zhu, Peng Cui, Daixin Wang, and Wenwu Zhu. Deep variational network embedding in Wasserstein space. In KDD, 2018.