# gromovwasserstein_averaging_of_kernel_and_distance_matrices__f94d730b.pdf

Gromov-Wasserstein Averaging of Kernel and Distance Matrices

Gabriel Peyr e GABRIEL.PEYRE@CEREMADE.DAUPHINE.FR CNRS and Univ. Paris-Dauphine, Pl. du M. De Lattre De Tassigny, 75775 Paris 16, FRANCE

Marco Cuturi MCUTURI@I.KYOTO-U.AC.JP Kyoto University, 36-1 Yoshida-Honmachi, Sakyo-ku, Kyoto 606-8501, JAPAN

Justin Solomon JSOLOMON@MIT.EDU Massachusetts Institute of Technology, Cambridge, MA 02139, USA

This paper presents a new technique for computing the barycenter of a set of distance or kernel matrices. These matrices, which deﬁne the interrelationships between points sampled from individual domains, are not required to have the same size or to be in row-by-row correspondence. We compare these matrices using the softassign criterion, which measures the minimum distortion induced by a probabilistic map from the rows of one similarity matrix to the rows of another; this criterion amounts to a regularized version of the Gromov-Wasserstein (GW) distance between metric-measure spaces. The barycenter is then deﬁned as a Fr echet mean of the input matrices with respect to this criterion, minimizing a weighted sum of softassign values. We provide a fast iterative algorithm for the resulting nonconvex optimization problem, built upon state-ofthe-art tools for regularized optimal transportation. We demonstrate its application to the computation of shape barycenters and to the prediction of energy levels from molecular conﬁgurations in quantum chemistry.

1. Introduction

Many classes of input data encountered in machine learning are best expressed using either pairwise distance matrices or kernels. For instance, the atoms making up proteins and other molecules have no canonical orientation and hence are only known up to rigid motion; so, the geometry of a molecule might be better described by the pair-

Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s).

wise distances between atoms rather than by their absolute positions. Kernel and covariance matrices from different datasets exhibit similar structure, since they deﬁne data points by their roles relative to the rest of a dataset rather than in any absolute sense.

A major difﬁculty with this representation is that in many applications these matrices are not registered or aligned, meaning that there is no explicit correspondence between their rows and columns. This is the case for shapes that undergo pose variation, articulation or deformation. Even worse, in some cases, similarities are not deﬁned over the same ground space. For example, different molecules may have varying numbers of atoms. This inconsistency yields matrices of varying dimensions.

In this paper, we propose a theoretical and computational framework for summarizing collections of unaligned distance or kernel matrices. Building upon the notion of Gromov-Wasserstein (GW) distances between metricmeasure spaces (M emoli, 2007), we provide machinery for registering matrices of varying sizes while building interpolants and barycenters inheriting structure from the inputs. We also derive a fast regularized approximation scheme providing a practical and efﬁcient way to recover these barycenters in practice.

1.1. Previous Work

Countless techniques leverage matrices of distances or kernels to describe a dataset. While distances and kernels have opposite ordering a large distance indicates small similarity unless it is necessary to distinguish we will use the term similarity matrix to refer to any matrix containing pairwise relationships.

Aligning similarity matrices. Comparison of similarity matrices is challenging due to a lack of alignment. Mathematically, they only deﬁne the intrinsic structure

Gromov-Wasserstein Averaging of Kernel and Distance Matrices

of a dataset, which remains unchanged e.g. under rotation, translation, and isometric motion. More concretely, there is unlikely to be a canonical ordering of the rows or columns of similarity matrices, and removing or adding a few rows/columns may not signiﬁcantly affect the structure. Automatic computation of a map aligning two similarity matrices is itself a challenging problem; our work is fairly unique in that we simultaneously compute such an alignment while solving an optimization problem over such matrices.

Matching rows of similarity matrices usually becomes a quadratic assignment problem (Loiola et al., 2007), which is NP-hard. For the speciﬁc case of aligning similarity matrices derived from pairwise distances in a metric space (e.g., geodesic distances along a surface), a well-known instance of this matching problem is computation of the Gromov-Hausdorff distance (Gromov, 2001), which was applied to matching in (M emoli, 2007).

To relax the matching problem to a continuous but still non-convex optimization problem, the seminal work of M emoli considers matching between metric-measure spaces, incorporating a probability distribution on the ground space that can account for some form of uncertainty (M emoli, 2011). The geodesic properties of this space of metric spaces are considered in (Sturm, 2012). The recent work (Hendrikson, 2016) showcases the application of the corresponding Gromov-Wasserstein distance to clustering of biological and social networks. It is possible to further relax this formulation to a quadratic (Aﬂalo et al., 2015) or semideﬁnite (Kezurer et al., 2015) convex program, although this relaxation can fail if the input similarity matrices exhibit symmetries.

Rather than mixing the possible symmetric maps, we embrace the nonconvexity of the original problem and optimize quadratic matching objectives directly. After regularization, our matching subroutine resembles softassign quadratic assignment (Rangarajan et al., 1999; Gold & Rangarajan, 1996). We extend their iterations to general loss functions and to more advanced machine learning problems than nearest-neighbor search, in particular barycenter computation. We additionally leverage the fact that GW can compare arbitrary matrices, not just pairwise distances, making our machinery applicable to a broader range of tasks. The mapping component of our algorithm can be considered a generalized version of the method recently proposed in (Solomon et al., 2016).

Averaging similarity matrices. The design of an average or barycenter of similarity matrices is informed by two primary factors: A measure of discrepancy between similarity matrices and a deﬁnition of a mean. We will GW distances to deﬁne discrepancies, so our remaining task is

to deﬁne what it means to compute a barycenter.

Even when similarity matrices are the same size and registered, it may not make sense to average them arithmetically. One natural way to deﬁne a mean or barycenter of inputs (Cs)s is to minimize the sum P

s λsd2(C, Cs) over all possible C, where d is some notion of distance. This generalized notion of a mean is known as the Karcher or Fr echet mean, which generalizes the notion of an average to a wide class of metric spaces (Nielsen & Bhatia, 2012).

Fr echet barycenters in Wasserstein space were proposed in (Agueh & Carlier, 2011). Several algorithms for their computation have been proposed including (Cuturi & Doucet, 2014; Benamou et al., 2015); these assume a ﬁxed ground distance metric. Our method, however, ﬁnds more commonality with algorithms that compute Fr echet means in the space of covariance or metric matrices. (Dryden et al., 2009) considers several simple options for covariance matrices, mostly with ﬁxed alignment; their Procrustes mean provides one rudimentary strategy for alignment. (Sra, 2011) deﬁnes a metric on registered positive deﬁnite kernel matrices that can be used for computing Fr echet means.

Barycenter computation is a building block for many learning methods. For instance, k-means clustering alternates between distances and barycenter computation. The nearest-centroid classiﬁer provides a similar approach to classiﬁcation (Manning et al., 2008).

Optimal transport (OT). OT (Villani, 2003) is a way to compare probability distributions (histograms in the ﬁnitedimensional case) deﬁned over either the same ground space or multiple pre-registered ground spaces. The means of comparison is a convex linear program optimizing for a matching that moves the mass from one distribution to the other with minimal cost. OT has been applied to machine learning problems including comparison of descriptors (Cuturi, 2013), k-means (Cuturi & Doucet, 2014), semi-supervised learning (Solomon et al., 2014), domain adaptation (Courty et al., 2014), designing loss functions (Frogner et al., 2015), low-rank approximation (Seguy & Cuturi, 2015), and dictionary learning (Rolet et al., 2016).

As highlighted above, GW-style matching extends OT to the case when the ground spaces are not pre-registered, yielding a non-convex quadratic program to compute the transport. The resulting transportation matrix can be understood is a soft registration from one domain to the other. Our algorithm solves a entropically-regularized version of this quadratic program by extending recent advances in the computation of OT (Cuturi, 2013; Benamou et al., 2015).

Gromov-Wasserstein Averaging of Kernel and Distance Matrices

1.2. Contributions

Our ﬁrst contribution is the deﬁnition of a new discrepancy between similarity matrices. It extends the Gromov Wasserstein distance between metric-measure spaces to arbitrary matrices, using a generic loss functions to compare pairwise similarities and entropic regularization. It can be deﬁned over different ground measured spaces (i.e. each point is equipped with a measure), which are not required to be registered a priori. Entropic regularization enables the design of a fast iterative algorithm to compute a stationary point of the non-convex energy deﬁning the discrepancy.

Our second contribution is a new notion of the barycenter of a set of unregistered similarity matrices, deﬁned as a Fr echet mean with respect to the GW discrepancy. We propose a block coordinate relaxation algorithm to compute a stationary point of the objective function deﬁning our barycenter.

We showcase applications of our method to the computation of barycenters between shapes. We also exemplify how the GW discrepancy can be used to predict energy levels in quantum chemistry, where molecules are naturally represented using their Coulomb interaction matrices, a perfect ﬁt for our unregistered dissimilarity matrix formalism.

The code to reproduce the results of this paper is available online.1

1.3. Notation

The simplex of histograms with N bins is ΣN def. = p R+ N ; P i pi = 1 . The entropy of T RN N + is

deﬁned as H(T) def. = PN i,j=1 Ti,j(log(Ti,j) 1). The set of couplings between histograms p ΣN1 and q ΣN2 is

Cp,q def. = T (R+)N1 N2 ; T1N2 = p, T 1N1 = q .

Here, 1N def. = (1, . . . , 1) RN. For any tensor L = (Li,j,k,ℓ)i,j,k,ℓand matrix (Ti,j)i,j, we deﬁne the tensormatrix multiplication as

L T def. = X

k,ℓ Li,j,k,ℓTk,ℓ

2. Gromov-Wasserstein Discrepancy

2.1. Entropic Optimal Transport

Optimal transport distances are useful to compare two histograms (p, q) ΣN1 ΣN2 deﬁned on the same metric

1https://github.com/gpeyre/ 2016-ICML-gromov-wasserstein

space, or at least on spaces that have previously registered. Given some cost matrix c RN1 N2 + , where ci,j represents the transportation cost between position indexed by i and j, we deﬁne the solution of entropically-regularized optimal transport between these two histograms as

T (c, p, q) def. = argmin T Cp,q c, T εH(T), (2)

which is a strictly convex optimization problem.

As shown in (Cuturi, 2013), the solution reads T (c, p, q) = diag(a)K diag(b) where K def. = e c

ε RN1 N2 + is the socalled Gibbs kernel associated to c, and (a, b) RN1 + RN2 + can be computed using Sinkhorn iterations

a p Kb and b q K a, (3)

denotes component-wise division.

2.2. Gromov-Wasserstein Discrepancy

Following the pioneering work of M emoli, we consider input data expressed as metric-measure spaces (M emoli, 2011). This corresponds to pairs of the form (C, p) RN N ΣN, where N is an arbitrary integer (the number of elements in the underlying space). Here, C is a matrix representing either similarities or distances between these elements, and p is an histogram, which can account either for some uncertainty or relative importance between these elements. In case no prior information is known about a space, one can set p = 1 N 1N to the uniform distribution. In our setting, since we target a wide range of machinelearning problems, we do not restrict the matrices C to be distance matrices, i.e., they are not necessarily positive and does not necessarily satisfy the triangle inequality.

We deﬁne the Gromov-Wasserstein discrepancy between two measured similarity matrices (C, p) RN1 N1 ΣN1 and ( C, q) RN2 N2 ΣN2 as follows:

GW(C, C, p, q) def. = min T Cp,q EC, C(T) (4)

where EC, C(T) def. = X

i,j,k,ℓ L(Ci,k, Cj,ℓ)Ti,j Tk,ℓ

The matrix T is a coupling between the two spaces on which the similarity matrices C and C are deﬁned. Here L is some loss function to account for the misﬁt between the similarity matrices. Typical choices of loss include the quadratic loss L(a, b) = L2(a, b) def. = 1 2|a b|2 and

the Kullback-Leibler divergence L(a, b) = KL(a|b) def. = a log(a/b) a + b (which is not symmetric). This definition (4) of GW extends slightly the one considered by (M emoli, 2011), since we consider an arbitrary loss L (rather than just the L2 squared loss). In the case

Gromov-Wasserstein Averaging of Kernel and Distance Matrices

L = L2, (M emoli, 2011) proves that GW1/2 deﬁnes a distance on the space of metric measure spaces quotiented by measure-preserving isometries.

Introducing the 4-way tensor

L(C, C) def. = (L(Ci,k, Cj,ℓ))i,j,k,ℓ,

we notice that

EC, C(T) = L(C, C) T, T ,

where denotes the tensor-matrix multiplication (1).

The following proposition shows how to compute L(C, C) T efﬁciently for a general class of loss functions:

Proposition 1. If the loss can be written as

L(a, b) = f1(a) + f2(b) h1(a)h2(b) (5)

for functions (f1, f2, h1, h2), then, for any T Cp,q,

L(C, C) T = c C, C h1(C)Th2( C) . (6)

where c C, C def. = f1(C)p1 N2+1N1q f2( C) is independent of T.

Proof. Under hypothesis (5), formula (1) shows that one has the decomposition L(C, C) T = A + B + C where

k f1(Ci,k) X

ℓ Tk,ℓ= (f1(C)(T1))i

ℓ f2( Cj,ℓ) X

k Tk,ℓ= (f2( C)(T 1))j

k h1(Ci,k) X

ℓ h2( Cj,ℓ)Tk,ℓ

which is equal to (h1(C)(h1( C)T ) )i,j.

Remark 1 (Computational complexity). Formula (6) shows that for this class of losses, one can compute L(C, C) T efﬁciently in O(N 2 1 N2+N 2 2 N1) operations, using only matrix/matrix multiplications, instead of the O(N 2 1 N 2 2 ) complexity of the na ıve implementation of formula (1). Remark 2 (Special cases). The square loss L = L2 satisﬁes (5) for f1(a) = a2, f2(b) = b2, h1(a) = a and h2(b) = 2b. The KL loss L = KL satisﬁes (5) for f1(a) = a log(a) a, f2(b) = b, h1(a) = a and h2(b) = log(b).

2.3. Entropic Gromov-Wasserstein Discrepancy

We consider the following entropic approximation of the initial GW formulation (4)

GWε(C, C, p, q) def. = min T Cp,q EC, C(T) εH(T). (7)

This is a non-convex optimization problem. We propose to use projected gradient descent, where both the gradient step and the projection are computed according to the KL metric.

Iterations of this algorithm are given by

T Proj KL Cp,q T e τ( EC, C(T ) ε H(T )) , (8)

where τ > 0 is a small enough step size, and the KL projector of any matrix K is

Proj KL Cp,q(K) def. = argmin T Cp,q KL(T |K).

Proposition 2. In the special case τ = 1/ε, iteration (8) reads T T (L(C, C) T, p, q). (9)

Proof. As shown in (Benamou et al., 2015), the projection is nothing else than the solution to the regularized transport problem (2), hence

Proj KL Cp,q(K) = T ( ε log(K), p, q).

One also has

EC, C(T) ε H(T) = L(C, C) T + ε log(T).

Re-arranging the terms in (8), one obtains, in the special case τε = 1, the desired formula.

Iteration (9) deﬁnes a surprisingly simple algorithm, in which each update of T involves a Sinkhorn projection. Remark 3 (Convergence). Using (Bot et al., 2015) (see Theorem 12), iterations (8) are guaranteed to converge provided that τ is chosen small enough, τ < τmax. Unfortunately, in general, one does not have τmax > 1/ε, so that the step size τ = 1/ε advocated by Proposition 2 is not covered by the theory. However, we found that using τ = 1/ε always leads to a converging sequence of T, and that this works well in practice. Note that when L = L2, this recovers the softassign quadratic assignement algorithms deﬁned in (Rangarajan et al., 1999; Gold & Rangarajan, 1996). The convergence proof (Rangarajan et al., 1999), however, only ensures convergence of the functional values (not of the iterates) and also only applies when the function being minimized in (7) is convex, which is not the case for arbitrary matrices (C, Cs).

3. Gromov-Wasserstein Barycenters

3.1. Gromov-Wasserstein Barycenters

We deﬁne Gromov-Wasserstein barycenters of measured similarity matrices (Cs)S s=1, where Cs RNs Ns, using

Gromov-Wasserstein Averaging of Kernel and Distance Matrices

a Fr echet mean formulation:

min C RN N X

s λs GWε(C, Cs, p, ps). (10)

In this section, for simplicity, we assume that the base histograms (ps)s and the histogram p associated to the barycenter are known and ﬁxed. Note in particular that the size (N, N) of the targeted barycenter matrix should be ﬁxed by the user. It is straightforward to extend the exposition as well as the optimization algorithm to the setting where p is unknown and included as an optimization variable.

This barycenter can be reformulated by re-introducing couplings

min C,(Ts)s

s λs (EC,Cs(Ts) εH(Ts)) (11)

subject to the constraints that for all s, Ts Cp,ps RN Ns + .

Note that if L is convex with respect to its ﬁrst variable, this problem is convex with respect to C but not with respect to (Ts)s (it is a quadratic but not necessarily positive problem with respect to the (Ts)s).

We propose to minimize (11) using a block coordinate relaxation, i.e. iteratively minimizing with respect to the couplings (Ts)s and to the metric C.

Minimization with respect to (Ts)s. The optimization problem (10) over (Ts)s alone decouples as S independent GWε optimizations

s, min Ts Cp,ps EC,Cs(Ts) εH(Ts).

A stationary point of this optimization problem can be computed using the optimization algorithm detailed in Section 2.3.

Minimization with respect to C. For given (Ts), the minimization with respect to C reads

s λs L(C, Cs) T, T . (12)

The following proposition shows, for a large class of losses, how to compute the global minimizer in closed form.

Proposition 3. If L satisﬁes (5) and f 1/h 1 is invertible, then the solution to (12) reads

C = f 1 h 1

s λs T s h2(Cs)Ts pp

where we assume the normalization P

Proof. Using relation (6), the functional to be minimized reads X

s λs f1(C)p1 +1p s f2(Cs) h1(C)Tsh2(Cs) , Ts .

The ﬁrst order optimality condition for this optimization problem thus reads f 1(C) (pp ) = h 1(C) P

s λs Tsh2(Cs)T s , which gives the desired formula.

The intuition underlying formula (13) is clear: Each T s h2(Cs)Ts is a realigned matrix where Ts acts as a fuzzy permutation (optimal transportation coupling) of both rows and columns of the distance matrix Cs. These realigned metric are then averaged, where the precise notion of averaging depends on the loss L.

Barycenters of PSD kernels using L2 loss. For the square loss L = L2, the update (13) becomes

s λs T s Cs Ts. (14)

This formula highlights an important property of the method:

Proposition 4. For L = L2, if the (Cs)s are positive semideﬁnite (PSD) matrices, the iterates C produced by the algorithm are also PSD.

Proof. Formula (14) shows that the update of C corresponds to a linear averaging of the matrices (diag(1/p)T s Cs Ts diag(1/p))s, which are all PSD since the matrices (Cs)s are.

This proposition implies, since the SDP cone is a closed set, that the output of our barycenter algorithm at convergence is an SDP kernel.

Barycenters of inﬁnitely divisible kernels using KL loss. For the KL loss L = KL, the update (13) similar, but with a geometric mean in place of an arithmetic mean

s λs T s log(Cs)Ts

The following proposition shows that this loss is particularly attractive for averaging inﬁnitely divisible kernels. Recall that C RN N is inﬁnitely divisible if and only if U = log(C) is a conditionally positive semideﬁnite kernel, i.e. for all x RN such that x, 1N = 0, then Ux, x 0.

Proposition 5. For L = KL, if the (Cs)s are inﬁnitely divisible kernels, the iterates C produced by the algorithm are also inﬁnitely divisible.

Gromov-Wasserstein Averaging of Kernel and Distance Matrices

Algorithm 1 Computation of GWε barycenters.

Input: (Cs, ps)s, p Initialize C. repeat

// minimize over (Ts)s for s = 1 to S do

Initialize Ts. repeat

// compute cs = L(C, Cs) Ts using (6). cs f1(C)+ f2(Cs) h1(C)Tsh2(Cs)

// Sinkhorn iterations (3) to compute T (cs, p, q) Initialize a 1, set K e cs/ε. repeat

a p Kb, b q K a. until convergence Update Ts diag(a)K diag(b). until convergence end for // minimize over C using (13).

s λs T s h2(Cs)Ts pp

until convergence

Proof. According to formula (15), one has to show that U def. = P

s λs diag(1/p)T s Us Ts diag(1/p) is conditionally PSD prodived that the Us matrices are. This is indeed the case, because, for any x such that x, 1N = 0, one has Ux, x = P

s λs Usxs, xs where xs def. = Ts x

p, and since xs, 1Ns = x

p, T s 1Ns = x

p, p = 0, one has Usxs, xs 0 so that Ux, x 0, which proves the result.

This proposition implies, since the cone of inﬁnitely divisible kernels is a closed set, that the output of our barycenter algorithm at convergence is inﬁnitely divisible.

A important example of inﬁnitely divisible kernels is given

by the form C = e D2

σ2 , where Di,j = ||xi xj|| is the Euclidean distance matrix between points (xi)i, and σ > 0 is a bandwidth parameter that controls the locality of

the resulting kernel. One veriﬁes that KL(e D2

σ2 ) = 1 2σ4 ||D2 D2||2 + o(1/σ4), so that in the regime of large bandwidth σ + , using the KL loss in conjunction with this class of kernels becomes equivalent to using the L2 loss between squared Euclidean matrices. A similar result in fact holds for any smooth Bregman divergence loss.

Pseudocode. Algorithm 1 details the steps of the optimization technique. Note that it makes use of three nested iterations: (i) blockwise coordinate descent on (Ts)s and C, (ii) projected gradient descent on each Ts, and (iii) Sinkhorn iterations to compute the projection.

Figure 1. Barycenters of point clouds from MNIST digits; sample input data clouds are shown on the left, and an MDS embedding of the barycenter distance matrix is shown on the right.

Figure 2. Barycenter example for shape data from (Thakoor et al., 2007).

4. Experiments

4.1. Point Clouds

Embedded barycenters. Figure 1 provides an example illustrating the behavior of our GW barycenter approximation. In this experiment, we extract 500 point clouds of handwritten digits from the dataset (Le Cun et al., 1998), rotated arbitrarily in the plane. We represent each digit as a symmetric Euclidean distance matrix and optimize for a 500 500 barycenter using Algorithm 1 (uniform weights, ε = 1 10 3); notice that most of the input point clouds consist of fewer than 500 points. We then visualize the barycenter matrix as a point cloud in the plane using multidimensional scaling (MDS). Each digit is colored by transferring RGB values from the barycenter point cloud using the computed map T.

This experiment illustrates a few properties of our algorithm. Most prominently, note that the input digits are rotated arbitrarily, which unlike tools based on classical optimal transportation does not affect the computation of the barycenter. Also, to avoid bias in this experiment we initialize C as a random matrix, and yet our nonconvex optimization algorithm still reaches a meaningful barycenter.

Figure 2 illustrates a similar experiment on the 2D shape data from (Thakoor et al., 2007). This second experiment illustrates some additional properties of our barycenters. Interestingly, even though we do not impose curve topol-

Gromov-Wasserstein Averaging of Kernel and Distance Matrices

Figure 3. Progressive interpolation ( xt)1 t=0 between two shapes (x0, x1) using GW barycenters.

ogy on the barycenter, the ﬁnal MDS embedding has onedimensional structure. The barycenter is also resilient to noise in the computed map from individual examples, e.g. the bones in the ﬁrst row.

Regularization. Figure bellow illustrates the dependence of our barycenter on the regularizer ε.

ε = 1.5 10 3 ε = 4 10 3 ε = 8 10 3 ε = 1.6 10 2ε = 3.2 10 2 It shows the barycenter from a test in Figure 2 for increasing values of ε. When ε is small, the embedded barycenter most closely resembles the input data. As ε increases, the barycenters become smoother, with the advantage of higher resilience to noise and fewer iterations needed for optimization.

Clustering. The ﬁgure on the right illustrates the incorporation of our barycenter technique into a larger machine learning pipeline. It shows results of a k-means unsupervised clustering of 250 handwritten characters.

In this example, we construct a dataset with 50 point clouds from each of digits 0 to 4 from the example in Figure 1. We then cluster the point clouds by representing them as pairwise distance matrices and applying the k-means algorithm (k = 5), with k-means++ initialization (Arthur & Vassilvitskii, 2007). We underscore that each handwritten character is represented using an unaligned point cloud with varying numbers of points.

Each row in the ﬁgure corresponds to a different character, and each column corresponds to a different cluster center (MDS embedding shown above the table). Color is proportional to the number of data points assigned to each cluster. Even without supervision, this rudimentary technique for clustering ﬁnds meaningful clusters of handwritten digits from the dataset. All the 1 digits are clustered cor-

Figure 4. Upper part: comparison between interpolation using L2

loss and KL loss. Lower part: comparison between interpolation using pairwise Euclidean and inner geodesic distances.

rectly. Digits 2 and 4 are mixed considerably in the computed clustering, reﬂecting the wide variety of handwritten styles for these two digits.

Shape interpolation. Figure 3 shows an application of the computation of GW barycenters between S = 2 input matrices two perform shape interpolation. The input matrices (C1, C2) (RN N)2 are Euclidean distance matrices of two input planar clouds (xs,i)i for s {1, 2} uniformly sampled inside a shape, so that we use p1 = p2 1

This means that Cs,i,j def. = ||xs,i xs,j|| for s {1, 2} and i, j = 1, . . . , N. The interpolation is achieved by ﬁrst computing a barycenter matrix Ct minimizing (10) with our optimization algorithm, for λ = (t, 1 t), t [0, 1], using the Euclidean loss L = L2. A barycentric point cloud xt = ( xt,i)N i=1 is then reconstructed using SMACOF multidimensional scaling (MDS), which amounts to computing a local minimizer of P

i,j |Ct,i,j || xt,i xt,j|||2 with respect to ( xt,i)i. This local minimizer xt is computed by a gradient descent scheme, initialized with x1.

L2 vs. KL losses. Figure 4, upper part, shows a comparison of the same interpolation as those computed in Figure 3, computed with two different losses. Row #1 shows barycenters of the distance matrices Cs,i,j = ||xs,i xs,j||, for s {1, 2}, using the squared Euclidean loss L = L2 as in the previous paragraph. The display is obtained by applying SMACOF MDS to the barycenter matrix Ct. Row #2 shows barycenter of the inﬁnitely divisible kernels Cs,i,j = exp( ||xs,i xs,j||2/σ2), for s = {1, 2}, using the Kullback-Leibler loss L =KL, and a bandwidth σ = 1 (for point clouds in the unit square). The display is then obtained by applying SMACOF MDS to p

σ log(Ct). The KL interpolation produces less regular interpolation because it imposes only local constraints at the scale of the bandwidth.

Gromov-Wasserstein Averaging of Kernel and Distance Matrices

Euclidean vs. Geodesic distance matrices. Euclidean pairwise distance, though simple to compute and to manipulate, fail to capture the intrinsic geometry of shapes, and thus leads to physically unrealistic shape interpolation, that ignore in particular articulation. Figure 4, lower part, shows how this issue can be in large part alleviated by replacing the extrinsic Euclidean distance Cs,i,j = ||xs,i xs,j|| for s {1, 2} by the inner geodesic distance (i.e. the length of the shortest path joining xs,i to xs,j while staying inside the shape). Note that, because the corresponding distance is not Euclidean anymore, the SMACOF MDS only produces an approximate embedding, the so-called bending invariant (Elad & Kimmel, 2003) used to perform isomericinvariant shape recognition.

4.2. Quantum chemistry

To demonstrate the interest of the Gromov-Wasserstein discrepancy, we consider its application to a regression problem on molecules. Several recent works (Rupp et al., 2012; Hansen et al., 2013) have proposed to predict atomization energies for molecules using descriptors of labeled molecules, rather than estimating them through expensive numerical simulations. (Rupp et al., 2012) proposed in particular to represent each molecule in the qm7 dataset of organic 7165 molecules through its so-called Coulomb matrix, which we describe next.

Coulomb matrices. For a molecule s with Ns atoms, each described by a relative location ri R3 in space and a nuclear charge Zi, Rupp et al. proposed to form the Ns Ns matrix Cs with off-diagonal terms Zi Zj/||ri rj|| for 1 i = j Ns, and diagonal terms 1

2Z2.4 i . Although this plays no major role in our analysis, we found that these 7165 Coulomb matrices are all inﬁnitely divisible positive deﬁnite kernel matrices. Rupp et al. argue that the Coulomb matrix is a convenient descriptor for molecules in the sense that it is invariant with respect to translations and rotations by construction. It has, however, an important weakness: Cs has an arbitrary ordering of the Ns atoms in s. As we explain next, this was addressed by Rupp et al. by creating randomly permuted copies of these matrices.

Previous work. Because their approach requires manipulating matrices of the same size, Rupp et al. pad all Ns Ns Coulomb matrices Cs with zeros to obtain 23 23 matrices (23 is the maximal number of atoms in a molecule in the qm7 database). To cope with the ordering problem, they also propose, as a representation for each molecule, to form several randomly permuted copies of Cs for each molecule s (between 8 and 1000 in (Hansen et al., 2013)). (Hansen et al., 2013, Table 3) provide out-of-sample mean absolute error (MAE) predictions for several techniques using 5 fold cross-validation. We report some of them in Table 4.2.

Algorithm MAE RMSE k-nearest neighbors 71.54 95.97 Linear regression 20.72 27.22 Gaussian kernel ridge regression 8.57 12.26 Laplacian kernel ridge regression (8) 3.07 4.84 Multilayer Neural Network (1000) 3.51 5.96 GW 3-nearest neighbors 10.83 29.27

Table 1. Mean-Absolute and Root Mean Squared errors for the atomization energy prediction in the qm7 database of 7165 molecules. All results quoted from (Hansen et al., 2013, Table 3) except ours. Number in between parenthesis stand for random copies used in the algorithm. Although the GW distance is not directly competitive performance-wise, this result shows that an acceptable performance on this task can be recovered exclusively using simple metric tools.

Our approach. We propose to use the original Ns Ns Coulomb matrices Cs directly as inputs for our GW distances. Namely, we compute a discrepancy between two molecules s, s that is exactly GW(Cs, Cs , 1Ns/Ns, 1Ns /Ns ), using L = L2 and a regularization strength ε tuned heuristically on a subset of 500 points of the database. Using a 3-nearest neighbor regression approach we obtain a MAE of 10.83 which, although not directly competitive with the best approach recorded in (Hansen et al., 2013), it remarkable in the sense that it is obtained with an extremely simple classiﬁer. Our result contradicts thus the observation made by Hansen et al., p.3414, that This [poor performance of k-nn] indicates that there are meaningful linear relations between physical quantities in the system and that it is insufﬁcient to simply look up the most similar molecules (as knearest neighbors does). The k-nearest neighbors approach fails to create a smooth mapping to the energies.

5. Conclusion

The Gromov-Wasserstein discrepancy measure is relevant to countless problems in machine learning for which the primary role of a given data point is deﬁned relative to other members of the dataset. After introducing an entropic regularizer to the problem, we in particular ﬁnd that GW is not only theoretically attractive but also a practical tool with an efﬁcient and easily-implemented non-convex optimization scheme. Our approach highlights the fact that using unregistered similarity matrices as features in machine learning problems is fruitful in many contexts. GW provides a natural framework for manipulation of these matrices, as we demonstrate on classiﬁcation, clustering, and regression tasks for shapes and molecules. We suspect this elegant and efﬁcient solution to extend to many other problem instances endowed with natural geometric structure.

Gromov-Wasserstein Averaging of Kernel and Distance Matrices

Acknowledgements The work of G. Peyr e has been supported by the European Research Council (ERC project SIGMA-Vision). J. Solomon acknowledges the support of the NSF Mathematical Sciences Postdoctoral Research Fellowship (award number 1502435).

Aﬂalo, Yonathan, Bronstein, Alexander, and Kimmel, Ron. On convex relaxation of graph isomorphism. Proc. National Academy of Sci., 112(10):2942 2947, 2015.

Agueh, Martial and Carlier, Guillaume. Barycenters in the Wasserstein space. SIAM Journal on Mathematical Analysis, 43(2):904 924, 2011.

Arthur, David and Vassilvitskii, Sergei. K-means++: The advantages of careful seeding. In Proc. SODA, pp. 1027 1035, 2007.

Benamou, Jean-David, Carlier, Guillaume, Cuturi, Marco, Nenna, Luca, and Peyr e, Gabriel. Iterative Bregman projections for regularized transportation problems. SIAM J. Sci. Comp., 37 (2):A1111 A1138, 2015.

Bot , Radu Ioan, Csetnek, Ern o Robert, and L aszl o, Szil ard Csaba. An inertial forward backward algorithm for the minimization of the sum of two nonconvex functions. EURO J. Comp. Optim., pp. 1 23, 2015.

Courty, Nicolas, Flamary, R emi, and Tuia, Devis. Domain adaptation with regularized optimal transport. In Machine Learning and Knowledge Discovery in Databases, pp. 274 289. 2014.

Cuturi, Marco. Sinkhorn distances: Lightspeed computation of optimal transportation. In Proc. NIPS, volume 26, pp. 2292 2300. 2013.

Cuturi, Marco and Doucet, Arnaud. Fast computation of Wasserstein barycenters. In Proc. ICML, volume 32, 2014.

Dryden, Ian L, Koloydenko, Alexey, and Zhou, Diwei. Non Euclidean statistics for covariance matrices, with applications to diffusion tensor imaging. The Annals of Applied Statistics, pp. 1102 1123, 2009.

Elad, Asi and Kimmel, Ron. On bending invariant signatures for surfaces. IEEE Tr. on PAMI, 25(10):1285 1295, 2003.

Frogner, Charlie, Zhang, Chiyuan, Mobahi, Hossein, Araya, Mauricio, and Poggio, Tomaso. Learning with a Wasserstein loss. In Advances in Neural Information Processing Systems, volume 28, pp. 2044 2052. 2015.

Gold, Steven and Rangarajan, Anand. A graduated assignment algorithm for graph matching. PAMI, 18(4):377 388, April 1996.

Gromov, Mikhail. Metric Structures for Riemannian and Non Riemannian Spaces. Progress in Math. Birkh auser, 2001.

Hansen, Katja, Montavon, Gr egoire, Biegler, Franziska, Fazli, Siamac, Rupp, Matthias, Schefﬂer, Matthias, Von Lilienfeld, O Anatole, Tkatchenko, Alexandre, and Mller, Klaus-Robert. Assessment and validation of machine learning methods for predicting molecular atomization energies. Journal of Chemical Theory and Computation, 9(8):3404 3419, 2013.

Hendrikson, Reigo. Using Gromov-Wasserstein distance to explore sets of networks. In University of Tartu, Master Thesis, 2016.

Kezurer, Itay, Kovalsky, Shahar Z., Basri, Ronen, and Lipman, Yaron. Tight relaxation of quadratic matching. CGF, 2015.

Le Cun, Yann, Bottou, L eon, Bengio, Yoshua, and Haffner, Patrick. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Loiola, Eliane Maria, de Abreu, Nair Maria Maia, Boaventura Netto, Paulo Oswaldo, Hahn, Peter, and Querido, Tania. A survey for the quadratic assignment problem. European J. Operational Research, 176(2):657 690, 2007.

Manning, Christopher D., Raghavan, Prabhakar, and Sch utze, Hinrich. Introduction to Information Retrieval. Cambridge University Press, 2008.

M emoli, Facundo. On the use of Gromov Hausdorff distances for shape comparison. In Symposium on Point Based Graphics, pp. 81 90. 2007.

M emoli, Facundo. Gromov Wasserstein distances and the metric approach to object matching. Foundations of Comp. Math., 11 (4):417 487, 2011.

Nielsen, Frank and Bhatia, Rajendra. Matrix Information Geometry. Springer, 2012.

Rangarajan, Anand, Yuille, Alan, and Mjolsness, Eric. Convergence properties of the softassign quadratic assignment algorithm. Neural Comput., 11(6):1455 1474, August 1999.

Rolet, Antoine, Cuturi, Marco, and Peyr e, Gabriel. Fast dictionary learning with a smoothed Wasserstein loss. In Proc. AISTATS 16, 2016.

Rupp, Matthias, Tkatchenko, Alexandre, M uller, Klaus-Robert, and Von Lilienfeld, O Anatole. Fast and accurate modeling of molecular atomization energies with machine learning. Physical review letters, 108(5):058301, 2012.

Seguy, Vivien and Cuturi, Marco. Principal geodesic analysis for probability measures under the optimal transport metric. In Advances in Neural Information Processing Systems, pp. 3294 3302, 2015.

Solomon, Justin, Rustamov, Raif, Guibas, Leonidas, and Butscher, Adrian. Earth mover s distances on discrete surfaces. ACM Trans. Graph., 33(4):67:1 67:12, July 2014.

Solomon, Justin, Peyr e, Gabriel, Kim, Vladimir, and Sra, Suvrit. Entropic metric alignment for correspondence problems. ACM Transactions on Graphics (TOG), 35(4), 2016.

Sra, Suvrit. Positive deﬁnite matrices and the S-divergence. ar Xiv preprint ar Xiv:1110.1773, 2011.

Sturm, Karl-Theodor. The space of spaces: curvature bounds and gradient ﬂows on the space of metric measure spaces. Preprint 1208.0434, ar Xiv, 2012.

Thakoor, Ninad, Gao, Jean, and Jung, Sungyong. Hidden Markov model-based weighted likelihood discriminant for 2-D shape classiﬁcation. Trans. Image Proc., 16(11):2707 2719, 2007.

Villani, Cedric. Topics in Optimal Transportation. Graduate studies in Math. AMS, 2003.