# optimal_tensor_transport__4facca52.pdf

Optimal Tensor Transport

Tanguy Kerdoncuff1, R emi Emonet1, Micha el Perrot2, Marc Sebban1

1 Univ Lyon, UJM-Saint-Etienne, CNRS, Institut d Optique Graduate School, Laboratoire Hubert Curien UMR 5516, F-42023, Saint-Etienne, France 2 Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189 - CRISt AL, F-59000 Lille, France {tanguy.kerdoncuff, remi.emonet, marc.sebban}@univ-st-etienne.fr, michael.perrot@inria.fr

Optimal Transport (OT) has become a popular tool in machine learning to align ﬁnite datasets typically lying in the same vector space. To expand the range of possible applications, Co-Optimal Transport (Co-OT) jointly estimates two distinct transport plans, one for the rows (points) and one for the columns (features), to match two data matrices that might use different features. On the other hand, Gromov Wasserstein (GW) looks for a single transport plan from two pairwise intra-domain distance matrices. Both Co-OT and GW can be seen as speciﬁc extensions of OT to more complex data. In this paper, we propose a uniﬁed framework, called Optimal Tensor Transport (OTT), which takes the form of a generic formulation that encompasses OT, GW and Co-OT and can handle tensors of any order by learning possibly multiple transport plans. We derive theoretical results for the resulting new distance and present an efﬁcient way for computing it. We further illustrate the interest of such a formulation in Domain Adaptation and Comparison-based Clustering.

Introduction

Comparing two probability measures in the form of empirical distributions is at the core of many machine learning tasks. Optimal Transport (OT) (Villani 2008; Peyr e, Cuturi et al. 2019) is a popular tool that allows such comparisons for datasets typically lying in a common vector space. Given two point clouds and a metric allowing to evaluate the transportation cost between two samples, the goal of OT is to learn the transport plan that minimizes the alignment cost between the two sets, resulting in the so-called Wasserstein distance. OT has been shown to be of great interest when dealing with machine learning tasks. For example, unsupervised Domain Adaptation (DA) aims at beneﬁting from labeled data of a source domain to classify examples drawn from a different but related target domain. The DA theory prompts us to reduce the shift between the source and the target distributions, a task that can be addressed by aligning the two datasets using OT (Courty et al. 2016, 2017; Shen et al. 2018; Damodaran et al. 2018). OT has also been successfully used in generative adversarial networks (GAN) (Goodfellow et al. 2014)

Copyright 2022, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

to minimize the divergence between training data and samples drawn from a generative model, leading to the WGAN (Arjovsky, Chintala, and Bottou 2017; Gulrajani et al. 2017). In order to expand the range of possible applications, different variants have been proposed in the OT literature to tackle more complex settings. While the standard OT scenario assumes that the two datasets lie in the same feature space, Gromov Wasserstein (GW) (Memoli 2007; M emoli 2011; Peyr e, Cuturi, and Solomon 2016) extends the framework to incomparable spaces by allowing the alignment of two distributions when only the within-dataset pairwise distances are available. This approach is particularly well suited to deal with graphs described by their adjacency matrices (Xu et al. 2019; Xu, Luo, and Carin 2019; Chowdhury and M emoli 2019). The GW discrepancy has been used efﬁciently in various applications such as heterogeneous DA (Yan et al. 2018), word translation (Alvarez-Melis and Jaakkola 2018) or GAN (Vayer et al. 2019; Bunne et al. 2019). Recently, Co-Optimal Transport (Co-OT) (Redko et al. 2020) extended the OT theory to datasets lying in different vector spaces. The idea is to jointly learn two transport plans. The ﬁrst one aligns the examples as in standard OT while the second one aligns the most similar features. This has been shown to be of particular interest in heterogeneous DA and co-clustering. Motivation and Contribution. While GW and Co-OT already cover a wide range of problems, we claim that many other scenarios are not covered by these two extensions. Let us suppose that both the source and target distributions are represented by a collection of graphs of the same size (in terms of nodes) but of different structure (in terms of edges). This is typically the case of two graphs evolving over time. In this case, the goal of OT would be to jointly align both the two collections of graphs and the nodes. It turns out that GW would be only able to handle the special case where there exist a known one-to-one correspondence between the graphs of the two collections. Another application is inspired from comparison-based learning. Let us consider a source and a target distribution represented by a set of users who watched movies, users providing a list of triplet comparisons of the form movie xi is closer to xj than to xk . In this case, neither GW nor Co-OT is able to align the two distributions because of the nature of this triplet-based representation. A last example comes from computer vision, where one may want to align two collections of images while preserving

The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22)

Figure 1: (Left) Transport plan T 1 between 400 images (only digits 0 and 1) of MNIST and USPS datasets; (Right) (top left) An example from MNIST and (bottom right) an example from USPS with a 90 right rotation; (top right) the OT plan T 2 between the rows of MNIST and USPS; (bottom left) the OT plan T 3 between the columns of MNIST and USPS; the arrows explain how to match the pixels between the two datasets using T 2 and T 3 obtained with OTT.

some inner structural information in rows and columns. It is worth noting that these three applications share a common characteristic: they can be represented in the form of thirdorder tensors. To solve OT tasks on such complex structures, it is necessary to design a framework generalizing the OT theory. This is the main contribution of this paper. We propose Optimal Tensor Transport (OTT), a new OT formulation that can handle datasets represented as tensors of any order by potentially learning multiple transport plans. The underlying idea is to jointly match the different dimensions of each tensor with respect to their weights. Figure 1 illustrates this on a transportation problem between images from the MNIST (Le Cun et al. 1998) to the USPS dataset (Friedman et al. 2001). Three transport plans are optimized in this scenario. T 1 is used to match the points (on the left), T 2 and T 3 preserve the structure by respectively mapping the pixel rows and pixel columns jointly (ﬁgure on the right). OTT effectively matches digits of the same class while only using supervision from the MNIST dataset. Note that the pixel-level transport plans are both close to the identity meaning that the structure of the images is automatically retrieved. We further illustrate this behaviour by extending this experiment in the supplementary material. From a theoretical perspective, we show that OTT encompasses both Co-OT and GW as well as standard OT. We also show that OTT can be seen as a distance between tensors of any order and thus it can be used to compute tensor barycenters. From an algorithmic point of view, we propose an efﬁcient optimization scheme based on a stochastic mirror descent that allows a drastic reduction of the computational complexity. The rest of this paper is organized as follows: Section recalls some preliminary knowledge on OT, Co-OT and GW. Section is dedicated to the introduction of our optimal tensor transport (OTT) setting. Section proposes an efﬁcient algorithm for solving OTT. We derive theoretical properties in Section before presenting experimental results on DA and Comparison-based Clustering tasks in Section .

Preliminary Knowledge In this section, we recall the standard OT (Villani 2008; Peyr e, Cuturi et al. 2019), the GW (Memoli 2007; Peyr e, Cuturi, and Solomon 2016), and the Co-OT (Redko et al. 2020) formulations. Let p and q be two histograms of respective dimensions I and K. The set of coupling transport plans is deﬁned as Upq = {T RI K + |T1K = p, T 1I = q} where 1R is a vector of ones of dimension R. The goal in discrete OT is to learn one (in standard OT and GW) or two (in Co-OT) transport plans. Note that for the sake of clarity, we only consider the discrete case here. Nevertheless, all the formulations presented in this section, as well as OTT, can be straightforwardly extended to the continuous case by replacing the sums by integrals over the compared distributions. In this case, the transport plans take the form of joint continuous measures. To prepare for our generalization, we unify the formulations below. In particular, we introduce subscripts and superscripts that are usually not used in the standard formulations. We denote the (R 1)-simplex R = {(xr)r J1,RK RR +| PR r=1 xr = 1}.

Optimal Transport (Villani 2008). Let X and Y be two datasets deﬁned over the same feature space X (e.g. X = RF ), with respectively I1 N and K1 N points with weights p1 I1 and q1 K1. The optimal transport plan between X and Y is obtained by solving:

min T 1 Up1q1

k1=1 L(Xi1, Yk1)T 1 i1k1 (1)

where Xi1 is example i1 in dataset X. Here, L is a loss function which measures the cost of aligning two examples Xi and Yk. An extension of OT which is conceptually different from what is covered in this article is the multi-marginal OT (Carlier 2003; Moameni 2014; Pass 2015; Friedland 2020) that aligns R 3 datasets simultaneously: L becomes a function of R parameters and T 1 an R-order tensor.

Co-Optimal Transport (Redko et al. 2020). Co-Optimal Transport also aims at transporting points from two datasets X and Y . However, contrary to standard OT, these datasets may have different feature spaces X RI2 and Y RK2 of respective dimensions I2 and K2 and equipped with weights p2 I2 and q2 K2. The goal is to jointly match the points with a ﬁrst transport plan T 1 and the features with a second one T 2. The Co-OT formulation is as follows:

min T 1 Up1q1

k1,k2=1 L(Xi1i2, Yk1k2)T 1 i1k1T 2 i2k2 (2)

where Xi1i2 is the value of feature i2 for example i1.

Gromov Wasserstein (Memoli 2007). Instead of having features describing the examples, let us consider that we only have access to within-dataset pairwise similarities or dissimilarities, that is X and Y are now square matrices of dimensions I1 I1 and K1 K1. It means that the two datasets may have different feature spaces, as in Co-OT, but

c) OTT11 (GW) b) OTT12 (Co-OT)

d) OTT111 (triplets) f) OTT112 (GW collections) e) OTT123 (tri Co-OT)

compact representation

full representation

a) OTT1 (OT) (F=5 features)

Figure 2: Various formulations of OTT with, each time, the two datasets and the different transport plans (best viewed in color). The subscripts of OTT correspond to the indices of the transport plans used in each dimension.

since these feature spaces are implicit, it is sufﬁcient to learn a single transport plan T 1. The GW formulation is:

min T 1 Up1q1

k1,k2=1 L(Xi1i2, Yk1k2)T 1 i1k1T 1 i2k2 (3)

where Xi1i2 is the (dis)similarity between examples i1 and i2. It is typical, for both Co-OT and GW, to use a comparison/loss function L (often the squared difference) that operates on two numbers. In Co-OT, L compares the value of a feature from one point in X with one feature from a point in Y. In GW, it compares an entry of the pairwise matrix X to one in Y . Both formulations can be extended by allowing L to compare more complex entries such as F-dimensional vectors in RF . As illustrated in the top row of Figure 2, corresponding to the formulations of Equations (1), (2), and (3), although OT, Co-OT and GW solve different problems, they still share common principles. Below, we propose a new generalized OT formulation that encompasses all of them.

Optimal Tensor Transport (OTT) Given the notational complexity involved in our generic formulation, let us ﬁrst explain the intuition behind the subscripts associated with OTT as illustrated in Figure 2. Both Co-OT and GW work on matrices (that is tensors of order D = 2) and thus will be represented with 2 digits. Since Co-OT uses A = 2 different transport plans T 1 and T 2, computing Co-OT boils down to solving OTT12 as deﬁned below. On the other hand, GW uses the same plan T 1 for both dimensions, thus corresponding to OTT11. Note that dimensions that share a transport plan must have the same sizes. Thus, GW (OTT11) deals with square matrices. Starting to generalize, when working with tensors of order D, a given OT extension considers A D transport plans and associates a transport plan (index) to each dimension. This is done by specifying an affectation function f : J1, DK J1, AK or equivalently, a D-tuple of transport plan indices, that is f J1, AKD. For instance, Co-OT uses f = (1, 2) which corresponds to the subscript in OTT12.

For a given f J1, AKD, we can now detail our OTTf formulation (denoted OTT when no ambiguity arises) that deﬁnes a distance between two datasets X and Y , represented as order D+1 tensors of respective size (If(1)...If(D), F) and (Kf(1)...Kf(D), F). The ﬁrst D dimensions will be matched between the two datasets using the transport plans, while the last dimension (F) is the feature dimension used to compare 2 points with the loss L. To simplify the rest of the paper, we will suppose that F = 1, as done in Co-OT and GW above. The OTT distance between X and Y relies on ﬁnding a list of optimal transport plans (T a)a J1,AK under constraints on the marginals deﬁned respectively by the weight vectors (pa)a J1,AK and (qa)a J1,AK. OTT is deﬁned as:

OTTf(X, Y, (pa)a, (qa)a) = min a T a Upaqa Ef (X, Y, (T a)a)

where Ef X, Y, (T a)a J1,AK = (4)

If(1),...,If(D) X

i1,...,i D=1

Kf(1),...,Kf(D) X

k1,...,k D=1 L(Xi1...i D, Yk1...k D)

d=1 T f(d) idkd

where Xi1...i D is the entry at position i1...i D in the tensor X. From this general formulation and looking at Equations 1, 2 and 3 with the support of Figure 2, one can check that OT corresponds to OTT1 (with F possibly > 1), Co-OT is equivalent to OTT12 and GW corresponds to OTT11. Our OTT formulation makes it possible to handle new forms of datasets as illustrated in the second row of Figure 2. In the experiments (see Section ), we will speciﬁcally consider two versions of OTT, each with order 3 tensors: (i) OTT111 corresponds to datasets of triplets (like GW but with triplets instead of pairs); (ii) OTT112 works with datasets that are collections of adjacency matrices. Figure 1 gives an illustration of a third kind of datasets, where OTT123 has been applied to collections of images, like Co-OT but with three dimensions. It is worth mentioning that the question that we tackle here is reminiscing of another problem in the literature: the Dregular hypergraphs (Berge 1984) matching. Such a problem is indeed equivalent to OTT in the particular case where all the transport plans are identical. But it uses either a different

Algorithm 1: OTT

Require: datasets X, Y , weights (pa)a J1,AK, (qa)a J1,AK, loss function L, nb. of samples M, regularization ϵ 1: a J1, AK, T a = paqa

2: for s= 0 to S-1 do 3: for a= 1 to A do 4: \ T a E = M gradient samples using Equation (7)

5: T a = min T Upaqa D \ T a E, T E + ϵKL(T, T a)

6: end for 7: end for

formulation or different constraints on the matching. Zass and Shashua (2008) proposes to ﬁnd a soft matching between D-regular hypergraphs, with uniform inequality constraints, using a Kullback-Leibler objective function. Duchenne et al. (2011) also matches hypergraphs, with a formulation similar to OTT but uses only row constraints on the matching matrix. Finally, (Peyr e et al. 2016) and (Ning and Georgiou 2014) propose to represent examples as PSD matrices and to align those matrices using a single transport plan where each entry is also a PSD matrix instead of a real value.

Algorithm to Solve OTT

In this section, we detail how to efﬁciently solve the main optimization problem behind Equation 4. The most used method for solving GW is called EGW (Peyr e, Cuturi, and Solomon 2016). It can be seen as a Mirror Descent scheme (Beck and Teboulle 2003) with the Kullback-Leibler divergence on a regularized version of GW: min T Up1q1 E(T)+KL(T, p1q1 ). The idea of the Mirror Descent algorithm is to interpret the usual gradient descent, at a point x, as a minimization of the sum of a linearization of the desired function h: xh, plus a regularization term ϵ x 2 2. Instead of using the Euclidean distance, Peyr e, Cuturi, and Solomon (2016) use the KL divergence. Thus, at a point T 1, Peyr e, Cuturi, and Solomon (2016) show that the minimization becomes equivalent to the entropy regularized OT problem (Cuturi 2013):

min T Up1q1 T 1E, T + ϵKL(T, p1q1 ). (5)

Xu et al. (2019) based on the work of Xie et al. (2020) change the uniform distribution p1q1 in Equation (5) to the previous transport plan T 1. In fact, this is equivalent to applying a Mirror Descent algorithm on the original GW problem (see Equation (3)) instead of the regularized one (see supplementary material for more details). Thus, to solve the OTT problem, we use a Mirror Descent algorithm with the KL divergence. When the goal is to ﬁnd multiple transport plans, we propose to use an alternating approach, similar to the Co-OT solver, where each transport plan is optimized in turn while the others remain ﬁxed. In summary, we combine the ideas of existing solvers for Co-OT and GW and apply an alternate Mirror Descent algorithm with the KL divergence for OTT, with the main bottleneck being the computation of the gradient of E. The pseudo-code of our approach is presented in Algorithm 1. The main steps are the following:

Step 1: We initialize (T a)a J1,AK with the marginal product. Step 2: We compute the gradient of E. For the sake of clarity, we assume that the aligned tensors are cubic , that is all their dimensions are of the same size N. In this case, the overall gradient with respect to T a is a N 2 matrix:

{d |f(d )=a}

If(1),...,If(d 1) If(d +1),...,If(D) X

i1,...,id 1=1 id +1,...,i D=1

Kf(1),...,Kf(d 1) Kf(d +1),...,Kf(D) X

k1,...,kd 1=1 kd +1,...,k D=1

L Xi1...id 1, ,id +1...i D, Yk1...kd 1, ,kd +1...k D

d=1|d =d T f(d) idkd . (6)

Note that computing the overall gradient exactly would be too expensive. Indeed, a naive approach leads to O(N 2D) operations which is prohibitively high. To simplify the computation, a ﬁrst idea would be to generalize the approach used for GW by Peyr e, Cuturi, and Solomon (2016) to our problem. This would reduce the complexity to O(N D+1) for a particular class of functions L, notably the square loss. We provide a proof of this approach in the supplementary material. Nevertheless, this remains too expensive as soon as D = 3. Thus, instead, we propose to use a stochastic Mirror Descent (Zhou et al. 2017; Zhang and He 2018; Hanzely and Richt arik 2021). This idea was used for the GW problem by Kerdoncuff, Emonet, and Sebban (2021) and we generalize it to our OTT problem. The main idea is to notice that the gradient of E with respect to T a can be seen as a sum of expectations over matrices of size N 2, denoted (Cd ){d |f(d )=a}, such that:

P Cd = L Xi1...id 1, ,id +1...i D, Yk1...kd 1, ,kd +1...k D

d=1|d =d T f(d) idkd

If(1),...,If(d 1) If(d +1),...,If(D) X

i1,...,id 1=1 id +1,...,i D=1

Kf(1),...,Kf(d 1) Kf(d +1),...,Kf(D) X

k1,...,kd 1=1 kd +1,...,k D=1

T f(d) idkd = 1

since a J1, AK, PIa,Ka i,k=1 T a ik = 1. The gradient can then be reformulated as:

{d |f(d )=a} E Cd . (7)

It means that one may obtain an unbiased estimate of the gradient in O(MN 2) operations where M is the number of samples to estimate the expectations. Step 3: The last step (line 5) requires to solve a regularized OT problem, that can be efﬁciently solved using a Sinkhorn solver (Xu et al. 2019; Cuturi 2013). We refer the interested reader to Peyr e, Cuturi, and Solomon (2016) and Xu et al. (2019) for an analysis of the efﬁciency of the Mirror Descent algorithm, and to Kerdoncuff, Emonet, and Sebban (2021) for investigations on the precision of the stochastic approximation of the gradient. We also provide in the supplementary material an experiment speciﬁc to our new formulation to show how well the gradient is approximated with an increasing order D of the tensors.

Theoretical Results In this section, we derive two main theoretical results. Theorem 1 shows that as long as the cost function is a proper distance, then OTT is a distance between D-order tensors. Thus, we can naturally deﬁne an OTT barycenter between tensors. Theorem 2 states that the optimal barycenter can be found in closed form for particular loss functions. Theorem 1. OTT is a distance between weighted tensors (X, (pa)a J1,AK) and (Y , (qa)a J1,AK) represented in canonical form (Deﬁnition 1 in the supplementary material), for any affectation function f, as long as L is a proper distance. The proof is provided in the supplementary material. This result notably extends the distance proof of Co-OT (Redko et al. 2020) to matrices of different sizes and to non-uniform weights. Even though their comparison with OTT is out of the scope of this paper, notice that other distances exist between higher-order tensors (De Lathauwer, De Moor, and Vandewalle 2000; Liu, Liu, and Chan 2010; Lai et al. 2013). Since OTT is a distance, we can deﬁne an OTT barycenter between several tensors with any affectation function f. Deﬁnition 1. (OTT barycenter) Assume that we are given B 1 weighted tensors of sizes ((Kb a)a J1,AK)b J1,BK de-

noted Xb RKb f(1)...Kb f(D), (qa,b Kb a)a J1,AK

Let λ B be the weights quantifying the importance of each tensor. For ﬁxed size (Ia)a J1,AK and weights (pa Ia)a J1,AK, the OTT barycenter is deﬁned as

min X R If(1)...If(D)

b=1 λb OTT(X, Xb, (pa)a, (qa,b)a). (8)

Note that the barycenter could also be deﬁned in a similar manner with the marginals (pa)a J1,AK not ﬁxed. To solve Problem (8), we propose to minimize alternatively the objective function w.r.t. X and (T a,b)a J1,AK, the transport plans between X and Xb. The latters can be found independently for each b J1, BK using Algorithm 1. Interestingly, X can be obtained in closed form for particular loss functions, which generalizes, in particular to Co-OT, a known result for OT and GW (Peyr e, Cuturi, and Solomon 2016). This is summarized in the next theorem. Theorem 2. Assume that the loss L is continuous and can be written as L(x, y) = f1(x) + f2(y) h1(x)h2(y) with four functions (f1, f2, h1, h2) such that f 1 h 1 is invertible. Further assume that L(x, y) x + . For ﬁxed

((T a,b)a J1,AK)b J1,BK, for all (id J1, If(d)K)d J1,DK, the optimal solution X i1,...,i D of Problem (8) is equal to

Kb f(1),...,Kb f(D) X

k1,...,k D=1 h2(Xb k1...k D)

T f(d),b idkd pf(d) id

In particular, when L is the squared euclidean distance,

X i1,...,i D =

Kb f(1),...,Kb f(D) X

k1,...,k D=1 Xb k1...k D

T f(d),b idkd pf(d) id .

Note that to obtain a barycenter using loss functions that are not covered by Theorem 2, for example the absolute loss, one can resort to a gradient based optimization scheme. In the next section, we present experiments focused on 3D-tensors alignments. Nevertheless, it is worth noticing that our theoretical results and Algorithm 1 hold for any tensor order and thus might be used with D = 4, for example in comparison based learning tasks (Ghoshdastidar, Perrot, and von Luxburg 2019) or to match hypergraphs (Berge 1984).

Experiments

In this section, we illustrate the interest of OTT on two different tasks1. First, following the success of OT in Domain Adaptation (Courty et al. 2016), we propose to predict the genres of recent movies based on labeled older movies by relying only on users preferences. We advantageously use a 3D-tensor formulation to take into account the particularity of each user. In a second experiment, we use the OTT barycenter in a Comparison-Based Clustering task.

Domain Adaptation (DA)

We consider a DA task on the Movielens dataset (Harper and Konstan 2015). The goal is to adapt a model learned on old movies (source) to predict the genres of new movies (target). Datasets. We build two 3-orders tensor Xs (source) and Xt (target) based on the ratings of the users. The entry (i, j, k) in Xs (and similarly for Xt) is 1 if the user i preferred the movie j over the movie k, 1 if the movie k is preferred over the movie j and 0 if the user i cannot choose. As the users did not rate every movie, we use the 0.33 percentile of their personal rates as a default rating. For both the new and old movies, we identify 4 different groups of movies: Thriller/Crime/Drama (T), Fantasy/Sci-Fi (F), War/Western (W), and Children s/Animation (C). We then create 6 pairwise binary classiﬁcation datasets of 200 movies each by selecting 2 classes among the four aforementioned ones. We assume that we have access to all the labels for the old movies (source) but only to a single random label per class for the new movies (target). The goal is to learn a model that is as accurate as possible on the target. Since many movies have a small number of ratings and many users only rated a few movies, we focus on the 100 users with the highest number of ratings and the 200 most rated ﬁlms for those users. Baselines. Even though OTT122 is, to the best of our knowledge, the ﬁrst algorithm that allows direct DA on such tensor-based datasets, we still propose various baselines by reducing the Xs and Xt tensors into matrices by averaging along one dimension. Rdm is a ﬁrst naive baseline that simply outputs random labels. For the next three baselines, we average over the user dimension. Then, SVM applies a SVM (Cortes and Vapnik 1995) classiﬁer only on the target domain, using the columns of the matrix as features. S-GWL (Xu, Luo, and Carin 2019) interprets the obtained matrices as adjacency matrices of graphs and matches the nodes of the two graphs. GW (Peyr e, Cuturi, and Solomon

1The code to reproduce all the experiments is available online: https://github.com/Hv0nnus/Optimal Tensor Transport

Datasets SVM S-GWL GW Co-OT OTT

T,F 62.5 63.0 62.0 72.0 80.8 1 T,C 69.0 77.0 78.0 83.0 97.0 0 T,W 32.5 61.0 63.0 65.5 71.3 5 F,C 74.5 72.0 74.0 74.0 70.2 4 F,W 53.0 53.0 60.5 47.0 67.9 2 C,W 60.0 57.0 52.0 67.5 76.8 6

AVG 58.6 63.8 64.9 68.2 77.3 3

AVGbest 58.6 66.3 71.0 70.7 78.9 3

Time (s) 0.1 94 673 4 5940

Table 1: Accuracy on 6 DA tasks with the hyperparameters found using the unsupervised proposed method. To evaluate the best possible performance reachable by each method, AVGbest displays the accuracy with the best hyperparameters using the ground truth of the target domain.

2016) solves the GW problem directly on the obtained matrices. The last baseline, Co-OT, uses a matrix obtained by averaging over one movie dimension, which leads to a matrix (users, movies). The two axes are then mapped jointly between the new and old movies. For all the methods that provide a transport plan T between the movies, the class of a target movie yt j is predicted via label propagation (Redko et al. 2019a) of the source label ys, that is yt k = ys T k. The stochastic methods are run 10 times and the mean and standard deviation are reported. Experimental setup and hyperparameter tuning. As the initialization is key to avoid local minima, we take advantage of both the labels and our stochastic algorithm by sampling only the labelled points in the source and target for the ﬁrst gradient estimation. The squared euclidean loss is used for L and we estimate the gradient of OTT using M = 1000 samples. S is set to 1000 iterations in Algorithm 1. For each method that uses the OT Sinkhorn solver, notably OTT, we replace it with the semi-supervised algorithm OTDA proposed by Courty et al. (2016) which adds a lp l1 regularization to take advantage of the available source labels. In DA, tuning the hyperparameters is often key as there is not enough target labeled movies. As the goal of DA is to reduce the divergence between the two datasets (Ben-David et al. 2007; Redko et al. 2019b), we can use the distance between the source and the target as a criterion to choose the hyperparameters for each method. To compute the OTT distance, we resort to the sampling scheme already used to approximate the GW distance in Kerdoncuff, Emonet, and Sebban (2021). The Kullback-Leibler regularization parameter ϵ of the Sinkhorn method (Cuturi 2013) is selected in the range [10 5, 102] and the class regularization η of OTDA (Courty et al. 2016) in [10 4, 101]. The hyperparameters selection is limited to 24 hours for each method and dataset. Results. The accuracy of each method is reported in Table 1. OTT achieves better performances than the other baselines on 5 out of 6 datasets. This result was expected as OTT is the only method which takes full advantage of the 3D structure of the data. Interestingly, OTT still behaves better than

the baselines even when one uses the ground truth over the target domain to tune their hyperparameters (that would be cheating) as shown in the line AVGbest of Table 1. We now analyze the impact of the different hyperparameters on the accuracy. We report the results on each dataset in the supplementary material and only consider the global average in Figure 3. The leftmost plot displays the accuracy for increasing values of the KL regularization parameter ϵ. The black markers correspond to the lowest achieved distance for each method. It is worth noting that this usually corresponds to a reasonable accuracy, which supports our hyperparameter tuning procedure. We notice a similar behaviour for the η parameter of OTDA as reported in the supplementary material. In Figure 3 (middle), we report the target accuracy with respect to the number of target labels available. We can notice that OTT is always better, even in the completely unsupervised scenario. Lastly, in the experiments reported in Table 1, we never use the fact that the users comparing the movies are the same for both old and new movies. Here, we study the impact of making this information available. To this end, we ﬁx the transport plan for an increasing number of users. Figure 3 (right) shows that this information greatly improves the target accuracy of the methods that can handle it, especially OTT. Interestingly, as indicated with the black marker, the smallest distance is achieved with the highest number of known pairings, which corresponds to the highest number of constraints on the users transport plan. This supports the key assumption of this experiment: a good matching between users leads to a better matching of similar movies. This also highlights a limit of a mirror descent-based solver as it struggles to ﬁnd the global minimum without this additional information.

Comparison Based Clustering In this second series of experiments, we show that OTT barycenters can be used competitively to address an unbalanced comparison-based clustering task. Comparison-based learning deals with the problem of learning from examples when neither an explicit representation nor a pairwise distance matrix is available (Vikram and Dasgupta 2016; Ukkonen 2017; Emamjomeh-Zadeh and Kempe 2018; Perrot, Esser, and Ghoshdastidar 2020). Instead, it is assumed that only triplet comparisons of the form example xi is closer to xj than to xk are available. This ﬁeld stems from the fact that relative judgments are usually easier than absolute ones for human observers (Shepard 1962; Young 1987; Stewart, Brown, and Chater 2005). For example, triplet-based queries are easier to answer than exact distance estimations. Given a set of examples and a given number of triplet comparisons, a dataset can be represented as a third order tensor where the entry (i, j, k) contains 1 if example xi is closer to xj than to xk and 1 otherwise. In comparison based clustering, the goal is to identify relevant groups in the examples, using only the information contained in the aforementioned tensor. As the three dimensions of the cubic tensor correspond to the same points we will use the same tranport plan for all the dimensions, that is OTT111. Setting. To show the interest of our method for clustering unbalanced triplet datasets, we take inspiration from the experimental setup of Perrot, Esser, and Ghoshdastidar (2020).

10 5 10 3 10 1 101 102 50

Kullback Leibler regularization ϵ

Target accuracy

0 10 20 30 40 50

Number of labels in target

OTT Co-OT GW S-GWL SVM Rdm

0 20 40 60 80 100

Number of similar users known

Figure 3: Target accuracy averaged over all the datasets. The shadow area represents the standard deviation for the stochastic methods. (Left) Target accuracy for various values of ϵ. The black symbols correspond to the value of ϵ associated with the lowest distance on average of each method. (Middle) Target accuracy for an increasing target supervision. (Right) Target accuracy for an increasing number of similar known users who rated both old and new movies.

nb. points per class Add S3 Add S3s t-STE t-STEs OTT

200,20,20 43 12 80 17 56 7 91 4 91 4 30,3,1 28 9 82 17 49 14 91 14 89 14 30,3,3 37 13 78 23 52 09 93 19 87 19 300,30,10 28 4 83 07 48 10 89 4 89 04

AVG 34 9 81 16 51 10 91 8 89 10

Table 2: ARI (in percentage) for unbalanced comparisonbased clustering on MNIST. Each line corresponds to the average over 10 combinations of classes, each run 10 times.

For a given dataset, we ﬁnd the OTT111 barycenter (b = 1) of size (I1, I1, I1) where I1 is the number of clusters that we are looking for. The intuition is that similar examples should be sent by the transport plan to the same point in the barycenter since the latter summarizes the initial points. Datasets. We consider some 3-class unbalanced subsamples of the MNIST dataset (Le Cun et al. 1998). For a given number of examples per class (for example, 200,20,20), we consider 10 random draws for the 3 classes and, for each of these, we further consider 10 random draws for the actual images. Given N points in each unbalanced dataset, we randomly select N log(N)3 triplets of the form d(xi, xj) > d(xi, xk) as suggested by Perrot, Esser, and Ghoshdastidar (2020). The distance between two digits is the euclidean distance after an UMAP projection in 2 dimensions. To simulate a real dataset, some noise is added by randomly ﬂipping d(xi, xj) > d(xi, xk) to d(xi, xj) < d(xi, xk) with probability 0.1 for each triplet selected. Baselines. We use two main triplet clustering baselines: (i) t-STE (Van Der Maaten and Weinberger 2012) which projects the triplets into a vector space followed by kmeans (Lloyd 1982), and (ii) Add S3 (Perrot, Esser, and Ghoshdastidar 2020) which estimates a pairwise similarity matrix also followed by k-means. Moreover, as the OT formulation requires the marginal as a prior, we assume that the proportions of the clusters are known. To stay fair, we propose two variants (Add S3s, t-STEs) of the previous baselines

where we replace the k-means step by an OT barycenter step which takes the marginal information into account. Hyperparameters. We use default hyperparameters, reported in the supplementary material, for t-STE, Add S3, and OTT with the KL regularization parameter set to ϵ = 0.1. To ensure convergence, we also set the number of samples M = 100 and the number of iteration S = 500 between each of the 20 barycenter updates. Finally, to take advantage of the closed form derived in Theorem 2, we use the squared euclidean loss for OTT. Results. The Adjusted Rand Index (ARI) (Hubert and Arabie 1985) between the predicted clusters and the ground truth is displayed in Table 2. Overall, OTT has better performances than Add S3s on average on all datasets while being slightly worse than t-STEs. Furthermore, for both Add S3 and t-STE, using the unbalancedness information improves the performances. The closeness between our approach and t-STEs is further investigated in the supplementary material, where we show a theoretical connection between t-STE and the OTT barycenter. The choice of the unbalanced setting is motivated by the fact that the two other baselines do not take into account this information during their ﬁrst step, while OTT directly uses it in its unique step.

Conclusion We presented OTT, a new OT formulation that can be used to align high dimensional tensors using potentially several transport plans. OTT generalizes various existing OT problems, such as GW and Co-OT, by deﬁning a new tensor distance. We proposed an efﬁcient algorithm to solve the underlying problem and demonstrated the competitiveness of OTT in DA and Comparison-based clustering. While our new approach unlocks new applications, this comes with a cost. First, despite having access to a solver that drastically reduces the computational complexity of the formulation, it still does not scale well on large datasets with high order tensors. Finally, we leave for future work a natural extension, Fused-OTT, inspired by Vayer et al. (2020), that would combine several OTT problems together. This approach could allow us to align datasets that are independently represented by multiple tensors of potentially different orders.

Acknowledgements This paper is part of the TADALo T Project funded by the region Auvergne-Rhˆone-Alpes (France) with the Pack Ambition Recherche (2017, 17 011047 01).

References Alvarez-Melis, D.; and Jaakkola, T. 2018. Gromov Wasserstein Alignment of Word Embedding Spaces. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasserstein generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. Beck, A.; and Teboulle, M. 2003. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters. Ben-David, S.; Blitzer, J.; Crammer, K.; Pereira, F.; et al. 2007. Analysis of representations for domain adaptation. Advances in Neural Information Processing Systems. Berge, C. 1984. Hypergraphs: combinatorics of ﬁnite sets. Elsevier. Bunne, C.; Alvarez-Melis, D.; Krause, A.; and Jegelka, S. 2019. Learning Generative Models across Incomparable Spaces. In International Conference on Machine Learning. Carlier, G. 2003. On a class of multidimensional optimal transportation problems. Journal of convex analysis. Chowdhury, S.; and M emoli, F. 2019. The Gromov Wasserstein distance between networks and stable network invariants. Information and Inference: A Journal of the IMA. Cortes, C.; and Vapnik, V. 1995. Support-vector networks. Machine learning. Courty, N.; Flamary, R.; Habrard, A.; and Rakotomamonjy, A. 2017. Joint distribution optimal transportation for domain adaptation. In Advances in Neural Information Processing Systems. Courty, N.; Flamary, R.; Tuia, D.; and Rakotomamonjy, A. 2016. Optimal transport for domain adaptation. IEEE transactions on pattern analysis and machine intelligence, 39(9): 1853 1865. Cuturi, M. 2013. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems. Damodaran, B. B.; Kellenberger, B.; Flamary, R.; Tuia, D.; and Courty, N. 2018. Deep JDOT: Deep Joint Distribution Optimal Transport for Unsupervised Domain Adaptation. In European Conference on Computer Vision. Springer. De Lathauwer, L.; De Moor, B.; and Vandewalle, J. 2000. A multilinear singular value decomposition. SIAM journal on Matrix Analysis and Applications. Duchenne, O.; Bach, F.; Kweon, I.-S.; and Ponce, J. 2011. A tensor-based algorithm for high-order graph matching. IEEE transactions on pattern analysis and machine intelligence. Emamjomeh-Zadeh, E.; and Kempe, D. 2018. Adaptive Hierarchical Clustering Using Ordinal Queries. In Symposium on Discrete Algorithms.

Friedland, S. 2020. Tensor optimal transport, distance between sets of measures and tensor scaling. ar Xiv preprint ar Xiv:2005.00945. Friedman, J.; Hastie, T.; Tibshirani, R.; et al. 2001. The elements of statistical learning. Springer series in statistics New York. Ghoshdastidar, D.; Perrot, M.; and von Luxburg, U. 2019. Foundations of Comparison-Based Hierarchical Clustering. In Advances in Neural Information Processing Systems. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. Advances in Neural Information Processing Systems. Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; and Courville, A. C. 2017. Improved Training of Wasserstein GANs. In Advances in Neural Information Processing Systems. Hanzely, F.; and Richt arik, P. 2021. Fastest rates for stochastic mirror descent methods. Computational Optimization and Applications. Harper, F. M.; and Konstan, J. A. 2015. The movielens datasets: History and context. Acm transactions on interactive intelligent systems. Hubert, L.; and Arabie, P. 1985. Comparing partitions. Journal of classiﬁcation. Kerdoncuff, T.; Emonet, R.; and Sebban, M. 2021. Sampled Gromov Wasserstein. Machine Learning. Lai, Z.; Xu, Y.; Yang, J.; Tang, J.; and Zhang, D. 2013. Sparse tensor discriminant analysis. IEEE transactions on Image processing. Le Cun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE. Liu, Y.; Liu, Y.; and Chan, K. C. 2010. Tensor distance based multilinear locality-preserved maximum information embedding. IEEE Transactions on neural networks. Lloyd, S. 1982. Least squares quantization in PCM. IEEE transactions on information theory. Memoli, F. 2007. On the use of Gromov-Hausdorff Distances for Shape Comparison. In Eurographics Symposium on Point Based Graphics. The Eurographics Association. M emoli, F. 2011. Gromov Wasserstein distances and the metric approach to object matching. Foundations of computational mathematics. Moameni, A. 2014. Multi-marginal Monge Kantorovich transport problems: A characterization of solutions. Comptes Rendus Mathematique. Ning, L.; and Georgiou, T. T. 2014. Metrics for matrix-valued measures via test functions. In 53rd IEEE Conference on Decision and Control, 2642 2647. IEEE. Pass, B. 2015. Multi-marginal optimal transport: theory and applications. ESAIM: Mathematical Modelling and Numerical Analysis. Perrot, M.; Esser, P. M.; and Ghoshdastidar, D. 2020. Near-optimal comparison based clustering. ar Xiv preprint ar Xiv:2010.03918.

Peyr e, G.; Chizat, L.; Vialard, F.-X.; and Solomon, J. 2016. Quantum optimal transport for tensor ﬁeld processing. ar Xiv preprint ar Xiv:1612.08731. Peyr e, G.; Cuturi, M.; and Solomon, J. 2016. Gromovwasserstein averaging of kernel and distance matrices. In International Conference on Machine Learning. Peyr e, G.; Cuturi, M.; et al. 2019. Computational optimal transport. Foundations and Trends in Machine Learning. Redko, I.; Courty, N.; Flamary, R.; and Tuia, D. 2019a. Optimal transport for multi-source domain adaptation under target shift. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics. PMLR. Redko, I.; Morvant, E.; Habrard, A.; Sebban, M.; and Bennani, Y. 2019b. Advances in domain adaptation theory. Elsevier. Redko, I.; Vayer, T.; Flamary, R.; and Courty, N. 2020. COOptimal Transport. In Advances in Neural Information Processing Systems. Shen, J.; Qu, Y.; Zhang, W.; and Yu, Y. 2018. Wasserstein distance guided representation learning for domain adaptation. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence. Shepard, R. N. 1962. The analysis of proximities: Multidimensional scaling with an unknown distance function. I. Psychometrika. Stewart, N.; Brown, G. D. A.; and Chater, N. 2005. Absolute identiﬁcation by relative judgment. Psychological review. Ukkonen, A. 2017. Crowdsourced correlation clustering with relative distance comparisons. ar Xiv preprint ar Xiv:1709.08459. Van Der Maaten, L.; and Weinberger, K. 2012. Stochastic triplet embedding. In 2012 IEEE International Workshop on Machine Learning for Signal Processing. IEEE. Vayer, T.; Chapel, L.; Flamary, R.; Tavenard, R.; and Courty, N. 2020. Fused Gromov-Wasserstein distance for structured objects. Algorithms. Vayer, T.; Flamary, R.; Tavenard, R.; Chapel, L.; and Courty, N. 2019. Sliced Gromov-Wasserstein. In Advances in Neural Information Processing Systems. Vikram, S.; and Dasgupta, S. 2016. Interactive bayesian hierarchical clustering. In International Conference on Machine Learning. Villani, C. 2008. Optimal transport: old and new. Springer Science & Business Media. Xie, Y.; Wang, X.; Wang, R.; and Zha, H. 2020. A fast proximal point method for computing exact wasserstein distance. In Uncertainty in Artiﬁcial Intelligence. Xu, H.; Luo, D.; and Carin, L. 2019. Scalable gromov Wasserstein learning for graph partitioning and matching. In Advances in Neural Information Processing Systems. Xu, H.; Luo, D.; Zha, H.; and Duke, L. C. 2019. Gromov Wasserstein Learning for Graph Matching and Node Embedding. In International Conference on Machine Learning. Yan, Y.; Li, W.; Wu, H.; Min, H.; Tan, M.; and Wu, Q. 2018. Semi-Supervised Optimal Transport for Heterogeneous Domain Adaptation. In International Joint Conference on Artiﬁcial Intelligence.

Young, F. W. 1987. Multidimensional scaling: History, theory, and applications. Lawrence Erlbaum Associates. Zass, R.; and Shashua, A. 2008. Probabilistic graph and hypergraph matching. In 2008 IEEE Conference on Computer Vision and Pattern Recognition. IEEE. Zhang, S.; and He, N. 2018. On the convergence rate of stochastic mirror descent for nonsmooth nonconvex optimization. ar Xiv preprint ar Xiv:1806.04781. Zhou, Z.; Mertikopoulos, P.; Bambos, N.; Boyd, S.; and Glynn, P. W. 2017. Stochastic mirror descent in variationally coherent optimization problems. Advances in Neural Information Processing Systems.