# unbalanced_lowrank_optimal_transport_solvers__ba105bec.pdf

Unbalanced Low-rank Optimal Transport Solvers

Meyer Scetbon Microsoft Research t-mscetbon@microsoft.com

Michael Klein

Apple michalk@apple.com

Giovanni Palla Helmholtz Center Munich giovanni.palla@helmholtz-muenchen.de

Marco Cuturi

Apple cuturi@apple.com

Two salient limitations have long hindered the relevance of optimal transport methods to machine learning. First, the O(n3) computational cost of standard sample-based solvers (when used on batches of n samples) is prohibitive. Second, the mass conservation constraint makes OT solvers too rigid in practice: because they must match all points from both measures, their output can be heavily inﬂuenced by outliers. A ﬂurry of recent works has addressed these computational and modeling limitations, but has resulted in two separate strains of methods: While the computational outlook was much improved by entropic regularization, more recent O(n) linear-time low-rank solvers hold the promise to scale up OT further. In terms of modeling ﬂexibility, the rigidity of mass conservation has been eased for entropic regularized OT, thanks to unbalanced variants of OT that can penalize couplings whose marginals deviate from those speciﬁed by the source and target distributions. The goal of this paper is to merge these two strains, low-rank and unbalanced, to achieve the promise of solvers that are both scalable and versatile. We propose custom algorithms to implement these extensions for the linear OT problem and its fused-Gromov-Wasserstein generalization, and demonstrate their practical relevance to challenging spatial transcriptomics matching problems. These algorithms are implemented in the ott-jax toolbox [Cuturi et al., 2022].

1 Introduction

Recent machine learning (ML) works have witnessed a ﬂurry of activity around optimal transport (OT) methods. The OT toolbox provides convenient, intuitive and versatile ways to quantify the difference between two probability measures, either to quantify a distance (the Wasserstein and Gromov-Wasserstein distances), or, in more elaborate scenarios, by computing a push-forward map that can transform one measure into the other [Peyré and Cuturi, 2019]. Recent examples include, e.g., single-cell omics [Bunne et al., 2021, 2022, Demetci et al., 2020, Nitzan et al., 2019, Cang et al., 2023, Klein et al., 2023], attention mechanisms [Tay et al., 2020, Sander et al., 2022], self-supervised learning[Caron et al., 2020, Oquab et al., 2023], and learning on graphs [Vincent-Cuaz et al., 2023].

On the challenges of using OT. Despite their long history in ML [Rubner et al., 2000], OT methods have long suffered from various limitations, that arise from their statistical, computational, and modelling aspects. The statistical argument is commonly referred to as the curse-of-dimensionality of OT estimators: the Wasserstein distance between two probability densities, and its associated optimal Monge map, is poorly approximated using samples as the dimension d of observation grows [Dudley et al., 1966, Boissard and Le Gouic, 2014]. On the computational side, computing OT between a pair of n samples involves solving a (generalized) matching problem, with a price of O(n3) and above [Kuhn, 1955, Ahuja et al., 1993]. Finally, the original model for OT rests on a

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

mass conservation constraint: all observations from either samples must be accounted for, including outliers that are prevalent in machine learning datasets. Combined, these weaknesses have long hindered the use of OT, until a more recent generation of solvers addressed these three crucial issues.

The Entropic Success Story. The winning approach, so far, to carry out that agenda has been entropic regularization methods [Cuturi, 2013]. The computational virtues of the Sinkhorn algorithm when solving OT [Altschuler et al., 2017, Peyré et al., 2016, Solomon et al., 2016, Le et al., 2021] come with statistical efﬁciency [Genevay et al., 2019, Mena and Niles-Weed, 2019, Chizat et al., 2020], and can also be seamlessly combined with unbalanced formulations by penalizing rather than constraint mass conservation, both for the linear [Frogner et al., 2015, Chizat et al., 2018, Séjourné et al., 2022, Fatras et al., 2021, Pham et al., 2020] and quadratic [Séjourné et al., 2021] problems. These developments have all been implemented in popular OT packages [Feydy et al., 2019, Flamary et al., 2021, Cuturi et al., 2022].

The Low-Rank Alternative. A recent strain of solvers relies instead on low-rank (LR) properties of cost and coupling matrices [Forrow et al., 2018, Scetbon and Cuturi, 2020, Scetbon et al., 2021]. Much like entropic solvers, these LR solvers have a better statistical outlook [Scetbon and Cuturi, 2022] and extend to GW problems [Scetbon et al., 2022]. In stark contrast to entropic solvers, however, LR solvers beneﬁt from linear complexity O(nrd) w.r.t sample size n (using rank r and cost dimension d) that can scale to ambitious tasks where entropic solvers fail [Klein et al., 2023].

The Need for Unbalanced Low-Rank Solvers. LR solvers do suffer, however, from a major practical limitation: their inability to handle unbalanced problems. Yet, unbalancedness is a crucial ingredient for OT to be practically relevant. This is exempliﬁed by the fact that unbalancedness played a crucial role in the seminal reference [Schiebinger et al., 2019], where it is used to model cell birth and death.

Our Contributions We propose in this work to lift this last limitation for LR solvers to:

Incorporate unbalanced regularizers to deﬁne a LR linear solver ( 3.1); Provide accelerated algorithms, inspired by some of the recent corrections proposed by [Séjourné

et al., 2022], to isolate translation terms that appear in dual subroutines ( 3.2); Carry over and adapt these approaches to the GW ( 3.3) and Fused-GW problems ( 3.4); Carry out an exhaustive hyperparameter selection procedure within large scale OT tasks (spatial

transcriptomics, brain imaging), and demonstrate the beneﬁts of our approach ( 4).

2 Reminders on Low-Rank Transport and Unbalanced Transport

We consider two metric spaces (X, d X ) and (Y, d Y), as well as a cost function c : X Y ! [0, +1[. The simplex +

n holds all positive n-vectors summing to 1. For n, m 1, a 2 +

n , and b 2 +

m, given points x1, . . . , xn 2 X and y1, . . . , ym 2 Y, we deﬁne two discrete probability measures µ and as µ := Pn

i=1 aiδxi, := Pm

j=1 bjδyj where δz is the Dirac mass at z.

Cost matrices. For q 1, consider ﬁrst two square pairwise cost matrices, each encoding the geometries of points within µ and , and a rectangular matrix that studies that across their support:

X (xi, xi0)]1 i,i0 n, B := [dq

Y(yj, yj0)]1 j,j0 m , C := [c(xi, yj)]1 i n,

The Kantorovich Formulation of OT is deﬁned as the following parameterized linear program:

OT(µ, ) := min P 2 a,bh C, Pi , where a,b :=

+ , s.t. P1m = a, P T 1n = b

The Low-Rank Formulation of OT is best understood as a variant of (1) that rests on a low-rank property for cost matrix C, and low-rank constraints for couplings P. More precisely, Scetbon et al. [2021] propose to constraint the set of admissible couplings to those, within a,b, that have a non-negative rank of r 1. That set can be equivalently reparameterized as

a,b(r) = {P 2 Rn m

+ |P = Q diag(1/g)RT , Q 2 a,g, R 2 b,g, and g 2 +

The low-rank optimal transport (LOT) problem simply uses that restriction in (1) to deﬁne :

LOTr(µ, ) := min P 2 a,b(r)h C, Pi = min Q2 a,g,R2 a,g,g2 +

h C, Q diag(1/g)Ri . (2)

Scetbon et al. [2021] propose and prove the convergence of a mirror-descent scheme to solve (2), and obtain linear time and memory complexities with respect to the number of samples, where each iteration in that descent scales as (n + m)rd, where d is the rank of C.

The Unbalanced Formulation of OT starts from (1) as well, but proposes to do without a,b and its marginal constraints [Frogner et al., 2015, Chizat et al., 2018], and rely instead on two regularizers:

UOT(µ, ) := min P 2Rn m

h C, Pi + 1KL(P1m|a) + 2KL(P T 1n|b) (3)

where 1, 2 > 0 and KL(p|q) := P

i pi log(pi/qi) + qi pi. This formulation is solved using entropic regularization, with modiﬁed Sinkhorn updates [Frogner et al., 2015]. Proposing an efﬁcient algorithm able to merge (2) with (3) is the ﬁrst goal of this paper.

Gromov-Wasserstein (GW) Considerations. The GW problem [Mémoli, 2011] is a generalization of (1) where the energy QA,B is a quadratic function of P deﬁned through inner cost matrices A, B:

(Aii0 Bjj0)2Pij Pi0j0 =1T

m P T A 2P1m + 1T

n PB 2P T 1n 2h APB, Pi (4)

where is the Hadamard product. To minimize (4), the default approach rests on entropic regularization [Solomon et al., 2016, Peyré et al., 2016] and variants [Sato et al., 2020, Blumberg et al., 2020, Xu et al., 2019, Li et al., 2023]. Scetbon et al. [2022] adapted the low-rank framework to minimize QA,B over low-rank matrices P, achieving a linear-time complexity when A and B are themselves low-rank. Independently, [Séjourné et al., 2021] proposed an unbalanced generalization that also applies to GW and which can be implemented practically using entropic regularization. Finally, the minimization of a composite objective involving the sum of QA,B with h C, i is known as the fused GW problem [Vayer et al., 2018].

3 Unbalanced Low-Rank Transport

3.1 Unbalanced Low-rank Linear Optimal Transport

We incorporate unbalancedness to low-rank solvers [Scetbon et al., 2021, 2022], moving gradually from the linear problem to the more involved GW and FGW problem. Using the framework of [Frogner et al., 2015, Chizat et al., 2018], we extend ﬁrst the deﬁnition of LOT, introduced in (2), to the unbalanced case by considering the following optimization problem:

ULOTr(µ, ) := min P : rk+(P ) rh C, Pi+ 1KL(P1m|a) + 2KL(P T 1n|b), (5)

where rk+(P) denotes the non-negative rank of P. Therefore by denoting r := {(Q, R, g) 2 Rn r

+: QT 1n = RT 1m = g}, and using the reparameterization of low-rank couplings, we obtain the following equivalent formulation of ULOT:

ULOTr(µ, ) = min (Q,R,g)2 r h C, Q diag(1/g)RT i | {z } LC(Q,R,g)

+ 1KL(Q1r|a) + 2KL(R1r|b) | {z } Ga,b(Q,R,g)

We introduce the more compact notation Ga,b(Q, R, g) := F 1,a(Q1r) + F 2,b(R1r), where F ,z(s) := KL(s|z) for > 0 and z 0 coordinate-wise. To solve (6), and using this split, we move away from mirror-descent and apply instead proximal gradient-descent for the KL divergence. At each iteration, we consider a linear approximation of LC where a KL penalization is added to the objective (as in the classical mirror descent scheme). However, we leave Ga,b intact at each iteration. Borrowing notations from [Scetbon et al., 2021], we must solve at each iteration the convex optimization problem:

(Qk+1, Rk+1, gk+1) := argmin

KL( , k) + 1KL(Q1r|a) + 2KL(R1r|b) , (7)

where (Q0, R0, g0) 2 r is the initialization, and the triplet k := ( (1)

k ) holds synthetic costs matrices that are re-computed at each iteration k:

k := Qk e γk CRk diag(1/gk), (2)

k := Rk e γk CT Qk diag(1/gk)), (3)

k := gk eγk!k/g2

with [!k]i := [QT

k CRk]i,i 2 Rr, and (γk)k 0 is a sequence of positive step sizes.

Reformulation using Duality. To solve (7), we apply Dykstra s algorithm [1983], whose iterations correspond to an alternating maximization on the dual formulation of (7): Proposition 1. The convex optimization problem deﬁned in (7) admits the following dual:

sup f1,h1,f2,h2

Dk(f1, h1, f2, h2) := F ?

heγk(f1 h1) 1, (1)

heγk(f2 h2) 1, (2)

he γk(h1+h2) 1, (3)

where h1, h2 2 Rr, f1 2 Rn, f2 2 Rm, F ?

,z( ) := supy{hy, i F ,z(y)} is the convex conjugate of F ,z. In addition strong duality holds and the primal problem admits a unique minimizer. Remark 1. While we stick to KL regularizers in this work for simplicity, it is worth noting that this can be extended to more generic regularizers F 1,a and F 2,b, as considered by Chizat et al. [2018].

We use an alternating maximization scheme to solve (8). Starting from h(0)

2 = 0r, we apply for 0 the following updates (dropping iteration number k in (7) for simplicity):

1 := arg sup

z D(z, h( )

2 ), f ( +1)

2 := arg sup

z D(f ( +1)

1 , z, h( )

2 ) := arg sup

1 , z1, f ( +1)

These maximizations can all be obtained in closed form, to result in the closed-form updates:

exp(γf ( +1)

(1) exp(γh( )

, exp(γf ( +1)

(2) exp(γh( )

(3) ( (1))T exp(γf ( +1)

1 ) ( (2))T exp(γf ( +1)

exp(γh( +1)

1 ) = g +1 ( (1))T exp(γf ( +1)

, exp(γh( +1)

2 ) = g +1 ( (2))T exp(γf ( +1)

When using scaling representations for these dual variables, 0, u( )

i := exp(γf ( )

i ) and v( )

i := exp(γh( )

i ) for i 2 {1, 2}, we obtain a simple update, provided in the appendix (Alg. 5).

Initialization and Termination. We use the stopping criterion proposed in [Scetbon et al., 2021] to terminate the algorithm, ( , , γ) := 1 γ2 (KL( , ) + KL( , )). Finding an efﬁcient initialization for that problem is challenging, and various choices have been implemented for instance in [Cuturi et al., 2022]. We adopt the practical choices proposed in [Scetbon and Cuturi, 2022], using either random subcoupling matrices or a k-means approach, and also follow them in adapting the choice of γk at each iteration k of the outer loop. We summarize our proposal in Algorithm 1, which can be seen as an extension of [Scetbon et al., 2021, Alg.2].

Convergence. The convergence proof for Dykstra s algorithm, as implemented in Alg. 5 (see appendix), follows from [Bauschke and Combettes, 2008]). Scetbon et al. [2021] show the convergence of their scheme towards a stationary point, w.r.t to the criterion ( , , γ), for ﬁxed γ. The stationary convergence of our proposed algorithm can be directly derived from their result.

Complexity. Given , solving Eq. (7) requires a time and memory complexity of O((n + m)r). However computing requires in general O((n2 + m2)r) time and O(n2 + m2) memory. Scetbon et al. [2021] propose to consider low-rank factorizations of the cost matrix C of the form C ' C1CT

2 where C1 2 Rn d and C2 2 Rm d. In that case computing can be done in O((n + m)rd) time and O((n + m)(r + d)) memory. Such factorizations are either known explicitly (e.g. when using squared-Euclidean distances) or can be obtained as approximations using the algorithm in [Indyk et al., 2019], which guarantees that for any distance matrix C 2 Rn m and > 0 it can output matrices C1 2 Rn d, C2 2 Rm d in O((m + n)poly( d

)) algebraic operations such that with probability at least 0.99, k C C1CT

F , where Cd denotes the best rank-d approximation to C.

Algorithm 1 ULOT(C, a, b, r, γ0, 1, 2, δ) Inputs: C, a, b, r, γ0, 1, 2, δ Q, R, g Initialization as proposed in [Scetbon and Cuturi, 2022] repeat

Q = Q, R = R, g = g, r Q = CR diag(1/g), r R = C>Q diag(1/g), ! D(QT CR), rg = !/g2, γ γ0/ max(kr Qk2

1), (1) Q exp( γr Q), (2) R exp( γr R), (3) g exp( γrg), Q, R, g ULR-Dykstra(a, b, , γ, 1, 2, δ) (Alg. 5) until ((Q, R, g), ( Q, R, g), γ) < δ; Result: Q, R, g

3.2 Improvements on the Unbalanced Dykstra Algorithm

A well documented source of instability of unbalanced formulations of OT lies in the fact that the total mass of the optimal unbalanced coupling is not known beforehand. Séjourné et al. [2022] have proposed a technique to address this issue, with the beneﬁt of reduced computational costs. They propose ﬁrst a dual objective that is translation invariant (TI). We take inspiration from this strategy and adapt it to our problem, to propose the following variant of (8):

sup f1, h1, f2, h2

DTI( f1, h1, f2, h2) := sup λ1,λ22R

D( f1 + λ1, h1 λ1, f2 + λ2, h2 λ2)

It is clear from the reparameterization that both problems (8) and (9) have the same value and also that ( f1, h1, f2, h2) is solution of (9) if and only if ( f1 + λ?

2) is solution of (8) where (λ?

2) solves DTI( f1, h1, f2, h2). To solve (9), we show that the variational formulation of the translation invariant dual objective targeted inside (9) can be obtained in closed form.

Proposition 2. Let f1 2 Rn, f2 2 Rm and h1, h2 2 Rr, then the inner problem deﬁned in (9) by DTI( f1, h1, f2, h2) admits a unique solution (λ?

2) and we have that

1 1 2 (1/γ + 1)(1/γ + 2)

1 1/γ 1/γ + 1

c1 1/γ 1/γ + 1

1 1 2 (1/γ + 1)(1/γ + 2)

1 2/γ 1/γ + 2

c2 1/γ 1/γ + 1

hexp( f1/ 1), ai hexp( γ( h1 + h2)), (3)i

, and c2 := log

hexp( f2/ 2), ai hexp( γ( h1 + h2)), (3)i

Using Proposition 2, we perform an alternate maximization scheme on the TI formulation of the dual DTI. Indeed using Danskin s theorem (under the assumption that λ?

2 do not diverge), one obtains a variant of Algorithm 5. That TI approach is summarized below in Algorithm 3, using Algorithm 2 as a subroutine. We show in the experiments section (Exp. 1) that the TI approach has better computational performance on a simple task.

Algorithm 2 compute-lambdas(a, b, (3), u1, v1, u2, v2, γ, 1, 2)

Inputs: a, b, (3), u1, v1, u2, v2, γ, 1, 2

u1 u 1/γ/ 1

1 , u2 u 1/γ/ 2

2 c1 log(h u1, ai) log(h (3), v 1

2 i), c2 log(h u2, bi) log(h (3), v 1

2 i) Result: λ?

2 as in (10), (11)

Algorithm 3 ULR-TI-Dykstra(a, b, , γ, 1, 2, δ)

Inputs: a, b, = ( (1), (2), (3)), γ, 1, 2, δ v1 = v2 = 1r, u1 = 1n, u2 = 1m repeat

v1 = v1, v2 = v2, u1 = u1, u2 = u2 λ1, λ2 compute-lambdas(a, b, (3), u1, v1, u2, v2, γ, 1, 2) (Alg. 2)

1 1+1/γ exp( λ1/ 1)

1 1/γ+ 1 , u2 =

2 2+1/γ exp( λ2/ 2)

λ1, λ2 compute-lambdas(a, b, (3), u1, v1, u2, v2, γ, 1, 2) (Alg. 2)

g = exp(γ(λ1 + λ2))1/3 /

(3) ( (1))T u1 ( (2))T u2

01/3 , v1 = g ( (1))T u1 , v2 = g ( (2))T u2 until 1

γ max(k log(ui/ ui)k1, k log(vi/ vi)k1) < δ;

Result: diag(u1) (1)

k diag(v1), diag(u2) (2)

k diag(v2), g

3.3 Unbalanced Low-rank Gromov-Wasserstein

The low-rank Gromov-Wasssertein (LGW) problem [Scetbon et al., 2022] between the two discrete metric measure spaces (µ, d X ) and ( , d Y), written for compactness using (a, A) and (b, B), reads

LGWr((a, A), (b, B)) = min P 2 a,b(r)QA,B(P), (12)

Building upon 3.1 and leveraging the TI variant presented in 3.2, we introduce the unbalanced low-rank Gromov-Wasserstein (ULGW) problem. There is, however, a signiﬁcant challenge that appears when introducing unbalanced regularizers in (12): When P is constrained to be in a,b, the ﬁrst two terms of the RHS in (12) simplify to a T A 2a + b T B 2b. Hence, they are constant and discarded when optimizing. In an unbalanced setting, these terms vary and must be accounted for:

ULGWr((a, A), (b, B)) := min (Q,R,g)2 rh A 2Q1r, Q1ri + h B 2R1r, R1ri

2h AQ diag(1/g)RT B, Q diag(1/g)RT i + 1KL(Q1r|a) + 2KL(R1r|b)

To solve the problem, we apply the same scheme as proposed for ULOT, that is a proximal gradient descent where we linearize QA,B and add a KL penalization while leaving the soft marginal constraints unchanged. Therefore the algorithm to solve ULGW is the same as that solving ULOT, however, the kernels k now take into account the quadratic terms of the original LGW problem. More formally, at each iteration k of the outer loop, we propose to solve

(Qk+1, Rk+1, gk+1) := argmin

KL( | k) + 1KL(Q1r|a) + 2KL(R1r|b), (14)

where (Q0, R0, g0) 2 r is the initialization, (γk)k 0 a sequence of positive step sizes. Using notation Pk = Qk diag(1/gk)RT

k , the synthetic cost matrices k := ( (1)

k ) are updated as:

k := Qk exp( 2γk A 2Qk1r1T

r ) exp( 4γk APk BRk diag(1/gk))) ,

k := Rk exp( 2γk B 2Rk1r1T

r ) exp( 4γk BP T

k AQk diag(1/gk))) ,

k := gk exp(4γk!k/g2

k) with [!k]i := [QT

k APk BRk]i,i 2 Rr .

Note that (14) is the exact same optimization problem as (7), where only k has changed and therefore can be solved using Algorithm 3. Algorithm 4 summarizes our strategy to solve (13).

Algorithm 4 ULGW(A, B, a, b, r, γ0, 1, 2, δ) Inputs: A, B, a, b, r, γ0, 1, 2, δ Q, R, g Initialization as proposed in [Scetbon and Cuturi, 2022] repeat

Q = Q, R = R, g = g, r Q = 4AQ diag(1/g)RT BR diag(1/g) + 2A 2Q1r1T

r , r R = 4BR diag(1/g)QT AQ diag(1/g) + 2B 2R1r1T

r , ! D(QT AQ diag(1/g)RT BR), rg = !/g2, γ γ0/ max(kr Qk2

1), (1) Q exp( γr Q), (2) R exp( γr R), (3) g exp( γkrg), Q, R, g ULR-TI-Dykstra(a, b, , γ, 1, 2, δ) (Alg. 3) until ((Q, R, g), ( Q, R, g), γ) < δ; Result: Q, R, g

Convergence and Complexity. Similarly to linear ULOT, the unbalanced Dykstra algorithm is guaranteed to converge [Bauschke and Lewis, 2000]. Because we use Algorithm 5, we retain exactly the same complexity, both in terms of time of memory, to solve these inner problems. The slight variation in kernel compared to ULOT still retains the same O((n2 + m2)r) time and O(n2 + m2) memory complexities. However, as in ULOT, we can take advantage of low-rank approximations of both costs matrices A and B to reach linear complexity. Indeed, assuming A ' A1AT

2 and B ' B1B2 where A1, A2 2 Rn d X and B1, B2 2 Rm d Y , then the total time and memory complexities become respectively O(mr(r + d Y ) + nr(r + d X)) and O((n + m)(r + d X + d Y )). Again, when A and B are distance matrices, we use the algorithms from [Indyk et al., 2019].

3.4 Unbalanced Low-rank Fused-Gromov-Wasserstein

We ﬁnally focus on the increasingly impactful [Klein et al., 2023] fused-Gromov-Wasserstein problem, which merges linear and quadratic objectives [Vayer et al., 2018]:

FGW(µ, ) := min P 2 a,b h C, Pi + QA,B(P) (15)

where 2 [0, 1] and := 1 allows interpolating between the GW and linear OT geometries. This problem remains a GW problem, where one replaces the 4-way cost M[i, i0, j, j0] := (Ai,i0 Bj,j0)2 appearing in (4) by a composite interpolated cost between the OT and GW geometries, redeﬁned as M[i, i0, j, j0] = Ci,j + (Ai,i0 Bj,j0)2. Our proposed unbalanced and low-rank version of the FGW problem includes |P| := k Pk1 the mass of P, to homogenize linear and quadratic terms,

ULFGWr(µ, ) := min P : rk+(P ) r |P|h C, Pi + QA,B(P) + 1KL(P1m|a) + 2KL(P T 1n|b) ,

(16) which is expanded through the explicit factorization of P, noticing that |P| = |g| := kgk1:

ULFGWr(µ, ) := min (Q,R,g)2 r |g|LC(Q, R, g) + QA,B(Q, R, g) + Ga,b(Q, R, g) (17)

Then by linearizing again H : (Q, R, g) ! |g|LC(Q, R, g) + QA,B(Q, R, g) with an added KL penalty and leaving Ga,b unchanged, we obtain at each iteration, the same optimization problem as in (14) where the kernels k are now deﬁned as

k := Qk exp( γkr QHk), (2)

k := Rk exp( γkr QHk), (3)

k := gk exp( γkrg Hk)

r QHk := |gk|CRk diag(1/gk) +

r + 4APk BRk diag(1/gk)

r RHk := |gk|CT Qk diag(1/gk) +

k AQk diag(1/gk)

h C, Pki1r |gk|!lin

k ]i := [QT

k CRk]i,i, [!quad

k ]i := [QT

k APk BRk]i,i 8 i 2 {1, . . . , r} . These steps are summarized in Alg. 6 (see appendix). These steps result in a quadratic complexity, both in time and memory, with respect to the number of points n and m. However, these complexities become linear when square matrices A, B and rectangular C all admit a low-rank factorization.

(a) Gene Nrgn: measured vs. predicted expression

(b) Gene Slc24a2: measured vs. predicted expression

Figure 1: Exp. 2: Spatial visualization of two mouse brain sections, contrasting observed vs. predicted (using ULFGW) spatial distributions of expression levels, for two different genes.

4 Experiments

We focus ﬁrst in Exp. 1 on demonstrating the empirical beneﬁts of the TI variant of our algorithm to solve linear ULOT, as implemented in Alg. 3 vs. Alg. 5; that algorithm is subsequently used as an inner routine to solve all quadratic ULR problems. We compare in Exp. 2 unbalanced low-rank (ULR) solvers to balanced low-rank (LR) counterparts on a spatial transcriptomics task, and follow in Exp. 3 by comparing ULR solvers to entropic (E) counterparts on a smaller task, to accommodate entropic solvers quadratic complexity. We conclude in Exp. 4 by comparing ULR solvers to [Thual et al., 2022], which can learn a sparse transport coupling, in the unbalanced FGW setting.

Figure 2: Visualization of measured and predicted tissue regions in the mouse brain in Exp. 2

Datasets. We consider two realworld datasets, described in B.1, and two synthetic datasets, that are large enough to showcase our solvers. The real-world datasets consist of both a shared feature space, used to compute the costs matrices for the linear term in the OT and FGW settings, as well as geometries that are speciﬁc to each source s and target t measures, and which are used to compute the costs matrices for the quadratic term in the GW and FGW settings. In Exp. 1, we simply consider two isotropic Gaussians in d = 30 to evaluate the performance of the TI variant on a liner problem. We use the mouse brain STARmap spatial transcriptomics data from [Shi et al., 2022] for Exp. 2 and Exp. 3. We use data from the Individual Brain Charting dataset [Pinho et al., 2018], to replicate the settings of [Thual et al., 2022], in Exp. 4.

Metrics. Following Klein et al. [2023], we evaluate maps by focusing on the two following metrics: (i) pearson correlation computed between the (ground truth) source s feature matrix F s 2 Rn d, and the barycentric projection of the target t to the source scaled by the target marginals bt. Writing P as the transport matrix from source to target, this can be computed as Pdiag( 1

bt )F t; (ii) F1 score when assessing class transfer (among 11 possible classes), computed between the original source vector of labels ls, taken in {1, , 11}n, and the inferred labels for the same points, predicted for each i by taking the argmaxj Bi,j, where B is a matrix of n 11 row probabilities, each the barycentric projection of the target t one-hot encoded labels Lt 2 {0, 1}m 11, B := Pdiag( 1

Experiment 1: Benchmarking The Translation Invariant Variant. We evaluate the effect of the proposed TI procedure on the computational cost of ULR solvers: We compare the time taken when solving unbalanced LR problems, with or without using the TI objective. In Figure 3, we compare the execution time (using our ott-jax implementation, and a single NVIDIA Ge Force RTX 2080

Ti card) of unbalanced LR Sinkhorn on large and high dimensional Gaussian distributions. The results presented are averaged over 10 random seeds with error bars. We use a δ = 10 9 convergence threshold and 1000 maximal number of iterations for Dykstra, in 64-bit precision. We observe that the use of our proposed TI objective is consistently beneﬁcial when solving ULR problems. See also Appendix B.3 for additional experiments.

Figure 3: Execution time of unbalanced LR Sinkhorn, with (Alg. 3) or without (Alg. 5) the TI variant. We ﬁx the rank to r = 10; n points (displayed in thousands) are sampled from two Gaussian distributions in d = 30 of means respectively 1.2 and 1.3, and standard deviations 1 and 0.2. (left) displays large (close to balanced), (right) is smaller (more unbalanced). We use the same convergence threshold for the outer loop, for all sample sizes. As n gets bigger, this results in a relatively looser threshold, explaining why timings can slightly decrease w.r.t. n. What matters is, therefore, the comparative performance of TI vs non-TI for a ﬁxed n, not the behaviour w.r.t. n.

solver mass % val test F1 mac. F1 mic. F1 weig.

LOT 1.000 0.282 0.386 0.210 0.411 0.360 ULOT 0.889 0.301 0.409 0.200 0.425 0.363

LGW 1.000 0.227 0.288 0.487 0.716 0.692 ULGW 1.001 0.222 0.287 0.463 0.701 0.665

LFGW 1.000 0.365 0.443 0.576 0.720 0.714 ULFGW 0.443 0.379 0.463 0.582 0.733 0.724

Table 1: Exp.2, Results for spatial transcriptomics dataset (brain coronal section from Shi et al. [2022]).

Experiment 2: ULOT vs. LOT on Gene Expression / Cell Type Annotation. We evaluate the accuracy of ULOT solvers for a large-scale spatial transcriptomics task, using gene expression mapping and cell type annotation. We compare it to the balanced LR alternative using the Pearson correlation as described in the metrics section. We leverage two coronal sections of the mouse brain proﬁled by STARmap spatial transcriptomics by [Shi et al., 2022]. They consist of n 40, 000 cells in both the source and target brain section. Each cell is described by 1000 gene features, in addition to 2D spatial coordinates. As a result A, B are 40k 40k, and the fused term C is a squared-Euclidean distance matrix on 30D PCA space computed on the gene expression space. We selected 10 marker genes for the validation and test sets from the HPF_CA cluster. We run an extensive grid search as reported in B.2, we pick the best hyperparameters combination using performance on the 10 validation genes as a criterion, and we report that metric on the other genes in Table 1, as well as qualitative results in Figure 1 and Figure 2. Clearly, ULFGW is the best performing solver across all metrics. Interestingly, the ULOT does not consistently outperforms its balanced version, and unbalancedness seems to hurt performance for the LGW solvers. Nevertheless, both solvers display inconsistent performance across metrics, whereas the ULFGW and LFGW are consistently superior to the rest of the solvers. These results highlight how the ﬂexibility given by the FGW formulation to leverage common and disparate geometries, paired with the unbalancedness relaxation, can provide state of the art algorithms for matching problems in large-scale, real world biological problems.

Experiment 3: ULOT vs. UEOT. We compare the performance of ULOT solvers to their unbalanced entropic alternatives (UEOT). We use the same datasets as in Exp. 2, but must pick a smaller subset (Olfactory bulb), to avoid OOM errors for entropic UGW solvers, since they cannot handle the 40k sizes considered in Exp. 2 (see B.1). This results in n 20, 000 source and 15, 000 target

Figure 4: Visualization of measured and predicted right auditory click contrast map in Exp.4.

cells, and 1000 genes. Similar to Exp. 2, the fused term C is a squared-Euclidean distance matrix on 30-D PCA space, computed on gene expressions. As done in Exp. 2, we select 10 marker genes

solver mass % val test F1 mac. F1 mic. F1 weig.

UEOT 1.012 0.368 0.479 0.511 0.763 0.751 LOT 1.000 0.335 0.440 0.511 0.760 0.751 ULOT 0.998 0.356 0.461 0.518 0.770 0.762

UEFGW 1.015 0.343 0.475 0.564 0.839 0.831 LFGW 1.000 0.348 0.453 0.512 0.762 0.753 ULFGW 0.339 0.368 0.491 0.556 0.826 0.818

Table 2: Exp. 3: Results for spatial transcriptomics dataset (Olfactory bulb section from Shi et al. [2022]).

for the validation and 10 genes for the test set, from cluster OB_1. We run an extensive grid search, as in Exp. 2 and B.2. Table 2 shows that ULFGW outperforms entropic solvers w.r.t. Pearson correlation , but is worse when considering F1 scores. On the other hand, ULFGW conﬁrms its superiority compared to the balanced alternative LFGW. Taken together, these results suggest that while unbalanced LR solvers are on par with unbalanced entropic solvers in terms of performance, in small data regimes, they remain much faster and can unlock the applications of unbalanced OT to larger scales.

Experiment 4: ULOT to align brain meshes. In this experiment, we compare the performance of our ULFGW solver to FUGW-sparse [Thual et al., 2022], an alternative approach to solve unbalanced FGW problems, using a two-scale (corse/ﬁne grid) approach to handle large sample sizes. This method was demonstrated to be effective in aligning brain anatomies, encompassing both mesh structures and functional signals associated with each vertex. For their empirical analysis, they use the individual brain charting dataset [Pinho et al., 2018].

solver mass val test

FUGW-sparse 0.999 0.492 0.472 LFGW 1.000 0.513 0.663 ULFGW 0.981 0.533 0.643

Table 3: Results on the brain anatomy with functional signal data from Pinho et al. [2018] in Exp.4.

In the absence of other information in the original paper, we draw inspiration from Pinho et al. s smaller scale notebook implementations: We embed the n 160, 000 vertices of the fsaverage7 mesh, into a 30-d space, using an approximation of the geodesic distances with landmark multi-dimensional scaling [De Silva and Tenenbaum, 2004] where 2048 points were used as anchors. Each vertex has an associated functional signal that entails 22 features. For both the quadratic and linear terms, we compute the costs based on the squared Euclidean distance. The coarse grid for FUGW-sparse is built using onetenth of n, i.e. 16k points. We evaluate all methods by comparing the performance of the best hyperparameter combination, based on the average correlation between the barycentric projection and ground-truth value of 5 features, across a test set of 5 contrast maps. In Table 3, we observe that ULFGW and LFGW outperform FUGW-sparse. In this setting, there is no clear evidence that the unbalanced version performs better than its balanced counterpart for low-rank methods. See also Appendix B.2 for additional experimental details and results.

Conclusion. The practical success of OT methods to natural sciences demonstrates the relevance of OT to their analysis pipelines. Practitioners must, however, often deal with the poor scalability of OT algorithms, as well as their rigid assumptions w.r.t. mass conservation. While Low-rank OT approaches hold the promise of scaling OT methods to large sizes, unbalanced formulations have proved useful to relax mass conservation for entropic OT solvers. We have proposed in this paper to merge these two strains, and demonstrated the practical relevance of these unbalanced low-rank solvers on various challenging alignment tasks.

Ravindra K Ahuja, Thomas L Magnanti, and James B Orlin. Network Flows: Theory, Algorithms,

and Applications. Prentice hall, 1993.

Jason Altschuler, Jonathan Weed, and Philippe Rigollet. Near-linear time approximation algorithms

for optimal transport via sinkhorn iteration. ar Xiv preprint ar Xiv:1705.09634, 2017.

Iro Armeni, Ozan Sener, Amir R Zamir, Helen Jiang, Ioannis Brilakis, Martin Fischer, and Silvio

Savarese. 3d semantic parsing of large-scale indoor spaces. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1534 1543, 2016.

Heinz H Bauschke and Patrick L Combettes. A Dykstra-like algorithm for two monotone operators.

Paciﬁc Journal of Optimization, 4(3):383 391, 2008.

Heinz H Bauschke and Adrian S Lewis. Dykstras algorithm with bregman projections: A convergence

proof. Optimization, 48(4):409 427, 2000.

Andrew J Blumberg, Mathieu Carriere, Michael A Mandell, Raul Rabadan, and Soledad Villar. Mrec:

a fast and versatile framework for aligning and matching point clouds with applications to single cell molecular data. ar Xiv preprint ar Xiv:2001.01666, 2020.

Emmanuel Boissard and Thibaut Le Gouic. On the mean speed of convergence of empirical and

occupation measures in wasserstein distance. In Annales de l IHP Probabilités et statistiques, volume 50, pages 539 563, 2014.

Charlotte Bunne, Stefan G. Stark, Gabriele Gut, Jacobo Sarabia del Castillo, Kjong-Van Lehmann,

Lucas Pelkmans, Andreas Krause, and Gunnar Rätsch. Learning single-cell perturbation responses using neural optimal transport. bio Rxiv, 2021. doi: 10.1101/2021.12.15.472775. URL https: //www.biorxiv.org/content/early/2021/12/15/2021.12.15.472775.

Charlotte Bunne, Andreas Krause, and Marco Cuturi. Supervised Training of Conditional Monge

Maps. In Advances in Neural Information Processing Systems (Neur IPS), 2022.

Zixuan Cang, Yanxiang Zhao, Axel A Almet, Adam Stabell, Raul Ramos, Maksim V Plikus, Scott X

Atwood, and Qing Nie. Screening cell cell communication in spatial transcriptomics via collective optimal transport. Nature Methods, 20(2):218 228, 2023.

Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin.

Unsupervised learning of visual features by contrasting cluster assignments. Advances in neural information processing systems, 33:9912 9924, 2020.

Lénaïc Chizat, Gabriel Peyré, Bernhard Schmitzer, and François-Xavier Vialard. Unbalanced optimal

transport: Dynamic and kantorovich formulations. Journal of Functional Analysis, 274(11): 3090 3123, 2018.

Lenaic Chizat, Pierre Roussillon, Flavien Léger, François-Xavier Vialard, and Gabriel Peyré. Faster

wasserstein distance estimation with the sinkhorn divergence. Advances in Neural Information Processing Systems, 33, 2020.

Samir Chowdhury, David Miller, and Tom Needham. Quantized gromov-wasserstein. ar Xiv preprint

ar Xiv:2104.02013, 2021.

Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in

neural information processing systems, pages 2292 2300, 2013.

Marco Cuturi, Laetitia Meng-Papaxanthos, Yingtao Tian, Charlotte Bunne, Geoff Davis, and Olivier

Teboul. Optimal transport tools (ott): A jax toolbox for all things wasserstein. ar Xiv preprint ar Xiv:2201.12324, 2022.

Vin De Silva and Joshua B Tenenbaum. Sparse multidimensional scaling using landmark points.

Technical report, technical report, Stanford University, 2004.

Pinar Demetci, Rebecca Santorella, Björn Sandstede, William Stafford Noble, and Ritambhara Singh.

Gromov-wasserstein optimal transport to align single-cell multi-omics data. bio Rxiv, 2020. doi: 10.1101/2020.04.28.066787.

Richard Mansﬁeld Dudley et al. Weak convergence of probabilities on nonseparable metric spaces

and empirical measures on euclidean spaces. Illinois Journal of Mathematics, 10(1):109 126, 1966.

Richard L Dykstra. An algorithm for restricted least squares regression. Journal of the American

Statistical Association, 78(384):837 842, 1983.

Kilian Fatras, Thibault Sejourne, Rémi Flamary, and Nicolas Courty. Unbalanced minibatch optimal

transport; applications to domain adaptation. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 3186 3197. PMLR, 18 24 Jul 2021. URL https://proceedings. mlr.press/v139/fatras21a.html.

Jean Feydy, Pierre Roussillon, Alain Trouvé, and Pietro Gori. Fast and scalable optimal transport for

brain tractograms. In Medical Image Computing and Computer Assisted Intervention MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13 17, 2019, Proceedings, Part III 22, pages 636 644. Springer, 2019.

Rémi Flamary, Nicolas Courty, Alexandre Gramfort, Mokhtar Z. Alaya, Aurélie Boisbunon, Stanislas

Chambon, Laetitia Chapel, Adrien Corenﬂos, Kilian Fatras, Nemo Fournier, Léo Gautheron, Nathalie T.H. Gayraud, Hicham Janati, Alain Rakotomamonjy, Ievgen Redko, Antoine Rolet, Antony Schutz, Vivien Seguy, Danica J. Sutherland, Romain Tavenard, Alexander Tong, and Titouan Vayer. Pot: Python optimal transport. Journal of Machine Learning Research, 22(78):1 8, 2021. URL http://jmlr.org/papers/v22/20-451.html.

Aden Forrow, Jan-Christian Hütter, Mor Nitzan, Philippe Rigollet, Geoffrey Schiebinger, and Jonathan

Weed. Statistical optimal transport via factored couplings, 2018.

Charlie Frogner, Chiyuan Zhang, Hossein Mobahi, Mauricio Araya, and Tomaso A Poggio. Learning

with a Wasserstein loss. In Advances in Neural Information Processing Systems, pages 2053 2061, 2015.

Aude Genevay, Lénaic Chizat, Francis Bach, Marco Cuturi, and Gabriel Peyré. Sample Complexity

of Sinkhorn Divergences. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), volume 22, 2019.

Piotr Indyk, Ali Vakilian, Tal Wagner, and David Woodruff. Sample-optimal low-rank approximation

of distance matrices, 2019.

Dominik Klein, Giovanni Palla, Marius Lange, Michal Klein, Zoe Piran, Manuel Gander, Laetitia

Meng-Papaxanthos, Michael Sterr, Aimee Bastidas-Ponce, Marta Tarquis-Medina, Heiko Lickert, Mostafa Bakhti, Mor Nitzan, Marco Cuturi, and Fabian J. Theis. Mapping cells through time and space with moscot. bio Rxiv, 2023. doi: 10.1101/2023.05.11.540374. URL https://www. biorxiv.org/content/early/2023/05/11/2023.05.11.540374.

Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics

quarterly, 2(1-2):83 97, 1955.

Khang Le, Huy Nguyen, Quang M Nguyen, Tung Pham, Hung Bui, and Nhat Ho. On robust optimal

transport: Computational complexity and barycenter computation. Advances in Neural Information Processing Systems, 34:21947 21959, 2021.

Jiajin Li, Jianheng Tang, Lemin Kong, Huikang Liu, Jia Li, Anthony Man-Cho So, and Jose Blanchet.

A convergent single-loop algorithm for relaxation of gromov-wasserstein in graph data. In The Eleventh International Conference on Learning Representations, 2023.

Facundo Mémoli. Gromov Wasserstein distances and the metric approach to object matching.

Foundations of Computational Mathematics, 11(4):417 487, 2011.

Gonzalo Mena and Jonathan Niles-Weed. Statistical bounds for entropic optimal transport: sample

complexity and the central limit theorem. Advances in Neural Information Processing Systems, 32, 2019.

Mor Nitzan, Nikos Karaiskos, Nir Friedman, and Nikolaus Rajewsky. Gene expression cartography.

Nature, 576(7785), 2019.

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov,

Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. ar Xiv preprint ar Xiv:2304.07193, 2023.

Gabriel Peyré, Marco Cuturi, and Justin Solomon. Gromov-wasserstein averaging of kernel and

distance matrices. In International Conference on Machine Learning, pages 2664 2672, 2016.

Gabriel Peyré and Marco Cuturi. Computational optimal transport. Foundations and Trends in

Machine Learning, 11(5-6), 2019. ISSN 1935-8245.

Khiem Pham, Khang Le, Nhat Ho, Tung Pham, and Hung Bui. On unbalanced optimal transport:

An analysis of sinkhorn algorithm. In International Conference on Machine Learning, pages 7673 7682. PMLR, 2020.

Ana Luísa Pinho, Alexis Amadon, Torsten Ruest, Murielle Fabre, Elvis Dohmatob, Isabelle Denghien,

Chantal Ginisty, Séverine Becuwe-Desmidt, Séverine Roger, Laurence Laurier, Véronique Joly Testault, Gaëlle Médiouni-Cloarec, Christine Doublé, Bernadette Martins, Philippe Pinel, Evelyn Eger, Gael Varoquaux, Christophe Pallier, Stanislas Dehaene, Lucie Hertz-Pannier, and Bertrand Thirion. Individual Brain Charting, a high-resolution f MRI dataset for cognitive mapping. Scientiﬁc Data , 5:180105, June 2018. doi: 10.1038/sdata.2018.105. URL https://hal.science/ hal-01817528.

R Tyrrell Rockafellar. Convex analysis. Princeton university press, 1970.

Yossi Rubner, Carlo Tomasi, and Leonidas J. Guibas. The earth mover s distance as a metric for

image retrieval. International Journal of Computer Vision, 40(2):99 121, November 2000.

Michael E Sander, Pierre Ablin, Mathieu Blondel, and Gabriel Peyré. Sinkformers: Transformers with

doubly stochastic attention. In International Conference on Artiﬁcial Intelligence and Statistics, pages 3515 3530. PMLR, 2022.

Ryoma Sato, Marco Cuturi, Makoto Yamada, and Hisashi Kashima. Fast and robust comparison of

probability measures in heterogeneous spaces. ar Xiv preprint ar Xiv:2002.01615, 2020.

Meyer Scetbon and Marco Cuturi. Linear time sinkhorn divergences using positive features. Advances

in Neural Information Processing Systems, 33:13468 13480, 2020.

Meyer Scetbon and Marco Cuturi. Low-rank optimal transport: Approximation, statistics and

debiasing. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 6802 6814. Curran As-

sociates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/ file/2d69e771d9f274f7c624198ea74f5b98-Paper-Conference.pdf.

Meyer Scetbon, Marco Cuturi, and Gabriel Peyré. Low-rank sinkhorn factorization. In International

Conference on Machine Learning, pages 9344 9354. PMLR, 2021.

Meyer Scetbon, Gabriel Peyré, and Marco Cuturi. Linear-time Gromov-Wasserstein distances using

low rank couplings and costs. ICML, 2022.

Geoffrey Schiebinger, Jian Shu, Marcin Tabaka, Brian Cleary, Vidya Subramanian, Aryeh Solomon,

Joshua Gould, Siyan Liu, Stacie Lin, Peter Berube, et al. Optimal-transport analysis of single-cell gene expression identiﬁes developmental trajectories in reprogramming. Cell, 176(4):928 943, 2019.

Thibault Séjourné, François-Xavier Vialard, and Gabriel Peyré. The unbalanced gromov wasserstein

distance: Conic formulation and relaxation. Advances in Neural Information Processing Systems, 34:8766 8779, 2021.

Thibault Séjourné, François-Xavier Vialard, and Gabriel Peyré. Faster unbalanced optimal transport:

Translation invariant sinkhorn and 1-d frank-wolfe. In International Conference on Artiﬁcial Intelligence and Statistics, pages 4995 5021. PMLR, 2022.

Hailing Shi, Yichun He, Yiming Zhou, Jiahao Huang, Brandon Wang, Zefang Tang, Peng Tan,

Morgan Wu, Zuwan Lin, Jingyi Ren, Yaman Thapa, Xin Tang, Albert Liu, Jia Liu, and Xiao Wang. Spatial atlas of the mouse central nervous system at molecular resolution. bio Rxiv, 2022. doi: 10.1101/2022.06.20.496914. URL https://www.biorxiv.org/content/early/2022/ 06/22/2022.06.20.496914.

Justin Solomon, Gabriel Peyré, Vladimir G Kim, and Suvrit Sra. Entropic metric alignment for

correspondence problems. ACM Transactions on Graphics (TOG), 35(4):1 13, 2016.

Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, and Da-Cheng Juan. Sparse Sinkhorn attention. In

Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 9438 9447. PMLR, 13 18 Jul 2020. URL https://proceedings.mlr.press/v119/tay20a.html.

Alexis Thual, Huy Tran, Tatiana Zemskova, Nicolas Courty, Rémi Flamary, Stanislas Dehaene, and

Bertrand Thirion. Aligning individual brains with fused unbalanced Gromov-Wasserstein. ar Xiv, 2022.

Titouan Vayer, Laetita Chapel, Rémi Flamary, Romain Tavenard, and Nicolas Courty. Fused gromov-

wasserstein distance for structured objects: theoretical foundations and mathematical properties. ar Xiv preprint ar Xiv:1811.02834, 2018.

Cédric Vincent-Cuaz, Rémi Flamary, Marco Corneli, Titouan Vayer, and Nicolas Courty. Semi-

relaxed gromov-wasserstein divergence and applications on graphs. In International Conference on Learning Representations, 2023.

F Alexander Wolf, Philipp Angerer, and Fabian J Theis. SCANPY: large-scale single-cell gene

expression data analysis. Genome Biology, 19(1), 2018.

Hongteng Xu, Dixin Luo, and Lawrence Carin. Scalable gromov-wasserstein learning for graph

partitioning and matching. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/ 2019/file/6e62a992c676f611616097dbea8ea030-Paper.pdf.