# lowrank_sinkhorn_factorization__f2318ac3.pdf

Low-Rank Sinkhorn Factorization

Meyer Scetbon 1 Marco Cuturi 1 2 Gabriel Peyré 3 4

Several recent applications of optimal transport (OT) theory to machine learning have relied on regularization, notably entropy and the Sinkhorn algorithm. Because matrix-vector products are pervasive in the Sinkhorn algorithm, several works have proposed to approximate kernel matrices appearing in its iterations using low-rank factors. Another route lies instead in imposing low-nonnegative rank constraints on the feasible set of couplings considered in OT problems, with no approximations on cost nor kernel matrices. This route was ﬁrst explored by Forrow et al. (2018), who proposed an algorithm tailored for the squared Euclidean ground cost, using a proxy objective that can be solved through the machinery of regularized 2-Wasserstein barycenters. Building on this, we introduce in this work a generic approach that aims at solving, in full generality, the OT problem under low-nonnegative rank constraints with arbitrary costs. Our algorithm relies on an explicit factorization of lowrank couplings as a product of sub-coupling factors linked by a common marginal; similar to an NMF approach, we alternatively updates these factors. We prove the non-asymptotic stationary convergence of this algorithm and illustrate its efﬁciency on benchmark experiments.

1. Introduction

By providing a simple and comprehensive framework to compare probability distributions, optimal transport (OT) theory has inspired many developments in machine learning (Peyré & Cuturi, 2019). A ﬂurry of works have recently connected it to other trending topics, such as normalizing ﬂows or convex neural networks (Makkuva et al., 2020; Ko-

1CREST, ENSAE 2Google Brain 3Ecole Normale Supérieure, PSL University 4CNRS. Correspondence to: Meyer Scetbon <meyer.scetbon@ensae.fr>, Marco Cuturi <cuturi@google.com>,

Gabriel Peyré <gabriel.peyre@ens.fr>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

rotin et al., 2021; Tong et al., 2020), while the scope of its applications has now reached several ﬁelds of science such as single-cell biology (Schiebinger et al., 2019; Yang et al., 2020), imaging (Schmitz et al., 2018; Heitz et al., 2020) or neuroscience (Janati et al., 2020; Koundal et al., 2020).

Challenges when computing OT. Solving optimal transport problems at scale poses, however, formidable challenges. The most obvious among them is computational: Instantiating the Kantorovich (1942) problem on discrete measures of size n can be solved with a linear program (LP) of complexity O(n3 log n). A second and equally important challenge lies in the statistical performance of using that LP to estimate OT between densities: the LP solution between i.i.d samples converges very slowly to that between densities (Fournier & Guillin, 2015). It is now increasingly clear that regularizing OT in some way or another is the only way to mitigate these two issues (Genevay et al., 2018; Chizat et al., 2020; Clason et al., 2021). A popular approach consists in penalizing the OT problem with a strongly convex function of the coupling (Cuturi, 2013; Dessein et al., 2018). We explore in this work an alternative, and more direct approach to add regularity: we restrict, instead, the set of feasible couplings to have a small nonnegative rank.

Low-Rank Kernel Factorization. Low-rank factorizations are not new to regularized OT. They have been used to speed-up the resolution of entropy regularized OT with the Sinkhorn algorithm, pending some approximations: Given a data-dependent n m cost matrix C, the Sinkhorn iterations consist in matrix-vector products of the form Kv or KT u where K , exp( C/") and u, v are n, mvectors. Altschuler et al. (2018) and Altschuler & Boix-Adsera (2020) have proposed to approximate the kernel K with a product of thin rank r matrices, e K = ABT . Naturally, the ability to approximate K with a low-rank e K degrades as " decreases, making this approach valid only for sufﬁciently large ". Thanks to this approximation, however, each Sinkhorn iteration is linear in n or m, and the coupling outputted by the Sinkorn algorithm is of the form e P = CDT

where C = diag(u)A, D = diag(v)B. This approximation results therefore in a low-rank solution that is not, however, rigorously optimal for the original problem as deﬁned by K but rather that deﬁned by e K. However the solution obtained with e K can be arbitrary close to the true solution by increasing the rank considered. Similarly, Scetbon & Cuturi (2020)

Low-Rank Sinkhorn Factorization

2t4s/qf Orhb9d+Lu F+wvdhac Lmwsv Fo4Xi84C/7CPy38K/7P7r7r/v/sfufwro L39Ry Pz1gvaz+1/DUx TGJw=</latexit>r = 10

a3C7+p M6v Fv524e8WHix0F54ub C68WDhe L3g LPg L/7Twzwv/svuvu/+x+7/ymgv/x FIf PXC9r P7n/9N2u KGKA=</latexit>r = 50

G0u Z28Sd1fr Xwtwt/t3B/obvwd GFz4c XC8c Lr BWch WPinh X9e+Jfdf9393/2P1PAf3l Lwq Zv17Qfnb/678BL38Y1g=</latexit>r = 100

m1B9e9zrd9c7ay97Sxlbx J3V+sf B3C3+/8GChu/Bk YWPh+c LRwqs FZ+Gf F/5l4V8X/m3n3f+c+e/dv5b QH/+s0Lmrxe0n53/+V+nr R5I</latexit>" = 0.0005

/v N7od B931l9u LG1u F39S5xc Lf7/w Dwv3F7o LTx Y2F54v HC+8Wn AWso V/Wfj Xh X/b/fd/9z9r93/Ft Cf/6y Q+bs F7Wf3f/4Xo Fse Cg=</latexit>" = 0.001

4e Dg=</latexit>" = 0.005

6Ed1A=</latexit>" = 0.05

Figure 1. Two Gaussian mixture densities evaluated on n = 200 and m = 220 sized grids in 1D, displayed as blue/red curves. Between

them, n m optimal coupling matrices obtained by our proposed low-rank OT method for varying rank constraint values r (in increasing order, top row) and the Sinkhorn algorithm, for various " (in decreasing order, bottom row). The ground cost is the 1.5-norm.

consider instead nonnegative low-rank approximations for K of the form e K = QRT where Q, R > 0.

Low-Nonnegative Rank Couplings. To our knowledge, only Forrow et al. (2018) have used low rank considerations for couplings, rather than costs or kernels. Their work studies the case where the ground cost is the squared Euclidean distance. They obtain for that cost a proxy for rank-constrained OT problems using 2-Wasserstein barycenters (Agueh & Carlier, 2011). Their algorithm blends those in (Cuturi & Doucet, 2014; Benamou et al., 2015) and results in an intuitive mass transfer plan that goes through a small number of r points, where r is the coupling s nonnegative rank.

Our Contributions. In this work, we tackle directly the low-rank problem formulated by (Forrow et al., 2018) but make no assumption on the cost matrix; we address instead the low-rank OT problem in its full generality. We consider couplings P = Q diag(1/g)RT decomposed as the product of two sub-couplings Q, R, with common right marginal g, and left-marginal given by those of P on each side. Each of these sub-couplings minimizes a transport cost that involves the original cost matrix C and the other subcoupling. We handle this problem by optimizing jointly on Q, R and g using a mirror-descent approach. We prove the non-asymptotic stationary convergence of this approach. In addition, we show that the time complexity of our algorithm can become linear when exploiting low rank assumptions on the cost (not the kernel) involved in the OT problem.

Differences with previous work. Our approach borrows ideas from (Forrow et al., 2018) but is generic as it applies to all ground costs. Our approach constrains the non-negative rank of the coupling solution P by construction, rather than relying on a low rank approximation e K for kernel K = e C/". This is a crucial point, because the ability to approximate K with a low rank e K signiﬁcantly degrades as " decreases. By contrast, our approach applies to all ranks,

small and large. Interestingly, we also show that a low-rank assumption on the cost matrix (not on the kernel) can also be leveraged, providing therefore a best of both worlds scenario in which both the coupling s and the cost s (not the kernel) low rank properties can be enforced and exploited. Finally, a useful parallel can be drawn between our approach and that of the vanilla Sinkhorn algorithm, in the sense that they propose different regularization schemes. Indeed, the (discrete) path of solutions obtained by our algorithm when varying r between 1 and min(n, m) can be seen as an alternative to the entropic regularization path. Both paths contain at their extremes the original OT solution (maximal rank and minimal entropy) and the product of marginals (minimal rank and maximal entropy), as illustrated in Fig. 1.

2. Discrete Optimal Transport

OT as a linear program. Let a and b be two histograms in n, m, the probability simplices of respective size n, m. Assuming a > 0 and b > 0, set X , (x1, . . . , xn) and Y , (y1, . . . , ym) two families of points taken each within arbi-

trary sets, and deﬁne discrete distributions µ , Pn

i=1 aiδxi and , Pm

j=1 bjδyj. The set of couplings with marginals a, b is:

a,b , {P 2 Rn m

+ s.t. P1m = a, P T 1n = b} .

Given a cost function c deﬁned on pairs of points in X, Y and writing C , [c(xi, yj)]i,j its associated matrix, the optimal transport (OT) problem can be written as follows:

OT(µ, ) , min P 2 a,bh C, Pi . (1)

Entropic regularization. Several works have shown recently (Genevay et al., 2018; Chizat et al., 2020) that when X and Y are sampled from a continuous space, it is preferable to regularize (1) using, for instance, an entropic regularizer (Cuturi, 2013) to achieve both better computational

Low-Rank Sinkhorn Factorization

and statistical efﬁciency,

OT"(µ, ) , min P 2 a,bh C, Pi "H(P) . (2)

where " 0 and H is the Shannon entropy deﬁned as H(P) , P

ij Pij(log Pij 1). If " goes to 0, one recovers the classical OT problem and for any " > 0, Eq. (2) becomes "-strongly convex on a,b and admits a unique solution P", of the form

+ s.t. P" = diag(u")Kdiag(v") (3)

where K , exp( C/"). Cuturi (2013) shows that the scaling vectors u" and v" can be obtained efﬁciently thanks to the Sinkhorn algorithm (see Alg. 1, where and / denote entry-wise operation). Each iteration can be performed in O(nm) algebraic operations as it involves only matrixvector products. The number of Sinkhorn iterations needed to converge to a precision δ (monitored by the difference between the column-sum of diag(u)Kdiag(v) and b) is controlled by the scale of elements in C relative to " (Franklin & Lorenz, 1989). That convergence deteriorates with smaller ", as studied in more detail by (Altschuler et al., 2017; Dvurechensky et al., 2018).

Algorithm 1 Sinkhorn(K, a, b, δ) Inputs: K, a, b, δ, u repeat

v b/KT u, u a/Kv until ku Kv ak1 + kv KT u bk1 < δ; Result: u, v

Mirror descent and " schedule. A possible interpretation of the entropic regularization in the OT problem is that it can be seen as the k"-th update of a Mirror Descent (MD) algorithm applied to the objective (1) where k" 1 depends on " and the gradient steps used in the MD. Several works have proposed such links between a gradual decrease in " to obtain a better approximation of the unregularized OT problem (Schmitzer, 2019; Lin et al., 2019; Xie et al., 2020). More precisely, the MD algorithm associated to the Kullback Leibler divergence (KL) applied to the objective (1) makes for all k 0 the following update:

Qk+1 , argmin

h C, Qi + 1

KL(Q, Qk) (4)

where (γk)k 0 is a sequence of positive real numbers, Q0 2 a,b is an initial point and KL is the Kullback Leibler divergence deﬁned asw. If Q0 , ab T , then one obtains that for all k 0, updating the coupling according to Eq. (4) is the same as solving

Qk+1 , argmin

h C, Qi "k H(Q)

where "k , (Pk

j=0 γj) 1. Therefore the MD algorithm applied to (1) produces the sequence (P"k)k 0 of optimal couplings according to the objective (2). We show next that this viewpoint can be applied when one adds also some structures to the couplings considered in the OT problem (1), leading to a new regularized approach.

3. Nonnegative Factorization of the Optimal

Here we aim at regularizing the OT problem by decomposing the couplings involved into a product of two low-rank couplings. We introduce the associated non-convex problem and develop a mirror-descent algorithm which operates by solving a succession of convex programs.

3.1. Low-Rank and Factored Couplings

We introduce low rank couplings and explain how they can be parameterized as factored couplings. Deﬁnition 1. Given M 2 Rn m, the nonnegative rank of M is the smallest number of nonnegative rank-one matrices into which the matrix can be decomposed additively:

rk+(M) , min

Ri, 8i, rk(Ri) = 1, Ri 0

Let r 1, and let us denote

a,b(r) , {P 2 a,b, rk+(P) r}.

From Deﬁnition 1, one has

i s.t. 8 i qi 2 n, ri 2 m,

giqi = a and

from which we deduce directly that a,b(r) is compact. Moreover for g 2

r , {h 2 r s.t. 8i hi > 0}, we write

+ , P = Q diag(1/g)RT ,

Q 2 a,g, and R 2 b,g

Note that a,g,b is compact and a subset of a,b(r) since for all P 2 a,g,b, P 2 a,b and one has rk(P) rk+(P) r. Moreover, for any P 2 a,b such that rk+(P) r, there exists g 2

r, Q 2 a,g and R 2 b,g such that P = Q diag(1/g)RT (Cohen & Rothblum, 1993). Therefore

a,g,b = a,b(r). (5)

We exploit next this identity to build an efﬁcient algorithm in order to solve the optimal transport problem under low nonnegative rank constraints.

Low-Rank Sinkhorn Factorization

3.2. The Low-rank OT Problem (LOT)

The problem of interest in this work is:

LOTr(µ, ) , min P 2 a,b(r)h C, Pi. (6)

Here the minimum is always attained as a,b(r) is compact and the objective is continuous. Thanks to (5), problem (6) is equivalent to

min (Q,R,g)2C(a,b,r)h C, Q diag(1/g)RT i (7)

where C(a, b, r) , C1(a, b, r) \ C2(r), with

C1(a, b, r) ,

(Q, R, g) 2 Rn r

s.t. Q1r = a, R1r = b

(Q, R, g) 2 Rn r

s.t. QT 1n = RT 1m = g

In the following, we also consider regularized version of the problem (7) by adding an entropic term to the objective which leads for all " 0 to the following problem

LOTr,"(µ, ) , inf (Q,R,g)2C(a,b,r)h C, Q diag(1/g)RT i

"H((Q, R, g)).

Here the entropy of (Q, R, g) is to be understood as that of the values of the three respective entropies evaluated for each term. We will see that adding an entropic term to the objective allows to stabilize the MD scheme employed to solve (6). For all " 0, the objective function deﬁned in (8) is lower semi-continuous, and admits therefore a minimum in C1(a, b, r) \ C2(r) where C1(a, b, r) is the closure of C1(a, b, r). However, the existence of a solution for problem (8) requires more care, as shown in the following proposition.

Proposition 1. If " = 0 then the inﬁmum of (8) is always attained. If " > 0, then if r = 1, the inﬁmum of (8) is attained and for r 2, problem (8) admits a minimum if LOTr,"(µ, ) < LOTr 1,"(µ, ).

Stabilized Formulation using Lower Bounds In order to ensure stability of the mirror descent method, and enable its theoretical analysis, we introduce a lower bound on the weight vector g.

Let us assume in the following that we consider (r, ") satisfying the conditions of Proposition 1. In particular if " = 0, r can be arbitrarily chosen and we recover the problem deﬁned in (6). Under this assumption, there exists (Q

") 2 C1(a, b, r) \ C2(r) solution of Eq. (8)

from which follows the existence of 1

r > 0, such that g

" coordinate-wise. Let us now deﬁne for any 1 r > 0, the following set

C1(a, b, r, ) ,

(Q, R, g) 2 Rn r

s.t. Q1r = a, R1r = b, g

Then if is sufﬁciently small (i.e. ) we have that the problem (8) is equivalent to

LOTr,", (µ, ) = min (Q,R,g)2C(a,b,r, )h C, Q diag(1/g)RT i

"H((Q, R, g)), (9)

where C(a, b, r, ) , C1(a, b, r, ) \ C2(r). Note that for any 1

r > 0, the set of constraints is not empty, compact and the minimum always exists.

3.3. Mirror Descent Optimization Scheme

Sinkhorn a b

Figure 2. Comparison of the Sinkhorn algorithm with our proposed

Mirror descent outer loop. We propose to use a Mirror Descent scheme with a KL divergence to solve Eq. (9). It leads, for all k 0, to the following updates which necessitate the solution of a convex problem at each step

(Qk+1, Rk+1, gk+1) , argmin

2C(a,b,r, )

KL( , k) (10)

where (Q0, R0, g0) 2 C(a, b, r, ) is an initial point such that Q0 > 0 and R0 > 0, k , ( (1)

k , exp( γk CRk diag(1/gk) (γk" 1) log(Qk)), (2)

k , exp( γk CT Qk diag(1/gk) (γk" 1) log(Rk)), (3)

k , exp(γk!k/g2

k (γk" 1) log(gk)) with [!k]i , [QT

k CRk]i,i for all i 2 {1, . . . , r} and (γk)k 0 is a sequence of positive step sizes. Note that for all k 0, (Qk, Rk, gk) live in (R

+)r, and therefore k is well deﬁned and lives also in (R

Low-Rank Sinkhorn Factorization

Dykstra s inner loop. In order to solve Eq. (10), we use the Dykstra s Algorithm (Dykstra, 1983). Given a closed convex set C Rn r

+, we denote for all 2 (R

+)r the projection according to the Kullback-Leibler divergence as

C ( ) , argmin

Starting from 0 , and q0 = q 1 = (1, 1, 1) 2 Rn r

+, the Dykstra s Algorithm consists in computing for all j 0,

C1(a,b,r, )( 2j q2j 1)

q2j+1 = q2j 1 2j 2j+1 2j+2 = PKL

C2(r)( 2j+1 q2j)

q2j+2 = q2j 2j+1

As C1(a, b, r, ) and C2(r) are closed convex subspaces and 2 (R

+)r, one can show that ( j)j 0 converges towards the unique solution of Eq. (10), (Bauschke & Lewis, 2000). The following propositions detail how to compute the relevant projections involved in the Dykstra s Algorithm.

Proposition 2. For , ( Q, R, g) 2 (R

+)r, one has, denoting ˆg , max( g, )

C1(a,b,r, )( ) =

Let us now show the solution of the projection on C2(r).

Proposition 3. For , ( Q, R, g) 2 (R

+)r, the projection (Q, R, g) = PKL

C2(r)( ) satisﬁes

Q = Q diag(g/ QT 1n), R = R diag(g/ RT 1m)

g = ( g QT 1n RT 1m)1/3.

Efﬁcient computation of the updates. The projection obtained in Proposition 2, 3 lead to simple updates of the couplings. Indeed, starting with 0 , = ( (1), (2), (3)) the Dysktra s Algorithm applied to our problem (10) needs only to compute scaling vectors as presented in Alg. 2. More precisely, the Dykstra s Algorithm produces the iterates ( j)j 0 which satisfy for all j 0 j = (Qj, Rj, gj) where

Qj = diag(u1

j) (1) diag(v1

Rj = diag(u2

j) (2) diag(v2

for the sequences (ui

j)j 0 initialized as, ui

1m for all i 2 {1, 2}, q(3)

0 = 1r and computed with the iterations

gn+1 = max( , gn q(3)

n+1,1 = (gn q(3)

gn+1 = ( gn+1 q(3)

n+1 = gn+1 ( i

n+1 = (vk,i

n+1,2 = ( gn+1 q(3)

We have denoted p1 , a and p2 , b to simplify the notations.

Algorithm 2 LR-Dykstra(( (i))1 i 3, p1, p2, , δ)

Inputs: (1), (2), g , (3), p1, p2, , δ, q(3)

2 = 1r, 8i 2 {1, 2}, v(i) = 1r, q(i) = 1r

u(i) pi/ (i) v(i) 8i 2 {1, 2}, g max( , g q(3)

1 )/g, g g, g ( g q(3)

i=1(v(i) q(i) ( (i))T u(i))1/3, v(i) g/( (i))T u(i) 8i 2 {1, 2}, q(i) ( v(i) q(i))/v(i) 8i 2 {1, 2}, q(3)

2 )/g, v(i) v(i) 8i 2 {1, 2}, g g until P2

i=1 ku(i) (i)v(i) pik1 < δ; Q diag(u(1)) (1) diag(v(1)) R diag(u(2)) (2) diag(v(2)) Result: Q, R, g

Let us now introduce the proposed MD algorithm applied to (9). By denoting D( ) the operator extracting the diagonal of a square matrix we obtain Alg. 3. See also Figure 2 for an illustration of the proposed algorithm.

Algorithm 3 LOT(C, a, b, r, , δ) Inputs: C, a, b, (γk)k 0, Q, R, g, , δ for k = 1, . . . do

(1) exp( γk CR diag(1/g) (γk" 1) log(Q)), (2) exp( γk CT Q diag(1/g) (γk" 1) log(R)), ! D(QT CR), (3) exp(γk!/g2 (γk" 1) log(g)), Q, R, g LR-Dykstra(( (i))1 i 3, a, b, , δ) (Alg. 2) end Result: h C, Q diag(1/g)RT i

Low-Rank Sinkhorn Factorization

Computational Cost. Note that ( (i))1 i 3 considered in Alg. 3 live in Rn r

+ and therefore given those matrices, each iteration of Alg. 2 requires O((n+m)r) algebraic operations, since it involves only matrix/vector multiplications of the form (i)vi and ( (i))T ui. However without any assumption on the cost matrix C, computing ( (i))1 i 3 requires O(nmr) algebraic operations since

CR and CT Q must be evaluated. We show in 3.5 how to reduce the quadratic cost of computing ( (i))1 i 3 to a linear cost with respect to the number of samples if one assumes that the considered cost matrix can be factored, either exactly (ensured with a squared Euclidean distance cost) or approximately if that cost is a distance. Writing N the number of iterations of the MD scheme and T the number of iterations considered in Algorithm 2 at each step of the MD, we end up with a total computational cost of O(NT(n + m)r + Nnmr).

Remark 1. Note that our algorithm can be applied in the speciﬁc case where " = 0 in order to solve Eq. (6). Moreover, our algorithm can be applied for an arbitrary choice of the cost function. For example in Figure 5, we run our algorithm on graphs where the distance considered in the shortest-path distance.

3.4. Convergence of the Mirror Descent

Even if the objective (9) is not convex in (Q, R, g), we obtain the non-asymptotic stationary convergence of the MD algorithm in this setting. For that purpose we introduce a stronger convergence criterion than the one presented in (Ghadimi et al., 2013) to obtain non-asymptotic stationary convergence of the MD scheme. Indeed let F" be the objective function of the problem (9) deﬁned on C(a, b, r, ) and let us denotes for any γ > 0 and 2 C(a, b, r, )

G", ( , γ) , argmin 2C(a,b,r, )

{hr F"( ), i + 1

γ KL( , )}.

Then the criteron used in (Ghadimi et al., 2013) to show the stationary convergence of the MD scheme is deﬁned as the square norm of the following vector:

PC(a,b,r, )( , γ) , 1

γ ( G", ( , γ)).

This vector can be seen as a generalized projected gradient of F" at . Indeed if X = Rd and by replacing the prox-function KL(u, x) by 1

2, we would have PX(x, γ) = r F"(x). Here we consider instead the following criterion to establish convergence:

", ( , γ) , 1

γ2 (KL( , G", ( , γ)) + KL(G", ( , γ), )).

Such criterion is in fact stronger than the one used in (Ghadimi et al., 2013) as we have

", ( , γ) = 1

γ2 (hrh(G", ( , γ)) rh( ), G", ( , γ) i

1 2γ2 k G", ( , γ) k2

2k PC(a,b,r, )( , γ)k2

where h denotes the minus entropy function and the last inequality comes from the strong convexity of h on C(a, b, r, ).

r > 0, we show in the following proposition the non-asymptotic stationary convergence of the MD scheme applied to the problem (9). To prove this result, we show that for any " 0, the objective is smooth relatively to the negative entropy function (Bauschke et al., 2017) and we extend the proof of (Ghadimi et al., 2013) to this case.

Proposition 4. Let " 0, 1

r > 0 and N 1. By denoting

and by considering a constant stepsize in the MD scheme (10) such that for all k = 1, . . . , N γk = 1 2L", , we obtain that

min 1 k N ", ((Qk, Rk, gk), γk) 4L", D0

where D0 , F"(Q0, R0, g0) LOTr,", is the distance of the initial value to the optimal one.

Thanks to Proposition 4, for sufﬁciently small (i.e. ), we have LOTr,", = LOTr," and therefore we obtain a stationary point of (8). In particular, if " = 0, the proposed algorithm converges towards a stationary point of (6).

Remark 2. We also propose an algorithm to directly solve (8). The main difference is that the updates of the MD can be solved using the Iterative Bregman Projections (IBP) Algorithm. See Appendix F for more details.

Remark 3. For all " 0, the MD scheme implies that each iteration k of our proposed algorithm outputs (Qk, Rk, gk) 2 C1(a, b, r, ) \ C2(r), and therefore the ma-

trix obtained a each iteration P LOT

k = Qk diag(1/gk)RT

k is a coupling which sastiﬁes the marginal constraints while in the Sinkhorn algorithm, the matrix deﬁned at each iteration by P Sin

k = diag(uk)K diag(vk) becomes a coupling which satisﬁes the marginal constraints only at convergence.

In the following section, we aim at accelerating our method in order to obtain a linear time algorithm to solve (8).

Low-Rank Sinkhorn Factorization

Figure 3. In this experiment, we consider two Gaussian distributions evaluated on n = m = 5000 in 2D. The ﬁrst one has a mean of (1, 1)T and identity covariance matrix I2 while the other has 0 mean and covariance 0.1 I2. The ground cost is the squared Euclidean

distance. Note that for this cost, an exact low-rank factorization of the cost is available, and therefore all low-rank methods, including ours, have a linear time complexity. Left: we show that when " = 0 our method is able to quickly obtain the exact OT by forcing the nonnegative rank of the coupling to be relatively small compared to the number of samples. Note that in this setting, all the other methods cannot be applied. Middle left, middle right: In these plots, we show that our method can obtain high accuracy for either estimate the true OT or its regularized version with order of magnitude faster than the other low-rank methods for any rank r. Moreover, our methods outperforms Sin in these regimes of small regularizations. Note that Sin does not converge for " = 0.001 as we do not consider its stabilized version using log-sum-exp function but rather its classical version which is less costly to compute. Right: Here we change the scale of the y-axis of the plot. We see that the regime of the entropic regularizations for the Sinkhorn algorithm and our method differs. Indeed, the Sinkhorn algorithm has a larger range of " such that it provides an efﬁcient approximation of the OT, whereas LOT is regularizing twice, namely with respect to both rank and entropy.

3.5. Linear time approximation of the Low-Rank

Optimal Transport

Here we aim at obtaining the optimal solution of Eq. (8) in linear time with respect to the number of samples. For that purpose let us introduce our main assumption on the cost matrix C.

Assumption 1. Assume that C admits a low-rank factorization, that is there exists A 2 Rn d and B 2 Rm d such that C = ABT .

From the Assumption 1 one can in fact accelerate the computation in the iterations of the proposed Alg. (3) and obtain a linear time algorithm with respect to the number of samples. Indeed recall that given = ( (i))1 i 3, each iteration of the Dykstra s Alg. (2) can be performed in linear time. Moreover, thanks to Assumption 1, the computation of , which requires to compute both CR and CT Q can be performed in O((n + m)dr) algebraic operations and thus Alg. (3) requires only a linear number of algebraic operations with respect to the number of samples at each iteration.

Let us now justify why the Assumption 1 of a low-rank factorization for the cost matrix is well suited in the problem of computing the Optimal Transport.

Squared Euclidean Metric. In the speciﬁc case where C is a Square Euclidean distance matrix, it admits a low-rank decomposition. Indeed let X , [x1, . . . , xn] 2 Rd n, let Y , [y1, . . . , ym] 2 Rd m and let D , (kxi yjk2

2)i,j. Then by denoting p = [kx1k2

2, . . . , kxnk2

2]T 2 Rn and q = [ky1k2

2, . . . , kymk2

2]T 2 Rm we can rewrite D as the

m + 1nq T 2XT Y.

Therefore by denoting A = [p, 1n, 2XT ] 2 Rn (d+2)

and B = [1m, q, Y T ] 2 Rn (d+2) we obtain that

General Case: Distance Matrix. In the following we denote a distance matrix D 2 Rn m, any matrix such that there exists a metric space (X, d), {xi}n

i=1 2 X n and {yj}m

j=1 2 X m which satisfy for all i, j, Di,j = d(xi, yj). In fact it is always possible to obtain a low-rank approximation of a distance matrix in linear time. In (Bakshi & Woodruff, 2018; Indyk et al., 2019), the authors proposed an algorithm such that for any distance matrix D 2 Rn m

and γ > 0 it outputs matrices M 2 Rn d, N 2 Rm d

in O((m + n)poly( d

γ )) algebraic operations such that with probability at least 0.99 we have

k D MN T k2

where Dd denotes the best rank-d approximation to D. Therefore one can always obtain a low-rank factorization of a distance matrix in linear time with respect to the number of samples. See Appendix D for more details.

4. Numerical Results

4.1. Comparison with other regularization schemes

We consider three synthetic problems in which we study the time-accuracy trade-off as well as the couplings obtained,

Low-Rank Sinkhorn Factorization

by comparing our method with other low-rank methods, as well as Sinkhorn s algorithm. More precisely, we compare our proposed method, LOT, with the factored Optimal Transport (Forrow et al., 2018), Factored OT, the Nystrombased method (Altschuler et al., 2018), Nys, the random features-based method (Scetbon & Cuturi, 2020), RF and the Sinkhorn algorithm (Cuturi, 2013), Sin. For LOT, we set the lower bound on g to = 10 5.

Time-accuracy Tradeoff We consider two problems where the ground cost involved in the OT problem is either the squared Euclidean distance or the Euclidean distance. In the ﬁrst one, we consider measures supported on n = 5000 points in R2, while the second we consider n = 10000 samples in R2. The method proposed by (Forrow et al., 2018) can only be used with the squared Euclidean distance (2-Wasserstein) while ours works for any cost. For all the low-ranks methods, we vary the ranks between 10 and 500. For all the randomized methods, we consider the mean over 10 runs to estimate the OT.

In Fig. 3, 4 we plot the ratio w.r.t. the (non-regularized) optimal transport cost deﬁned as R := h C, e Pi/h C, P i where

e P is the coupling obtained by the method considered and P is the ground truth (we ensure this optimal cost is large enough to avoid spurious divisions by 0). We present the time-accuracy tradeoffs of the methods for different regularizations " and ranks r. We show that our method provides consistently a better approximation of the OT while being much faster than the other low-rank methods for various targeted rank values r. We also show that our method is able to approximate arbitrarily well the OT and so faster than the Sinkhorn algorithm thanks to the low-rank constraints. We compare the methods in the same setting but we increase the dimensionality of the problems considered and we observe similar results. See Appendix G for more details.

Remark 4. Adding an entropic regularization in our objective allows to stabilize the MD scheme and therefore obtain faster convergence. Indeed if " > 0, then the number of iterations required to solve each iteration of the MD scheme (10) by Algorithm (2) is monitored by " given a certain precision δ while in the case where " = 0, the number of iterations required for Algorithm 2 to reach the precision δ increases as the number of iterations in the MD scheme increases.

Comparison of the Couplings Seeking to take a deeper look at the phenomenon highlighted in Fig. 1, we study differences in the regularization paths of LOT and Sin. We consider distributions supported on graphs of n = 1000 nodes, endowed with the shortest path distance (Bondy et al., 1976). We consider LOT with no entropic regularization (i.e. " = 0 in Eq. (9)) against Sin for various pairs of regularizers. Results are displayed in Fig. 5, where the discrete

path of regularizations parameterized by the rank r of LOT is compared with that obtained by Sin when varying ". The gaps in couplings (in 1) between the two methods are displayed. Both methods are able to approximate arbitrarily well the OT but offer two different paths to interpolate from the independent coupling ab T of rank 1 to the optimal one. More precisely, we see that the range of " for which the entropic OT provides an efﬁcient approximation of the true coupling is very localized, while the rank r needed for LOT to obtain such approximation is wider. Moreover, we see that the decay of the ratio of LOT with respect to r is faster than the decay of Sin w.r.t. ".

Remark 5. A comparative advantage of using the low-rank parameterization of OT over the Sinkhorn approach lies in the simple bounds that r admits, between 1 and n, and the fact that r encodes directly, through an integer, a direct property of the resulting coupling. In that sense, the same value r can be used across experiments that compare measures of various sizes and supports. By contrast, selecting a suitable regularization strength " in the Sinkhorn algorithm is usually challenging, as the parameter is continuous and its magnitude depends directly on the cost matrix values, making a common choice across experiments difﬁcult.

Real World Application In Figure 6 we consider the single-cell trajectory inference problem (Schiebinger et al., 2019) where the goal is to infer the ancestors of some speciﬁc cells (i PSCs) from temporal snapshots sampled several times a day for a period of 18 days. We apply the exact same pre-processing suggested by (Schiebinger et al., 2019), and we obtain that our proposed method is able to recover a similar path as the one obtained by the Sinkhorn algorithm.

4.2. On the non-convexity of LOT

As our problem (6) is non-convex, we investigate the effect of the initialization as well as the choice of the gradient step γ in the proposed MD scheme. In addition, we consider a speciﬁc situation where the optimal coupling solution of (1) admits a nonnegative low-rank to see if our method is able to recover the global minimum in such situation. In the following experiments we set " = 0 and the lower bound on g to = 10 5.

Figure 7. In this experiment, we

consider the same situation as in Figure 3 with n = m = 1000 varying γ for r = 50 or 500.

Effect of γ In Figure 7, we plot the ratio on the same experiment presented in Figure 3 when varing γ. We show that our algorithm is robust to the choice of γ as it manages to converge for a large range of γ. Moreover if the rank is large

Low-Rank Sinkhorn Factorization

Figure 4. Here we consider two Gaussian mixture densities sampled with n = m = 10000 points in 2D (See Appendix G for more

details). The ground cost is the Euclidean distance. As this cost is a distance, we can apply our linear version of the algorithm and we denote LOT Quad to refer to its quadratic counterpart. We see that LOT and LOT Quad provide similar results while LOT is faster. All kernel-based methods (Nys, RF) fail to converge in this setting. As in Fig. 3, we see that our method is able to approximate faster than Sin the true OT thanks to the low-rank constraint.

Figure 5. We illustrate in this plot the gaps between the couplings

reached by Sin and LOT for varying regularization strengths. Measures were sampled on a complete graph obtained by sampling 2n = 2000 points from a 2-D standard normal distribution, the edge weights set to their squared Euclidean distances. The supports are obtained by randomly splitting the nodes of the graphs into two subsets of same size. We vary the entropic regularization " and the nonnegative rank r. We consider " in log-scale ranging from 0.001 to 1 and r ranging from 1 to 1000, represented as a fraction of n. The blue (resp. red) curve stands for Sin (resp. LOT). We plot the 1 distance between their respective couplings.

Figure 6. Here we compare the paths recovered by both the

Sinkhorn algorithm with " = 5, and our method with γ = 1/" and r500. Each sub-optimal transport problem between two temporal snapchots contains n ' 5000 cells.

enough, our method is able to ﬁnd the optimal solution of the true OT problem (1).

Figure 8. Same

setting as in Figure 7.

Effect of the Initialization In Figure 8 we plot the ratios to LP solution of LOT costs, with 50 random initializations (Gaussian entries for Q, R, rescaled to have left/right marginals a and b). We show that our method is robust to the choice of the initialization. We also design an OT problem where the ground truth OT solution of Eq. (1) has low nonnegative rank. Indeed, by ﬁxing z1, . . . , zr 2 Rd anchors and by deﬁning the cost c(x, y) = mink21,..,r kx zkk+kzk yk, we show that the true optimal coupling has a low nonnegative rank r. Our algorithm recovers consistently the OT coupling for multiple random initializations. See Appendix H for more details.

Conclusion We proposed a new approach to regularize the OT problem by restricting solutions to have a small non-negative rank. Our algorithm leverages both low-rank constraints and entropic smoothing. Our method can leverage the factorization of the ground cost (and not that of the kernel usually associated to Sinkhorn) to propose a linear time complexity alternative to solve OT problems.

Acknowledgements This work was supported by a "Chaire d excellence de l IDEX Paris Saclay", by the European Research Council (ERC project NORIA) and by the French government under management of ANR as part of the Investissements d avenir program (ANR19-P3IA-0001,

PRAIRIE 3IA Institute).

Low-Rank Sinkhorn Factorization

Agueh, M. and Carlier, G. Barycenters in the wasserstein

space. SIAM Journal on Mathematical Analysis, 43(2): 904 924, 2011.

Altschuler, J., Weed, J., and Rigollet, P. Near-linear time ap-

proximation algorithms for optimal transport via sinkhorn iteration. ar Xiv preprint ar Xiv:1705.09634, 2017.

Altschuler, J., Bach, F., Rudi, A., and Niles-Weed, J. Mas-

sively scalable sinkhorn distances via the nyström method, 2018.

Altschuler, J. M. and Boix-Adsera, E. Polynomial-time

algorithms for multimarginal optimal transport problems with structure, 2020.

Bakshi, A. and Woodruff, D. P. Sublinear time low-rank

approximation of distance matrices, 2018.

Bauschke, H. H. and Lewis, A. S. Dykstras algorithm with

bregman projections: A convergence proof. Optimization, 48(4):409 427, 2000.

Bauschke, H. H., Bolte, J., and Teboulle, M. A descent

lemma beyond lipschitz gradient continuity: ﬁrst-order methods revisited and applications. Mathematics of Operations Research, 42(2):330 348, 2017.

Benamou, J.-D., Carlier, G., Cuturi, M., Nenna, L., and

Peyré, G. Iterative Bregman projections for regularized transportation problems. SIAM Journal on Scientiﬁc Computing, 37(2):A1111 A1138, 2015.

Bondy, J. A., Murty, U. S. R., et al. Graph theory with

applications, volume 290. Macmillan London, 1976.

Bregman, L. M. The relaxation method of ﬁnding the com-

mon point of convex sets and its application to the solution of problems in convex programming. USSR computational mathematics and mathematical physics, 7(3): 200 217, 1967.

Chizat, L., Roussillon, P., Léger, F., Vialard, F.-X., and

Peyré, G. Faster wasserstein distance estimation with the sinkhorn divergence. Advances in Neural Information Processing Systems, 33, 2020.

Clason, C., Lorenz, D. A., Mahler, H., and Wirth, B. En-

tropic regularization of continuous optimal transport problems. Journal of Mathematical Analysis and Applications, 494(1):124432, 2021. ISSN 0022-247X. doi: https://doi.org/10.1016/j.jmaa.2020.124432.

Cohen, J. E. and Rothblum, U. G. Nonnegative ranks, decompositions, and factorizations of nonnegative matrices. Linear Algebra and its Applications, 190:149 168, 1993. ISSN 0024-3795. doi:

https://doi.org/10.1016/0024-3795(93)90224-C. URL http://www.sciencedirect.com/science/ article/pii/002437959390224C.

Cuturi, M. Sinkhorn distances: Lightspeed computation

of optimal transport. In Advances in neural information processing systems, pp. 2292 2300, 2013.

Cuturi, M. and Doucet, A. Fast computation of Wasserstein

barycenters. In Proceedings of ICML, volume 32, pp. 685 693, 2014.

Dessein, A., Papadakis, N., and Rouas, J.-L. Regularized op-

timal transport and the rot mover s distance. The Journal of Machine Learning Research, 19(1):590 642, 2018.

Dvurechensky, P., Gasnikov, A., and Kroshnin, A. Com-

putational optimal transport: Complexity by accelerated gradient descent is better than by sinkhorn s algorithm. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 1367 1376. PMLR, 10 15 Jul 2018.

Dykstra, R. L. An algorithm for restricted least squares re-

gression. Journal of the American Statistical Association, 78(384):837 842, 1983.

Forrow, A., Hütter, J.-C., Nitzan, M., Rigollet, P.,

Schiebinger, G., and Weed, J. Statistical optimal transport via factored couplings, 2018.

Fournier, N. and Guillin, A. On the rate of convergence in

Wasserstein distance of the empirical measure. Probability Theory and Related Fields, 162(3-4):707 738, 2015.

Franklin, J. and Lorenz, J. On the scaling of multidimen-

sional matrices. Linear Algebra and its Applications, 114: 717 735, 1989.

Genevay, A., Chizat, L., Bach, F., Cuturi, M., and Peyré,

G. Sample complexity of sinkhorn divergences. ar Xiv preprint ar Xiv:1810.02733, 2018.

Ghadimi, S., Lan, G., and Zhang, H. Mini-batch stochastic

approximation methods for nonconvex stochastic composite optimization, 2013.

Heitz, M., Bonneel, N., Coeurjolly, D., Cuturi, M., and

Peyré, G. Ground metric learning on graphs. Journal of Mathematical Imaging and Vision, pp. 1 19, 2020.

Indyk, P., Vakilian, A., Wagner, T., and Woodruff, D.

Sample-optimal low-rank approximation of distance matrices, 2019.

Janati, H., Bazeille, T., Thirion, B., Cuturi, M., and Gram-

fort, A. Multi-subject meg/eeg source imaging with sparse multi-task regression. Neuro Image, pp. 116847, 2020.

Low-Rank Sinkhorn Factorization

Kantorovich, L. On the transfer of masses (in russian).

Doklady Akademii Nauk, 37(2):227 229, 1942.

Korotin, A., Li, L., Solomon, J., and Burnaev, E. Continu-

ous wasserstein-2 barycenter estimation without minimax optimization. In International Conference on Learning Representations, 2021. URL https://openreview. net/forum?id=3t FAs5E-Pe.

Koundal, S., Elkin, R., Nadeem, S., Xue, Y., Constantinou,

S., Sanggaard, S., Liu, X., Monte, B., Xu, F., Van Nostrand, W., et al. Optimal mass transport with lagrangian workﬂow reveals advective and diffusion driven solute transport in the glymphatic system. Scientiﬁc reports, 10 (1):1 18, 2020.

Lin, T., Ho, N., and Jordan, M. On efﬁcient optimal trans-

port: An analysis of greedy and accelerated mirror descent algorithms. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 3982 3991. PMLR, 09 15 Jun 2019.

Lu, H., Freund, R. M., and Nesterov, Y. Relatively-smooth

convex optimization by ﬁrst-order methods, and applications, 2017.

Makkuva, A., Taghvaei, A., Oh, S., and Lee, J. Optimal

transport mapping via input convex neural networks. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.

6672 6681. PMLR, 13 18 Jul 2020.

Peyré, G. and Cuturi, M. Computational optimal transport.

Foundations and Trends in Machine Learning, 11(5-6),

2019. ISSN 1935-8245.

Scetbon, M. and Cuturi, M. Linear time sinkhorn diver-

gences using positive features, 2020.

Schiebinger, G., Shu, J., Tabaka, M., Cleary, B., Subrama-

nian, V., Solomon, A., Gould, J., Liu, S., Lin, S., Berube, P., et al. Optimal-transport analysis of single-cell gene expression identiﬁes developmental trajectories in reprogramming. Cell, 176(4):928 943, 2019.

Schmitz, M. A., Heitz, M., Bonneel, N., Ngole, F., Coeur-

jolly, D., Cuturi, M., Peyré, G., and Starck, J.-L. Wasserstein dictionary learning: Optimal transport-based unsupervised nonlinear dictionary learning. SIAM Journal on Imaging Sciences, 11(1):643 678, 2018.

Schmitzer, B. Stabilized sparse scaling algorithms for en-

tropy regularized transport problems. SIAM Journal on Scientiﬁc Computing, 41(3):A1443 A1481, 2019.

Tong, A., Huang, J., Wolf, G., Van Dijk, D., and Krish-

naswamy, S. Trajectory Net: A dynamic optimal transport network for modeling cellular dynamics. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 9526 9536. PMLR, 13 18 Jul 2020.

Xie, Y., Wang, X., Wang, R., and Zha, H. A fast proximal

point method for computing exact wasserstein distance. In Uncertainty in Artiﬁcial Intelligence, pp. 433 453. PMLR, 2020.

Yang, K. D., Damodaran, K., Venkatachalapathy, S., Soyle-

mezoglu, A. C., Shivashankar, G., and Uhler, C. Predicting cell lineages using autoencoders and optimal transport. PLo S computational biology, 16(4):e1007828, 2020.

Zhang, K. S., Peyré, G., Fadili, J., and Pereyra, M. Wasser-

stein control of mirror langevin monte carlo, 2020.