# linear_partial_gromovwasserstein_embedding__e7b8c73f.pdf

Published as a conference paper at ICLR 2025

Linear Partial Gromov-Wasserstein Embedding

Yikun Bai1, Abihith Kothapalli*1, Hengrong Du*2, Rocio Diaz Martin*3, and Soheil Kolouri1

1Department of Computer Science, Vanderbilt University 2Department of Mathematics, University of California, Irvine 3Department of Mathematics, Tufts University 1yikun.bai, abi.kothapalli, soheil.kolouri@vanderbilt.edu

2hengrond@uci.edu 3rocio.diaz martin@tufts.edu

The Gromov Wasserstein (GW) problem, a variant of the classical optimal transport (OT) problem, has attracted growing interest in the machine learning and data science communities due to its ability to quantify similarity between measures in different metric spaces. However, like the classical OT problem, GW imposes an equal mass constraint between measures, which restricts its application in many machine learning tasks. To address this limitation, the partial Gromov-Wasserstein (PGW) problem has been introduced. It relaxes the equal mass constraint, allowing the comparison of general positive Radon measures. Despite this, both GW and PGW face significant computational challenges due to their non-convex nature. To overcome these challenges, we propose the linear partial Gromov-Wasserstein (LPGW) embedding, a linearized embedding technique for the PGW problem. For K different metric measure spaces, the pairwise computation of the PGW distance requires solving the PGW problem O(K2) times. In contrast, the proposed linearization technique reduces this to O(K) times. Similar to the linearization technique for the classical OT problem, we prove that LPGW defines a valid metric for metric measure spaces. Finally, we demonstrate the effectiveness of LPGW in practical applications such as shape retrieval and learning with transport-based embeddings, showing that LPGW preserves the advantages of PGW in partial matching while significantly enhancing computational efficiency. The code is available at https: //github.com/mint-vu/Linearized_Partial_Gromov_Wasserstein.

1 Introduction

Optimal transport (OT), unbalanced and partial OT, and linearized OT. The core objective of this work, and a general challenge in machine learning, is to match objects and devise practical, computationally efficient notions of similarity between shapes. One of the most well-known techniques for this task arises from optimal transport (OT) theory, which has found extensive applications in generative modeling (Arjovsky et al., 2017; Genevay et al., 2016), domain adaptation (Courty et al., 2016; Balaji et al., 2020), representation learning (Wu et al., 2023), and many other domains. The classical OT problem, as detailed by (Villani, 2009), involves matching two probability distributions to minimize the expected transportation cost, giving rise to the so-called Wasserstein distances. Although the classical OT problem imposes a mass conservation requirement, recent advances have extended the utility of OT beyond probability measures, allowing for comparisons of non-negative measures with different total masses and partial matching of sets through the unbalanced and partial OT frameworks (Chizat et al., 2018a;b; Fatras et al., 2021; S ejourn e et al., 2021). Additionally, in large-scale machine learning applications, since the OT approach is computationally expensive, it has motivated numerous approximations that lead to significant speedups, such as those in (Cuturi, 2013; Chizat et al., 2020; Scetbon & Cuturi,

*These authors contributed equally to this work.

Published as a conference paper at ICLR 2025

2022), and the linearization technique, termed linear optimal transportation (LOT), proposed by (Wang et al., 2013). There have also been advances in reducing the computational complexity of comparing unbalanced measures (Cai et al., 2022; Bai et al., 2023).

The Gromov-Wasserstein (GW) distance, unbalanced and partial GW, and linearized GW. Another line of research has focused on the comparison of probability measures across different metric spaces using Gromov-Wasserstein (GW) distances (M emoli, 2011). These innovations have applications in areas ranging from quantum chemistry (Peyr e et al., 2016) to natural language processing (Alvarez-Melis & Jaakkola, 2018), enhancing fundamental tasks such as clustering (Chowdhury & Needham, 2021), dimensionality reduction (Van Assel et al., 2024), shape correspondence (Kong et al., 2024), shape analysis (M emoli & Needham, 2022), and inference-based simulation (Hur et al., 2024). The GW distance is a valuable tool across a variety of domains, as it allows for comparisons between distributions while being independent of the ambient spaces. To extend GW to unbalanced settings, GW-like comparisons have been developed for metric measure spaces with differing total masses (Liu et al., 2020; Chapel et al., 2020; S ejourn e et al., 2021; Bai et al., 2024; Zhang et al., 2022). Furthermore, to address the computational expense of pairwise GW distance calculations in large-scale applications, a linearized version of GW has been proposed in (Beier et al., 2022). Similar to the LOT framework (Wang et al., 2013), this approach is based on barycentric projection maps of transport plans.

Contributions.

We propose the linear Partial Gromov-Wasserstein embedding (LPGW) technique, which extends the classical LOT embedding technique to the partial GW setting.

Based on the embedding, we propose the so-called LPGW distance, which we prove to be a metric under certain assumptions. Given K different metric measure spaces, computing their pairwise PGW distances requires solving the PGW problem O(K2) times. However, computing the LPGW distance pairwise requires only O(K) distance computations.

Numerically, we test our proposed LPGW embedding and LPGW distance in two experiments: shape retrieval and learning with transform-based embeddings. In both experiments, we observe that the LPGW-based approach can preserve the partial matching property of PGW while significantly improving computational efficiency.

2 Background 2.1 The Wasserstein Distance

Let P(Rd) be the space of probability measures on the Borel σ-algebra on Rd, and let P2(Rd) := {µ P(Rd) : R

Rd |x|2dµ(x) < }. Given a (measurable) mapping T : Rd Rd, the pushforward measure T#µ is defined as T#µ(B) := µ(T 1(B)) for all Borel sets B Rd, where T 1(B) := {x : T(x) B} is the preimage of B under T.

Given µ, ν P2(Rd), the 2-Wasserstein distance is defined as

W 2 2 (µ, ν) := min γ Γ(µ,ν)

Rd Rd x y 2dγ(x, y) (1)

where Γ(µ, ν) is set of joint probability measures on Rd Rd whose first and second marginals are µ and ν, respectively. The classical OT theory (Villani, 2021; 2009) guarantees that a minimizer to (1) always exists, so such a formulation is well-defined. In fact, there might exist multiple minimizers, and we denote by Γ (µ, ν) the set of all optimal transportation plans for problem (1).

When an optimal transportation plan γ is induced by a mapping T : Rd Rd, that is,

γ = (id T)#µ Γ (µ, ν), with T#µ = ν, (2)

we say that the Monge mapping assumption holds, and the function T is called a Monge mapping. In particular, by Brenier s theorem, when µ is a continuous measure (i.e.,

Published as a conference paper at ICLR 2025

absolutely continuous with respect to the Lebesgue measure on Rd), a Monge mapping T always exists and it is unique.

We refer to Appendix A for a detailed introduction to the LOT formulation.

2.2 The Gromov-Wasserstein Distance

Consider two compact gauged measure spaces (gm-spaces) X = (X, g X, µ) and Y = (Y, g Y , ν), where X Rdx, Y Rdy are non-empty (compact) convex sets (with dx, dy N); g X : X 2 R is a gauge function in L2 sys(X 2, µ 2), i.e., the space of symmetric, real-valued functions on X 2 := X X that are square-integrable with respect to µ 2 := µ µ (similarly for g Y L2 sys(Y 2, ν 2); and µ, ν are probability measures defined on X, Y respectively. In addition, assume that X and Y are metric measure spaces (mm-spaces); that is, there exist distances d X and d Y on X and Y , respectively. In general, we will assume that g X is a continuous function on X 2 with respect to the metric d X 2((x1, x 1), (x2, x 2)) := d X(x1, x2) + d X(x 1, x 2) on X 2 (respectively, for g Y ). Thus, since X is compact, g X (respectively, g Y ) is a Lipschitz continuous function, i.e., there exists K R>0 such that

|g X(x1, x 1) g X(x2, x 2)| K(d X(x1, x2) + d X(x 1, x 2)), x1, x 1, x2, x 2 X. (3)

For the reader s convenience, unless stated otherwise, we default to g X(x1, x2) = d2 X(x1, x2) (which is the classical setting for the Gromov-Wasserstein problem (Alvarez-Melis & Jaakkola, 2018; Vayer, 2020)), or to the inner product g X(x1, x2) = x 1 x2 (which has been studied in (Vayer, 2020; Sturm, 2012)). The same applies to g Y .

The Gromov-Wasserstein (GW) problem between X and Y is defined as

GW 2(X, Y) := min γ Γ(µ,ν) |g X g Y |2, γ 2 (4)

|g X g Y |2, γ 2 := Z

(X Y ) 2 |g X(x, x ) g Y (y, y )|2 dγ(x, y)dγ(x , y ).

A minimizer for (4) always exists (M emoli, 2011), and we use Γ (X, Y) to denote the set of all optimal transportation plans γ Γ(µ, ν) for the GW problem defined in (4). In addition, when g X = ds X and g Y = ds Y for s 1, GW( , ) defines a metric on the space of mm-spaces up to isomorphism1.

Similar to classical OT theory, we say that the GW-Monge mapping assumption holds if there exists a minimizer γ for (4) of the form γ = (id T)#µ, where id : X X is the identity map on X, and T : X Y is a measurable function that pushes forward µ to ν. In recent years, the existence of Monge mappings for the GW problem has been studied in works such as (Vayer, 2020; Maron & Lipman, 2018; Dumont et al., 2024).

Note that the GW minimization problem defined in (4) is quadratic but non-convex. The computational cost of GW( , ) for discrete measures µ, ν of size n is O(n3 L) to obtain a solution that is a local minimum, where n3 is the computational cost of one iteration and L is the required number of iterations2. Thus, when faced with the task of computing the pairwise GW distances between discrete gm-spaces Y1, . . . , YK with n points each, the resulting complexity is O(K2 n3 L). To reduce the complexity of computing the pairwise distances in this setting, (Beier et al., 2022) introduces a linear embedding technique called linear Gromov-Wasserstein (LGW), which we discuss in the following section.

2.3 The Linear Gromov-Wasserstein Embedding and Distance

Consider a fixed gm-space X = (X, g X, µ), which serves as a reference space. Given a target gm-space Y = (Y, g Y , ν), suppose that the GW-Monge mapping assumption holds, i.e., there exists an optimal transportation plan γ = (id T)#µ for the GW problem

1Two mm-spaces X, Y are isomorphic is there exists an isometry ψ : X Y , i.e., d X(x, x ) = d Y (ψ(x), ψ(x )), which is surjective and such that ν = ψ#µ. 2We refer to Appendix D for details.

Published as a conference paper at ICLR 2025

(4) between X and Y. Then, the linear Gromov-Wasserstein (LGW) embedding is defined as the following mapping

Y 7 kγ := g X( 1, 2) g Y (T( 1), T( 2)), (5)

which assigns each gm-space Y to a gauge function kγ : X X R defined by (5). This embedding can recover the GW distance GW(X, Y) by the following:

GW(X, Y) = Z

X X |kγ(x, x )|2dµ(x)dµ(x ).

Given two gm-spaces Y1 = {Y 1, g Y 1, ν1}, Y2 = {Y 2, g Y 2, ν2}, assume that there exist optimal transportation plans γ1 = (id T 1)#µ Γ (X, Y1), γ2 = (id T 2)#µ Γ (X, Y2). Let kγ1, kγ2 denote the corresponding linear GW embeddings as defined by (5). Then, the LGW distance conditioned on γ1, γ2, is defined as

LGW(Y1, Y2; X, γ1, γ2) := kγ1 kγ2 L2(µ 2). (6)

When the GW-Monge mapping assumption does not hold, (Beier et al., 2022) extend the above distance conditioned on a pair of arbitrary plans γ1 Γ (X, Y1), γ2 Γ (X, Y2):

LGW(Y1, Y2; X, γ1, γ2) = inf γ Γ(γ1,γ2;µ) |g Y 1 g Y 2|2, γ 2 (7)

= inf γ Γ(γ1,γ2;µ)

(X Y 1 Y 2) 2 |g Y 1(y1, y1 ) g Y 2(y2, y2 )|2dγ(x, y1, y2)γ(x , y1 , y2 ),

Γ(γ1, γ2; µ) := {γ M+(X Y 1 Y 2) : (πX,Y 1)#γ = γ1, (πX,Y 2)#γ = γ2, (πX)#γ = µ},

for πX,Y i(x, y1, y2) := (x, yi) for i = 1, 2, and πX(x, y1, y2) := x. Finally, the LGW distance between Y1 and Y2 is defined as

LGW(Y1, Y2; X) = inf γ1 Γ (µ,ν1) γ2 Γ (µ,ν2)

LGW(Y1, Y2; X, γ1, γ2). (8)

We refer to Appendix B for further details.

To simplify the computation of the above formulation, (Beier et al., 2022) propose the barycentric projection method, which we will briefly review here.

Let γ Γ(µ, ν) be a transportation plan between µ P(X) and ν P(Y ). By using [Definition 5.4.2, Ambrosio et al. (2005)], the barycentric projection map Tγ : X Y is defined as

Y y dγY |X(y|x). (9)

where {γY |X( |x)}x X is the corresponding disintegration of γ with respect to its first marginal µ. That is, {γY |X( |x)}x X is a family of measures satisfying Z

X Y ϕ(x, y) dγ(x, y) = Z

Y ϕ(x, y) dγY |X(y|x)dµ(x), ϕ C(X Y ).

If γ Γ (X, Y), we define the measure eνγ := (Tγ )#µ P(Y ) and the gm-space e Yγ := (Y, g Y , νγ ). We provide the following results as a complement to the work in (Beier et al., 2022).

Proposition 2.1. Let γ Γ (X, Y) and γ0 Γ (X, X). Then, we have the following:

(1) The LGW embedding of Y can recover the GW distance, i.e.,

LGW(X, Y; X, γ0, γ ) = GW(X, Y).

Published as a conference paper at ICLR 2025

(2) If T is a Monge mapping for GW(X, Y) such that γ = (id T)#µ, then Tγ = T, µ-a.e. As a consequence, eνγ = ν, and thus e Yγ = Y.

(3) If g Y ( 1, 2) = α 1, 2 Y , where 1, 2 Y is an inner product restricted to Y and α R, then the barycentric projection map Tγ defined by (9) is a Monge mapping in the sense that eγ = (id Tγ )#µ is optimal for the GW problem GW(X, e Yγ ).

We refer the reader to Appendix E for the proof.

Based on the above proposition, given Y1 = (Y 1, g Y 1, ν1) and Y2 = (Y 2, g Y 2, ν2), let γ1 Γ (X, Y1) and γ2 Γ (X, Y2). The following approximated LGW (a LGW) distance can be used as a proxy for the LGW and GW distances. Furthermore, in practice, we can fix a single pair (γ1, γ2) to compute the distance rather than compute the infimum:

a LGW(Y1, Y2; X) := inf γ1 Γ (X,Y1) γ2 Γ (X,Y2)

LGW(Y1 γ1, Y2 γ2; X, γ1, γ2)

= inf γ1 Γ (X,Y1) γ2 Γ (X,Y2)

g Y 1(Tγ1( 1), Tγ1( 2)) g Y 2(Tγ2( 1), Tγ2( 2)) 2 L2(µ 2) (10)

g Y 1(Tγ1( 1), Tγ1( 2)) g Y 2(Tγ2( 1), Tγ2( 2)) 2 L2(µ 2). (11)

2.4 The Partial Gromow-Wasserstein Distance

Partial Gromov-Wasserstein (PGW), or more generally, unbalanced GW, relaxes the assumption that µ, ν are normalized probability measures (Chapel et al., 2020; S ejourn e et al., 2021; Bai et al., 2024). In particular, the PGW distance is defined as

PGW 2 λ(X, Y) := inf γ Γ (µ,ν) |g X g Y |2, γ 2 + λ(|µ 2 γ 2 X | + |ν 2 γ 2 Y |), (12)

where λ > 0 is a parameter; γX := (πX)#γ, γY := (πY )#γ are the corresponding marginals of γ on X and Y , respectively; and Γ (µ, ν) := {γ M+(X Y ) : γX µ, γY ν}, where M+(X Y ) denotes the set of finite non-negative measures on X Y , and γX µ denotes that for any Borel set A, γX(A) µ(A) (similarly for γY ν).

In the discrete setting, the PGW problem can be solved by variants of the Frank-Wolfe algorithm (see the algorithms in (Bai et al., 2024) for details). From a theoretical perspective, there always exists a minimizer for the optimization problem in (12) (see [Proposition 3.3, (Bai et al., 2024)]). We use the notation Γ ,λ(X, Y) to denote the set of all optimal transportation plans for PGWλ(X, Y). For convenience, when λ is fixed, we may omit the subscript λ and write Γ (X, Y) instead.

We say that the PGW-Monge mapping assumption holds if there exists an optimal transportation plan γ for the optimization problem in (12) of the form γ = (id T)#γX for some T : X Y .

3 Linear Partial Gromov-Wasserstein Embedding and Distance

Inspired by the work in (Beier et al., 2022), we extend the LGW distance to the unbalanced setting. In particular, we present a linearization technique for the PGW problem proposed in (Bai et al., 2024).

Given gm-spaces X = (X, g X, µ) and Y = (Y, g Y , ν), let γ Γ ,λ(X, Y). We first suppose that the PGW-Monge mapping assumption holds; that is, γ = (id T)#γX for some T : X Y . We subsequently define the linear partial Gromov-Wasserstein (LPGW) embedding as Y 7 (γX, kγ, γc), (13)

where γX := (πX)#γ, kγ( 1, 2) := g X( 1, 2) g Y (T( 1), T( 2)), and γc := ν 2 (γY ) 2

for γY = (πY )#γ. Here, the first component of the embedding, γX, encodes the mass transportation from the source domain X. The second component, kγ, describes the transportation cost function. Finally, the third component, γc, encodes the mass creation in the target domain Y .

Published as a conference paper at ICLR 2025

Similarly to the LGW embedding given by (5), we show in Proposition 3.1 that the above embedding can recover PGWλ(X, Y) in the sense that

kγ 2 L2((γ X) 2) + λ(|µ|2 |γ X|2 + |γ c |) = PGWλ(X, Y).

Given two gm-spaces Y1, Y2, let γ1 Γ ,λ(X, Y1) and γ2 Γ ,λ(X, Y2). Suppose γ1, γ2 are induced by Monge mappings T 1, T 2 and compute the corresponding LPGW embeddings via (13). We define the LPGW discrepancy conditioned on γ1, γ2 by

LPGWλ(Y1, Y2; X, γ1, γ2) := inf µ γ1 X γ2 X kγ1 kγ2 2 L2(µ 2) + λ(|ν1|2 + |ν2|2 2|µ |2), (14)

where (γ1 X γ2 X)(E) = min F E{γ1 X(E) + γ2 X(E \ F)}. Note that the above formulation can be written entirely in terms of the LPGW embeddings since |ν1|2 = |γ1 c| + |γ1 X|2, and similarly |ν2|2 = |γ2 c| + |γ2 X|2.

When the PGW-Monge mapping assumption does not hold, we extend the above LPGW discrepancy conditioned on arbitrary transportation plans γ1 Γ ,λ(X, Y1), γ2 Γ ,λ(X, Y2) as follows:

LPGWλ(Y1, Y2; X, γ1, γ2) := inf γ Γ (γ1,γ2;µ) g Y 1 g Y 2 2 L2(γ 2) + λ(|ν1|2 + |ν2|2 2|γ|2), (15)

Γ (γ1, γ2; µ) := {γ M+(X Y 1 Y 2) : (πX,Y 1)#γ γ1, (πX,Y 2)#γ γ2}.

Finally, to gain independence from the transportation plans γ1, γ2, we formally define the LPGW discrepancy as

LPGWλ(Y1, Y2; X) := inf γ1 Γ ,λ(X,Y1) γ2 Γ ,λ(X,Y2)

LPGWλ(Y1, Y2; X, γ1, γ2). (16)

Proposition 3.1. Given compact gm-spaces X = (X, g X, µ) and Y = (Y, g Y , ν), let γ0 Γ ,λ(X, X) and γ Γ ,λ(X, Y). We have the following:

(1) Problem (15) admits a minimizer.

(2) Under the PGW-Monge mapping assumption, the problems (14) and (15) coincide.

(3) Under the PGW-Monge mapping assumption, when

2λ max y1,y1 Y 1, y2,y2 Y 2 |g Y (y1, y1 ) g Y (y2, y2 )|2 (17)

problem (14) achieves the value

LPGWλ(Y1, Y2; X, γ1, γ2) := kγ1 kγ2 2 L((γ1 X γ2 X) 2) + λ(|ν1|2 + |ν2|2 2|γ1 X γ2 X|2). (18)

(4) The LPGW embedding of Y can recover PGWλ(X, Y) in the sense that, under the PGW-Monge mapping assumption, we have

kγ 2 L2((γ X) 2) + λ(|µ|2 |γ X|2 + |γ c |) = PGWλ(X, Y),

and in general, we have

LPGWλ(X, Y; γ0, γ ) = PGWλ(X, Y).

(5) The LPGW discrepancies defined in (15) and (16) are pseudo-metrics. Indeed, under certain conditions, (16) is a rigorous metric. We refer to Appendix F for detailed discussion.

Published as a conference paper at ICLR 2025

The proof of the above proposition is included in Appendix F.

Similar to LGW, to accelerate the computation of the above formulation, we propose the following barycentric projection method in the LPGW setting.

First, similar to Proposition 2.1, we present the following theorem, for which we provide the proof in Appendix H.

Theorem 3.2. Given gm-spaces X = (X, g X, µ) and Y = (Y, g Y , ν), let γ0 Γ ,λ(X, X) and γ M+(X Y ). Further, let eν := eνγ := (Tγ )#γ and e Yγ = (Y, d Y , eν). Then,

(1) If γ is induced by a mapping T, then Tγ = T, γ X a.s.

(2) If γ Γ (X, Y), and the conditions for g Y ( 1, 2) in statement (3) of Proposition 2.1 hold, then eγ := (id Tγ )#γ X is optimal for PGWλ(X, e Yγ ).

(3) If γ1 Γ (X, Y1) and γ2 Γ (X, Y2), and the PGW-Monge mapping assumption holds, then

LPGWλ(Y1, Y2; X, γ1, γ2) = LPGWλ(e Yγ1, e Yγ2; X, eγ1, eγ2) + λ(|γ1 c| + |γ2 c|). (19)

Based on part (3) in the above theorem, we define the approximated LPGW (a LPGW) discrepancy by

a LPGWλ(Y1, Y2; X) := inf γ1 Γ ,λ(X,Y1),

γ2 Γ ,λ(X,Y2)

LPGWλ(e Yγ1, e Yγ2; X, γ1, γ2) + λ(|γ1 c| + |γ2 c|). (20)

Proposition 3.3. If X, Y are compact sets, when λ is sufficiently large, in particular, (17) is satisfied, the above formulation becomes:

a LPGWλ(Y1, Y2; X) = inf γ1 Γ ,λ(X,Y1),

γ2 Γ ,λ(X,Y2)

kγ1 kγ2 2 (γ1 X γ2 X) 2 + λ(|ν1|2 + |ν2|2 2|γ1 X γ2 X|2). (21)

In practice, similar to LGW, we only select one pair of optimal transportation plans γ1, γ2 to compute the embedding and the above a LPGW discrepancy. In addition, we apply (21) to approximate the original formula (20). That is, in our experiments, we directly compute

kγ1 kγ2 2 (γ1 X γ2 X) 2 + λ(|ν1|2 + |ν2|2 2|γ1 X γ2 X|2). (22)

3.1 Numerical implementation of LPGW in discrete setting

In practice, we consider discrete distributions (or, more generally, discrete Radon measures). Suppose X = {X, d0, µ = Pn i=1 piδxi}, where X Rd0 is a convex compact set containing {0d0, x1, . . . xn}. We similarly let Y1 = (Y 1, d1, ν = Pm1 j=1 q1 j δy1 j ) and Y2 = (Y 2, d2, ν2 = Pm2 j=1 q2 j δy2 j ).

Suppose γ1 Rn m1 + is an optimal transportation plan for PGWλ(X, Y1). By convexity of Y , the barycentric projection given by (9) becomes

eqi [γ1y1]i if eqi > 0 0d1 if eqi = 0 where eq = γ1 X = γ11m1. (23)

Thus, the corresponding projected measure becomes eνγ1 = P

i eqiδeyi. In addition, we have |γc| = (Pn i=1 pi)2 (Pn i 1 eq1 i )2. Let

e K1 := keγ = [ eyi ey i 2 xi x i 2]i,i [1:n] Rn n. (24)

Suppose γ2 Rn m2 + is an optimal transportation plan for PGWλ(X, Y2). We define all corresponding terms analogously.

Published as a conference paper at ICLR 2025

Let eq1,2 = eq1 eq2, |eq1 eq2|T V = Pn i=1 |eq1 i eq2 i |, |q1| = Pm1 i=1 q1 i . The a LPGW distance given by (20) and (21) becomes

a LPGW(Y1, Y2; X)

ey1 i ey1 i 2 e Y 2 i e Y 2 i 2 2 eq12 i eq12 i + λ(|q1|2 + |q2|2 2|eq12|2)

= (eq12) h ( e K1 e K2)2i (eq12) + λ(|γ1 c| + |γ2 c| + |eq 2 1 eq 2 2 |T V ), (25)

where ( e K1 e K2)2 denotes the element-wise squared matrix and eq 2 1 = eq1eq 1 (similarly for eq 2 2 ). The above quantity (25) can be used to approximate the original PGW distance between Y1 and Y2.

In addition, the original LPGW embedding (13) of Y1 (in fact, of e Y1) is ( e K1, eνγ1, γ1 c). We reduce this embedding to ( e K1, eq1, |γ1 c|) which is sufficient to compute the above a LPGW discrepancy. Thus, ( e K1, eq1, |γ1 c|) can be regarded as numerical implementation of the LPGW embedding (13).

4 Experiments

4.1 Elliptical Disks

In this experiment, we apply the LPGW distance and the PGW distance to the 2D dataset presented in (Beier et al., 2022), consisting of 100 elliptical disks. We first compute the pairwise distances between the samples using the PGW distance and then compare them against the LPGW distance using nine different reference spaces. We present the resulting wall-clock time for each method and evaluate the quality of the approximation of PGW by LPGW using the mean relative error (MRE) and the Pearson correlation coefficient (PCC).

Experiment setup. We represent each 2D shape as an mm-space Xi = (R2, 2, µi), where µi = Pn i=1 1 nδxi. We normalize each shape so that the largest pairwise distance in each mm-space is 1. Based on [Lemma E.2, Bai et al. (2023)], the largest possible choice of λ is given by 2λ = 1. We hence test λ {0.05, 0.08, 0.1, 0.3, 0.5}, and for each reference space, we compute the pairwise LPGW distances and compute the wall-clock time, MRE, and PCC. We refer to Appendix K for full numerical details, MDS visualizations, and complete results.

Performance analysis. We present the results when λ = 0.1 in Table 1. First, we observe that LPGW is significantly faster than PGW as it only requires N = 100 transport plan computations, whereas the PGW methods require N 2 transport plan computations. Second, we observe that when the reference space is one of S6, S7, S8, S9, LPGW admits a relatively smaller MRE and higher PCC, demonstrating the importance of the chosen reference space.

PGW S1 S2 S3 S4 S5 S6 S7 S8 S9

points 441 676 625 52 289 545 882 882 317 time (min) 46.97 0.76 3.78 3.13 0.08 0.62 1.56 1.91 1.98 0.71 MRE 0.1941 0.1264 0.1431 0.2542 0.0538 0.0444 0.0205 0.0198 0.0245 PCC 0.5781 0.5738 0.5881 0.8581 0.9871 0.9930 0.9952 0.9954 0.9949

Table 1: Experimental results when λ = 0.1. The first row shows the total wall-clock time for PGW and LPGW. The second and third rows display the MRE (mean relative error) and PCC (Pearson correlation coefficient), respectively.

4.2 Shape Retrieval

Experiment setup. We now employ the PGW distance to distinguish between 2D and 3D shapes, similarly to as done in (Beier et al., 2022). We use GW, LGW, and PGW as baselines for comparison against the results of our proposed LPGW. Given a series of shapes, we represent the shapes as mm-spaces Xi = (Rd, 2, µi), where µi = Pni

k=1 αiδxi k.

Published as a conference paper at ICLR 2025

For the GW and LGW methods, we normalize the mass for the balanced mass constraint setting (i.e., αi = 1 ni ), and for the PGW and LPGW methods, we let αi = α for all the shapes, where α > 0 is a fixed constant. In this manner, we compute the pairwise distances between the shapes.

Next, using the approach given by (Beier et al., 2022), we combine each distance matrix with a support vector machine (SVM), applying stratified 10-fold cross-validation. In each iteration of cross-validation, we train an SVM using exp( σD) as the kernel, where D is the matrix of pairwise distances (w.r.t. one of the considered distances) restricted to 9 folds, and compute the accuracy of the model on the remaining fold. We report the accuracy averaged over all 10 folds for each model.

Dataset setup. We test one 2D shape dataset and one 3D shape dataset. We use the 2D dataset presented in (Bai et al., 2024), consisting of 8 classes, each containing 20 shapes. The 3D dataset is given by (Pan et al., 2021), which provides 100-200 complete shapes in each of 16 different classes, and for each complete shape, provides 26 corresponding incomplete shapes. We choose four classes of shapes from this dataset, and for each, we sample 15 complete and 45 incomplete shapes, yielding a total of 240 shapes. The datasets are visualized in Figure 4 in the Appendix.

GW PGW LGW LPGW (ours) 2D Dataset Accuracy 98.1% 96.2% 93.7% 97.5% Time (s) 43s 39s 0.4s 0.5s 3D Dataset Accuracy 92.5% 93.8% 92.5% 93.7% Time (m) 203.0m 203.6m 1.3m 1.8m

Table 2: Accuracy and wall-clock time comparison in shape retrieval experiment.

Performance analysis. We refer to Appendix L for full numerical details, parameter settings, and the visualization of the resulting confusion matrices. We visualize the resulting pairwise distance matrices in Figure 7. For the SVM experiments, on the 2D dataset, GW achieves the highest accuracy, 98.1%, while the second best method is LPGW, 97.5%. On the 3D dataset, PGW achieves the highest accuracy of 93.8%, and LPGW achieves the next highest accuracy of 93.7%. We refer to Table 2 for details.

In addition, we report the wall-clock time required to compute all pairwise distances for each distance in Table 2. In both datasets, we observe that LGW/LPGW admit similar computational time and are much faster than GW/PGW. This difference is more obvious in the 3D dataset. We note that when λ is sufficiently large, the performance between the LPGW method and the LGW method is the same (see Appendix L for details).

4.3 Learning with Transform-Based Embeddings

Data Method LOT LGW LPGW (ours)

no rotation Accuracy 89.0% 82.5% 82.5% Time 183s+16s 405s+77s 309s+84s

η = 0 Accuracy 51.2% 82.5% 82.5% Time 183s+15s 405s+77s 309s+84s

η = 0.1 Accuracy 13.4% 17.0% 81.8% Time 183s+17s 405s+88s 309s+91s

η = 0.3 Accuracy 12.5% 13.3% 75.8% Time 183s+23s 405s+145s 309s+123s

η = 0.5 Accuracy 12.5% 12.5% 72.9% Time 183s+27s 405s+248s 309s+168s

Table 3: Accuracy and wall-clock time comparison for learning with transform-based embeddings experiments. We report each time as t1 + t2, where t1 is the time required to compute the training set embeddings and t2 is the time required to compute the testing set embeddings.

In this experiment, we perform machine learning tasks using transform-based embeddings. Specifically, given training and testing datasets, we apply various transform-based embedding methods to both. Additionally, we assume that the testing data is randomly rotated and has been corrupted with random noise. The objective is to evaluate the robustness of

Published as a conference paper at ICLR 2025

machine learning models trained using the different embedding techniques. The full numerical details for this experiment can be found in Appendix M. Baseline methods. In this experiment, we consider several classical embedding methods, namely the LOT embedding (Wang et al., 2013; Kolouri et al., 2020; Nenna & Pass, 2023), LGW embedding (Beier et al., 2022), and our proposed LPGW embedding.

train test test, corrupted

(a) MNIST point cloud dataset.

LOT LGW LPGW

(b) Reconstructed digits based on the LOT/LGW/LPGW embeddings.

Figure 1: MNIST digits and its reconstruction.

Dataset setup. We adopt the MNIST 2D point cloud dataset for this experiment. Specifically, for each digit, we sample N1 = 500 point clouds per class from the training set and N2 = 100 point clouds per class from the testing set. Each point cloud is represented as a gm-space in R2 with the measure µ = Pn i=1 piδxi, where pi > 0, i {1, . . . , n} are the pixel intensities provided by the original dataset. We normalize pi such that Pn i=1 pi = 1 for all sampled point clouds.

For each shape in the training set, we randomly rotate or flip the shape horizontally and add the transformed shape to the training set. As a result, each class contains 500 2 = 1000 shapes in the training set. For each test shape, we first randomly rotate or flip the shape, and we then corrupt the shape by adding uniformly distributed noise. The mass of each added point is 1

n, where n is the number of points in the original shape, and the total mass of the added points is η {0, 0.1, 0.3, 0.5}.

Performance analysis. For each method (LOT, LGW, and LPGW), we first compute the embeddings of the training data and optimize a classification model (logistic regression) using these embeddings. Then, we compute the embeddings for the test dataset and evaluate the accuracy of the model on the test embeddings. The results are presented in Table 3. In this table, we observe that for the original test dataset, LOT achieves the highest accuracy at 89.0%. However, on the corrupted dataset, LOT s performance drops significantly, classifying only 51% correctly when η = 0 (no noise but with random rotation/flip) and falling dropping to 12.5% (i.e., random guessing) when η > 0.1. LGW is more robust to the data with random rotations/flips, but its accuracy also reduces to random guessing as η is increased. In contrast, LPGW embeddings exhibit substantially greater performance, with its accuracy ranging from 70-85% when η 0.1.

For an intuitive comparison of the performance of these embeddings, we present the data reconstruction results in Figure 1b and the t-SNE visualization of the embeddings produced by each method in Figure 9. Further details can be found in Appendix N.

5 Summary In this paper, we propose the linear partial Gromov-Wasserstein (LPGW) embedding, a linearization technique for the PGW problem. Theoretically, we prove that LPGW admits a metric with certain assumptions, and numerically, we demonstrate the utility of our proposed LPGW method in shape retrieval and learning with transform-based embedding tasks. We demonstrate that the LPGW-based method can preserve the partial matching property of PGW while significantly improving computational efficiency.

Acknowledgments

This research was partially supported by the NSF CAREER Award No. 2339898.

Published as a conference paper at ICLR 2025

David Alvarez-Melis and Tommi S Jaakkola. Gromov-wasserstein alignment of word embedding spaces. ar Xiv preprint ar Xiv:1809.00013, 2018.

Luigi Ambrosio, Nicola Gigli, and Giuseppe Savar e. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2005.

Martin Arjovsky, Soumith Chintala, and L eon Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pp. 214 223. PMLR, 2017.

Yikun Bai, Bernard Schmitzer, Mathew Thorpe, and Soheil Kolouri. Sliced optimal partial transport. ar Xiv preprint ar Xiv:2212.08049, 2022.

Yikun Bai, Ivan Vladimir Medri, Rocio Diaz Martin, Rana Shahroz, and Soheil Kolouri. Linear optimal partial transport embedding. In International Conference on Machine Learning, pp. 1492 1520. PMLR, 2023.

Yikun Bai, Rocio Diaz Martin, Abihith Kothapalli, Hengrong Du, Xinran Liu, and Soheil Kolouri. Partial gromov-wasserstein metric. ar Xiv preprint ar Xiv:2402.03664, 2024.

Yogesh Balaji, Rama Chellappa, and Soheil Feizi. Robust optimal transport with applications in generative modeling and domain adaptation. Advances in Neural Information Processing Systems, 33:12934 12944, 2020.

Florian Beier, Robert Beinert, and Gabriele Steidl. On a linear gromov wasserstein distance. IEEE Transactions on Image Processing, 31:7292 7305, 2022.

Nicolas Bonneel, Michiel Van De Panne, Sylvain Paris, and Wolfgang Heidrich. Displacement interpolation using lagrangian mass transport. In Proceedings of the 2011 SIGGRAPH Asia conference, pp. 1 12, 2011.

Tianji Cai, Junyi Cheng, Bernhard Schmitzer, and Matthew Thorpe. The linearized hellinger kantorovich distance. SIAM Journal on Imaging Sciences, 15(1):45 83, 2022.

Laetitia Chapel, Mokhtar Z Alaya, and Gilles Gasso. Partial optimal tranport with applications on positive-unlabeled learning. Advances in Neural Information Processing Systems, 33:2903 2913, 2020.

Lenaic Chizat, Gabriel Peyr e, Bernhard Schmitzer, and Fran cois-Xavier Vialard. Scaling algorithms for unbalanced optimal transport problems. Mathematics of Computation, 87 (314):2563 2609, 2018a.

Lenaic Chizat, Gabriel Peyr e, Bernhard Schmitzer, and Fran cois-Xavier Vialard. Unbalanced optimal transport: Dynamic and Kantorovich formulations. Journal of Functional Analysis, 274(11):3090 3123, 2018b.

Lenaic Chizat, Pierre Roussillon, Flavien L eger, Fran cois-Xavier Vialard, and Gabriel Peyr e. Faster wasserstein distance estimation with the sinkhorn divergence. Advances in Neural Information Processing Systems, 33:2257 2269, 2020.

Samir Chowdhury and Tom Needham. Generalized spectral clustering via gromovwasserstein learning. In International Conference on Artificial Intelligence and Statistics, pp. 712 720. PMLR, 2021.

Nicolas Courty, R emi Flamary, Devis Tuia, and Alain Rakotomamonjy. Optimal transport for domain adaptation. IEEE transactions on pattern analysis and machine intelligence, 39(9):1853 1865, 2016.

Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information processing systems, 26, 2013.

Th eo Dumont, Th eo Lacombe, and Fran cois-Xavier Vialard. On the existence of monge maps for the gromov wasserstein problem. Foundations of Computational Mathematics, pp. 1 48, 2024.

Published as a conference paper at ICLR 2025

Kilian Fatras, Thibault S ejourn e, R emi Flamary, and Nicolas Courty. Unbalanced minibatch optimal transport; applications to domain adaptation. In International Conference on Machine Learning, pp. 3186 3197. PMLR, 2021.

R emi Flamary, Nicolas Courty, Alexandre Gramfort, Mokhtar Z. Alaya, Aur elie Boisbunon, Stanislas Chambon, Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo Fournier, L eo Gautheron, Nathalie T.H. Gayraud, Hicham Janati, Alain Rakotomamonjy, Ievgen Redko, Antoine Rolet, Antony Schutz, Vivien Seguy, Danica J. Sutherland, Romain Tavenard, Alexander Tong, and Titouan Vayer. Pot: Python optimal transport. Journal of Machine Learning Research, 22(78):1 8, 2021. URL http://jmlr.org/papers/v22/ 20-451.html.

Aude Genevay, Marco Cuturi, Gabriel Peyr e, and Francis Bach. Stochastic optimization for large-scale optimal transport. Advances in neural information processing systems, 29, 2016.

Yoon Haeng Hur, Wenxuan Guo, and Tengyuan Liang. Reversible gromov monge sampler for simulation-based inference. SIAM Journal on Mathematics of Data Science, 6(2): 283 310, 2024.

Soheil Kolouri, Navid Naderializadeh, Gustavo K Rohde, and Heiko Hoffmann. Wasserstein embedding for graph learning. ar Xiv preprint ar Xiv:2006.09430, 2020.

Lemin Kong, Jiajin Li, Jianheng Tang, and Anthony Man-Cho So. Outlier-robust gromovwasserstein for graph data. Advances in Neural Information Processing Systems, 36, 2024.

Weijie Liu, Chao Zhang, Jiahao Xie, Zebang Shen, Hui Qian, and Nenggan Zheng. Partial gromov-wasserstein learning for partial graph matching. ar Xiv preprint ar Xiv:2012.01252, pp. 115, 2020.

Xinran Liu, Yikun Bai, Huy Tran, Zhanqi Zhu, Matthew Thorpe, and Soheil Kolouri. Ptlp: Partial transport lp distances. In Neur IPS 2023 Workshop Optimal Transport and Machine Learning, 2023.

Haggai Maron and Yaron Lipman. (probably) concave graph matching. Advances in Neural Information Processing Systems, 31, 2018.

Facundo M emoli. Gromov wasserstein distances and the metric approach to object matching. Foundations of computational mathematics, 11:417 487, 2011.

Facundo M emoli and Tom Needham. Distance distributions and inverse problems for metric measure spaces. Studies in Applied Mathematics, 149(4):943 1001, 2022.

Caroline Moosm uller and Alexander Cloninger. Linear optimal transport embedding: provable Wasserstein classification for certain rigid transformations and perturbations. Information and Inference: A Journal of the IMA, 12(1):363 389, 2023.

Luca Nenna and Brendan Pass. Transport type metrics on the space of probability measures involving singular base measures. Applied Mathematics & Optimization, 87(2):28, 2023.

Liang Pan, Tong Wu, Zhongang Cai, Ziwei Liu, Xumin Yu, Yongming Rao, Jiwen Lu, Jie Zhou, Mingye Xu, Xiaoyuan Luo, et al. Multi-view partial (mvp) point cloud challenge 2021 on completion and registration: Methods and results. ar Xiv preprint ar Xiv:2112.12053, 2021.

Gabriel Peyr e, Marco Cuturi, and Justin Solomon. Gromov-wasserstein averaging of kernel and distance matrices. In International conference on machine learning, pp. 2664 2672. PMLR, 2016.

Filippo Santambrogio. Optimal transport for applied mathematicians. Birk auser, NY, 55 (58-63):94, 2015.

Published as a conference paper at ICLR 2025

Meyer Scetbon and Marco Cuturi. Low-rank optimal transport: Approximation, statistics and debiasing. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https: //openreview.net/forum?id=4bt Ne XKFAQ.

Thibault S ejourn e, Fran cois-Xavier Vialard, and Gabriel Peyr e. The unbalanced gromov wasserstein distance: Conic formulation and relaxation. Advances in Neural Information Processing Systems, 34:8766 8779, 2021.

Karl-Theodor Sturm. The space of spaces: curvature bounds and gradient flows on the space of metric measure spaces. ar Xiv preprint ar Xiv:1208.0434, 2012.

Hugues Van Assel, C edric Vincent-Cuaz, Nicolas Courty, R emi Flamary, Pascal Frossard, and Titouan Vayer. Distributional reduction: Unifying dimensionality reduction and clustering with gromov-wasserstein projection. ar Xiv preprint ar Xiv:2402.02239, 2024.

Titouan Vayer. A contribution to optimal transport on incomparable spaces. ar Xiv preprint ar Xiv:2011.04447, 2020.

Cedric Villani. Optimal transport: old and new. Springer, 2009. URL https://www. cedricvillani.org/sites/dev/files/old_images/2012/08/preprint-1.pdf.

C edric Villani. Topics in optimal transportation, volume 58. American Mathematical Soc., 2021.

Wei Wang, Dejan Slepˇcev, Saurav Basu, John A Ozolek, and Gustavo K Rohde. A linear optimal transportation framework for quantifying and visualizing variations in sets of images. International journal of computer vision, 101(2):254 269, 2013.

Fang Wu, Nicolas Courty, Shuting Jin, and Stan Z Li. Improving molecular representation learning with metric learning-enhanced optimal transport. Patterns, 4(4), 2023.

Zhengxin Zhang, Youssef Mroueh, Ziv Goldfeld, and Bharath Sriperumbudur. Cycle consistent probability divergences across different spaces. In International Conference on Artificial Intelligence and Statistics, pp. 7257 7285. PMLR, 2022.

Published as a conference paper at ICLR 2025

A Background: Linear Optimal Transportation Distance

The space of measures P2(Rd) can be endowed with the two-Wasserstein metric W2, and the resulting space defines a Riemannian manifold (see, e.g., (Villani, 2021)), which is referred to in the literature as the Wasserstein space. Under the Monge mapping assumption, consider µ P2(Rd) and let Tanµ(P2(Rd)) denote the corresponding tangent space at µ. Then, each tangent vector v Tanµ(P2(Rd)) can be regarded as an L2(µ) function, that is, v : Rd Rd such that

v 2 L2(µ) = Z

Rd v(x) 2 dµ(x) < .

Given ν P2(Rd), suppose γ = (id T)#µ is an optimal transportation plan for W2(µ, ν). Then, the logarithm mapping is defined by

P2(Rd) ν 7 vγ := (T id) Tanµ(P2(Rd)), (26)

and the resulting image vγ is the so-called Linear Optimal Transportation (LOT) embedding (Wang et al., 2013).

The LOT embedding vγ is a representation of the measure ν and encodes the optimal displacement from the fixed measure µ to ν. Indeed,

W 2 2 (µ, ν) = vγ 0 2 L2(µ).

In general, given two probability measures ν1, ν2 P2(Rd), if γ1, γ2 are optimal solutions for W2(µ, ν1), W2(µ, ν2) of the form (2), one can use the following so-called LOT distance to approximate the original OT distance W2(ν1, ν2) (Moosm uller & Cloninger, 2023):

LOT 2 2 (ν1, ν2; µ, γ1, γ2) := vγ1 vγ2 2 L2(µ). (27)

The above definition relies on the Monge mapping assumption. In (Wang et al., 2013; Moosm uller & Cloninger, 2023), the authors generalize the LOT distance without the Monge mapping assumption; however, in what follows, we will avoid such an assumption.

First, we discuss the notion of a geodesic in the classical OT setting. Suppose γ is an optimal transportation plan for the OT problem between µ, ν P2(Rd). Then, the geodesic from µ to ν is defined by

t 7 γt := ((1 t)πX + tπY )#γ (0 t 1). (28)

Remark A.1. In the particular case where γ is of the form γ = (id T)#µ, the above formula reduces to t 7 ((1 t)id + t T)#µ (0 t 1). Remark A.2. In the discrete case, µ = Pn i=1 piδxi and ν = Pm j=1 qjδyj. By abuse of notation, let γ Γ (µ, ν) be interpreted as a matrix. Then, the above geodesic becomes

j=1 δ(1 t)xi+tyjγij (0 t 1).

Given curves {γ1 t }t [0,1], {γ2 t }t [0,1] both originating from µ, we claim they are equivalent if there exists ϵ > 0 such that for all t [0, ϵ], γ1 t = γ2 t . Let Gµ denote the set of all the equivalence classes. The generalized tangent space at µ, denoted by Tang µ(P2(Rd)), is defined as the closure of Gµ (see (Beier et al., 2022) for details).

Now, given ν1, ν2 P2(Rd), let γ1, γ2 be some optimal transportation plan in Γ (µ, ν1), Γ (µ, ν2), respectively. We define the following distance, defined on the space Tang µP2(Rd), as the LOT distance conditioned on γ1, γ2:

LOT 2 2 (ν1, ν2; µ, γ1, γ2) := inf γ Γ(γ1,γ2;µ)

R3d y1 y2 2dγ(x, y1, y2). (29)

where Γ(γ1, γ2; µ) := {γ P(R3d) : (π01)#γ = γ1, (π02)#γ = γ2}, where π01(x, y1, y2) = (x, y1), π02(x, y1, y2) = (x, y2).

Published as a conference paper at ICLR 2025

Remark A.3. When γ1 = (id T 1)#µ, γ2 = (id T 2)#µ for some mappings T 1, T 2, there exists only one element in Γ(γ1, γ2; µ), which is (id T 1 T 2)#µ. Thus, the right-hand side of (29) becomes

T 1(x) T 2(x) 2 L2(µ). (30)

That is, (29) and (27) coincide.

Finally, to formulate a distance that is independent of the optimal transportation plans γ1, γ2, we define the LOT distance between ν1, ν2 with respect to µ by

LOT 2 2 (ν1, ν2; µ) := inf γ1 Γ (µ,ν1) γ2 Γ (µ,ν2)

LOT 2 2 (ν1, ν2; µ, γ1, γ2). (31)

Barycentric projection. The above LOT distance is complicated to compute. To simplify the computation, (Wang et al., 2013) introduces the barycentric projection method that we will recall here.

First, we review some basic concepts in measure theory. Given γ P(R2d), with π1#γ = µ, where π1 : R2d Rd is given by π1(x, y) = x, by the disintegration theorem (see, for e.g., (Ambrosio et al., 2005, Thm 5.3.1) and (Dumont et al., 2024, Theorem 6)), there exists a µ a.e. uniquely defined family of probability measures {γ( |x)}x Rd P(Rd), such that Z

R2d ϕ(x, y)dγ(x, y) = Z

Rd ϕ(x, y)dγ(y|x)dµ(x), ϕ C0(R2d). (32)

In this paper, we call γ( |x) the disintegration of γ with respect to its first marginal (µ), given x (for each x supp(µ)).

The barycentric projection map Tγ : Rd Rd of γ is defined by

Tγ(x) := arg min y Y

Y y y 2 dγ(y |x ) = Z

Y y dγ(y |x ). (33)

Note that the second equality holds since the distance between y, y is simply the quadratic Euclidean distance. In the third term, we have a vector integration.

Given an optimal transportation plan γ Γ (µ, ν), we define eνγ := (Tγ)#µ.

Proposition A.4 (Prop. II.4 (Beier et al., 2022)). If γ Γ (µ, ν), then the transportation plan induced by the barycentric projection map Tγ, denoted as eγ := (id Tγ)#µ is optimal for the OT problem W2(µ, eνγ).

In addition, if γ = (id T)#µ satisfies the Monge mapping assumption, then Tγ = T µ-a.e. and is a Monge map.

Thus, if we replace ν by eνγ, the OT problem W2(µ, eνγ) can be solved by the map Tγ. Based on this property, given ν1, ν2 P(Rd), consider optimal transportation plans γ1 Γ (µ, ν1), γ2 Γ(µ, ν2), then define the following approximated LOT distance (a LOT) to estimate W2(ν1, ν2):

a LOT 2 2 (ν1, ν2; µ) := inf γ1 Γ (µ,ν1) γ2 Γ (µ,ν2)

LOT 2 2 (eνγ1, eνγ2; µ, eγ1, eγ2)

= inf γ1 Γ (µ,ν1) γ2 Γ (µ,ν2)

Tγ1 Tγ2 2 L2(µ) (34)

Tγ1 Tγ2 2 L2(µ) (35)

Note, in most of the literature, (35) is referred to as LOT2, and for simplicity, it is approximated by any pair of optimal transportation plans (γ1, γ2).

Published as a conference paper at ICLR 2025

B Background: Linear Gromov-Wasserstein Distance

Let X Rd0 be a compact and convex subset, and consider a Borel measure µ on X. A symmetric function g X : X 2 R with R

X 2 g X(x, x )dµ 2 < is called a gauge function. The space of all gauge functions is denoted as L2 sym(X 2, µ 2). In this paper, without specific description, we default set g X to be continuous (thus, Lipschitz continuous) on X. For example g X(x, x ) = x x 2 or g X(x, x ) = x x . The space X = (X, g X, µ) is called a gauged measure space (gm-space), which can be regarded as a generalized version of an mm-space. The GW problem between two gm-spaces X, Y is:

GW 2(X, Y) := inf γ Γ(µ,ν)

(X X) 2 |g X(x, x ) g Y (y, y )|2dγ 2.

Lemma B.1. There exists a minimizer γ of GW(X, Y) in Γ(µ, ν).

The existence of minimizers of GW(X, Y) could be shown by a standard compact-SNEss argument, see for instance, Lemma 10.3 in (M emoli, 2011). Suppose γ is one optimal solution of the above generalized GW problem. The geodesic between X and Y is defined by

t 7 γt := (X Y, (1 t)g X + tg Y , γ), t [0, 1]. (36)

Note, even if g X, g Y are metrics, γt is a gm-space (rather than mm-space), and this is the reason mm-space is generalized to gm-spaces in (Beier et al., 2022) and in this paper.

Two gm-spaces X, Y are equivalent if GW 2(X, Y) = 0. Let [X] denote the equivalence class of X. The tangent space at [X] is defined by

(X ,g X ,µ ) [X] L2 sym(X X , µ 2)

where, if Tk1 = (X1, g X1, µ1), Tk2 = (X2, g X2, µ2) [X] are representatives of k1 L2 sys((X1) 2, (µ1) 2), k2 L2 sys((X2) 2, (µ2) 2), we say that k1 k2 if there exists γ Γ(X1, X2) such that GW(X1, X2) = 0. (We refer to (Beier et al., 2022) for more details.)

Given spaces X1 := (X Y 1, d X, γ1), X2 := (X Y 2, d X, γ2) where γ1, γ2 are optimal transportation plans for GW(X, Y1), GW(X, Y2), respectively, it is straightforward to verify

GW(X, X1) = GW(X, X2) = 0,

thus X1, X2 [X].

In addition, choose k1 L2 sys(X1) 2, (γ1) 2), k2 L2 sys((X2) 2, (γ2) 2), then we have

k1, k2 Tan X,

where Γ (Tk1, Tk2) is same to Γ (γ1, γ2) in GW(X1, X2).

Their inner product distance is defined by

D2(k1, k2) := inf γ Γ (Tk1,Tk2) k1 k2 2 γ 2.

Now, we set k1 = kγ1 = g Y 1 g X, k2 = kγ2 = g Y 2 g X. Thus, the linear GW distance between Y1 and Y2 given γ1, γ2 is defined by the above inner product distance. That is,

LGW(Y1, Y2; X, γ1, γ2) := D(k1, k2)

= inf γ Γ (Tk1,Tk2) k1 k2 2 L2(γ 2)

= inf γ Γ(γ1,γ2;µ)

(X Y 1 Y 2) 2 g Y 1(y1, y1 ) g Y 2(y2, y2 ) 2dγ 2

Published as a conference paper at ICLR 2025

where Γ(γ1, γ2; µ) := {γ P(X Y 1 Y 2) : (πX,Y 1)#γ = γ1, (πX,Y 2)#γ = γ2}. The third line follows from Proposition III.1 in (Beier et al., 2022) (or equivalently from the identity (49) in the next section). Note, under Monge mapping assumptions, γ1 = (id T 1)#µ, γ2 = (id T 2)#µ, Γ(γ1, γ2; µ) = {(id T 1 T 2)#µ} and the above distance (38) coincides with (7).

The original Linear Gromov-Wasserstein distance between Y1 and Y2 with respect to X is defined as LGW(Y1, Y2; X) := inf γ1 Γ(X,Y1) γ2 Γ(X,Y2)

LGW(Y1, Y2; X, γ1, γ2). (39)

C LOT/LGW/LPGW embedding without Monge mapping.

In this section, we briefly discuss the linear transportation embedding under linear optimal transport, linear Gromov Wasserstein, and linear partial Gromov Wasserstein setting. In summary, the formulations of these embeddings do not rely on the Monge mapping assumption.

Linear OT embedding without Monge mapping. Given probability measures ν1, ν2 P(Ω), and reference measure µ P(Ω), where Ω Rd is non-empty set. Choose optimal transportation plans γ1 Γ (µ, ν1), γ2 Γ (µ, ν2), the linear OT embedding 26 can be generalized to

ν1 7 ˆT 1 := {x 7 γ1( |x)}, (40)

where the conditional distribution γ1( |x) can be treated as a random mapping . When γ1 = (id T 1)#µ, it is straightforward to verify T 1 = ˆT 1). That is, the embedding (40),(26) coincide.

We define the distance between two such two random mappings T 1, T 2 via:

ˆT 1 ˆT 2 2 L(µ) := Z inf γ Γ(γ1( |x),γ2( |x)) y1 y2 2dγ(y1, y2)dµ(x) (41)

From (29), we have

LOT 2 2 (ν1, ν2; µ) := Z

γ Γ(ν1,ν2;µ) y1 y2 γ(x, y1, y2)

Ω inf γ Γ(γ1( |x),γ2( |x) y1 y2 2dµ(x)

= ˆT 1 ˆT 2 2 L(µ).

Linear GW embedding without Monge mapping. Similar to above formulation, given two gm spaces Y1 = (Y 1, g Y 1, ν1), Y2 = (Y 1, g Y 2, ν2) and a reference gm space X = (X, g X, µ), where µ, ν1, ν2 are probability measures. Suppose γ1, γ2 are optimal transportation plans for GW(X, Y1), GW(X, Y2) respectively. Then the embedding (5) can be generalized to:

Y1 7 ˆk1 := {(x, x ) 7 (g Y 1( 1, 2))#(γ1( 1|x)γ1( 2|x )). (42)

Note, similar to LOT, when γ1 = (id T 1)#µ, we have ˆT 1 = T 1. That is (5),(42) coincide.

Similarly, we define the distance between two embeddings ˆk1, ˆk2 via:

ˆk1 ˆk2 L(µ 2)

X2 inf γ Γ(γ1( |x),γ2( |x) γ Γ(γ1( |x ),γ2( |x ))

g Y 1(y1, y1 ) g Y 2(y2, y2 ) 2dγ((y1, y2|x)dγ (y1 , y2 |x )dµ 2(x, x ).

Published as a conference paper at ICLR 2025

From (8), we have

LGW(Y1, Y2; Y1, Y2, X)

= inf γ Γ(γ1,γ2;µ) |g Y 1(y1, y1 ) g Y 2(y2, y2 )|2γ 2(x, y1, y2)

X2 inf γ Γ(γ1( |x),γ2( |x) γ Γ(γ1( |x ),γ2( |x ))

g Y 1(y1, y1 ) g Y 2(y2, y2 ) 2dγ((y1, y2|x)dγ (y1 , y2 |x )dµ 2(x, x )

= ˆk1 ˆk2 L(µ 2).

That is, the distance between two embeddings coincides with the LGW distance between two gm spaces.

LPGW embedding without Monge mapping. Similar to above setting, given gm-spaces X, Y1, Y2, and suppose γ1, γ2 are optimal transportation plans for PGWλ(X, Y1), PGWλ(X, Y2), the LPGW embedding in general case is defined by

Y1 7 E1 := (ˆk1, γ1 X, γ1 c := (ν1) 2 (γ1 Y 1) 2). (44)

Note, when the Monge mapping assumption holds, (13),(44) coincide.

The distance between two embeddings is defined as:

D(E1, E2) := inf µ γ1 X γ2 X ˆk1 ˆk1 (µ ) 2 + λ(|γ1 X|2 + |γ2 X|2 2|µ |2 + |γ1 c| + |γ2 c|). (45)

LPGWλ(Y1, Y2; X)

= inf γ Γ (ν1,ν2;µ)

(X Y 1 Y 2)2 g Y 1(y1, y1 ) g Y 2(y2, y2 ) 2dγ(x, y1, y2)dγ(x , y1 , y2 )

+ λ(|ν1|2 + |ν2|2 |γ|2)

= inf µ γ1 γ2 inf γ Γ(γ1( |x),γ2( |x)) γ Γ(γ1( |x ),γ2( |x ))

X2 g Y 1(y1, y1 ) g Y 2(y2, y2 ) d(µ ) 2(x, x )

ˆk1 ˆk2 L((µ ) 2) + λ(|ν1|2 |γ1 X|2 | {z } |γ1c |

+ |ν2|2 |γ2 X|2 | {z } |γ2 X|2

+|γ1 X|2 + |γ2 X|2 2|µ |2)

= D(E1, E2)

That is, the distance between two embeddings coincides with the LPGW distance between the two gm-spaces.

D Computational Complexity of LPGW

Computational Complexity of GW and PGW. The classical solvers for the GW and PGW problems are variants of the Frank-Wolf algorithms (Chapel et al., 2020; Bai et al., 2024).

Consider metric measure spaces X and Y, with measures given by µ = Pn i=1 piδxi and ν = Pm j=1 qjδyj, respectively. To achieve an ϵ accurate solution for the GW problem, the number of required iterations of the FW algorithm is

2L1, 2 n2m2 max({2(CX)2 + 2(CY )2})

Published as a conference paper at ICLR 2025

and for PGW is

2L1, 2 min(|p|, |q|) n2m2 max({2(CX)2 + 2(CY )2, 2λ})

Here, L1 is a value determined by the initial guess γ0 used in the FW algorithm and CX, CY are the corresponding cost matrices for X, Y. Note, since the largest value for λ is 2λ = max(2(CX)2, 2(CY )2), the above two quantities coincide.

In each iteration of FW, we are required to solve an OT/POT problem, whose complexity is O(nm(n + m)) (Bonneel et al., 2011). Thus, the theoretical time complexity to solve GW/PGW is O( 1

ϵ2 nm(n + m)n2m2).

Given K mm-spaces, X1, . . . XK, then, computing their pairwise GW/PGW distances requires K 2 O( 1

ϵ2 nm(n + m)n2m2). Meanwhile, computing the pairwise LPGW distances between them requires only

ϵ2 nm(n + m) + n2m2 + K 2

O(n2 0), (46)

where the term O(n2 0) represents the time complexity of computing the distance between two embeddings, and n0 denotes the size of the reference space. In general, we would like to set n0 to be the mean/median/maximum of the sizes of K mm-spaces, and thus, this term can be ignored as compared to the first term. Remark D.1. We generally impose a fixed limit on the maximum number of iterations to be used in the FW algorithm (e.g., in Python OT (Flamary et al., 2021), it is set to 1000 by default). In practice, this maximum number of iterations is generally not achieved. Hence, un-rigorously speaking, we can think of the complexity of GW/PGW as being O(nm(n+m)).

E Proof of Proposition 2.1

E.1 Clarification

Although statements (1) and (2) in this proposition are not explicitly introduced in (Beier et al., 2022), they are implicitly mentioned, for example, in Proposition III.1 of such article. Thus, we do not claim the proofs of statements (1) and (2) as contributions of this paper. Additionally, to the best of our knowledge, the statement (3) has not yet been studied. We present the related proof as a complement to the work in (Beier et al., 2022).

E.2 Conjecture and Understanding

Regarding statement (3), we conjecture that this conclusion can be extended to the case where g X and g Y are squared Euclidean distances. However, this statement may not hold for other gauge mappings. In our understanding, for general gauge mappings, achieving a similar result to the statement (3) would require defining a generalized barycentric projection mapping that is dependent on the specific gauge mapping.

The formulation of a new barycentric projection mapping based on the chosen gauge mapping, as well as a generalized version of statement (3), is left for future work.

E.3 Proof of Proposition 2.1: Part (1)

Proof in a particular case. We first show the result in a simplified case. In particular, suppose γ satisfies the Monge mapping, i.e. γ = (id T)#µ, where T : X Y , and γ0 = (id id)#µ Γ (X, X). Then kγ0 0 and thus

LGW(X, Y; X, γ0, γ ) = kγ kγ0 2 µ 2 = kγ 2 µ 2

(X Y ) 2 |g X(x, x ) g Y (T(x), T(x ))|2dµ 2

= GW(X, Y).

Published as a conference paper at ICLR 2025

Proof in the general case. Consider γ0 Γ (X, X), and γ Γ(γ0, γ ).

We claim the following lemma

Lemma E.1. Choose γ0 Γ (X, X), γ M+(X, Y), for each γ Γ(γ0, γ ; µ), we have the following identity: Z

(X X Y ) 2 |g X(x2, x2 ) g Y (y, y )|2d(γ) 2((x1, x1 ), (x2, x2 ), (y, y ))

(X Y ) 2 |g X(x, x ) g Y (y, y )|2d(γ ) 2((x, x ), (y, y )) (47)

Proof of Lemma (E.1). Since

GW(X, X) = Z

(X X) 2 |g X(x1, x1 ) g X(x2, x2 )|2dγ0 = 0,

we have g X(x1, x1 ) g X(x2, x2 ) = 0, γ0 a.s.

For convenience, let X1 = X2 = X be independent copies of X. We have: Z

(X X Y ) 2 |g X(x2, x2 ) g Y (y, y )|2dγ 2((x1, x1 ), (x2, x2 ), (y, y ))

(X X0 Y ) 2 |g X(x2, x2 ) g X(x1, x1 ) + g X(x1, x1 ) g Y (y, y )|2dγ 2((x1, x1 )(x2, x2 ), (y, y ))

(Y ) 2 |g X(x2, x2 ) g X(x1, x1 ) + g X(x1, x1 ) g Y (y, y )|2

dγ 2 Y |(X1,X2)((y, y )|(x1, x1 ), (x2, x2 )) i d(γ0) 2((x1, x1 ), (x2, x2 )) (48)

Y 2 |g X(x1, x1 ) g Y (y, y )|2dγ 2 Y |X1,X2((y, y )|(x1, x1 ), (x2, x2 )

d(γ0) 2((x1, x1 ), (x2, x2 ))

(X1 Y ) 2 |g X(x1, x1 ) g Y (y, y )|2dγ 2((x1, x2, y), (x1 , x2 , y ))

(X Y ) 2 |g X(x, x ) g Y (y, y )|2d(γ ) 2((x, y), (x , y )) (49)

where in (48) γY |X1,X2( |x1, x2) is the disintegration of γ with respect to (πX1,X2)#γ = γ0.

Based on the identity (47) and the fact γ1 is optimal for problem GW(X, Y), we have:

LGW(X, Y; X, γ0, γ ) = inf γ Γ(γ0,γ )

(X X Y ) 2 |d X(x0, x0 ) d Y (y, y )|2dγ 2((x, x ), (x0, x0 ), (y, y ))

= inf γ Γ(γ0,γ ) GW(X, Y)

= GW(X, Y),

and we have completed the proof.

E.4 Proof of Proposition 2.1: Part (2)

In this case, γ = (id T)#µ, for each x Supp(µ), we have that the disintegration of γ with respect to its first marginal (πX)#γ = γ X, given first component is x, is

γ Y |X( 2|y ) = δ( 2 T(x)),

Published as a conference paper at ICLR 2025

where δ is the Dirac measure. Thus, for all x supp(µ), we have:

Y ydγ Y |X(y|x) = Z

Y y dδ(y T(x)) = T(x).

where the last equality holds from the fact T(x) supp(ν) Y . Therefore, Tγ = T µ a.s. Since T#µ = ν, we have eνγ := (Tγ )#µ = T#µ = ν.

E.5 Proof of Proposition 2.1: Part (3)

E.5.1 Proof in the Discrete Case

We first demonstrate the proof in the following simplified discrete case: Suppose g X(x, x ) = x x and g Y (y, y ) = y y , and consider

i=1 piδxi, ν =

j=1 qjδyj with

i=1 pi = 1 =

In addition, we suppose that X, Y are compact convex sets which contains 0 (besides containing {x1, . . . , xn} and {y1, . . . , ym}, respectively).

Let γ Γ (X, Y) be an optimal transportation plan. Then, the corresponding barycentric projected measure is given by

eν = eνγ :=

i=1 piδeyi, where eyi := 1

j=1 γ ijyj.

By convexity, ey1, . . . , eyn Y . Thus e Y := e Yγ = (X, g Y , eν).

Then, it induces a transportation plan eγ := diag(p0 1, , p0 n) Rn n + with first and second marginals µ and eν, respectively. We will show that for any transportation plan eγ Γ(µ, eν), it holds that C(eγ ; X, e Y) C(eγ; X, e Y),

C(γ; X, Y) =

j,j =1 (x i xi y j yj )2γijγi j ,

in other words, we will show that eγ is optimal for GW(X, e Y).

First, we notice that (eγ ) 1 = diag(1/p0 1, , 1/p0 n). Then, we set

γ := eγ(eγ ) 1γ Rn m + .

Secondly, we check that γ Γ(µ, ν). In fact, let p = (p1, . . . , pn), q = (q1, . . . , qm) be the vectors of the weights of finite discrete measures µ and ν, respectively, then, since γ 1m = p and γT 1n = q, we have

γ1m = eγ(eγ ) 1γ 1m = eγ(eγ ) 1p = eγ1n = q,

γ 1n = (γ ) (eγ ) 1eγ 1n = (γ ) (eγ ) 1q = (γ ) 1m = p.

Lastly, we compute the costs as follows:

C(γ ; X, Y) =

i,i =1 (x i xi )2pipi +

j,j =1 (y j yj )2qjqj 2

j,j =1 (x i xi )(y j yj )γ ijγ i j

j,j =1 (x i xi )(y j yj )γ ijγ i j .

Published as a conference paper at ICLR 2025

where J1 = Pn i,i =1(x i xi )2pipi + Pm j,j =1(y j yj )2qjqj is independent of γ . By the multilinearity of the last term in the identity, we can show

C(eγ ; X, e Y) = J2 2

ℓ,ℓ =1 (x i xi )(ey ℓeyℓ )eγ iℓeγ i ,ℓ

ℓ,ℓ =1 x i xi

= J2 J1 + C(γ ; µ, ν1),

where J2 = Pn i,i =1(x i xi )2pip0 i + Pn ℓ,ℓ =1(ey ℓeyℓ )2p0 ℓp0 ℓ . Similarly, we can compute

C(eγ; X, e Y) = J2 J1 + C(γ; X, Y) (51)

The optimality of γ together with identities (50) and (51) implies that C(eγ ; X, e Y) C(eγ; X, e Y).

E.5.2 Proof in the General Case

Similar to the discrete case, given γ Γ (X, Y), the barycentric projection is given by

Y y dγ Y |X(y|x).

We have that eγ = (id Tγ )#µ is a joint measure with first and second marginals µ and eν := eνγ := (Tγ )#µ, respectively. The goal is to show that eγ is optimal for GW(X, e Y) where e Y := e Yγ := (Tγ )#µ.

Let eγ Γ(µ, eνγ ) be an arbitrary plan. Let eγ( 2|x) := eγY |X( 2|x) denote the disintegration of eγ with respect to µ, given that the first component is x, for each x Supp(µ). Similarly, we adopt the notation γ ( 2|x) for each x Supp(µ).

We define the joint measure

γ := γ Y |Xeγ X|Y eγ (52)

on X Y . In particular, the above notation means that for each test function ϕ C0(X Y ), we have

X Y ϕ(x, y)dγ(x, y) = Z

X Y X Y ϕ(x, y)dγ (y|x0)deγ (x0|ey)deγ (x, ey)

Y ϕ(x, y)dγ (y|x0) deγ (x0|ey) deγ (x, ey)

Published as a conference paper at ICLR 2025

For any test functions ϕX C0(X), ϕY C0(Y ) we have Z

X Y ϕX(x)dγ(x, y) = Z

X Y X Y ϕX(x)dγ (y|x0)deγ (x0|ey)deγ(x, ey)

X Y ϕX(x)deγ(x, ey)

X ϕX(x)dµ(x) Z

X Y ϕY (y)dγ(x, y) = Z

X Y X Y ϕY (y)dγ (y|x0)deγ (x0|ey)deγ(x, ey)

X Y X ϕY (y)dγ (y|x0)deγ (x0|ey)deν(ey)

X Y ϕY (y)dγ (y|x0)dµ(x0)

X Y ϕY (y)dγ (x0, y)

Y ϕY (y)dν(y)

Thus γ Γ(µ, ν).

In addition, we observe that we have the following property that holds for the case of considering inner products: Lemma E.2. Suppose X Rdx, Y Rdy are finite-dimensional, convex compact sets. Let γ M+(X Y ) and define Tγ as in (9). In addition, let g Y (y, y ) = α y, y , where (y, y ) 7 y, y := Pd1 i=1 yiy i is an inner product (the standard one, for example). Then for each x, x X, let ey = Tγ(x), ey = Tγ(x ), we have:

g Y (ey, ey ) = Z

Y Y g Y (y, y )dγ(y|x)dγ(y |x ). (53)

Proof of Lemma E.2. For each i [1 : dy], we have:

ey[i]ey [i] = Z

X y[i]dγ(y|x) Z

X y [i]dγ(y |x )

y y [i]dγ(y|x)dγ(y |x )

Y Y y[i]y [i]dγ(y|x)dγ(y |x )

where the third equality follows from Fubini s theorem.

g Y (ey, ey ) = α

i=1 ey[i]ey [i]

Y Y y[i]y [i]dγ(y|x)dγ(y |x )

i=1 y[i]y [i]dγ(y|x)dγ(y |x )

Y Y g X(y, y )dγ(y|x)dγ(y |x )

Published as a conference paper at ICLR 2025

Now, we continue with the proof of Proposition 2.1, part (3).

C(γ ; X, Y) = Z

(X Y ) 2 |g X(x, x ) g Y (y, y )|2dγ (x, y)dγ (x , y )

= g X(x, x ) 2 L2(µ 2) + g Y (y, y ) 2 L2(ν 2) 2 Z

(X X) 2 g X(x, x )g Y (y, y )dγ (x, y)dγ (x , y )

= J1 2 g Xg Y , (γ ) 2 (54)

where J1 = g X 2 L2(µ 2) + g Y 2 L2(ν 2) is independent to γ . Similarly,

C(eγ ; X, e Y)

(X Y ) 2 |g X(x, x ) g Y (ey, ey )|2deγ (x, y)deγ (x , y )

= g X 2 L2(µ 2) + g Y 2 L2(eν 2) 2 Z

(X Y ) 2 g X(x, x )g Y (ey, ey )deγ (x, ey)deγ (x , ey )

X 2 g X(x, x )g Y (Tγ (x), Tγ (x ))dµ(x)dµ(x ) (55)

X 2 g X(x, x ) Z

Y 2 g Y (y, y )dγ (y|x)dγ (y |y ) dµ(x)dµ(x ) (56)

(X Y ) 2 g X(x, x )g Y (y, y )dγ (x, y)dγ (x , y )

= J2 2 g Xg Y , (γ ) 2 (57)

where in (55), J2 = g X( 1, 2) 2 L2(µ 2) + g Y ( 1, 2) 2 L2(eν 2) (which is independent of eγ ); and (56) follows from Lemma E.2. Combining (54) and (57), we have

C(eγ ; X, e Y) = C(γ ; X, Y) J1 + J2. (58)

Similarly, we can show that

C(eγ; X, e Y) = C(γ; X, Y) J1 + J2. (59)

Also, similarly to (54) and (57), we have

C(γ; X, Y) = J1 2 g Xg Y , γ 2 (60)

C(eγ; X, e Y) = J2 2 g Xg Y , eγ 2 (61)

It remains to show

g Xg Y , γ 2 = g Xg Y , eγ 2 . (62)

Published as a conference paper at ICLR 2025

g Xg Y , eγ 2 = Z

(X Y ) 2 g X(x, x )g Y (ey, ey )deγ(x, ey)deγ(x , ey )

(X Y ) 2 g X(x, x )g Y (ey, ey )deγ(ey|x)deγ(ey |x )dµ(x)dµ(x )

(X Y ) 2 g X(x, x ) Z

X 2 g Y (Tγ (x0), Tγ (x0 ))deγ (x0|ey)deγ (x0 |ey )

deγ(ey|x)deγ(ey |x )dµ(x)dµ(x ) (63)

(X Y X) 2 g X(x, x )g Y (Tγ (x0), Tγ (x0 ))

deγ (x0|ey)deγ (x0 |ey )deγ(ey|x)deγ(ey |x )dµ(x)dµ(x )

(X Y X Y ) 2 g X(x, x ) Z

Y 2 g Y (y, y )dγ (y|x0)dγ (y |x0 )

deγ (x0|ey)deγ (x0 |ey )deγ(ey|x)deγ(ey |x )dµ(x)dµ(x ) (64)

(X Y X Y ) 2 g X(x, x )g Y (y, y ) dγ (y|x0)deγ (x0|ey)deγ(ey|x)dµ(x)

dγ (y |x 0 )deγ (x0 |ey )deγ(ey |x )dµ(x )

(X Y ) 2 g X(x, x )g Y (y, y )dγ(x, x)dγ(x , y )

= g Xg Y , γ 2

where (63) holds since given ey, theprobability measure eγ ( 1|ey) is only supported on {x0 X : ey = Tγ (x0)}, similarly to measure γ( 1|ey ); (64) follows from Lemma E.2 (that is, from equality (53)).

Combining (59), (58), (62) and the fact C(γ ; X, Y) C(γ; X, Y), we obtain that C(eγ ; X, Y) C(eγ; X, Y) which completes the proof.

F Proof of Proposition 3.1

Consider gm-spaces X, Y1, Y2, and select γ1 Γ ,λ(X, Y1), γ2 Γ ,λ(X, Y2). Inspired by the LGW distance (38), we propose the LPGW distance (conditioned on γ1, γ2) (15) in the general case:

LPGWλ(Y1, Y2; X, γ1, γ2) := inf γ Γ (γ1,γ2;µ)

(X Y 1 Y 2) 2 |g Y 1(y1, y1 ) g Y 2(y2, y2 )|2dγ 2

+ λ(|ν1|2 + |ν2|2 2|γ|2). (65)

where Γ (γ1, γ2; µ) := {γ M+(X Y 1 Y 2) : (πX,Y 1)#γ γ1, (πX,Y 2)#γ γ2}.

Next, we discuss the proof of Proposition 3.1.

F.1 Proof of Proposition 3.1: Part (1)

In this section, we discuss the proof of statement (1) of Proposition 3.1; namely, we prove the existence of a minimizer to the LPGW problem.

First, we introduce a series of lemmas. Lemma F.1. Given Radon measures µ, ν1, ν2, and γ1 Γ (µ, ν1), γ2 Γ (µ, ν2), then

Γ (γ1, γ2; µ)

is sequentially compact set.

Published as a conference paper at ICLR 2025

Proof. The main idea is similar to the proof in, e.g., Lemma B.2 (Liu et al., 2023). In particular, consider a sequence (γn) Γ ,X(γX, γY ). It is straightforward to verify that this sequence is bounded above in total variation (since γ1 and γ2 are finite measures) and that it is also a tight sequence. This verification is even simpler because X, Y 1, and Y 2 are compact. Thus, by Prokhorov s theorem for signed measures, the closure of Γ (γ1, γ2) (in the weak topology) is weakly sequentially compact in M(X Y 1 Y 2).

It remains to show Γ (γ1, γ2; µ) is closed.

Suppose γn w γ M(X Y 1 Y 2), for each nonnegative test function ϕ C(X Y 1), we have

X Y 1 ϕ(x, y1)d(πX,Y 1)#γn(x, y1) = lim n

X Y 1 Y 2 ϕ(x, y1)dγn(x, y1, y2)

X Y 1 Y 2 ϕ(x, y1)dγ(x, y1, y2)

X Y 1 ϕ(x, y1)d(πX,Y 1)#γ(x, y1)

That is, (πX,Y 1)#γn w (πX,Y 1)#γ. By Lemma B.1 in (Liu et al., 2023), we have (πX,Y 1)#γ γ1. Similarly we have (πX,Y 2)#γ γ2.

Thus, γ Γ (γ1, γ2; µ) and we have completed the proof.

Lemma F.2. Suppose g X, g Y are continuous functions (see assumption (3)), we claim that the mapping

(X Y ) 2 ((x, y), (x , y )) 7 |g X(x, x ) g Y (y, y )|2 R

is a Lipschitz function with respect to metric

DX Y ((x, y), (x , y )) = d X(x, x ) + d Y (y, y ).

Proof of Lemma F.2. Let the mapping in the statement of the lemma be denoted by Φ. For x1, x 1, x2, x 2 X, y1, y 1, y2, y 2 Y we have

|Φ((x1, y1), (x 1, x 1)) Φ((x2, y2), (x 2, y 2))|

= ||g X(x1, x 1) g Y (y1, y 1)|2 |g X(x2, x 2) g Y (y2, y 2)|2|

||g X(x1, x 1) g Y (y1, y 1)|2 |g X(x2, x 2) g Y (y1, y 1)|2|

+ ||g X(x2, x 2) g Y (y1, y 1)|2 |g X(x2, x 2) g Y (y2, y 2)|2|

K1(|g X(x1, x 1) g X(x2, x 2)| + |g Y (y1, y 1) g Y (y2, y 2)|) (66)

= K1K2|d X(x1, x2) + d X(x 1, x 2) + d Y (y1, y2) + d Y (y 1, y 2)| (67)

where (66) follows from the fact that the function r1 7 |r1 r2|2 is Lipschitz on a compact set (see, e.g., Lemma C.1 in (Bai et al., 2024)), and K1 0 is its Lipschitz constant. Then, (67) follows from assumption (3) for g X and g Y , with K2 0 being the maximum of their Lipschitz constants.

Consider (γn) Γ (γ1, γ2; µ), such that

|g Y 1 g Y 2|2 2λ, (γn) 2 inf γ Γ (γ1,γ2;µ) |g Y 1 g Y 2|2 2λ, γ 2 .

By compactness of Γ (γ1, γ2; µ) (see Lemma F.1), we have that there exists γ Γ (γ1, γ2; µ) which is sub-sequence limit of γn (in weak convergence).

By Lemma F.2 and Lemma C.2.(3) in (Bai et al., 2024), we have

|g Y 1 g Y 2|2 2λ, (γ ) 2 = inf γ Γ (γ1,γ2;µ) |g Y 1 g Y 2|2 2λ, γ 2 .

Thus, γ is a minimizer, and we have completed the proof.

Published as a conference paper at ICLR 2025

F.2 Proof of Proposition 3.1: Part (2)

Proof. Under the Monge mapping assumptions, we have

Γ (γ1, γ2; µ) = {(id T 1 T 2)#µ : µ γ1 X γ2 X}.

Thus, (65) becomes

LPGW(Y1, Y2; X, γ1, γ2)

= inf µ γ1x γ2 X

X 2 |g Y 1(T 1(x), T 1(x )) g Y 2(T 2(x), T 2(x ))|2 + λ(|ν1|2 + |ν2|2 2|(id T 1 T 2)#µ |2)

= inf µ γ1 X γ2 X

X 2 |g Y 1(T 1(x), T 1(x )) g Y 2(T 2(x), T 2(x ))|2 + λ(|ν1|2 + |ν2|2 2|µ |2)

and we have proven the statement.

F.3 Proof of Proposition 3.1: Part (3)

Since X, Y 1, Y 2 are compact, by continuity (see (3)), we have that g X, g Y 1, g Y 2 are bounded functions.

By Lemma E.2 in (Bai et al., 2024), when 2λ > max(|g Y 1(y1, y1 ) g Y 2(y2, y2 )|2), pick any optimal γ Γ (γ1, γ2; µ) = {(id T 1 T 2)#µ : µ γ1 X γ2 X},

we have |γ| = min(|γ1|, |γ2|) = min(|γ1 X|, |γ2 X|). Thus γ = (id T 1 T 2)#γ1 X γ2 X. Plug in this optimal γ into the LPGW problem (14) (or (15)), we obtain

LPGW(Y1, Y2; X, γ1, γ2)

= |g Y 1 g Y 2|2, (γ1 X γ2 X) 2 + λ(|ν1|2 + |ν2|2 2|γ1 X γ2 X|2)

= |kγ1 kγ2|2, (γ1 X γ2 X) 2 + λ(|ν1|2 + |ν2|2 2|γ1 X γ2 X|2)

and we complete the proof.

F.4 Proof of Proposition 3.1: Part (4)

First, under the Monge mapping assumption, an optimal PGW plan is of the form γ = (id T)#γ X for some γ X µ, we have

kγ 2 L2((γ X) 2) + λ(|µ|2 |γ X|2 + |γc|)

X 2 |g X(x, x ) g Y (T(x), T(x ))|2d(γ ) 2 + λ(|µ 2 (γ X) 2| + |ν 2 (γ Y ) 2|)

= PGWλ(X, Y).

In the general case, we first introduce the following lemma.

Lemma F.3. If γ0 Γ ,λ(X, X), then (πX)#γ0 = µ, (πY )#γ0 = µ.

Proof of Lemma F.3. Pick any γ Γ (X, X), suppose |γ| < |µ|, we have

C(γ; X, X, λ) = Z

X 2 |g X(x, x ) g X(x0, x0 )|dγ 2 + λ(2|µ|2 2|γ|2)

λ(2|µ|2 2|γ|2) > 0. (68)

However, PGWλ(X, X) = 0. Thus,γ is not optimal.

Published as a conference paper at ICLR 2025

By the above lemma, we have |γ | |γ0|. For each γ Γ (γ0, γ ; µ), from (Bai et al., 2024), there exists γ Γ (γ0, γ ) such that (πX,Y )#γ = γ and γ γ . Thus we have |γ | = |γ |.

C(γ; X, X, Y, λ) := Z

(X X Y ) 2 |g X(x, x ) g Y (y, y )|2dγ 2((x, x ), (x, x ), (y, y ))

+ λ(|µ|2 + |ν|2 2|γ|2)

and similarly we define C(γ ; X, X, Y, λ). Then

C(γ ; X, X, Y, λ) C(γ; X, X, Y, λ)

|g X(x, x ) g Y (y, y )|2 2λ d(γ 2 γ 2)((x, x ), (x, x ), (y, y ))

In addition,

C(γ ; X, X, Y, λ)

(X X Y ) 2 |g X(x, x ) g Y (y, y )|2dγ 2((x, x ), (x, x ), (y, y )) + λ(|µ|2 + |ν|2 2|γ |2)

(X Y ) 2 |g X(x, x ) g Y (y, y )|2d(γ ) 2((x, x ), (y, y )) + λ(|µ|2 + |ν|2 2|γ |2)

= PGWλ(X, Y)

where the second equality holds by (49) and the fact |γ | = |γ |. Thus, we have

LPGWλ(X, Y; X, γ0, γ ) C(γ ; X, X, Y, λ) = PGWλ(X, Y).

Another direction is trivial since for each γ Γ (γ0, γ ; µ), (π2,3)#γ := (πX,Y )#γ Γ (µ, ν). Thus

PGWλ(X, Y; γ) C(γ; X, X, Y, λ) LPGWλ(X, Y; X, γ0, γ ).

and we have completed the proof.

F.5 Proof of Proposition 3.1: Part (5)

This section discusses the proof of statement (5) of Proposition 3.1. That is, we discuss the metric properties of the proposed LPGW distance.

F.5.1 Formal Statement

In this section, we set g Y = ds Y . Let

G := {Y = (Y, g Y , ν)}

where Y Rd for some d N such that Y is nonempty convex compact set; ν M+(Y ) with ν 2(|g Y |) < .

Then Y1, Y2 G, we define Y1 Y2 iff GW(Y1, Y2) = 0. By (M emoli, 2011), in this case, optimal transportation plan GW(Y1, Y2) is induced by the Monge mapping T. Furthermore, T is bijection ν1-a.s. and is called isomorphism since

d Y 1(y1, y1 ) = d Y 2(T(y1), T(y1 )), ν1 a.s.

Next, given G G we introduce the following assumptions:

Assumption F.4. For each Y G , the problem PGWλ(X, Y) admits a unique solution.

Published as a conference paper at ICLR 2025

Assumption F.5. For subset G , define K1, K2 by

K1 = sup{|ν|, Y = (Y, g Y , ν) G },

K2 = sup{ max y,y supp(ν) |g Y (y, y )|2 : Y = (X, g Y , ν) G )}

We suppose K1, K2 < . Assumption F.6. g X, g Y 1, g Y 2 are induced by metrics. In particular, g X( , ) = dq X( , ), g Y 1( , ) = dq Y 1( , ), g Y 2( , ) = dq Y 2( , ), where d X is a metric defined in X, q 1, similar to d Y 1, d Y 2. Remark F.7. The uniqueness of optimal transportation plan assumption (F.4) is also introduced by (Nenna & Pass, 2023) to prove the triangle inequality of linear optimal transport. The assumption (F.5) discusses the bounded total mass and transportation cost. Proposition F.8 (Metric property of LPGW). Fix X G and for each Y1, we select γ1 Γ ,λ(X, Y1), γ2 Γ ,λ(X, Y2). In addition, set g X = ds X, g Y = ds Y where s 1. Then we have:

(1) The LPGW distances (15),(16) are nonnegative:

LPGW(Y1, Y2; X, γ1, γ2, λ), LPGWλ(Y1, Y2; X) 0 (69)

(2) The LPGW distances (15),(16) are symmetric:

LPGW(Y1, Y2; X, γ1, γ2, λ) = LPGW(Y2, Y1X, γ2, γ1, λ)

LPGW(Y2, Y1; X, λ) = LPGWλ(Y1, Y2; X). (70)

(3) The LPGW distances (15) satisfy triangle inequality.

In addition, under assumption (F.4), (16) satisfies the triangle inequality.

(4) Under assumption (F.6), if LPGWλ(Y1, Y2; X, γ1, γ2) = 0, we have Y1 Y2. Similarly to LPGWλ(Y1, Y2; X).

(5) Under assumption (F.5), suppose |µ| K1, and 2λ K2 + supx,x supp(µ) |g X(x, x )|2, and Y1 Y2, where K1, K2 are defined in (F.5). Then there exists γ1, γ2 such that

LPGW(Y1, Y2; X, γ1, γ2, λ) = LPGWλ(Y1, Y2; X) = 0.

Therefore, under the assumptions (F.4), (F.5), and (F.6), we have that (16) defines a metric in G / .

Note, the above statement implies the metric property of LGW by setting λ . We refer to the next section for details.

F.5.2 Proof of Proposition F.8: Parts (1), (2), (4), (5)

Proof. Choose Y1, Y2, Y3 G.

(1), (2) It is straightforward to verify the nonnegativity (1) and symmetry (2).

(4) In this setting, we have:

0 = LPGWλ(Y1, Y2; X, γ1, γ2)

= inf γ Γ (γ1,γ2;µ)

(X Y 1 Y 2) 2 |g Y 1(y1, y1 ) g Y 2(y2, y2 )|2dγ 2 + λ(|ν1|2 + |ν2| 2|γ|2)

Pick one corresponding minimizer γ, we obtain Z

(X Y 1 Y 2) 2 |g Y 1(y1, y1 ) g Y 2(y2, y2 )|2dγ 2 = 0,

Published as a conference paper at ICLR 2025

and |ν1|2 + |ν2|2 2|γ|2 = 0. Thus γ Γ(ν1, ν2) and therefore

GW(Y1, Y2) = 0.

That is Y1 Y2.

By a similar process, if LPGWλ(Y1, Y2; X) = 0, we have Y1 Y2.

(5) Since Y1 Y2, from triangle inequality of PGWλ( , ), we have

PGWλ(X, Y1) = PGWλ(X, Y2).

In addition, there exists a bijection mapping T : supp(ν1) supp(ν2) such that g Y 1(x1, x 1) = g X2(T(x1), T(x 1)), ν1 a.s. such that T#ν1 = ν2.

Pick γ1 Γ ,λ(X, Y1). Let γ2 := (id T)#γ1, that is, for each test function ϕ C0(X Y 2),

X Y 1 ϕ(x, T(y1))dγ1(x, y1).

We claim γ2 is optimal for PGWλ(X1, Y2).

(π1)#γ2 = (π1)#γ1 µ,

(π2)#γ2 = T#((π2)#γ1)) ν2 since (πY )#γ1 ν1.

Thus, γ2 Γ (µ, ν2). In addition,

C(γ2; X, Y2)

(X Y 1) 2 |g X(x, x ) g Y 2(T(y1), T(y1) )|2d(γ1) 2 + λ(|µ|2 + |ν2|2 2|γ2|2)

(X Y 1) 2 |g X(x, x ) g Y 2(T(y1), T(y1) )|2d(γ1) 2(x, x |y1, y1 ))d(ν1) 2(y1, y1 )

+ λ(|µ|2 + |ν1| 2|γ1|1)

(X Y 1) 2 |g X(x, x ) g Y 1(y1, y1 )|2d(γ1) 2(x, x |y1, y1 ))d(ν1) 2(y1, y1 )

+ λ(|µ|2 |γ1|2) (71)

= PGWλ(X, Y1) = PGWλ(X, Y2).

and thus γ2 is optimal in PGWλ(X, Y2).

Since 2λ is sufficiently large, by lemma E.2 in (Bai et al., 2024), there exists optimal γ1 such that: |γ1| = min(|µ|, |ν1|) = |ν1|. Thus we have |γ2| = |γ1| = |ν1| = |ν2|. Plug γ1, γ2 into (15), we have:

LPGW(Y1, Y2; X, γ1, γ2)

= Z g Y 1(y1, y1 ) g Y 2(T(y1), T(y1 ))d(ν1) 2(y1, y1 )

F.5.3 Proof of Proposition F.8: Part (3)

Notation Setup and Summary Choose γ1 Γ ,λ(X, Y1), γ2 Γ ,λ(X, Y2), γ3 Γ ,λ(X, Y3), the goal is to show triangle inequality in this section:

LPGWλ(Y1, Y2; X, γ1, γ2) LPGWλ(Y1, Y3; X, γ1, γ3) + LPGWλ(Y2, Y3; X, γ2, γ3). (73)

Published as a conference paper at ICLR 2025

Related Lemmas Lemma F.9 (Varient Gluing lemma). Suppose γ1 Γ(µ, ν1), γ2 Γ(µ, ν2), γ3 Γ(µ, ν3) are probability measures. Pick γ1,2 Γ(γ1, γ2; µ), γ1,3 Γ(γ1, γ3; µ), γ2,3 Γ(γ1, γ3 ; µ), there exists γ P(X Y 1 Y 2 Y 3) such that (πX,Y 1,Y 3)#γ = γ1,3, (πX,Y 2,Y 3)#γ = γ2,3. (74)

Note, if we set X = {0} and µ = δ0, the above lemma becomes the classical Gluing lemma (see e.g. Lemma 5.5 (Santambrogio, 2015)). For this, we call it the variant gluing lemma.

Proof of Lemma F.9. For convenience, let X , Y 3 be independent copy of X, Y 3. By gluing lemma, there exists γ P((X Y 1 Y 2) X ) such that (πX,Y 1,Y 2)#γ = γ1,2, (πX,X )#γ = (id id)#µ. (75)

Apply gluing lemma again between γ and γ1,3, we can find γ P((X Y 1 Y 3) (X Y 2 Y 3 )) such that (πX,Y 1,Y 3,X )#γ = γ , (πX ,Y 2,Y 3)#γ = γ2,3.

Applying γ = (πX,Y 1,Y 2,Y 3)#γ we complete the proof.

Alternatively, we can set

γ = γ1,3 Y 1,Y 3|Xγ2,3 Y 2,Y 3|Xµ(x)

and it is straightforward to verify γ satisfies (74).

Lemma F.10 (Triangle inequality for LGW). Suppose X, Y1, Y2, Y3 are under the balanced GW setting, choose γ1 Γ (µ, ν1), γ2 Γ (µ, ν2), γ3 Γ (µ, ν3), we have

LGW(Y1, Y2; X, γ1, γ2) LGW(Y1, Y3; X, γ1, γ3) + LGW(Y2, Y3; X, γ2, γ3) (76)

Proof of lemma F.10. Pick the γ from lemma F.9, we have

(πX,Y 1)#γX,Y 1 = (πX,Y 1)#γ1,3 = γ1,

(πX,Y 2)#γX,Y 2 = (πX,Y 2)#γ2,3 = γ2.

Thus (πX,Y 1,Y 2)#γ Γ(γ1, γ2; µ). Then we have

LGW(Y1, Y2; X, γ1, γ2)

= |g Y 1 g Y 2|2, (γ1,2) 2 1/2

|g Y 1 g Y 2|2, ((πX,Y 1,Y 2)#γ) 2 1/2 (77)

X Y Y 2 Y 3 |g Y 1(y1, y1 ) g Y 2(y2, y2 )|2dγ 2((x, x ), (y1, y1 ), (y2, y2 ), (y3, y3 )) 1/2

X Y Y 2 Y 3

|g Y 1(y1, y1 ) g Y 3(y3, y3 )| + |g Y 2(y2, y2 ) g Y 3(y3, y3 | 2 dγ 2 1/2

X Y Y 2 Y 3

|g Y 1(y1, y1 ) g Y 3(y3, y3 )| 2 dγ 2 1/2

X Y Y 2 Y 3

|g Y 2(y2, y2 ) g Y 3(y3, y3 | 2 dγ 2 1/2 (78)

= LGW(Y1, Y3; X, γ1, γ3) + LGW(Y2, Y3; X, γ2, γ3)

Published as a conference paper at ICLR 2025

where (77) follows from the fact γ1,2 is optimal in Γ(γ1, γ2; µ); (78) holds by the Minkowski inequality. In the uniqueness assumption (F.4), we have

LGW(Y1, Y2; X) LGW(Y1, Y3; X) + LGW(Y2, Y3; X).

Convert LPGW to LGW The main idea of this step is similar to the proof of Proposition 3.3 in (Bai et al., 2024).

Notations Setup

We can redefine Γ (γ1, γ2; µ) as

Γ (γ1, γ2; µ) = γ M+(X Y 1 X Y 2) : (π1,2)#γ γ1, (π3,4)#γ γ2,

(π1,3)#γ = (id id)#µ , µ µ} . (79)

We define auxiliary points ˆ 0, ˆ 1, ˆ 2, ˆ 3 and then define

ˆX = X { ˆ 0}, ˆY 1 = Y 1 { ˆ 1, ˆ 2, ˆ 3}, ˆY 2 = Y 2 { ˆ 1, ˆ 2, ˆ 3}, ˆY 3 = Y 3 { ˆ 1, ˆ 2, ˆ 3}.

ˆγ1 = γ1 + |γ2|δ ˆ 0, ˆ 2 + |γ3|δ ˆ 0, ˆ 3 M+( ˆX ˆY 1),

ˆγ2 = γ2 + |γ1|δ ˆ 0, ˆ 1 + |γ3|δ ˆ 0, ˆ 3 M+( ˆX ˆY 2),

ˆγ3 = γ3 + |γ1|δ ˆ 0, ˆ 1 + |γ2|δ ˆ 0, ˆ 2 M+( ˆX ˆY 3).

g ˆY 1(y1, y1 ) = g Y 1(y1, y1 ) if y1, y1 Y 1, elsewhere

and g ˆY 2, g ˆY 3 are defined similarly.

Finally, define

Dλ(r1, r2) = |r1 r2| if r1, r2 R,

λ elsewhere. .

Let ˆX = X { ˆ 0} to be a copy of set ˆX. Then, we can set

ˆΓ(ˆγ1, ˆγ2; µ) := n ˆγ M+( ˆX ˆY 1 ˆX ˆY 2), (π1,2)#ˆγ = ˆγ1, (π3,4)#ˆγ = ˆγ2, (80)

(π1,3)#ˆγ |X 2= (id id)#µ , µ µ} . (81)

Consider the mapping:

F : Γ (γ1, γ2; µ) 7 ˆΓ(ˆγ1, γ2; µ)

γ 7 ˆγ := γ + (γ1 (π1,2)#γ) δ( ˆ 0, ˆ 1) + δ( ˆ 0, ˆ 2) (γ2 (π3,4)#γ) + |γ|δ( ˆ 0, ˆ 2),( 0, ˆ 1) + |γ3|δ( ˆ 0, ˆ 3),( ˆ 0, ˆ 3). (82)

By the following lemma, we show that F defines an equivalent relation between the two sets.

Lemma F.11. The mapping F defined in (82) is well-defined. In addition, it is a bijection if we set the identity ˆ 0 = ˆ 1 = ˆ 2 = ˆ 3.

Published as a conference paper at ICLR 2025

Proof of Lemma F.11. By (Bai et al., 2022, Proposition B.1) we have for each γ Γ(γ1, γ2; µ), (π1,2)#ˆγ = ˆγ1, (π3,4)#ˆγ = ˆγ2.

In addition,

(π1,3)#ˆγ = (π1,3)# γ + ((π1)#γ1 (π1)#γ) δ ˆ 0 + δ ˆ 0 ((π1)#γ2 (π3)#γ) + (| γ| + |γ3|)δ ˆ 0, ˆ 0.

Thus, (π1,3)#ˆγ |X 2= (π1,3)#γ = (id id)#(π1)#γ,

where (π1)#γ (π1)#γ1 µ, and we have F is well-defined.

It remains to show F is a bijection. For each ˆγ ˆΓ (ˆγ1, ˆγ2) and we define mapping

F 1 : bΓ(ˆγ1, ˆγ2; µ) Γ (γ1, γ2; µ) (83) ˆγ 7 ˆγ |X Y 1 X Y 2

It remains to show F 1 is inverse of F.

First, we claim F 1 is well-defined. Pick ˆγ ˆΓ(ˆγ1, ˆγ2), by Lemma (Bai et al., 2022, Lemma B.1.), we have (πX,Y 1)#ˆγ γ1, (π3,4)#ˆγ γ2. In addition,

(π1,3)#γ = (π1,3)#ˆγ |X 2= (id id)#µ ,

where µ µ. Thus, F 1(ˆγ) Γ (γ1, γ2; µ).

In addition, it is straightforward to verify F 1(F(γ)) = γ, γ Γ (γ1, γ2; µ),and then we complete the proof.

It is straightforward to verify:

D2 λ(g ˆY 1, g ˆY 2, ˆγ 2) = |g Y 1 g Y 2|2, γ 2 + λ(|γ1|2 + |γ2|2 |γ|2)

Combine it with the above lemma, take the infimum over Γ (γ1, γ2; µ) (equivalently, over ˆΓ(ˆγ1, ˆγ2; µ), we obtain:

inf ˆγ bΓ(ˆγ1,ˆγ2;µ) D2 λ(g ˆY 1, g ˆY 2, ˆγ 2) = inf γ Γ (γ1,γ2;µ) |g Y 1 g Y 2|2, γ 2 + λ(|γ1|2 + |γ2|2 |γ|2)

= LPGW(Y1, Y2; X, γ1, γ2)

where the second equality holds from the fact |ν1| = |ν2| = |ν3| = |γ1| = |γ2| = |γ3|. Similar identity holds for LPGW(Y1, Y3; X, γ1, γ3), LPGW(Y2, Y3; X, γ2, γ3).

Verifying Inequality It remains to show

inf ˆγ bΓ(ˆγ1,ˆγ2;µ) D2 λ(g ˆY 1, g ˆY 2, ˆγ 2) 1/2 inf ˆγ bΓ(ˆγ1,ˆγ3;µ) D2 λ(g ˆY 1, g ˆY 3, ˆγ 2) 1/2+ inf ˆγ bΓ(ˆγ2,ˆγ3;µ) D2 λ(g ˆY 2, g ˆY 3, ˆγ 2) 1/2.

Pick γ1,2 Γ (γ1, γ2; µ) that is optimal for D2 λ(g ˆY 1, g ˆY 2, ˆγ 2) 1/2. Similalry, pick optimal γ1,3 Γ (γ1, γ3), γ2,3 Γ (γ2, γ3), we construct the corresponding optimal ˆγ1,3 Γ(ˆγ1, ˆγ3; µ), ˆγ2,3 Γ(ˆγ2, ˆγ3; µ).

Thus, by Gluing lemma, there exists γ M+((X Y 1) (X Y 2) (X Y 3))) such that

(π(1,2),(5,6))#γ = ˆγ1,3,

(π(3,4),(5,6))#γ = ˆγ2,3.

Now by the definition of ˆΓ(ˆγ1, ˆγ3; µ), ˆΓ(ˆγ2, ˆγ3; µ), we have

(π1,5)#γ = (id id)#µ , (84)

(π3,5)#γ = (id id)#µ (85)

Published as a conference paper at ICLR 2025

for some radon measures µ , µ µ. Thus, we have µ = µ µ. Therefore,

(π1,3)#γ = (id id)#µ .

In addition,

(π1,2)#γ = (π1,2)#ˆγ1,3 = ˆγ1,

(π3,4)#γ = (π1,2)#ˆγ2,3 = ˆγ2.

So (π1,2,3,4)#γ Γ(ˆγ1, ˆγ2; µ).

Therefore, we have D2 λ(k Y 1, k Y 2), (ˆγ1,2) 2 1/2

D2 λ(k Y 1, k Y 2), γ 2 1/2

D2 λ(k Y 1, k Y 3), γ 2 1/2 + D2 λ(k Y 2, k Y 3), γ 2 1/2

where the second inequality holds from (Bai et al., 2024, Eq.(57)).

G Relation Between LPGW and LGW

Theorem G.1. Suppose |µ| = |ν1| = |ν2| = 1 and x, x is bounded, g Y 1, g Y 2 are continuous function.

Choose sequence λn , if n is sufficiently large, we have λn, then

LPGWλn(Y1, Y2; X) LGWλn(Y1, Y2; X),

a LPGWλn(Y1, Y2; X) a LGW(Y1, Y2; X).

Proof of theorem G.1. By the lemma F.1 in (Bai et al., 2024), when n is sufficiently large, in particular, when

λn > M := sup y1,y1 Y 1

|g Y 1(y1, y1 ) g Y 2(y1, y1 )|,

for each γ1 Γ (X, Y1), |γ1| = min(|µ|, |ν1|) = 1. That is

Γ ,λn(X, Y1) = Γ (X, Y1).

Similarly, when n is sufficiently large, we have , Γ ,λn(X, Y2) = Γ (X, Y2).

Pick γ1 Γ (X, Y1) = Γ (X, Y1), γ2 Γ (X, Y2) = Γ (X, Y2), since |γ1| = |µ| = |ν1| = |ν2| = 1, the mass penalty term vanishes in (65), thus we have

LPGWλn(Y1, Y2; X, γ1, γ2) = LGW(Y1, Y2; X, γ1, γ2).

Take the infimum over all γ1, γ2 and the take a limit for n , we prove

LPGWλn(Y1, Y2; X) LGW(Y1, Y2; X).

Similarly, in this case, γ1 c = γ2 c = 0

LPGWλn(e Yγ1, e Yγ2; X, eγ1, eγ2) + λ(|γ1 c| + |γ2 c|) = LGW(e Yγ1, e Yγ2; X, eγ1, eγ2).

Take the infimum on both sides over γ1, γ2 and take λn , we obtain

a LPGW(Y1, Y2; X) a LGW(Y1, Y2; X).

Published as a conference paper at ICLR 2025

H Proof of Theorem 3.2

H.1 Proof of Theorem 3.2: Parts (1), (3)

(1) Similar to proof in Theorem 2.1, in this case, we have Tγ = T, γ X a.s.

(3) Based on statement (2) of Theorem 3.2 (see the proof in next section), we have eγ1, eγ2 are optimal solutions for PGWλ(X, e Yγ1), PGWλ(X, e Yγ2) respectively. By (1), we have T 1 = Tγ1, T 2 = Tγ2 γ1 X γ2 Y a.s. Thus, we have:

LPGWλ(Y1, Y2; X, γ1, γ2)

= inf µ γ1 X γ2 X

X 2 |d Y 1(T 1( 1), T 1( 2)) d Y 2(T 1( 1), T 2( 2))|2d(µ ) 2 + λ(|ν1|2 + |ν2|2 2|µ |2)

= inf µ γ1 X γ2 X

X 2 |d Y 1(Tγ1( 1), Tγ1( 2)) d Y 2(Tγ2( 1), Tγ2( 1)|2d(µ ) 2 + λ(|eνγ1|2 + |eνγ2|2 2|µ |2)

+ λ(|ν1|2 |eνγ1|2 + |ν2|2 |eνγ2|2)

= inf µ γ1 X γ2 X

X 2 |d Y 1(Tγ1( 1), Tγ1( 2)) d Y 2(Tγ2( 1), Tγ2( 1)|2d(µ ) 2 + λ(|eνγ1|2 + |eνγ2|2 2|µ |2)

+ λ(|γ1 c| + |γ2 c|) (86)

= LPGWλ(e Yγ1, e Yγ2; X, γ1, γ2) + λ(|γc|1 + |γc|2)

where (86) holds since

eν1 = Tγ1γ1 1 = T 1γ1 1 = (πY )#γ1 ν1.

Thus |ν1|2 |eν1|2 = |(ν1) 2 ((πY )#γ1) 2| = |γ1 c|; and similarly we have |ν2|2 |eν2|2 = |γ2 c|.

H.2 Proof of Theorem 3.2: Part (2)

H.2.1 Proof in the discrete case.

Notation setup

We first demonstrate a simplified proof in the discrete case. Next, we will provide the proof of the statement in the general case.

i=1 pixi, ν =

g X(x, x ) = x x , x, x X,

g X(y, y ) = y y , y, y Y.

Choose γ Γ (X, Y), we obtain the corresponding barycentric projected measure:

eν := eνγ = Pn i=1 eqieyi with

j=1 γ i,j, i [1 : n], (87)

( 1 eq1 i γ ijxi if eqi > 0,

0 elsewhere. . (88)

Then e Yγ = (Y, g Y , eν), eγ = diag(eq1, . . . eqn). Our goal is to show eγ = diag(eq1, . . . , eqn) is optimal in PGWλ(X, e Yγ ).

Published as a conference paper at ICLR 2025

Pick eγ Γ (p, eq) := Γ (µ, eν), similar to section E.5, we set diagonal matrix (eγ ) 1 Rn n + as

(eγ ) 1 ii =

( 1 eqi if eqi > 0, 0 elsewhere.

Let γ := eγ(eγ ) 1γ Rn m + . Let γX := γ1m and γY := γ 1n.

In addition, we set D = {i : eqi > 0}, and 1D Rn with 1D[i] = 1, i D and 1D[i] = 0 elsewhere. Then we have:

γ1m = eγ(eγ ) 1γ 1m = eγ(eγ ) 1eq = eγ1D = eγX p0, (89)

γ 1n = (γ ) (eγ ) 1eγ 1n (γ ) (eγ ) 1eq = (γ ) 1D = γ Y q. (90)

Therefore, γX eγX p, γY γ Y q. Thus, γ Γ (p, q).

Relation between PGWλ(X, Y) and PGWλ(X, e Y).

Similar to (54),(57), we obtain

C(γ ; X, Y, λ) = (x x )2, (γ X) 2 + (y y )2, (γ Y ) 2 + λ(|p|2 + |q|2 2|γ |)

j,j =1 x i x iy j y jγ i,jγ i ,j

C(eγ ; X, e Y, λ) = (x x )2, (eγ X) 2 + (y y )2, (eγ Y ) 2 + λ(|p|2 + |eq|2 2|eγ |)

j,j =1 x i x iy j yjγ i,jγ i ,j

where (x x )2 := [(x i x i )2]i,i [1:n] Rn n, (γ X) 2 = γ X(γ X) , (x x )2, (γ X) 2 := Pn i,i =1(x i x i )2(γ X)i(γ X)i , is the element-wise dot product. All other notations are defined similarly.

Combined the above two equalities with the fact |γ | = |γ Y | and γ X = eγ X, we obtain:

C(eγ ; µ, eν, λ) = C(γ ; X, Y, λ) + (ey ey )2, (eγ Y ) 2 (y y )2, (γ Y ) 2 + λ(|q|2 |eq|2). (91)

Similarly, from the fact γX = eγX (see (89)) we obtain

C(eγ; µ, eν, λ) = C(γ; X, Y, λ) + (ey ey )2, (eγY ) 2 (y y )2, (γY ) 2 + λ(|q|2 |eq|2). (92)

C(eγ ; X, e Y, λ) C(eγ; X, e Y, λ)

= C(γ ; X, Y, λ) C(γ; X, Y, λ) + (ey ey )2, (eγ Y ) 2 (eγY ) 2 (y y )2, (γ Y ) 2 (γY ) 2

(ey ey )2, (eγ Y ) 2 (eγY ) 2 (y y )2, (γ Y ) 2 (γY ) 2 (93)

i,i D (ey i eyi )2((eγ Y )i(eγ Y )i (eγY )i(eγY )i )

j,j =1 (y j yj )2((γ Y )j(γ Y )j (γY )j(γY )j )

γ i,jγ i,j eq1 i eq1 i ((x i xi )2)((eγ Y )i(eγ Y )i (eγY )i(eγY )i )

j,j =1 (y j yj )2((γ Y )j(γ Y )j (γY )j(γY )j ) (94)

j,j =1 (y j yj )2

γ i,jγ i,j eq1 i eq1 i ((eγ Y )i(eγ Y )i (eγY )i(eγY )i ) ((γ Y )j(γ Y )j (γY )j(γY )j )

Published as a conference paper at ICLR 2025

where (93) holds since γ1 is optimal in PGWλ(X, Y); (94) follows from the facts γY γ Y (see (90)), and the fact

ey i ey i =

γi,jγi ,j y j y j eqieq i , i, i D

and Jensen s inequality:

(ey i ey i)2

γi,jγi ,j (y j y i)2

The last equality (95) holds from the following:

For each (j, j [1 : m]), we have

γ i,jγ i,j eq1 i eq1 i ((eγ1 Y )i(eγ Y )i (eγY )i(eγY )i ) ((γ Y )j(γ Y )j (γY )j(γY )j )

γ i,jγ i,j eq1 i eq1 i (eγ Y )i(eγ Y )i

(γ Y )j(γ Y )j

γ i,jγ i,j eqieqi (eγY )i(eγY )i

(γY )j(γY )j

γ i,jγ i ,j eq1 i eq1 i eq1 i eq1 i eq1 j eq1 j

i,i D γ i,jγ i ,j eq1 j eq1 j

eq1 j eq1 j = 0

γ i,jγ i,j eqieqi

k,k =1 eγk,ieγk ,i

k,k =1 γk,jγk ,j

γ i,jγ i,j eq1 i eqi eγk,ieγk ,i

eγk,iγ i,j eqi )( X

eγk ,i γ i ,j eqi )

γ i,jγ i,j eq1 i eqi eγk,ieγk ,i X

eγk,ieγk ,i γ i,jγ i ,j eq1 i eq1 i

and thus we complete the proof.

H.3 Proof for the general case

H.3.1 Notation setup and related lemma

Suppose g X(x, x ) = α0x x , g Y = αx y , x, x X, Y, y X.

Choose γ Γ ,λ(X, Y). The barycentric projection mapping is

Tγ (x) := Z y dγ (y |x ), x supp(µ).

We obtain: eν := eνγ = (Tγ )#µ, eγ = (id Tγ )#µ and we e Yγ := (X, g Y , eν).

Published as a conference paper at ICLR 2025

Our goal is to show eν is optimal in PGWλ(X, e Y). Equivalently, pick eγ Γ (µ, eν), we need to show C(eγ ; X, Y, λ) C(eγ; X, Y, λ).

Set γ from (52): γ = γ Y |Xeγ X|Y eγ.

We have γ satisfies the following two lemma:

Lemma H.1. Choose γ Γ (µ, ν), eγ Γ (µ, eν). Set eν = (Tγ )#µ, then γ defined in (52) satisfies the following:

(a) γ Γ ((πX)#eγ, (πY )#γ ) Γ (µ, ν1), furthermore:

(πX)#γ = (πX)#eγ (96) (πY )#γ (πY )#γ (97)

(b) If eγ = eγ , then γ = γ .

(c) Regarding the second marginal of γ, we have for each test function ϕX C0(X): Z

X ϕY (y)dγY (y) = Z

Y X Y X ϕY (y)deγ (y|x0)deγ (x0|ey)deγY (ey). (98)

(a) Note, in a discrete setting, this statement has been proved in (89),(90). Pick test functions ϕX C0(X), ϕY C0(Y ) with ϕY 0, we have:

X Y ϕX(x)dγ(x, y)

Y ϕX(x)dγ (y|x0)deγ (x0|ey)deγ(x, ey)

X Y ϕX(x)deγ(x, ey)

X ϕX(x)deγX(x)

ϕY , γY = Z

X Y ϕY (y)dγ(x, y)

X Y X Y ϕY (y)dγ (y|x0)deγ (x0|ey)deγ(x, ey)

Y ϕY (y)dγ (y|x0)deγ (x0|ey)deγY (ey)

Y ϕY (y)dγ (y|x0)deγ (x0|ey)deγ Y (ey)

X Y ϕY (y)dγ (y|x0)deγ X(x0)

X Y ϕY (y)dγ (y|x0)dγ X(x0)

Y ϕY (y)dγ Y (y)

where the inequality holds from the fact eγY eγ Y .

Published as a conference paper at ICLR 2025

(b) If eγ = eγ , pick ϕ C0(X X), we have:

Y ϕ(x, y)dγ (y|x0)deγ (x0|ey)deγ (ey|x)deγ X(x)

Y ϕ(x, y)dγ (y|x0)deγ X(x)

X ϕ(x, y)dγ (y|x0)δ(x x0)dγ X(x)

X Y ϕ(x, y)dγ (y|x)dγ X(x)

X X ϕ(x, y)dγ (x, y)

(c) It follows directly from the definition of γ.

H.3.2 Relation between PGWλ(X, Y) and PGWλ(X, e Y).

From (54),(57),(60),(61) and (62), we obtain:

C(eγ ; X, e Y, λ) = C(γ ; X, Y, λ) + g2 Y , (eγ Y ) 2 g2 Y , (γ Y ) 2 + λ(|ν|2 |eν|2) (99)

C(eγ; X, e Y, λ) = C(γ; X, Y, λ) + g2 Y , (eγY ) 2 g2 Y , (γY ) 2 + λ(|ν|2 |eν|2) (100)

From (99)-(100), we obtain

C(eγ ; X, e Y, λ) C(eγ; X, e Y, λ)

= (C(γ ; X, Y, λ) C(γ; X, Y, λ)) + ((eγ ) 2 eγ 2))(g2 Y (ey, ey )) ((γ ) 2 γ 2)(g2 Y (x, y ))

g2 Y , (eγ Y ) 2 (eγY ) 2 | {z } A

g2 Y , (γ Y ) 2 γ 2 Y (101)

where (101) holds from the fact γ is optimal in PGWλ(X, Y).

In addition, pick ey, ey supp(eγ Y ), then for each x supp(eγ ( 1|y)), x supp(eγ ( 1|y )), we have: ey = Tγ (x), ey = Tγ (x ).

X 2 g2 Y (Tγ (x0), Tγ (x0 ))d(γ 2 X|Y )((x0, x0 )|e(y, ey ))d((eγ Y ) 2 (eγY ) 2)((x, x ), (ey, ey ))

From Lemma E.2 and Jensen s inequality

g2 Y (Tγ (x0), Tγ (x0 ))

Y 2 g Y (y, y )dγ (y|x0)dγ (y |x 0 ) 2

Y 2 g2 Y (y, y )dγ (y|x0)dγ (y |x 0 ).

Combined it with the fact eγ Y eγY (see Lemma H.1 (a)), we obtain

X 2 g2 Y (y, y )dγ (y|x0)dγ (y |x 0 )deγ (x0|y)deγ (x0 |y )d (eγ Y ) 2 eγ 2 Y ) ((x, x ), (y, y ))

= g2 Y , (γ ) 2 Y |X(eγ ) 2 X|Y ((eγ Y ) 2 (eγY ) 2) . (102)

Published as a conference paper at ICLR 2025

Thus, we can continue bound (101): (101) g2 Y , (γ ) 2 Y |X(eγ ) 2 X|Y ((eγ Y ) 2 (eγY ) 2) g2 Y , (γ Y ) 2 γ 2 Y

= g Y , (γ ) 2 Y |X(eγ ) 2 X|Y (eγ Y ) 2 (γ Y ) 2 | {z } B1 g Y , (γ ) 2 Y |X(eγ ) 2 X|Y (eγY ) 2 (γY ) 2 | {z } B2

where B2 = 0 from lemma H.1 (c) and B1 = B2 from lemma H.1(b). Therefore w, we complete the proof.

H.4 Proof of proposition 3.3

The proposition is directly implied by definition (20) and Proposition 3.1 (3).

I Special Case: Linear Mass-Constrained Partial Gromov-Wasserstein Distance

In this section, we introduce the linearization technique for the mass-constraint partial Gromov-Wasserstein distance, which can be regarded as a special case of the proposed LPGW distance.

Mass-constraint Partial Gromov-Wasserstein distance is defined as:

MPGWρ(X, Y) = inf γ Γη (µ,ν)

(X Y ) 2 |g X(x, x ) g Y (y, y )|2dγ , (104)

where η [0, min(|µ|, |ν1|)] is a fixed number. Γη (µ, ν) = {γ M+(X Y ) : (πX)#γ µ, (πY )#γ ν, |γ| = η},

We use Γη (X, Y1) to denote the set of all optimal transportation plans.

Consider the following gm-spaces X = (X, g X, µ), Y1 = (Y 1, g Y 1, ν1), Y2 = (Y 2, g Y 2, ν2), given η [0, min(|ν1|, |ν2|) and we suppose |µ| = η. Choose γ1 Γη (X, Y1), γ2 Γη (X, Y2). Suppose the Monge mapping assumption holds and let T 1, T 2 be the corresponding transportation plan. We define kγ1 : ( 1, 2) 7 d X( 1, 2) d Y 1(T1( 1), T1( 2)) as linear MPGW embedding of Y1 given γ1. kγ2 is defined similalry.

Thus, similar to (15) and (7), the linear M-PGW distance between Y1, Y2, given γ1, γ2 is defined as LMPGWη(Y1, Y2; X, γ1, γ2) := kγ1 kγ2 2 µ 2. (105)

and similar to (7),(15), the general case without Monge mapping assumption is

LMPGWη(Y1, Y; X, γ1, γ2) := inf γ Γη (γ1,γ2;µ)

(X Y ) 2 |g X g Y |2dγ 2, (106)

where Γρ (γ1, γ2; µ) := {γ M+(X Y 1 Y 2) : (πX,Y 1)#γ γ1, (πX,Y 2)#γ γ2, |γ| = η}. And similar to (10),(20), by applying barycentric projection Tγ1, Tγ2, we define the approximated linear-MPGW distance as follows:

a LMPGWη(Y1, Y2; X) := inf γ1 Γρ (X,Y1)

γ2 Γρ (X,Y2)

LMPGW(b Y1, b Y2; X)

= inf γ1 Γρ (X,Y1)

γ2 Γρ (X,Y2)

X 2 |d Y 1(Tγ1(x), Tγ1(x ) d Y 2(Tγ2(x), Tγ2(x ))|dµ 2. (107)

Similar to theorem 3.2, we have:

Published as a conference paper at ICLR 2025

Proposition I.1. Choose γ Γρ, (X, Y), γ1 Γρ, (X, Y1), γ2 Γρ, (X, Y2), we have:

(1) linear MPGW embedding of Y1 can recover MPGW discrepancy between X and Y1, i.e. kγ 2 µ 2 = MPGWη(x).

In general, we have:

LMPGWη(X, Y; γ0, γ ) := MPGWη(X, Y), γ0 Γη, (X, X).

(2) If Monge mapping assumption hold for γ1, γ2, then (106),(105) coincide.

(3) ˆγ = (id Tγ )#µ is optimal for MPGWη(X, e Yγ ) = GW(X, e Yγ ).

(4) When λ is sufficiently large, we have:

LMPGWη(Y1, Y2; X) = LPGWλ(Y1, Y2; X) λ(|ν1|2 + |ν2|2 2η2).

a LMPGWη(Y1, Y2; X) = a LPGWλ(Y1, Y2; X) λ(|ν1|2 + |ν2|2 2η2).

(5) If |µ| = |ν1| = |ν2| = η = 1, LMPGWη, LGW coincide and a LMPGWη, a LGW coincide.

(1) If Monge mapping assumption holds for γ , i.e. γ = (id T)#µ, we have:

MPGWη(X, Y) = Z

(X Y ) 2 |g X(x, x ) g Y (y, y )|2d(γ ) 2

= kγ 2 µ 2.

Without Monge mapping assumption, pick γ0 Γη, (X, X), since |γ0| = η = |µ|, we have γ0 Γ(µ, µ). Thus Z

(X X) 2 |g X(x, x ) g X(x0, x0 )|2d((γ0)2) 2 = MPGWη(X, X) = 0 = GW(X, X).

Therefore γ0 Γ (X, X).

Pick γ Γη (γ0, γ ) = Γ(γ0, γ ), by (49), we have Z

(X X Y ) 2 |g X(x, x ) g Y (y, y )|2dγ 2

(X Y ) 2 |g X(x, x ) g Y (y, y )|2d(γ ) 2

= MPGWη(X, Y1; X, γ0, γ )

It holds for all γ Γη (γ0, γ ), thus

LMPGWη(X, Y; X, γ0, γ ) = MPGWη(X, Y ).

(2) Under Monge mapping assumption, we have γ1 = (id T 1)#µ , γ2 = (id T 2)#µ

for some mappings T 1, T 2, where µ , µ µ. Since |µ | = |µ | = η = |µ|, then we have µ = µ = µ. Thus,

Γ (γ1, γ2; µ) = {(id T 1 T 2)#µ , µ µ, |µ | = η}

= {(id T 1 T 2)#µ}

and we have (106),(105) coincide.

Published as a conference paper at ICLR 2025

(3) Since η = |µ|, we have |eνγ1| = |γ | = η = |µ|, thus

Γη (µ, eνγ1) = Γ(µ, eνγ1).

Thus, MPGWη(X, e Yγ ) = GW(X, e Yγ ). Since γ is optimal in MPGWη(X, Y) = GW(X, Y), from Proposition 2.1, we have eγγ1 is optimal for GW(X, Yγ1) = MPGWη(X, e Yγ1) and we complete the proof.

(4) Since X, Y 1, Y 2 are compact, by Lemma E.2 in (Bai et al., 2024) when λ is sufficiently large, for each γ1 Γ ,λ(X, Y1), γ2 Γ ,λ(X, Y2), we have |γ1| = |γ2| = η = |µ|. By Proposition M.1. (Bai et al., 2024), we have γ1, γ2 are optimal for MPGWη(X, Y1), MPGWη(X, Y2). That is

Γ ,λ(X, Y1) Γη, (X, Y1), Γ ,λ(X, Y2) Γη, (X, Y2).

For the other direction, pick γ1 Γη, (X, Y1), γ Γ ,λ(X, Y1). Thus |γ | = η, and we have

C(γ1; X, Y1, λ)

(X Y 1) 2 |g X(x, x ) g Y 1(y1, y1 )|2d(γ1) 2 + γ(|µ|2 + |ν1|2 2η2)

(X Y 1) 2 |g X(x, x ) g Y 1(y1, y1 )|2d(γ ) 2 + γ(|µ|2 + |ν1|2 2η2)

= C(γ ; X, Y1, λ)

= PGWλ(X, Y1)

Thus, γ1 Γ ,λ(X, Y1).

Γ ,λ(X, Y1) = Γη, (X, Y1), Γ ,λ(X, Y2) = Γη, (X, Y2).

Pick γ1 Γ ,λ(X, Y1) = Γη, (X, Y1), γ2 Γ ,λ(X, Y2) = Γη, (X, Y2), we have

MLPGWη(Y1, Y2; X, γ1, γ2) = LPGWλ(Y1, Y2; X, γ1, γ2) λ(|ν1| + |ν2| 2η2)

Take the infimum over all γ1, γ2, we obtain

MLPGWη(Y1, Y2; X) = LPGWλ(Y1, Y2; X) λ(|ν1|2 + |ν2|2 2η2)

Similarly, we have

a MLPGWη(Y1, Y2; X) = a LPGWλ(Y1, Y2; X) λ(|ν1|2 + |ν2|2 2η2) (108)

(5) In this case, we have Γη (µ, ν1) = Γ(µ, ν1), Γη (µ, ν2 = Γ(µ, ν2). Thus

Γη, (X, Y1) = Γ (X, Y1), Γη, (X, Y2) = Γ (X, Y2).

Thus LMPGW, LGW coincide, a LMPGW, a LGW coincide.

J Numerical implementation of LOT, LGW, LPGW distance.

LOT distance In previous sections, we introduce LOT distance (31), a LOT distance (34) and its approximation formulation (35). Their relationship can be described as follows:

LOT distance (31) is proposed to approximate OT distance. a LOT distance (34) is proposed to approxiamte LOT distance.

Published as a conference paper at ICLR 2025

(a) Shapes in ellipse dataset.

(b) Reference spaces.

Figure 2: In the first figure, we visualize the ellipse dataset (Beier et al., 2022). For each ellipse shape X Rn 2, we normalize the scaling of the shape such that maxi,i [1:n] X[i, : ] X[i , :] = 1. The sizes of these shapes range from 90 to 650. In the second figure, we visualize the reference spaces. In each shape, the color represents the value of the probability mass at the corresponding location.

Published as a conference paper at ICLR 2025

Formulation (35) is proposed to approximate a LOT distance.

However, in practice, it is not a multi-layer approximation. The distance LOT (31) and the distance LOT (34) are only proposed for theoretical completeness and are not computationally feasible. In practice, formulation (35),

Tγ1 Tγ2 L(µ),

is used to approximate OT distance between ν1 and ν2 and in most of the reference which cite (Wang et al., 2013), (34) is refereed as LOT distance, the original formulation (31),(34) are not mentioned.

A similar convention is adapted to LGW and LPGW.

LGW distance The relation between LGW distance (39), a LGW distance (10) and its approximation formulation (11) can be described as follows:

LGW distance (39) is proposed to approximate GW distance. a LGW distance (10) is proposed to approximate LGW distance. Formulation (11) is proposed to approximate a LGW distance.

Similarly, it is not a multi-layer approximation. LGW distance (39) and a LGW distance (??) are only proposed for theoretical completeness and are not computationally feasible. In practice, formulation (11),

|g Y 1(Tγ1( 1), Tγ2( 2)) g Y 2(Tγ2( 1), Tγ2( 2))|2 L(µ 2),

is used to approximate GW distance.

LPGW distance The relation between LPGW distance (16), a LGW distance (20) and its approximation formulation (22) can be described as follows:

LPGW distance (16) is proposed to approximate PGW distance. a LGW distance (20) is proposed to approximate LPGW distance. Formulation (22) is proposed to approximate a LPGW distance.

Similarly, it is not a multi-layer approximation. LPGW distance (16) and a LGW distance (20) are only proposed for theoretical completeness and are not computationally feasible. In practice, formulation (22),

|g Y 1(Tγ1( 1), Tγ2( 2)) g Y 2(Tγ2( 1), Tγ2( 2))|2 L((γ1 X γ2 X) 2) + λ(|ν|1 + |ν2|2 2|γ1 X γ2 X|2)

is used to approximate PGW distance.

K Details of Elliptical Disks Experiment

In this section, we present the details of the elliptical disks experiment.

Dataset and numerical setting details. The dataset we used is the ellipse dataset given by (Beier et al., 2022), which consists of 100 distinct ellipses. Each ellipse is represented as an n 2 matrix, where n ranges from 90 to 600. For reference shapes, we selected 9 different 2D shapes, including disks, squares, triangles, and others. See Figure 2 for a visualization of the dataset and the reference spaces. In this experiment, each shape is modeled with an empirical measure Pn i=1 1 nδxi.

Performance analysis

We present the results in Table 4. As mentioned in the main text, LPGW is significantly faster than PGW, as it requires only N = 100 PGW computations, while PGW requires N 2 computations. Furthermore, we observe that for some reference spaces (e.g., S5, S7,

Published as a conference paper at ICLR 2025

PGW S1 S2 S3 S4 S5 S6 S7 S8 S9

points 441 676 625 52 289 545 882 882 317

time (mins) 45.34 0.86 3.8 3.08 0.66 0.63 1.43 1.89 2.1 0.69 MRE 0.1982 0.1263 0.1428 0.6315 0.4732 0.1454 0.0394 0.0793 0.0245 PCC 0.5774 0.5741 0.5884 0.5584 0.6514 0.7698 0.9303 0.8711 0.9944

time (mins) 43.86 1.02 3.33 3.55 0.34 0.99 1.25 1.29 1.55 1.22 MRE 0.1941 0.1264 0.1431 0.2542 0.0705 0.0444 0.0205 0.0198 0.0245 PCC 0.5781 0.5738 0.5881 0.8581 0.8741 0.993 0.9952 0.9954 0.9949

time (mins) 46.97 0.76 3.78 3.13 0.08 0.62 1.56 1.91 1.98 0.71 MRE 0.1941 0.1264 0.1431 0.2542 0.0538 0.0444 0.0205 0.0198 0.0245 PCC 0.5781 0.5738 0.5881 0.8581 0.9871 0.993 0.9952 0.9954 0.9949

time (mins) 44.77 0.74 3.68 3.02 0.08 0.61 1.51 1.89 1.98 0.69 MRE 0.1932 0.1262 0.1428 0.2542 0.0538 0.0443 0.0205 0.0198 0.0246 PCC 0.5779 0.5737 0.5879 0.8583 0.9871 0.993 0.9952 0.9953 0.9949

Table 4: In the first column, the values 0.05, 0.08, 0.1, 0.5, represent the selected λ values. For each λ, the first row shows the wall-clock time for PGW and LPGW. The second and third rows display the MRE (mean relative error) and PCC (Pearson correlation coefficient), respectively.

S9), the MRE is relatively lower. Moreover, for most reference spaces, including S7, S8, and S9, LPGW admits a PCC greater than 0.85. Finally, when λ is larger, the PCC tends to be higher, and the MRE is lower across all reference spaces, as discussed further in the next section. These results highlight the importance of the choice of reference space, as is commonly the case for linear OT-based methods.

Relative error analysis Given ν1, ν2, . . . , νK and reference measure µ, the relative error is defined as:

MRE = 1 K 2 X

|PGW(νi, νj) LPGW(νi, νj; µ)|

PGW(νi, νj) . (109)

Remark K.1. For numerical stability, when λ is small (i.e., λ = 0.05, 0.08), we remove the PGW/LPGW distance whenever PGW 1 10 10 since 1 10 10 is the tolerance in the PGW algorithm. In this case, PGW 1 10 10 and LPGW 1 10 11 which renders the relative error uninformative.

We decompose this error into the following four aspects:

The transportation plan induced by LPGW may not necessarily be the optimal transportation plan for the PGW problem.

In practice, we use the barycentric projected measure ˆνi to approximate νi for each i. These two measures can be distinct, especially when the optimal transportation plan is not generated by the Monge mapping.

The solvers for both PGW and LPGW rely on the Frank-Wolfe algorithm. Due to the non-convexity of these problems, the Frank-Wolfe algorithm may yield a local minimizer instead of a global one. Therefore, the computed PGW transportation plan might not be optimal.

In practice, we approximate the original LPGW distance (16) (or (10)) using the approximation formulation (22). This introduces a gap between the real LPGW distance and the approximation.

It is important to note that the first issue arises from a theoretical perspective, while the remaining aspects are due to the numerical implementation. The first two errors are also present in other linear OT-based methods (Wang et al., 2013; Cai et al., 2022; Bai et al., 2023; Beier et al., 2022). The third issue stems from the non-convexity of GW/PGW, affecting both LGW and LPGW similarly. The final error is specific to LPGW due to the approximation formulation used.

Published as a conference paper at ICLR 2025

In practice, several methods can be employed to reduce these errors. For example, the dataset can be represented as empirical measures with equal mass at each point. This approach is effective for unbalanced linear OT techniques (e.g., (Bai et al., 2023; Cai et al., 2022)), as these methods do not require mass normalization. With high probability, this will result in a Monge mapping, reducing the second error. Additionally, to minimize the last error, as stated in Theorem 3.2, using a higher λ leads to a lower error, and when λ is sufficiently large, this error becomes zero.

MDS visualization and analysis. We visualize the multi-dimensional scaling (MDS) embeddings for both PGW and LPGW with respect to each reference space in Figure 3. We observe that when λ is large (i.e. λ = 0.1, 0.5), LPGW with reference space S5, S6, S7, S8, S9 admit similar patterns. When λ = 0.05 or λ = 0.08, we observe LPGW with reference S7, S8, S9 and PGW admit similar patterns. From this figure, we observe that S7, S8, S9 admit better performance than other reference spaces.

Summary. Although LPGW methods are subject to potential errors, the high PCC observed between LPGW and PGW (when, e.g., the reference space is S7, S8, S9), suggests that LPGW can serve as a good proxy for PGW, rather than simply as an approximation.

Published as a conference paper at ICLR 2025

(a) λ = 0.05

(b) λ = 0.08

(c) λ = 0.1

(d) λ = 0.5

Figure 3: MDS visualization for λ = 0.05, 0.08, 0.10, 0.50. Each subfigure shows the MDS visualization for PGW and LPGW based on different reference spaces. In each figure, the first subfigure in the first row is the MDS visualization of PGW.

Published as a conference paper at ICLR 2025

L Details of Shape Retrieval Experiment

(a) 2D Dataset

(b) 3D Dataset

Figure 4: Datasets for shape retrieval experiment.

Dataset details. We test two datasets in this experiment: a 2D dataset and a 3D dataset. The visualization of the two datasets is presented in Figure 4.

In addition, for each shape X = {x1, . . . xn}, we normalize the scaling such that maxi =j xi xj = 1. For the 3D dataset, we apply the k-means clustering to reduce the size of each shape to 500, from its original size 2048.

Numerical details. We represent the shapes in each dataset as mm-spaces Xi = Rd, 2, µi = Pni

k=1 αiδxi k

We use αi = 1 ni to compute the GW/LGW distances for the balanced mass constraint setting.

For the PGW/LPGW distances, we set α = 1 N , where N is the median number of points across all shapes in the dataset.

For the SVM experiments, we use exp( σD) as the kernel for the SVM model. Here, we normalize the matrix D and choose the best σ {0.001, 0.01, 0.1, 1, 5, 8, 10, 100} for each method used in order to facilitate a fair comparison of the resulting performances. We note that the resulting kernel matrix is not necessarily positive semidefinite.

Parameter selection for PGW and LPGW. The parameter λ for PGW and LPGW is chosen as follows. For the 2D dataset, we set λ such that λ λmax = 1

2 maxi (|Ci|2) = 0.5, since all shapes are normalized to a scale of 1. We perform a line search for λ over the set {0.2, 0.3, 0.5} and select λ = 0.2 for both PGW and LPGW.

For the 3D dataset, we randomly select 2-3 shapes and compute the PGW distance between these shapes and the reference space. We find that λmax 0.15. Thus, we conduct a line search for λ over the range {0.02, 0.03, 0.05, 0.06, 0.070.08, 0.080.1, 0.15} and choose λ = 0.06 for both PGW and LPGW.

Reference space setting for LGW and LPGW. Similar to the classical OT method (Wang et al., 2013), the ideal reference space should represent the center of the tested measures. In practice, we typically use the GW barycenter (Peyr e et al., 2016) or the PGW barycenter (Bai et al., 2024).

For each dataset, we randomly select one point cloud from each class. Specifically, for the 2D dataset, we select 8 shapes, and for the 3D dataset, we select 4 shapes. We then compute the GW/PGW barycenter between these selected shapes. The wall-clock time for this computation on the 2D dataset is about 10 seconds, and for the 3D dataset, it is about 5 minutes. For fairness in comparison, we apply the same reference space for both the LGW and LPGW methods in all experiments.

Nearest neighbor classification. Once the pairwise distance matrices have been computed between the shapes using each method, we use the computed distances for a nearest-

Published as a conference paper at ICLR 2025

LPGW (ours)

(a) 2D Dataset.

LPGW (ours)

(b) 3D Dataset.

Figure 5: Confusion matrices computed from nearest neighbor classification experiments.

neighbor classification experiment. We choose a single representative at random from each class in the dataset and classify each shape according to its nearest representative. This is repeated over 10,000 iterations, and we generate a confusion matrix for each distance used.

Performance analysis. The pairwise distance matrices are visualized for each dataset in Figure 7, and the confusion matrices from the nearest neighbor classification experiment on each dataset are shown in Figure 5. Finally, the classification accuracy with the SVM experiments is reported in Table 2. The results indicate that the LPGW distance is able to obtain high performance across both data sets consistently.

Figure 6: We visualize the accuracy of LPGW for different λ. For each λ, we select the best σ to compute the accuracy. We observe that when λ 0.08, LPGW and LGW have the same accuracy.

From Figure 7, we observe that PGW and LPGW qualitatively admit a more reasonable similarity measure compared to the other considered methods. For example, on the 2D dataset, class bone and rectangle should have a relatively smaller distance than bone and annulus . Ideally, a reasonable distance should satisfy the following: 0 < d(bone, rectangle) < d(bone, anulus). However, we do not observe this relation in GW or LGW.

In the 3D dataset, the main challenge is the presence of incomplete shapes, which represent only a part of the corresponding complete shape and can have a large dissimilarity to each other. In this experiment, we observe that PGW/LPGW admits slightly better performance. We suspect due the the unbalanced setting of these two methods, these two methods are more robust to the incomplete shape classification.

Published as a conference paper at ICLR 2025

(a) 2D Dataset.

(b) 3D Dataset.

Figure 7: Visualization of pairwise distance matrices resulting from each of the considered methods: GW, PGW, LGW, and LPGW (ours). In (b), the label inc. airplane denotes the incomplete airplane shape class, and similarly for each of the other 3 such classes.

Note that in this dataset all shapes have the same size. Thus, when λ is sufficiently large (i.e. λ = 0.5), LPGW can recover the performance of LGW. We refer to Figure 6 for visualization.

M Details of Learning with Transform-Based Embeddings Experiment

Reference space. Similar to the shape retrieval experiment, for each class/digit, we select one shape from the training set and compute the GW barycenter based on the selected shapes. Note, in this step, the reference space we obtain is (M0 Rn0 n0, p0), where n0 N is the size of the reference space, M0 denotes the pairwise distance matrix, and p0 is the PMF of the reference measure. Here, n0, p0 are inputs to the GW barycenter algorithm, and in this experiment, we set n0 to be the mean of the sizes of the selected shapes, and p0 is set to be p0 = 1 n0 1n0.

However, M0 cannot be applied as reference space for the LOT method. Thus, we apply the MDS method for M0 and obtain the supported points X0 = {x0 1, . . . x0 n0} of the reference space, i.e.,

X = arg min X

i=1 (M0)i,j xi xj 2, where xi = X[i, :], i.

We use (M0, p0, Pn i=1 xi) as the reference space for LOT/LGW/LPGW. The wall-clock time of barycenter computation is 26 seconds.

Numerical details. Note, since the goal of this experiment is to test the performance of learning from embeddings under corruption by random rotation/flipping and noise, we remove digits {5, 9} from the data since the rotated and flipped digit 2 is highly similar to 5 . Similarly, rotated 6 is nearly identical to 9 .

The classification model we selected is the logistic regression model as provided by the scikitlearn package. For the LOT and LGW methods, due to the balanced mass requirement, we

Published as a conference paper at ICLR 2025

Figure 8: We visualize the interpolation between two shapes using GW, PGW, LGW, and LPGW. In the first column, the three shapes shown are the source shape, the target shape, and the reference space used in LGW and LPGW. The target shape, digit 4 , is corrupted by the addition of noise points, with a total mass of η = 0.3.

normalize the PMF of all shapes in the testing dataset. For LPGW, this normalization is not applied.

Parameter setting for LPGW. Note that in this case, the training data is not corrupted. Thus, we can set λ = λmax = 1

2 maxi C2 i where Ci is the cost matrix for shape i. In this experiment, we set λ = 40. In addition, to improve the computational speed of LGW/LPGW, the augmented training data obtained by rotation and flipping is NOT contained in these two methods since the rotated shape is equivalent to the original shape in the setting of GW/PGW.

Performance analysis. The LOT method achieves 51.3% accuracy when η = 0. This indicates that even with the addition of rotated/flipped data to the training set, the LOT embedding method struggles to classify the rotated digits with high accuracy. In contrast, LGW/LPGW achieve 82.5% accuracy, indicating that these two embedding techniques are more robust to corruption by rotation/flipping.

When η 0.1, the accuracy of both LOT and LGW drops to 10-20%. However, we observe that the LPGW embedding maintains a strong accuracy between 70% and 85%. This demonstrates that the LPGW embedding is more robust to corrupted test data.

M.1 Toy Example: Point Cloud Interpolation

In this subsection, we select one shape from the training data and a noise-corrupted shape from the testing data (with rotation and flipping removed). We demonstrate the interpolation between these two shapes using the GW barycenter (Peyr e et al., 2016), PGW barycenter (Bai et al., 2024), LGW geodesic (Beier et al., 2022), and our LPGW interpolation. The goal is to intuitively visualize the LGW/LPGW embeddings for the reader s understanding.

Our method and baseline methods. Let X be the reference mm-space applied in the classification experiment. Note, numerically, it can be described by (C0, X0, p0), where p0 Rn0 + is the PMF; n0 is the size of p0; X0 Rn0 2 is the set of 2D supported points; and C0 = [ X0 i X0 j 2 i,j [1:n0]] is the corresponding cost matrix.

That is, we can use (C0, p0) to represent X3. Similarly, we use (C1, p1) and (C2, p2) to represent the source and target mm-spaces (shapes).

3X0 is not required as C0 contains all the information of X0 in the GW/PGW problem.

Published as a conference paper at ICLR 2025

Now we introduce the GW barycenter method. For each time t {0, 1/6, . . . , 6/6 = 1}, we solve the barycenter problem

C = arg min C Rn0 n0

(1 t) GW((C, p0), (C1, p1)) + t GW((C, p0), (C2, p2)) .

When t = 0, (C , p0) and (C1, p1) represent similar shapes. In fact, if p0 = p1, the two shapes are identical. Similarly, if t = 1, (C , p1) is similar to the target (C2, p2). For t (0, 1), (C , p0) represents an interpolation shape between the source shape (C1, p1) and the target shape (C2, p2).

The PGW barycenter method is defined similarly.

In the LGW geodesic method, let Pn i=1 δˆy1 i p0 i denote the barycentric projection obtained by γ, where γ is the optimal transportation plan for the GW problem between (C0, p0) and (C1, p1). The numerical implementation of the LGW embedding is given by

E1 = C0 [ ˆy1 i ˆy1 j 2]i,j [1:n0] Rn0 n0

(we refer to (5) for the original formulation). Similarly, we can define the LGW embedding for the target shape (C2, p2), denoted as E2 Rn0 n0.

Then for each t, the interpolation/geodesic of LGW is defined by

C0 + (1 t)E1 + t E2. (110)

For the LPGW interpolation method, the numerical LPGW embeddings E1, E2 Rn0 n0 are defined similarly, and we refer to (24) for details. We use (110) for the interpolation.

To visualize these interpolations, for each Ct, we adapt the MDS method, and the solution Xt Rn0 2 is a point cloud. We visualize these point clouds in Figure 8.

Performance comparison. In this experiment, the size of these shapes is in the range of 200 250. GW/PGW requires 230 270 seconds, while LGW/LPGW requires 1 2 seconds. In Figure 8, we observe that the shapes generated by GW/LGW contain more noise points. Additionally, at times such as t = 3/6 and t = 4/6, the generated shapes are difficult to distinguish due to the noise points.

In contrast, the shapes generated by the PGW/LPGW methods contain significantly fewer noise points.

Note that at t = 0 and t = 1, the shapes generated by the LGW/LPGW methods can be treated as visualizations of the corresponding embeddings for the source shape (C1, p1) and target shape (C2, p2), respectively.

M.2 Other baselines OT/GW/PGW

Note that OT, GW, and PGW methods cannot be directly applied to this experiment since we use a linear model (logistic regression) as the classifier. However, we can adapt the kernel-SVM methods described in the shape retrieval experiment for this classification task as well. The primary distinction between these three methods and LOT, LGW, and LPGW lies in their computational complexity.

Suppose Ntrain and Ntest represent the sizes of the training and testing datasets, respectively. Let n and d denote the average size and dimension of all the point clouds or digits. Additionally, let N represent the average number of iterations required by the FW algorithms for GW and PGW.

The computational cost of these methods is summarized as follows:

In methods OT and LOT, the term (n3 + n2d) refers to the complexity of computing one OT distance, while nd represents the complexity of computing one LOT distance given two embeddings.

Published as a conference paper at ICLR 2025

Method Training Process Testing Process OT O(N 2 train (n3 + n2d)) O(Ntrain Ntest(n3 + n2d)) LOT O(Ntrain (n3 + n2d)) O(Ntest (n3 + n2d) + Ntest Ntrain (nd)) GW O(N 2 train (n3N + n2d)) O(Ntest Ntrain (n3N + n2d)) LGW O(Ntrain(n3N + n2d)) O(Ntest(n3N + n2d) + NNtesttrain (n2)) PGW O(N 2 train (n3N + n2d)) O(Ntest Ntrain (n3N + n2d)) LPGW O(Ntrain(n3N + n2d)) O(Ntest(n3N + n2d) + Ntest Ntrain(n2))

Table 5: Computational complexity of different methods.

Similarly, in methods GW and LGW, the term (n3N + n2d) refers to the computational complexity of computing one GW distance, and the term (n2) represents the complexity of computing one LGW distance given two embeddings.

The complexities for PGW and LPGW follow a similar pattern.

In this experiment, Ntrain = 8000 and Ntest = 1000. OT-based methods require approximately 50 hours (2 3 days) to train the model, while GW and PGW require about 400 hours (over 15 days). In contrast, LOT, LGW, and LPGW take only 3 5 minutes.

N Complemental results MNIST classification experiment

Data reconstruction. We visualize the reconstructed digits using LOT, LGW, and LPGW embeddings under two settings: η = 0 and η = 0.2 (See figure 1b). When the data is not corrupted by noise points (η = 0), the reconstructed digits from all methods closely resemble the original data. However, when 20% of noise points are added (η = 0.2), LPGW s reconstructed digits effectively exclude most of the noise points. In contrast, the embeddings produced by OT and LGW retain information from the noisy data. This demonstrates that the LPGW embedding leverages the partial matching property of PGW, ensuring that most of the noise points are excluded during the embedding process. This explains the robustness of LPGW embeddings to noise corruption.

t-SNE. From the figure 9, we can observe that when η 0.1, LPGW embeddings show greater separability than those of LOT or LGW.

O Compute Resources

All experiments presented in this paper are conducted on a computational machine with an AMD EPYC 7713 64-Core Processor, 8 32GB DIMM DDR4, 3200 MHz, and an NVIDIA RTX A6000 GPU.

Published as a conference paper at ICLR 2025

Figure 9: t-SNE visualization of embeddings. In the first column, the label orig indicates the experimental results for the original testing dataset without random rotations/flips or noise. In the remaining columns, random rotations/flips are applied to the testing data, and η denotes the total mass of noise points.