# slicing_unbalanced_optimal_transport__edc944e0.pdf

Published in Transactions on Machine Learning Research (12/2024)

Slicing Unbalanced Optimal Transport

Clément Bonet clement.bonet@ensae.fr CREST, ENSAE, IP Paris, Palaiseau, France

Kimia Nadjahi kimia.nadjahi@ens.fr CNRS, ENS, Paris, France

Thibault Séjourné thibault.sejourne@epfl.ch LTS4, EPFL, Lausanne, Switzerland

Kilian Fatras kilian.fatras@mila.quebec Mila, Mc Gill University, Montreal, Canada

Nicolas Courty nicolas.courty@irisa.fr IRISA, Université Bretagne-Sud, Vannes, France

Reviewed on Open Review: https: // openreview. net/ forum? id= Aj JTg5M0r8

Optimal transport (OT) is a powerful framework to compare probability measures, a fundamental task in many statistical and machine learning problems. Substantial advances have been made in designing OT variants which are either computationally and statistically more efficient or robust. Among them, sliced OT distances have been extensively used to mitigate optimal transport s cubic algorithmic complexity and curse of dimensionality. In parallel, unbalanced OT was designed to allow comparisons of more general positive measures, while being more robust to outliers. In this paper, we bridge the gap between those two concepts and develop a general framework for efficiently comparing positive measures. We notably formulate two different versions of sliced unbalanced OT, and study the associated topology and statistical properties. We then develop a GPU-friendly Frank-Wolfe like algorithm to compute the corresponding loss functions, and show that the resulting methodology is modular as it encompasses and extends prior related work. We finally conduct an empirical analysis of our loss functions and methodology on both synthetic and real datasets, to illustrate their computational efficiency, relevance and applicability to real-world scenarios including geophysical data.

1 Introduction

Many machine learning tasks involve aligning objects such as images, graphs, datasets or their representations after transformations. This is particularly relevant in transfer learning tasks like domain adaptation (Fatras et al., 2021) or multimodal machine learning (Baltrušaitis et al., 2018). These objects can be conveniently represented as positive measures, i.e., a set of samples associated with non-negative weights. Aligning then consists in minimizing a distance or discrepancy between two measures. It is crucial to choose a meaningful discrepancy that has desirable statistical, robustness and computational properties. In particular, some settings require comparing arbitrary positive measures, i.e., measures whose total mass can have an arbitrary value, as opposed to probability distributions whose total mass is equal to 1. In cell biology, for instance, measures are used to represent and compare gene expressions of cell populations, and the total mass corresponds to the population size (Schiebinger et al., 2019).

Equal contribution

Published in Transactions on Machine Learning Research (12/2024)

(Unbalanced) Optimal Transport. Optimal transport (OT) has been frequently chosen as a loss function to align objects. OT defines a distance between two positive measures α and β of the same mass (m(α) = m(β)) by moving the mass of α toward the mass of β with the least possible effort. However, in some applications, the mass equality constraint is not satisfied, i.e., m(α) = m(β). It can still be enforced by a re-normalization of the mass, which is potentially spurious and makes the problem less interpretable. This setting has motivated the development of a new OT framework, called unbalanced OT (UOT), that can naturally compare measures of different masses by softly relaxing the mass conservation constraints (Kondratyev et al., 2016; Liero et al., 2018; Chizat et al., 2018b). An appealing outcome of this new OT variant is its robustness to outliers which is achieved by discarding them before transporting α to β. UOT has been useful for many theoretical and practical applications, e.g., theory of deep learning (Chizat & Bach, 2018; Rotskoff et al., 2019), biology (Schiebinger et al., 2019; Demetci et al., 2022) and domain adaptation (Fatras et al., 2021). We refer to (Séjourné et al., 2022a) for an extensive survey of UOT. Computing UOT requires to solve a linear program whose complexity is cubical in the number n of samples (O(n3 log n)) (Pele & Werman, 2009; Peyré et al., 2019). Besides, accurately estimating UOT distances through empirical distributions is challenging as they suffer from the curse of dimension (Dudley, 1969). A common workaround is to rely on variants with lower complexities and better statistical properties. Among the most popular, we can list entropic OT (Cuturi, 2013; Pham et al., 2020) or minibatch OT (Fatras et al., 2020; 2021). In this paper, we focus on developing sliced UOT approaches.

Sliced Optimal Transport. Sliced OT (SOT) defines an alternative metric by leveraging the closed-form solution of OT between univariate measures (Rabin et al., 2012; Bonneel et al., 2015). It averages the OT cost between projections of (α, β) on 1D subspaces of Rd. For 1D data, the OT solution can be computed through a sort algorithm, leading to an appealing O(n log(n)) complexity (Peyré et al., 2019). Furthermore, it has been shown to lift useful topological and statistical properties of OT from 1-dimensional to multi-dimensional settings (Nadjahi et al., 2020b; Bayraktar & Guo, 2021; Goldfeld & Greenewald, 2021). It therefore helps to mitigate the curse of dimensionality making SOT-based algorithms theoretically grounded, statistically efficient, and practical to solve even on large-scale settings. These appealing properties motivated the development of several variants and generalizations, e.g., by considering different types or distributions of projections (Kolouri et al., 2019; Deshpande et al., 2019; Nguyen et al., 2020; Ohana et al., 2023; Nguyen et al., 2023) or manifold data (Bonet et al., 2023a;b;c). Fast computations of partial OT (a particular case of UOT) between univariate measures (Bonneel & Coeurjolly, 2019; Bai et al., 2023) or more generally on trees (Sato et al., 2020; Le & Nguyen, 2021), have been developed, so that slicing partial OT benefits from these efficient implementations and allows to compare large unnormalized measures. However, while (sliced) partial OT allows to compare measures with different masses, it assumes that each input measure is discrete and supported on points that all share the same mass (typically 1). In contrast, the Gaussian-Hellinger Kantorovich (GHK) distance (Liero et al., 2018), another popular formulation of UOT, allows to compare measures with different masses and supported on points with varying masses, and has not been studied jointly with slicing.

Contributions. In this paper, we present the first general framework combining UOT and slicing between arbitrary distributions. Our main contribution is the introduction of two novel sliced variants of UOT, called Sliced UOT (SUOT) and Unbalanced Sliced OT (USOT). SUOT and USOT are both defined as regularized OT problems which leverage one-dimensional projections, but differ on how they relax the mass preservation constraint: USOT essentially performs a global reweighting of the inputs measures (α, β), while SUOT reweights each projection of (α, β). We provide a theoretical analysis of SUOT and USOT, which reveals that they share topological properties with UOT while being statistically more efficient in high-dimensional regimes, thanks to the slicing operation. Additionally, we propose fast and GPU-friendly algorithms to compute SUOT and USOT, based on the (non-trivial) dual derivation of our SUOT and USOT losses and a Frank-Wolfe strategy (Séjourné et al., 2022b). Finally, we illustrate the efficiency of our framework on various experiments: we deploy SUOT and USOT to compare distributions on non-Euclidean hyperbolic manifolds, classify documents, transfer image colors and aggregate large-scale geophysical data, and discuss their advantages over existing approaches.

Outline. In Section 2, we provide background knowledge on unbalanced OT and sliced OT. In Section 3, we introduce our new loss functions (SUOT and USOT) and prove their metric, topological, statistical and

Published in Transactions on Machine Learning Research (12/2024)

duality properties in wide generality. We then explain in Section 4 how to compute SUOT and USOT via a Frank-Wolfe-based approach. We finally analyze the performance of SUOT and USOT on different practical tasks in Section 5.

2 Background

In this section, we first state our notations. Then, we provide the necessary background on unbalanced optimal transport and sliced optimal transport.

2.1 Notations

In what follows, M+(Rd) denotes the set of all positive Radon measures of finite mass on Rd. For any α M+(Rd), supp(α) is the support of α, and m(α) = R

Rd dα(x) < + is the mass of α. For α M+(Rd) and a map T : Rd Rp, T#α is the pushforward measure of α by T, defined for all A Rd as T#µ(A) = µ T 1(A) . Let δz be the Dirac measure at z and for n 1, define the empirical measure ˆαn = Pn i=1 wiδZi, where (Zi)n i=1 are n independent and identically distributed (i.i.d.) samples from α M+(Rd), and wi > 0. For any convex function φ : R R {+ }, we denote by φ its Legendre transform, i.e., for x R, φ (x) = supy 0 xy φ(y). We will also use the notation φ (x) = φ ( x). For α, β M+(Rd), α β is the product measure, and for f, g : Rd R, denote by f g : Rd Rd R the mapping defined as (f g)(x, y) = f(x)+g(y) for all x, y Rd. Sd 1 {θ Rd : θ = 1} is the unit sphere, and for θ Sd 1, θ : Rd R is the mapping defined as θ (x) = θ, x for all x Rd.

2.2 Unbalanced Optimal Transport

We recall the static formulation of unbalanced OT proposed by Liero et al. (2018), which uses φ-divergences as penalty terms.

Definition 2.1 (φ-divergences). Let (α, β) M+(Rd) M+(Rd). Let φ : R R {+ } be an entropy function, i.e., φ is convex, lower semicontinuous, dom(φ) {x R, φ(x) < + } [0, + ) and φ(1) = 0. Denote φ limx + φ(x)

x . The φ-divergence between α and β is

dβ (x) dβ(x) + φ

Rd dα (x) , (1)

where α is defined as α = dα

Definition 2.2 (Unbalanced OT (Liero et al., 2018)). Let (φ1, φ2) be a pair of entropy functions and Cd : Rd Rd R a cost function. The unbalanced OT problem between (α, β) M+(Rd) M+(Rd) reads

UOT(α, β) inf π M+(Rd Rd)

Z Cd(x, y)dπ(x, y) + Dφ1(π1|α) + Dφ2(π2|β) , (2)

where π1 and π2 denote the marginal distributions of π with respect to (w.r.t.) the first and second variable respectively.

When φ1 = φ2 and φ1(x) = 0 for x = 1, φ1(x) = + otherwise, (2) boils down to the Kantorovich formulation of OT (or balanced OT), denoted by OT(α, β). Indeed, in that case, Dφ1(π1|α) = Dφ2(π2|β) = 0 if π1 = α and π2 = β, Dφ1(π1|α) = Dφ2(π2|β) = + otherwise.

Under other suitable choices of entropy functions φ1 and φ2, UOT(α, β) is more robust than OT(α, β), since it can discard outliers and compare α and β with different masses. We refer to (Séjourné et al., 2022a, Section 4.2) for a detailed discussion on the choice of entropies and its consequences on the transport plan computed by UOT. Two common choices are φi(x) = ρ |x 1| and φi(x) = ρ(x log(x) x + 1), where ρ > 0 is a characteristic radius w.r.t. Cd. They respectively correspond to Dφi = ρTV (total variation distance (Chizat et al., 2018a)) and Dφi = ρKL (Kullback-Leibler divergence), and operate differently: KL smooths out geometric outliers, while TV either keeps or removes samples (Séjourné et al., 2022a). The GHK distance corresponds to (2) with Cd(x, y) = ||x y||2 and Dφi = ρi KL (Liero et al., 2018).

Published in Transactions on Machine Learning Research (12/2024)

One can obtain an equivalent formulation of UOT by deriving the dual of (2) and proving strong duality. We recall this result below.

Proposition 2.3 (Corollary 4.12 in (Liero et al., 2018)). The UOT problem (2) can equivalently be written as UOT(α, β) = supf g Cd D(f, g; α, β), with

D(f, g; α, β) Z φ 1 f(x) dα(x) + Z φ 2 g(y) dβ(y), (3)

where for i {1, 2}, φ i (x) φ i ( x) with φ i (x) supy 0 xy φi(y) the Legendre transform of φi, and f g Cd means that for (x, y) α β, f(x) + g(y) Cd(x, y).

When clear from the context, we will omit the dependence on (α, β) and write D(f, g) instead of D(f, g; α, β). The Legendre transform of φi is well known for typical choices of φi-divergences. For example, if Dφi = ρi KL, then φ i (x) = ρi(ex/ρi 1).

Based on Proposition 2.3, one can compute UOT(α, β) by optimizing a pair of continuous functions (f, g). However, UOT(α, β) is known to be computationally intensive (Pham et al., 2020), which motivates the development of methods able to scale to the large dimensions and sample sizes encountered in ML applications.

2.3 Sliced Optimal Transport

Among the many workarounds that have been proposed to overcome the OT computational bottleneck (Peyré et al., 2019), Sliced OT (Rabin et al., 2012) has attracted a lot of attention due to its computational benefits and theoretical guarantees.

Definition 2.4 (Sliced OT). Let Sd 1 {θ Rd : θ = 1} be the unit sphere in Rd. For θ Sd 1, denote by θ : Rd R the linear map such that for x Rd, θ (x) θ, x . Let σ be the uniform probability over Sd 1. Consider (α, β) M+(Rd) M+(Rd). The Sliced OT problem is defined as

SOT(α, β) Z

Sd 1 OT(θ α, θ β)dσ(θ) , (4)

where for any measurable function f and ξ M+(Rd), f ξ is the push-forward measure of ξ by f, i.e., for any measurable set A R, f ξ(A) ξ(f 1 A) , f 1(A) {x Rd : f(x) A}.

Since (θ α, θ β) are two measures supported on R, OT(θ µ, θ ν) is defined in terms of a cost function C1 : R R R, and can be efficiently computed. Therefore, SOT(α, β) can provide significant computational advantages over OT(α, β) in large-scale settings. In practice, if α = Pn i=1 αiδxi and β = Pn i=1 βiδyi are discrete measures, the standard procedure for approximating SOT(α, β) consists in sampling m i.i.d. samples {θj}m j=1 from σ, then computing OT (θ j ) α, (θ j ) β for j = 1, . . . , m. This second step involves sorting the n support points of α and β (Peyré et al., 2019, Section 2.6), thus involves O(n log n) operations per θj.

SOT(α, β) relies on the Kantorovich formulation of OT, thus SOT(α, β) < + only when m(α) = m(β), and may not provide meaningful comparisons in presence of outliers. To overcome such limitations, prior works have proposed slicing a particular instance of UOT that is partial OT (Bonneel & Coeurjolly, 2019; Bai et al., 2023), for which Dφ is the total variation distance. More precisely, noting POT the UOT problem with Dφ = ρTV, they consider for α, β M+(Rd) the problem

SPOT(α, β) Z

Sd 1 POT(θ α, θ β)dσ(θ). (5)

For the 1D partial OT problem, Bonneel & Coeurjolly (2019) solve a one dimensional injective partial assignment in quasilinear complexity, but which does not allow for mass destruction in the source measure, while Bai et al. (2023) proposed an efficient procedure with a quadratic worst case complexity. However, their algorithms only apply to measures whose samples have constant mass (e.g., αi = βj = 1). In the next section, we generalize their line of work and propose a new way of combining sliced OT and unbalanced OT.

Published in Transactions on Machine Learning Research (12/2024)

3 Sliced Unbalanced OT and Unbalanced Sliced OT

We present two new scalable and robust OT problems, by combining the unbalanced and slicing strategies in two different ways. We conduct a theoretical analysis of both strategies and provide a comparison of the two. For ease of exposition, all proofs of the results in this section are provided in Appendix A.

First, we propose to slice the unbalanced OT problem: we average the UOT problem over different projections of the compared measures, similar to the approach of sliced partial OT (Bonneel & Coeurjolly, 2019; Bai et al., 2023). We refer to this problem as Sliced Unbalanced OT (SUOT) and introduce it in Section 3.1. Next, we explore the reverse strategy, i.e., we unbalance the sliced OT problem: the weights of SUOT are penalized to introduce imbalance, analogous to how UOT relates to OT. We call this method Unbalanced Sliced OT (USOT) and present it in Section 3.2.

3.1 Sliced Unbalanced Optimal Transport

Our first strategy consists in slicing the unbalanced OT problem and leads to the following definition.

Definition 3.1 (Sliced Unbalanced OT). Let (φ1, φ2) be a pair of entropy functions and C1 : R R R a cost function. The sliced unbalanced OT problem between (α, β) M+(Rd) M+(Rd) reads

SUOT(α, β) Z

Sd 1 UOT(θ α, θ β)dσ(θ) , (6)

where for θ σ, UOT(θ α, θ β) = infπθ M+(R R) R

R R C1(x, y)dπθ(x, y)+Dφ1 (πθ)1|θ α +Dφ2 (πθ)2|θ β

with (πθ)1, (πθ)2 the marginal distributions of πθ.

By definition, SUOT is a specific instance of the class of sliced probability divergences (Nadjahi et al., 2020a), where the base divergence is chosen as UOT. SUOT can also be interpreted as a general expression of the sliced partial OT problem (Bonneel & Coeurjolly, 2019; Bai et al., 2023): while the latter imposes Dφi = ρi TV, SUOT allows for the use of arbitrary φ-divergences.

In the following, we establish a set of theoretical properties for SUOT with different choices of φ-divergences and cost functions C1. First, we identify sufficient conditions for which the solution of (6) exists.

Proposition 3.2 (SUOT: Existence of solutions). Assume that C1 is lower-semicontinuous and that either (i) φ 1, = φ 2, = + , or (ii) C1 has compact sublevels on R R and φ 1, + φ 2, + inf C1 > 0. Then, the solution of SUOT(α, β) exists, in the sense that for any θ σ, there exists π θ M+(R R) attaining the infimum in UOT(θ α, θ β).

The assumptions of Proposition 3.2 are met for some settings of interest, including Dφ1 = Dφ2 = KL (since φ = + ), or Dφ1 = Dφ2 = TV and C1(x, y) = |x y|p (p 1) (since φ = 1): see (Séjourné et al., 2022a, Section 2.1) for more details.

Next, we show some topological properties of SUOT. In the next proposition, we prove that SUOT preserves the metric properties of UOT, which is consistent with (Nadjahi et al., 2020a, Proposition 1). In Section 3.3, we study the metrization of the weak -topology with SUOT.

Proposition 3.3 (SUOT: Metric properties). Suppose UOT is non-negative, symmetric and/or definite on M+(R) M+(R). Then, SUOT is respectively non-negative, symmetric and/or definite on M+(Rd) M+(Rd). If there exists p [1, + ) s.t. for any α, β, γ M+(R), UOT1/p(α, β) UOT1/p(α, γ) + UOT1/p(γ, β), then SUOT1/p(α, β) SUOT1/p(α, γ) + SUOT1/p(γ, β).

By Proposition 3.3, establishing the metric axioms of UOT between univariate measures (as detailed in (Séjourné et al., 2022a, Section 3.3.1)) is sufficient to prove the metric properties of SUOT between multivariate measures. For example, since GHK is a metric for the order p = 2 (Liero et al., 2018), so is the induced SUOT.

We move on to the statistical aspects and study the sample complexity of SUOT. For (α, β) M+(Rd) M+(Rd), we establish the speed of convergence of SUOT(ˆαn, ˆβn) toward SUOT(α, β), where ˆαn, ˆβn denote

Published in Transactions on Machine Learning Research (12/2024)

the empirical measures supported on n independent samples from α, β respectively (as defined in Section 2.1). We prove below that SUOT extends the sample complexity of UOT from one-dimensional settings to multidimensional ones.

Theorem 3.4 (SUOT: Sample complexity). (i) Assume for (µ, ν) M M with M M+(R), E|UOT(µ, ν) UOT(ˆµn, ˆνn)| κ(n). Then, for (α, β) e M e M with e M {η M+(Rd) : θ Sd 1, θ η M}, E|SUOT(α, β) SUOT(ˆαn, ˆβn)| κ(n).

(ii) Assume for µ M with M M+(R), E|UOT(µ, ˆµn)| ξ(n). Then, for α e M with e M {η M+(Rd) : θ Sd 1, θ η M}, E|SUOT(α, ˆαn)| ξ(n).

Note that the expectations in Theorem 3.4 are taken with respect to the samples of the empirical measures, which are random. Theorem 3.4 shows that SUOT enjoys a dimension-free sample complexity, even when comparing multivariate measures. This advantage is recurrent of sliced divergences (Nadjahi et al., 2020b) and further motivates their use on high-dimensional settings. The sample complexity rates κ(n) or ξ(n) can be deduced from the literature on UOT for univariate measures. For instance, in the GHK setting, the rate is given by κ(n) n 1/2 for measures with compact, convex support and continuously differentiable densities (Vacher & Vialard, 2023, Corollary 3.4), and a suitable class M can be defined.

Finally, we derive the dual formulation of SUOT and prove that strong duality holds. This result has important practical implications, as we will leverage it in Section 4 to develop a methodology for computing SUOT. Note that the computation of SUOT involves integration with respect to σ, which generally cannot be done in closed from, as is the case for most sliced divergences. Since our goal is to develop a practical and implementable method, we will consider the Monte Carlo approximation commonly used by practitioners to compute sliced divergences (Nadjahi et al., 2020a): we approximate SUOT(α, β) as R

Sd 1 UOT(θ α, θ β)dˆσK(θ), where ˆσK is a discrete distribution supported on K i.i.d. samples drawn from σ.

Theorem 3.5 (SUOT: Strong duality). For i {1, 2}, let φi be an entropy function such that dom(φ i ) R is non-empty, and either 0 dom(φi) or m(α), m(β) dom(φi). Let E {(fθ, gθ)θ supp(ˆσK) : θ supp(ˆσK), fθ gθ C1}. Then,

SUOT(α, β) = sup (fθ,gθ) E

Sd 1 D(fθ, gθ; θ #α, θ #β)dˆσK(θ) . (7)

3.2 Unbalanced Sliced Optimal Transport

As a second strategy to make unbalanced OT scalable, we propose to unbalance sliced OT. To this end, we start with the following formulation of UOT (Liero et al., 2018),

UOT(α, β) = inf (π1,π2) M+(Rd) M+(Rd) OT(π1, π2) + Dφ1(π1|α) + Dφ2(π2|β) , (8)

and we replace UOT by its sliced counterpart, SOT. This yields the following definition:

Definition 3.6 (Unbalanced Sliced OT). Let (φ1, φ2) be a pair of entropy functions. The unbalanced sliced OT problem between (α, β) M+(Rd) M+(Rd) reads

USOT(α, β) inf (π1,π2) M+(Rd) M+(Rd) SOT(π1, π2) + Dφ1(π1|α) + Dφ2(π2|β) . (9)

This approach is entirely novel since, to the best of our knowledge, it has never been studied in prior work, even for specific choices of entropy. To gain a better grasp of this new object, USOT, we examine how the theoretical properties discussed in the previous section apply here.

We first prove that the solution of (9) exists under the same conditions as those for SUOT outlined in Proposition 3.2.

Proposition 3.7 (USOT: Existence of solutions). Assume that C1 is lower-semicontinuous and that either (i) φ 1, = φ 2, = + , or (ii) C1 has compact sublevels on R R and φ 1, + φ 2, + inf C1 > 0. Then, the solution of USOT(α, β) exists: there exists (π 1, π 2) M+(Rd) M+(Rd) attaining the infimum in (9).

Published in Transactions on Machine Learning Research (12/2024)

Figure 1: Toy illustration on the behaviors of SUOT and USOT. (left) Original 2D samples and slices used for illustration. KDE density estimations of the projected samples: grey, original distributions, colored, distributions reweighed by SUOT (center), and reweighed by USOT (right).

We then review the conditions under which USOT is a (pseudo-)metric, and we prove that strong duality holds. Similar to SUOT, the dual formulation that we derive will enable the design of an algorithm for effectively computing USOT in practice.

Proposition 3.8 (USOT: Metric properties). For any (α, β) M+(Rd) M+(Rd), USOT(α, β) 0. If φ1 = φ2, USOT is symmetric. If Dφ1 and Dφ2 are definite, then USOT is definite. If C1(x, y) = |x y| and Dφ1 = Dφ2 = ρTV, then, USOT satisfies the triangle inequality.

Theorem 3.9 (USOT: Strong duality). For i {1, 2}, let φi be an entropy function such that dom(φ i ) R is non-empty, and either 0 dom(φi) or m(α), m(β) dom(φi). Let E {(fθ, gθ)θ supp(ˆσK) : θ supp(ˆσK), fθ gθ C1}. Then,

USOT(α, β) = sup (fθ,gθ) E D Z

Sd 1 fθ θ dˆσK(θ), Z

Sd 1 gθ θ dˆσK(θ); α, β . (10)

Since USOT does not belong to the class of sliced divergences, establishing its sample complexity is more challenging compared to SUOT. Based on the literature, one standard technique involves deriving covering number bounds on the space of the dual potentials of USOT. This theoretical question is highly non-trivial given the complex structure of E, and as such is out of the scope of this paper. Nevertheless, we investigate the sample complexity on empirical settings: our experimental results presented in Appendix C.5 suggest that USOT might also enjoy a dimension-free rate.

3.3 Comparative Analysis of Sliced Unbalanced and Unbalanced Sliced Optimal Transport

In addition to the theoretical analysis previously conducted for SUOT and USOT independently, this section provides further insights to better grasp the differences between these two strategies.

First, by comparing Definition 3.1 with Definition 3.6, SUOT and USOT clearly differ at the conceptual level. Specifically, SUOT(α, β) penalizes the marginals of πθ for θ σ, where πθ is the coupling that transports mass from θ α to θ β. In contrast, USOT(α, β) directly regularizes the marginals of the coupling between α and β. To illustrate this difference, we consider (α, β) M+(R2) M+(R2) with α contaminated by outliers, then compute SUOT(α, β) and USOT(α, β). We plot (α, β) and the sampled projections (θk)k (Figure 1, left), the marginals of (πθk)k obtained with SUOT(α, β) (Figure 1, center), and the marginals of ((θ k) π)k with USOT(α, β) (Figure 1, right). We observe that the source outliers in α have been successfully removed by USOT(α, β) for all θk, while they may still appear with SUOT(α, β) (e.g., Figure 1, center: note the bimodal marginal in blue for θ = 120 ). This difference is due to the marignal penalization terms in USOT(α, β), which operate directly w.r.t. (α, β) rather than their projections (θ α, θ β), unlike SUOT(α, β).

A question of particular interest regarding probability divergences is how they relate to each other, specifically whether they yield equivalent topologies. We explore this question for SUOT and USOT. To do so, we

Published in Transactions on Machine Learning Research (12/2024)

consider a notion of equivalence that is frequently studied in the literature, as in (Bayraktar & Guo, 2021, Theorem 2.3.(i)).

We start by proving a first set of inequalities that relate SUOT, USOT and UOT: our next theorem shows that USOT is always greater than SUOT and that, for appropriate choices of cost functions, USOT is upper-bounded by UOT. Theorem 3.10. For any (α, β) M+(Rd) M+(Rd),

SUOT(α, β) USOT(α, β) . (11)

Moreover, suppose that, (x, y) Rd Rd, θ Sd 1, C1 θ (x), θ (y) Cd(x, y). Then, for any (α, β) M+(Rd) M+(Rd), SUOT(α, β) USOT(α, β) UOT(α, β) . (12)

In particular, Theorem 3.10 holds for the following common choice of costs: (s, t) R2, C1(s, t) = |s t|p

and (x, y) Rd Rd, Cd(x, y) = x y p, with p [1, + ).

Next, we prove that UOT(α, β) can be upper-bounded by a functional of SUOT(α, β) when (α, β) have compact supports, by adapting the reasoning from Bonnotte (2013, Lemma 5.1.4) to our setting and considering the duals of UOT (Proposition 2.3) and SUOT (Theorem 3.5) instead of the dual of OT and SOT. Most arguments in (Bonnotte, 2013) adapt well to our setting, but establishing a Lipschitz condition on the integrand of the dual required a more technical approach. To this end, we prove Lemma A.13, which results in a different constant value, denoted as c(m(α), m(β), ρ, R). Theorem 3.11. Let X Rd be a compact set with radius R. Define the cost functions as C1(s, t) = |s t|p , (s, t) R2, and Cd(x, y) = x y p, (x, y) Rd Rd, with p [1, + ). Assume either, (i) Dφ1 = Dφ2 = ρKL; or (ii) p = 1 and Dφ1 = Dφ2 = ρTV. Then, for any (α, β) M+(X) M+(X),

UOT(α, β) c(m(α), m(β), ρ, R) SUOT(α, β) 1 d+1 ,

where c(m(α), m(β), ρ, R) is a constant depending on m(α), m(β), ρ, R, which is non-decreasing in m(α) and m(β).

We show the equivalence of SUOT, USOT and UOT by combining Theorem 3.11 and Theorem 3.10, assuming that the constant c(m(α), m(β), ρ, R) does not depend on m(α), m(β). This occurs, for example, when the masses of α and β are uniformly bounded; that is, there exists M R+ such that m(α) M and m(β) M.

The equivalence of SUOT, USOT and UOT is a key result for proving that SUOT and USOT metrize weak

convergence, provided that UOT does (as in the GHK setting (Liero et al., 2018, Theorem 7.25)). Recall that a sequence of positive measures (αn)n N converges weakly to α M+(Rd) (denoted by αn α) if, for any continuous and bounded f : Rd R, limn + R fdαn = R fdα. Theorem 3.12 (Metrizability of the weak topology by SUOT, USOT). Assume the conditions in Theorem 3.11 are met. Let (αn)n N be a sequence of measures in M+(X) and α M+(X), where X Rd is a compact set with radius R. Then, SUOT and USOT metrize the weak convergence, i.e., αn α limn + SUOT(αn, α) = 0, and αn α limn + USOT(αn, α) = 0.

The metrizability of weak convergence was not studied in related work, including in existing instances of our framework, such as partial OT (Bonneel & Coeurjolly, 2019; Bai et al., 2023). In addition to complementing prior work, our result paves the way for other research directions. For instance, it can be used to justify the well-posedness of approximating an unbalanced Wasserstein gradient flow (Ambrosio et al., 2005) using SUOT, as done for SOT in (Candau-Tilh, 2020; Bonet et al., 2022). Unbalanced Wasserstein gradient flows have been a key tool in deep learning theory, e.g., to prove global convergence of one-hidden layer neural networks (Chizat & Bach, 2018; Rotskoff et al., 2019).

4 Computing SUOT and USOT with Frank-Wolfe algorithms

In this section, we propose two algorithms to compute SUOT and USOT in practice. The resulting procedures are given in Algorithms 2 and 3 respectively, and require smooth penalty terms (Dφ1, Dφ2). This condition

Published in Transactions on Machine Learning Research (12/2024)

is satisfied in the GHK setting (Dφi = ρi KL), but not for sliced partial OT (Dφi = ρi TV, Bai et al. (2023)). Our strategy is inspired by Séjourné et al. (2022b), where they proposed to solve the unbalanced OT problem between univariate measures using the Frank-Wolfe algorithm (as recalled in Appendix B.2). More precisely, we apply FW to optimize translation-invariant forms of the dual problems derived in Theorems 3.5 and 3.9.

4.1 Background: Frank-Wolfe Algorithm and Application to One-Dimensional Unbalanced OT

FW is a popular iterative first-order optimization algorithm for solving maxx E H(x), where E is a compact convex set and H : E R a concave, differentiable function. The procedure consists in maximizing a linear approximation of H at each iteration: given the current iterate xt, FW solves the linear oracle rt+1 arg maxr E H(xt), r , then performs xt+1 = (1 γt+1)xt + γt+1rt+1 with stepsize γt+1 typically chosen as γt+1 = 2 2+t+1. We refer to this step as FWStep and report the pseudo-code in Appendix B.2.

Séjourné et al. (2022b) apply FW to solve a translation-invariant formulation of the dual of UOT(α, β) for (α, β) M+(R) M+(R), and show that the linear oracle in FWStep is the dual of OT(αt, βt) where (αt, βt) are normalized versions of (α, β), i.e., m(αt) = m(βt) = 1. Therefore, computing UOT amounts to solve a sequence of OT problems, which can efficiently be done since (αt, βt) are univariate probability measures. The expression of (αt, βt) depend on the input measures (α, β), the current iterates (ft, gt) and the penalty coefficients (ρ1, ρ2).

4.2 Frank-Wolfe Solvers for Sliced Unbalanced and Unbalanced Sliced OT

Translation-invariant duals. We compute SUOT(α, β) and USOT(α, β) for any (α, β) M+(Rd) M+(Rd) by solving translation-invariant formulations of their duals with FW. By Theorems 3.5 and 3.9, and adapting the reasoning of Séjourné et al. (2022b), we prove that

SUOT(α, β) = sup (fθ,gθ) E

Sd 1 H(fθ, gθ; θ #α, θ #β)dˆσK(θ) , (13)

USOT(α, β) = sup (fθ,gθ) E H Z

Sd 1 fθ θ dˆσK(θ), Z

Sd 1 gθ θ dˆσK(θ); α, β (14)

where H(f, g; α, β) supλ R D(f + λ, g λ; α, β). These alternative duals are translation-invariant since, for any λ R, H(f + λ, g λ; α, β) = H(f, g; α, β). If (φ 1, φ 2) are smooth and strictly concave, then the maximizer in H, denoted by λ (f, g), exists and is unique. In particular, when Dφ1 = ρ1KL and Dφ2 = ρ2KL, λ (f, g) admits an analytical expression, which is given in the normalization routine (Algorithm 1). This is convenient as it avoids the need for approximate solvers to compute H(f, g; α, β).

Algorithm 1 Norm(α, β, f, g, ρ1, ρ2) Input: α, β, f, g, ρ1, ρ2 Output: Normalized measures ( α, β)

λ ρ1ρ2 ρ1+ρ2 log R e f(x)/ρ1 dα(x) R e g(y)/ρ2 dβ(y)

α e (f(x)+λ )

ρ1 α β e (g(y) λ )

ρ2 β Return ( α, β)

Frank-Wolfe iterations. We then apply FW to solve (13) and (14). We show that each iteration consists in solving a particular sliced OT problem between probability measures that depend on the input (α, β) and the iterates. To clarify this point, we present below the updates of FWStep tailored for each problem, starting with SUOT.

Proposition 4.1 (Frank-Wolfe iterations for SUOT). Let (α, β) M+(Rd) M+(Rd) and consider solving (13) with FW. Assume that (φ 1, φ 2) are smooth and strictly concave. Given current iterates (f t θ, gt θ)θ supp(ˆσK) E, the solutions of the linear oracle (rt θ, st θ)θ supp(ˆσK) are the dual potentials of R

Sd 1 OT(θ αt θ, θ βt θ)dˆσK(θ), where (αt θ, βt θ) are measures given by αt θ = φ f t θ + λ (f t θ, gt θ) α and βt θ = φ gt θ λ (f t θ, gt θ) β.

Proposition 4.1 shows that each FW iteration for solving the translation-invariant dual of SUOT(α, β) reduces to solving a balanced sliced OT problem: by (Séjourné et al., 2022b, Proposition 1), the measures (αt θ, βt θ) have the same mass, i.e., m(αt θ) = m(βt θ). When using KL-based penalty terms, the procedure for computing (αt θ, βt θ) is detailed in Algorithm 1, and reports the closed-form expression of λ (f t θ, gt θ).

Published in Transactions on Machine Learning Research (12/2024)

Algorithm 2 SUOT

Input: α, β, F, (θk)K k=1, ρ1, ρ2 Output: SUOT(α, β), (fθ, gθ)

(fθ, gθ) (0, 0) for t = 0, 1, . . . , F 1 do

for θ (θk)K k=1 do (αθ, βθ) Norm(θ α, θ β, fθ, gθ, ρ1, ρ2) (rθ, sθ) Sliced Dual(αθ, βθ) fθ (1 γt)fθ + γtrθ (FWStep) gθ (1 γt)gθ + γtsθ (FWStep) end for end for Return SUOT(α, β), (fθ, gθ) as in (7)

Algorithm 3 USOT

Input: α, β, F, (θk)K k=1, ρ1, ρ2 Output: USOT(α, β), (favg, gavg)

(fθ, gθ, favg, gavg) (0, 0, 0, 0) for t = 0, 1, . . . , F 1 do

for θ (θk)K k=1 do (π1, π2) Norm(α, β, favg, gavg, ρ1, ρ2) (rθ, sθ) Sliced Dual(θ π1, θ π2) end for (ravg, savg) 1 K PK k=1 rθk, 1

K PK k=1 sθk favg (1 γt)favg + γtravg (FWStep) gavg (1 γt)gavg + γtsavg (FWStep) end for Return USOT(α, β), (favg, gavg) as in (10)

Each iteration requires computing the dual potentials of a sliced OT problem, which is non-trivial: previous implementations related to sliced OT only output the value of the loss, SOT(α, β), typically in the context of training generative models (Deshpande et al., 2019; Nguyen et al., 2020). We thus design two novel implementations in Py Torch (Paszke et al., 2019) to compute the dual potentials of sliced OT. The first one leverages that the gradient of OT(α, β) w.r.t. (α, β) are optimal (f, g), which allows to backpropagate OT(θ α, θ β) w.r.t. (α, β) to obtain (rθ, sθ). The second one computes them in parallel on GPUs using their closed form, which to the best of our knowledge, is a new sliced algorithm. We call Sliced Dual(α, β) the step returning optimal (rθ, sθ) solving OT(θ α, θ β) for all θ supp(ˆσK), and refer to Appendix B.3 for the algorithms.

Building on Proposition 4.1 and the discussion above, we develop the FW methodology to compute SUOT(α, β) and detail it in Algorithm 2. Next, we derive the FW iterates for USOT(α, β).

Proposition 4.2 (Frank-Wolfe iterations for USOT). Let (α, β) M+(Rd) M+(Rd) and consider solving (14) with FW. Assume that (φ 1, φ 2) are smooth and strictly concave. Given current iterates (f t θ, gt θ)θ supp(ˆσK) E, the solutions of the linear oracle (rt θ, st θ)θ supp(ˆσK) are the dual potentials of SOT( αt, βt), where ( αt, βt) are measures given by αt = φ (favg + λ (favg, gavg))α and βt = φ (gavg λ (favg, gavg))β, with favg(x) R

Sd 1 f t θ(θ (x))dˆσK(θ), gavg(y) R

Sd 1 gt θ(θ (y))dˆσK(θ).

The resulting FW methodology, detailed in Algorithm 3, also leverages the Norm and Sliced Dual routines. The key difference from SUOT(α, β) is in where the integral over θ supp(ˆσK) is performed, leading to a different balanced sliced OT problem to solve.

Marginals of UOT/USOT. The optimal primal marginals of UOT and USOT are geometric normalizations of inputs (α, β) with discarded outliers. Their computation involves the Norm routine detailed in Algorithm 1, using optimal dual potentials. This is how we compute marginals in Figure 1 and in the experiments of Section 5: see Appendix B.4 for more details.

4.3 Convergence Properties and Complexity

Convergence and stochastic Frank-Wolfe. Our theoretical setting verifies the assumptions of (Lacoste Julien & Jaggi, 2015, Theorem 8), thus ensuring fast convergence of our methods. The number of FW iterations needed to converge remains low in our experiments. We give in Appendix B.5 empirical evidences that few iterations of FW (F 20) suffice to reach numerical precision.

Formally, the preceding algorithms assume that the functional H is given through integrals over the hypersphere, describing the set of all possible directions θ. However, in practice, SOT is computed by Monte-Carlo approximations, i.e., drawing a fixed number K of directions (θk)K k=1 and solving independently the different 1D OT problems. In the specific case of SUOT, this does not change much: K FW procedures are ran

Published in Transactions on Machine Learning Research (12/2024)

Table 1: Accuracy on document classification

BBCSport Goodreads

Acc t ( 10 3s) Acc (genre) Acc (like) t ( 10 3s)

OT 94.55 3.12 1.61 55.22 71.00 440.30 250 UOT 96.73 243.39 9.24 - - - Sinkh UOT 95.45 46.22 2.17 53.55 67.81 2021.68 356

SOT 89.39 0.76 1.80 0.22 50.09 0.51 65.60 0.20 4.49 1.44 SUOT 90.12 0.15 13.9 1.21 50.15 0.04 66.72 0.38 14.32 0.95 USOT 93.52 0.04 14.37 1.29 52.67 0.62 67.78 0.39 14.45 0.88 SUOT (CV on ρ) 90.00 0.59 - 49.67 0.79 66.43 0.44 - USOT (CV on ρ) 92.61 0.55 - 52.06 7.20 66.61 0.72 -

10 4 10 3 10 2 10 1 100 0.80

SOT USOT SUOT

Figure 2: Ablation on BBCSport of ρ.

independently (eventually in parallel) over the fixed set of directions. The case of USOT relies on a global FW scheme, where favg, gavg are computed w.r.t. a fixed distribution ˆσK = (1/K) PK k=1 δθk. This empirical distribution of directions can be considered fixed throughout the FW iterations, or can be drawn independently for each iteration of the FW procedure. This actually corresponds to a Stochastic FW algorithm, which also converges as our setting verifies the assumptions of (Hazan & Luo, 2016, Theorem 3). We call this procedure Stochastic USOT, which corresponds to Algorithm 3 except that (θk)K k=1 are sampled at each iteration. Since this procedure performs well in our experiments (e.g., Table 4) and Eθk σ[ˆσK] = σ, this suggests the dual in Theorem 3.9 holds for σ.

Algorithmic complexity. FW algorithms and its variants have been widely studied theoretically. Computing Sliced Dual has theoretically a complexity O(KN log N), where N is the number of samples, and K the number of projections of ˆσK. However, we note that the sorting operation, which yields the super linear complexity, can be computed once for all FW iterations. Consequently, the overall complexity of SUOT and USOT is thus O(KN log N + FN), where F is the number of FW iterations needed to reach convergence, with a O(N) complexity. Thus, our formulation enjoy a similar complexity than SOT, which is particularly appealing. However, Stochastic USOT is more costly, as each iteration requires sorting data projected along newly-sampled (θk)K k=1. Its complexity is therefore O(KFN log N). We finally note that due to the independent nature of the treatments of every projections, computing both Norm and Sliced Dual operations can be done in parallel, leveraging GPU computations when available.

Extension to non-Euclidean settings. Interestingly, our algorithms offer great modularity, in the sense they can easily be used to compute unbalanced versions of existing variants of SOT. Indeed, while such variants differ in the one-dimensional representations of α and β they use, they all consist in solving 1D OT problems to compare α and β, which our FW strategy can solve. To illustrate this point, we combined our FW routine with hyperbolic SOT (Bonet et al., 2023c) to compare measures supported on hyperbolic spaces: see Appendix C.3.

5 Experiments

This section presents a set of numerical experiments, which illustrate the effectiveness, robustness and computational efficiency of USOT1. We first showcase the benefit of USOT over SUOT and SOT on a document classification task. Then, we consider experiments in very large scale settings such as color transfer on every pixels and the computation of barycenters of geophysical datasets.

5.1 Document classification

We first consider a document classification problem (Kusner et al., 2015). Documents are represented as distributions of words embedded with word2vec (Mikolov et al., 2013) in dimension d = 300. Let Dk be the k-th document and xk 1, . . . , xk nk Rd be the set of words in Dk. Then, Dk = Pnk i=1 wk i δxk i where wk i is the frequency of xk i in Dk normalized s.t. Pnk i=1 wk i = 1. Given a loss function L, the document classification task is solved by computing the matrix L(Dk, Dℓ)

k,ℓ, then using a k-nearest neighbor classifier. The aim

1The code is available at https://github.com/clbonet/Slicing_Unbalanced_Optimal_Transport.

Published in Transactions on Machine Learning Research (12/2024)

Figure 3: Color transfer between a source and a target image (first and second columns). We compare SOT gradient flows operated in the color space (third column) and the same procedure with a reweighing of the distributions by USOT (fourth column). The last column shows a percentage of mass change given by USOT, i.e., (π 2 β)

β , where red indicates mass creation and blue mass destruction.

of this experiment is to show that by discarding possible outliers using a well chosen parameter ρ, USOT is able to outperform SOT and SUOT on this task. Since a word typically appears several times in a document, the measures are not uniform and sliced partial OT (Bonneel & Coeurjolly, 2019; Bai et al., 2023) cannot be used in this setting. We detail in Appendix C additional experiments without normalizing histograms to compare with (Bai et al., 2023) (i.e., Pnk i=1 wk i is the sentence length). We consider the BBCSport dataset (Kusner et al., 2015), a standard benchmark with small documents for which OT can be used effectively, and the Goodreads dataset (Maharjan et al., 2017) on two tasks (genre and likability predictions), a dataset with large-scale documents for which the computational burden of performing OT and UOT is substantial. We report on Table 1 the accuracy and average runtimes of OT, UOT computed with the majorization minimization algorithm (Chapel et al., 2021) or approximated with the Sinkhorn algorithm (Sinkh UOT) (Pham et al., 2020), as well as SOT, SUOT and USOT. All the benchmark methods are computed using the Python OT library (Flamary et al., 2021) on a Nvidia Tesla V100 GPU. For sliced methods, we average over 3 computations of the loss matrix and report the standard deviation in Table 1. The number of neighbors was selected via cross validation. The results for UOT, Sinkh UOT, SUOT and USOT are reported for ρ yielding the best accuracy among a grid (see Appendix C.1 for more details), and we display an ablation of this parameter on the BBCSport dataset in Figure 2. We also add on Table 1 the results obtained with a cross validation (CV) on ρ for USOT and SUOT. Our findings demonstrate that our approaches surpass SOT in performance, incurring only a minor computational overhead. Moreover, our methods closely rival the performance of OT, while being 40 times faster on large-scale datasets. This highlight their practical significance, particularly when OT is computationally unfeasible.

5.2 Color transfer

Color transfer is a long-standing problem in OT, which dates back to the seminal work of Rabin et al. (2010). It consists in aligning the color distributions of two images. While previous works, e.g. (Ferradans et al., 2013; Bonneel et al., 2016), considered color palettes to deal with the complexity of OT, we illustrate the scalability of our methods by considering here the full distributions of pixels within images, in a way similar to (Bonneel & Coeurjolly, 2019). We express the color transfer as a gradient flow, where every pixel is a sample in the 3D RGB color space. Formally, let α(t) = 1 N PN i=1 δxi(t), β = 1 M PM j=1 δyj, where α (resp. β) represents the color distribution of the source (resp. target) image. The SOT gradient flow performing color transfer consists in iterating the following scheme: X(t + 1) = X(t) γ XSOT(α(t), β), where α(t) is the color distribution of the source image at iterations t, supported by pixels from X(t). One of the major

Published in Transactions on Machine Learning Research (12/2024)

Figure 4: Barycenter of geophysical data. (First row) Simulated output of 4 different climate models depicting different scenarios for the evolution of a tropical cyclone (Second row) Results of different aggregation strategies.

problem is color bleeding: potential color artefacts are likely to appear since the object proportions between images are likely to differ. We propose to correct this issue in a two steps procedure, relying on USOT: i) we first obtain optimal marginals (π 1, π 2) by solving (9) and using the Norm routine, then ii) we solve the classical SOT gradient flow with the measures (π 1, π 2).

We present results on 300 300 images produced by the generative model Midjourney (Mid, 2023). The images correspond to two landscapes and photo-realistic portraits of two women (see first and second column of Figure 3). For every results, we iterate the gradient flow for 100 iterations, and the learning rate γ is set to 10 2. The computation of the final result is produced in less than one minute with a commodity GPU, while it is out of reach for OT or UOT solvers on this distributions size (90K pixels). The third column shows the color transfer using SOT. It reveals instances of color bleeding: green clouds appear in the sky of landscape scenes, and a red superimposed logo results in unwanted red pixels in the portrait. Contrasting, the fourth column displays the outcomes achieved through our approach, effectively addressing the color bleeding issue. We chose ρ1 = 104 and ρ2 = 0.02 for this task and we observe the resulting change in mass distribution on the target image in the last column of Figure 3. It provides insightful indications of the discarded colors (the darkened landscape in the first case and the removal of the logo in the second), which is only possible when all the pixels are considered in the transfer.

5.3 Barycenter of geophysical data

OT barycenters are an important topic of interest (Le et al., 2021) for their ability to capture mass changes and spatial deformations over several reference measures. In order to compute barycenters under the USOT geometry on a fixed grid, we employ a mirror-descent strategy similar to (Cuturi & Doucet, 2014a, Algorithm (1)) and described more in depth in Appendix C. We compute unbalanced sliced OT barycenter for climate model data. Ensembles of multiple models are commonly employed to reduce biases and evaluate uncertainties in climate projections (Sanderson et al., 2015; Thao et al., 2022). The commonly used Multi-Model Mean approach assumes models are centered around true values and averages the ensemble with equal or varying weights. However, spatial averaging may fail in capturing specific characteristics of the physical system at stake, and we propose to use USOT barycenter instead. We use the Climate Net dataset (Prabhat et al., 2021), and more specifically the TMQ (precipitable water) indicator. The Climate Net dataset is a human-expert-labeled curated dataset which captures tropical cyclones (TCs), among other things. To simulate the output of several climate models, we take a specific instant (first date of 2011) and apply the elastic deformation from Torch Vision (Paszke et al., 2019) in an area close to the eastern part of the U.S.A. As a result, we obtain 4 different TCs, as shown in the first row of Figure 4. The classical L2 spatial mean is

Published in Transactions on Machine Learning Research (12/2024)

displayed on the second row of Figure 4 and reveals 4 different TCs centers/modes, which is undesirable. As the total TMQ mass in the considered zone varies between the different models, a direct application of SOT is impossible, or requires a normalization of the mass that has undesired effect as can be seen on the second row. Finally, we show the result of the USOT barycenter with ρ1 = 1e1 (related to the data) and ρ2 = 1e4 (related to the barycenter). This barycenter has only one apparent mode, which is the expected behaviour. The considered measures have a size of 100 200, and we run the barycenter algorithm for 500 iterations (with K = 64 projections), which takes 3 minutes on a commodity GPU. UOT barycenters for this size of problems are intractable, and to the best of our knowledge, our experiment is the very first instance where unbalanced OT barycenters can be computed on such a large scale.

6 Conclusion

We proposed two losses merging unbalanced and sliced OT, with theoretical guarantees and an efficient and modular Frank-Wolfe algorithm. We illustrate the performance improvement over SOT on various experiments, and described novel applications of unbalanced OT barycenters of positive measures, with a new case study on geophysical data. These novel results and algorithms pave the way to numerous new applications of sliced variants of OT, and we believe that our contributions will motivate practitioners to further explore their use in general ML applications, without the cumbersome task of pre-processing probability measures.

Acknowledgments

We thank the anonymous reviewers for their valuable comments. KF was supported by NSERC Discovery grant (RGPIN-2019-06512) and a Samsung grant. CB was supported by project Dyna Learn from Labex Comin Labs and Région Bretagne ARED DLearn Me, and by the ANR PEPR PDE-AI. NC was supported by the ANR AI Chair OTTOPIA ANR-20-CHIA-0030

Midjourney, 2023. URL "https://docs.midjourney.com/legacy/en". (Cited on p. 13)

Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2005. (Cited on p. 8)

Yikun Bai, Bernhard Schmitzer, Matthew Thorpe, and Soheil Kolouri. Sliced optimal partial transport. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13681 13690, 2023. (Cited on p. 2, 4, 5, 8, 9, 12, 39)

Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence, 41(2):423 443, 2018. (Cited on p. 1)

Erhan Bayraktar and Gaoyue Guo. Strong equivalence between metrics of Wasserstein type. Electronic Communications in Probability, 26(none):1 13, 2021. doi: 10.1214/21-ECP383. (Cited on p. 2, 8)

Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167 175, 2003. (Cited on p. 41)

Vladimir Igorevich Bogachev and Maria Aparecida Soares Ruas. Measure theory, volume 1. Springer, 2007. (Cited on p. 23)

Clément Bonet, Benoît Malézieux, Alain Rakotomamonjy, Lucas Drumetz, Thomas Moreau, Matthieu Kowalski, and Nicolas Courty. Sliced-wasserstein on symmetric positive definite matrices for m/eeg signals. In Proceedings of the 40th International Conference on Machine Learning, 2023a. (Cited on p. 2)

Clément Bonet, Nicolas Courty, François Septier, and Lucas Drumetz. Efficient gradient flows in slicedwasserstein space. Transactions on Machine Learning Research, 2022. (Cited on p. 8)

Published in Transactions on Machine Learning Research (12/2024)

Clément Bonet, Paul Berg, Nicolas Courty, François Septier, Lucas Drumetz, and Minh-Tan Pham. Spherical sliced-wasserstein. In The Eleventh International Conference on Learning Representations, 2023b. (Cited on p. 2)

Clément Bonet, Laetitia Chapel, Lucas Drumetz, and Nicolas Courty. Hyperbolic sliced-wasserstein via geodesic and horospherical projections. In Proceedings of 2nd Annual Workshop on Topology, Algebra, and Geometry in Machine Learning (TAG-ML), pp. 334 370. PMLR, 2023c. (Cited on p. 2, 11, 41)

Nicolas Bonneel and David Coeurjolly. Spot: sliced partial optimal transport. ACM Transactions on Graphics (TOG), 38(4):1 13, 2019. (Cited on p. 2, 4, 5, 8, 12)

Nicolas Bonneel, Julien Rabin, Gabriel Peyré, and Hanspeter Pfister. Sliced and radon wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision, 51:22 45, 2015. (Cited on p. 2)

Nicolas Bonneel, Gabriel Peyré, and Marco Cuturi. Wasserstein barycentric coordinates: histogram regression using optimal transport. ACM Trans. Graph., 35(4), jul 2016. (Cited on p. 12)

Nicolas Bonnotte. Unidimensional and evolution methods for optimal transportation. Ph D thesis, Université Paris Sud-Paris XI; Scuola normale superiore (Pise, Italie), 2013. (Cited on p. 8, 25, 26)

Jules Candau-Tilh. Wasserstein and sliced-wasserstein distances. Master s thesis, Université Pierre et Marie Curie, 2020. (Cited on p. 8)

Laetitia Chapel, Rémi Flamary, Haoran Wu, Cédric Févotte, and Gilles Gasso. Unbalanced optimal transport through non-negative penalized linear regression. Advances in Neural Information Processing Systems, 34: 23270 23282, 2021. (Cited on p. 12, 37)

Lenaic Chizat and Francis Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. Advances in neural information processing systems, 31, 2018. (Cited on p. 2, 8)

Lenaic Chizat, Gabriel Peyré, Bernhard Schmitzer, and François-Xavier Vialard. An interpolating distance between optimal transport and fisher rao metrics. Foundations of Computational Mathematics, 18:1 44, 2018a. (Cited on p. 3)

Lenaic Chizat, Gabriel Peyré, Bernhard Schmitzer, and François-Xavier Vialard. Unbalanced optimal transport: Dynamic and kantorovich formulations. Journal of Functional Analysis, 274(11):3090 3123, 2018b. (Cited on p. 2)

Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information processing systems, 26, 2013. (Cited on p. 2)

Marco Cuturi and Arnaud Doucet. Fast computation of wasserstein barycenters. In Eric P. Xing and Tony Jebara (eds.), Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pp. 685 693, Bejing, China, 22 24 Jun 2014a. PMLR. (Cited on p. 13)

Marco Cuturi and Arnaud Doucet. Fast computation of wasserstein barycenters. In International conference on machine learning, pp. 685 693. PMLR, 2014b. (Cited on p. 41)

Pinar Demetci, Rebecca Santorella, Manav Chakravarthy, Bjorn Sandstede, and Ritambhara Singh. Scotv2: Single-cell multiomic alignment with disproportionate cell-type representation. Journal of Computational Biology, 29(11):1213 1228, 2022. (Cited on p. 2)

Ishan Deshpande, Yuan-Ting Hu, Ruoyu Sun, Ayis Pyrros, Nasir Siddiqui, Sanmi Koyejo, Zhizhen Zhao, David Forsyth, and Alexander G Schwing. Max-sliced wasserstein distance and its use for gans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10648 10656, 2019. (Cited on p. 2, 10)

Published in Transactions on Machine Learning Research (12/2024)

Richard Mansfield Dudley. The speed of mean glivenko-cantelli convergence. The Annals of Mathematical Statistics, 40(1):40 50, 1969. (Cited on p. 2)

Kilian Fatras, Younes Zine, Rémi Flamary, Remi Gribonval, and Nicolas Courty. Learning with minibatch wasserstein : asymptotic and gradient properties. In Silvia Chiappa and Roberto Calandra (eds.), Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pp. 2131 2141, Online, 26 28 Aug 2020. PMLR. (Cited on p. 2)

Kilian Fatras, Thibault Sejourne, Rémi Flamary, and Nicolas Courty. Unbalanced minibatch optimal transport; applications to domain adaptation. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 3186 3197. PMLR, 18 24 Jul 2021. (Cited on p. 1, 2)

Sira Ferradans, Nicolas Papadakis, Julien Rabin, Gabriel Peyré, and Jean-François Aujol. Regularized discrete optimal transport. In Scale Space and Variational Methods in Computer Vision, pp. 428 439, 2013. (Cited on p. 12)

Rémi Flamary, Nicolas Courty, Alexandre Gramfort, Mokhtar Z Alaya, Aurélie Boisbunon, Stanislas Chambon, Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo Fournier, et al. Pot: Python optimal transport. The Journal of Machine Learning Research, 22(1):3571 3578, 2021. (Cited on p. 12)

Aude Genevay, Lénaic Chizat, Francis Bach, Marco Cuturi, and Gabriel Peyré. Sample complexity of sinkhorn divergences. In The 22nd international conference on artificial intelligence and statistics, pp. 1574 1583. PMLR, 2019. (Cited on p. 30)

Ziv Goldfeld and Kristjan Greenewald. Sliced mutual information: A scalable measure of statistical dependence. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 17567 17578. Curran Associates, Inc., 2021. (Cited on p. 2)

Elad Hazan and Haipeng Luo. Variance-reduced and projection-free stochastic optimization. In International Conference on Machine Learning, pp. 1263 1271. PMLR, 2016. (Cited on p. 11)

Soheil Kolouri, Kimia Nadjahi, Umut Simsekli, Roland Badeau, and Gustavo Rohde. Generalized sliced wasserstein distances. Advances in neural information processing systems, 32, 2019. (Cited on p. 2)

Stanislav Kondratyev, Léonard Monsaingeon, and Dmitry Vorotnikov. A fitness-driven cross-diffusion system from population dynamics as a gradient flow. Journal of Differential Equations, 261(5):2784 2808, 2016. (Cited on p. 2)

Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. From word embeddings to document distances. In International conference on machine learning, pp. 957 966. PMLR, 2015. (Cited on p. 11, 12, 37)

Simon Lacoste-Julien and Martin Jaggi. On the global linear convergence of frank-wolfe optimization variants. Advances in neural information processing systems, 28, 2015. (Cited on p. 10)

Khang Le, Huy Nguyen, Quang M Nguyen, Tung Pham, Hung Bui, and Nhat Ho. On robust optimal transport: Computational complexity and barycenter computation. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 21947 21959, 2021. (Cited on p. 13)

Tam Le and Truyen Nguyen. Entropy partial transport with tree metrics: Theory and practice. In International Conference on Artificial Intelligence and Statistics, pp. 3835 3843. PMLR, 2021. (Cited on p. 2)

Matthias Liero, Alexander Mielke, and Giuseppe Savaré. Optimal entropy-transport problems and a new hellinger kantorovich distance between positive measures. Inventiones mathematicae, 211(3):969 1117, 2018. (Cited on p. 2, 3, 4, 5, 6, 8, 19, 21, 29, 30, 36)

Published in Transactions on Machine Learning Research (12/2024)

Suraj Maharjan, John Arevalo, Manuel Montes, Fabio A González, and Thamar Solorio. A multi-task approach to predict likability of books. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 1217 1227, 2017. (Cited on p. 12, 37)

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26, 2013. (Cited on p. 11, 37)

Kimia Nadjahi, Alain Durmus, Lénaïc Chizat, Soheil Kolouri, Shahin Shahrampour, and Umut Simsekli. Statistical and topological properties of sliced probability divergences. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 20802 20812. Curran Associates, Inc., 2020a. (Cited on p. 5, 6)

Kimia Nadjahi, Alain Durmus, Lénaïc Chizat, Soheil Kolouri, Shahin Shahrampour, and Umut Simsekli. Statistical and topological properties of sliced probability divergences. Advances in Neural Information Processing Systems, 33:20802 20812, 2020b. (Cited on p. 2, 6, 24, 32)

Yoshihiro Nagano, Shoichiro Yamaguchi, Yasuhiro Fujita, and Masanori Koyama. A wrapped normal distribution on hyperbolic space for gradient-based learning. In International Conference on Machine Learning, pp. 4693 4702. PMLR, 2019. (Cited on p. 41)

Khai Nguyen, Nhat Ho, Tung Pham, and Hung Bui. Distributional sliced-wasserstein and applications to generative modeling. ar Xiv preprint ar Xiv:2002.07367, 2020. (Cited on p. 2, 10)

Khai Nguyen, Tongzheng Ren, Huy Nguyen, Litu Rout, Tan Minh Nguyen, and Nhat Ho. Hierarchical sliced wasserstein distance. In The Eleventh International Conference on Learning Representations, 2023. (Cited on p. 2)

Ruben Ohana, Kimia Nadjahi, Alain Rakotomamonjy, and Liva Ralaivola. Shedding a pac-bayesian light on adaptive sliced-wasserstein distances. In Proceedings of the 40th International Conference on Machine Learning, 2023. (Cited on p. 2)

Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. Thumbs up? sentiment classification using machine learning techniques. In Proceedings of EMNLP, pp. 79 86, 2002. (Cited on p. 37)

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019. (Cited on p. 10, 13)

Ofir Pele and Michael Werman. Fast and robust earth mover s distances. In 2009 IEEE 12th international conference on computer vision, pp. 460 467. IEEE, 2009. (Cited on p. 2)

Gabriel Peyré, Marco Cuturi, et al. Computational optimal transport: With applications to data science. Foundations and Trends in Machine Learning, 11(5-6):355 607, 2019. (Cited on p. 2, 4)

Khiem Pham, Khang Le, Nhat Ho, Tung Pham, and Hung Bui. On unbalanced optimal transport: An analysis of Sinkhorn algorithm. In Hal Daumé III and Aarti Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 7673 7682. PMLR, 13 18 Jul 2020. (Cited on p. 2, 4, 12)

Benedetto Piccoli and Francesco Rossi. Generalized wasserstein distance and its application to transport equations with source. Archive for Rational Mechanics and Analysis, 211:335 358, 2014. (Cited on p. 30)

Prabhat, K. Kashinath, M. Mudigonda, S. Kim, L. Kapp-Schwoerer, A. Graubner, E. Karaismailoglu, L. von Kleist, T. Kurth, A. Greiner, A. Mahesh, K. Yang, C. Lewis, J. Chen, A. Lou, S. Chandran, B. Toms, W. Chapman, K. Dagon, C. A. Shields, T. O Brien, M. Wehner, and W. Collins. Climatenet: an expert-labeled open dataset and deep learning architecture for enabling high-precision analyses of extreme weather. Geoscientific Model Development, 14(1):107 124, 2021. doi: 10.5194/gmd-14-107-2021. (Cited on p. 13)

Published in Transactions on Machine Learning Research (12/2024)

Julien Rabin, Julie Delon, and Yann Gousseau. Regularization of transportation maps for color and contrast transfer. In International Conference on Image Processing, pp. 1933 1936, 2010. (Cited on p. 12)

Julien Rabin, Gabriel Peyré, Julie Delon, and Marc Bernot. Wasserstein barycenter and its application to texture mixing. In Scale Space and Variational Methods in Computer Vision: Third International Conference, SSVM 2011, Ein-Gedi, Israel, May 29 June 2, 2011, Revised Selected Papers 3, pp. 435 446. Springer, 2012. (Cited on p. 2, 4)

Grant Rotskoff, Samy Jelassi, Joan Bruna, and Eric Vanden-Eijnden. Global convergence of neuron birthdeath dynamics. ar Xiv preprint ar Xiv:1902.01843, 2019. (Cited on p. 2, 8)

Benjamin M Sanderson, Reto Knutti, and Peter Caldwell. A representative democracy to reduce interdependency in a multimodel ensemble. Journal of Climate, 28(13):5171 5194, 2015. (Cited on p. 13)

Filippo Santambrogio. Optimal transport for applied mathematicians. Birkäuser, NY, 55(58-63):94, 2015. (Cited on p. 19, 23, 29, 31, 34, 35)

Ryoma Sato, Makoto Yamada, and Hisashi Kashima. Fast unbalanced optimal transport on a tree. Advances in neural information processing systems, 33:19039 19051, 2020. (Cited on p. 2)

Geoffrey Schiebinger, Jian Shu, Marcin Tabaka, Brian Cleary, Vidya Subramanian, Aryeh Solomon, Joshua Gould, Siyan Liu, Stacie Lin, Peter Berube, et al. Optimal-transport analysis of single-cell gene expression identifies developmental trajectories in reprogramming. Cell, 176(4):928 943, 2019. (Cited on p. 1, 2)

Thibault Séjourné, Jean Feydy, François-Xavier Vialard, Alain Trouvé, and Gabriel Peyré. Sinkhorn divergences for unbalanced optimal transport. ar Xiv preprint ar Xiv:1910.12958, 2019. (Cited on p. 21)

Thibault Séjourné, Gabriel Peyré, and François-Xavier Vialard. Unbalanced optimal transport, from theory to numerics. ar Xiv preprint ar Xiv:2211.08775, 2022a. (Cited on p. 2, 3, 5)

Thibault Séjourné, François-Xavier Vialard, and Gabriel Peyré. Faster unbalanced optimal transport: Translation invariant sinkhorn and 1-d frank-wolfe. In International Conference on Artificial Intelligence and Statistics, pp. 4995 5021. PMLR, 2022b. (Cited on p. 2, 9, 23, 29, 34)

Stephen Simons. Minimax and monotonicity. Springer, 2006. (Cited on p. 21)

Soulivanh Thao, Mats Garvik, Gregoire Mariethoz, and Mathieu Vrac. Combining global climate models using graph cuts. Climate Dynamics, 59:2345 2361, 2022. (Cited on p. 13)

Adrien Vacher and François-Xavier Vialard. Semi-dual unbalanced quadratic optimal transport: fast statistical rates and convergent algorithm. In International Conference on Machine Learning, pp. 34734 34758. PMLR, 2023. (Cited on p. 6)

Jiaqi Xi and Jonathan Niles-Weed. Distributional convergence of the sliced wasserstein process. ar Xiv preprint ar Xiv:2206.00156, 2022. (Cited on p. 23)

Published in Transactions on Machine Learning Research (12/2024)

A Postponed proofs for Section 3

A.1 Existence of minimizers: Proof of Proposition 3.2 and Proposition 3.7

We provide the formal statement and detailed proof on the existence of a solution for both SUOT and USOT, as mentioned in Section 3.

Proposition A.1. (Existence of minimizers) Assume that C1 is lower-semicontinuous and that either (i) φ 1, = φ 2, = + , or (ii) C1 has compact sublevels on R R and φ 1, + φ 2, + inf C1 > 0. Then the solution of SUOT(α, β) and USOT(α, β) exist, i.e., the infimum in (6) and (9) is attained. More precisely, there exists (π1, π2) which attains the infimum for USOT(α, β) (see Equation 9). Concerning SUOT(α, β), there exists for any θ supp(σ) a plan πθ attaining the infimum in UOT(θ α, θ β) (see Equation 2).

Proof. We leverage (Liero et al., 2018, Theorem 3.3) to prove this proposition. In the setting of SUOT, if such assumptions (i) or (ii) are satisfied for (α, β), then they also hold for (θ α, θ β) for any θ Sd 1. Hence, UOT(θ α, θ β) admits a solution πθ.

Concerning USOT, note that one necessarily has m(π1) = m(π2), otherwise SOT(π1, π2) = + . From (Liero et al., 2018, Equation (3.10)), for any admissible (π1, π2, π), one has

USOT(α, β) m(π) inf C1 + m(α)φ1( m(π)

m(α)) + m(β)φ2( m(π)

In both settings the above bounds implies coercivity of the functional of USOT w.r.t. the masses of the measures (π1, π2, π). Thus there exists M > 0 such that m(π1) = m(π2) = m(π) < M, otherwise USOT(α, β) = + . By the Banach-Alaoglu theorem, the set of bounded measures (π1, π2) is compact, and the set of plans π with such marginals is also compact because Rd is Polish and C1 is lower-semicontinuous (Santambrogio, 2015, Theorem 1.7). Because the functional of USOT is lower-semicontinuous in (π1, π2, π) and we can restrict optimization over a compact set, we have existence of minimizers for USOT by standard proofs of calculus of variations.

A.2 Strong duality: Proof of Theorem 3.5 and Theorem 3.9

Note that the result for SUOT (Theorem 3.5) is proved in Lemma A.4. Thus we focus on the proof of duality for USOT.

Proof of Theorem 3.9. We start from the definition of USOT, reformulate it to apply the strong duality result of Proposition A.2 and obtain our reformulation. We first have that

USOT(α, β) = inf (π1,π2) M+(Rd)2 SOT(π1, π2) + Dφ1(π1|α) + Dφ2(π2|β) ,

= inf (π1,π2) M+(Rd)2

sup fθ gθ C1

Z fθd(θ π1) + Z gθd(θ π2)

+ sup f E(Rd)

Z φ 1( f(x))dα(x) Z f(x)dπ1(x)

+ sup g E(Rd)

Z φ 2( g(y))dβ(y) Z g(y)dπ2(y)

= inf (π1,π2) M+(Rd)2

sup fθ gθ C1

" Z fθd(θ π1) + Z gθd(θ π2)

+ sup f E(Rd)

Z φ 1( f(x))dα(x) Z f(x)dπ1(x)

+ sup g E(Rd)

Z φ 2( g(y))dβ(y) Z g(y)dπ2(y)

Published in Transactions on Machine Learning Research (12/2024)

where E(Rd) denotes a set of lower-semicontinuous functions, and the last equality holds thanks to Lemma A.3.

We focus now on verifying that Proposition A.2 holds, so that we can swap the infimum and the supremum. Define the functional

L((π1, π2), ((fθ)θ, (gθ)θ, f, g)) Z

" Z fθd(θ π1) + Z gθd(θ π2)

+ Z φ 1( f(x))dα(x) Z f(x)dπ1(x)

+ Z φ 2( g(y))dβ(y) Z g(y)dπ2(y) .

One has that,

For any ((fθ)θ, (gθ)θ, f, g), L is linear (thus convex) and lower-semicontinuous.

For any (π1, π2), L is concave in ((fθ)θ, (gθ)θ, f, g) because φ i is concave and thus L is a sum of linear or concave functions.

Furthermore, since we assumed that 0 dom(φ), then

sup ((fθ)θ,(gθ)θ, f, g) inf (π1,π2) M+(Rd)2 L USOT(α, β) φ1(0)m(α) + φ2(0)m(β) ,

because the marginals (π1, π2) = (0, 0) are admissible and suboptimal. If we consider instead that (m(α), m(β)) dom(φ), then we take the marginals π1 = α/m(α) and π2 = β/m(β), which yields an upperbound by m(α)φ1( 1 m(α)) + m(β)φ2( 1 m(β)). Then we consider an anchor dual point b = ((fθ)θ, (gθ)θ, f, g) to bound L over a compact set. We take fθ = 0, gθ = 0, which are always admissible since we take C1(x, y) 0. Then, since we assume there exists pi 0 in dom(φ i ), we take f = p1 and g = p2. For these potentials one has:

L((π1, π2), b ) = φ 1(p1)m(α) p1m(π1) + φ 2(p2)m(α) p2m(π2).

Note that the functional at this point only depends on the masses of the marginals (π1, π2). Since (p1, p2) 0 the set of (π1, π2) such that L((π1, π2), b ) φ1(0)m(α) + φ2(0)m(β) is non-empty (at least in a neighbourhood of (π1, π2) = (0, 0), and that (m(π1), m(π2)) are uniformly bounded by some constant M > 0. By the Banach-Alaoglu theorem, such set of measures is compact for the weak* topology.

Therefore, Proposition A.2 holds and we have strong duality, i.e.,

USOT(α, β) = sup fθ gθ C1 ( f, g) E(Rd)

inf (π1,π2) M+(Rd)2 L((π1, π2), ((fθ)θ, (gθ)θ, f, g)).

To achieve the proof, note that taking the infimum in (π1, π2) (for fixed dual variables) reads

inf π1,π2 0

Sd 1 fθ(θ (x))dˆσK(θ)

dπ1(x) Z f(x)dπ1(x)

Sd 1 gθ(θ (y))dˆσK(θ)

dπ2(y) Z g(y)dπ2(y).

Note that we applied Fubini s theorem here, which holds here because all measures have compact support, thus all quantities are finite. It allows to rephrase the minimization over π1, π2 0 as the following constraint Z

Sd 1 fθ(θ (x))dˆσK(θ) f(x), Z

Sd 1 gθ(θ (y))dˆσK(θ) g(y),

Published in Transactions on Machine Learning Research (12/2024)

otherwise the infimum is . However, the function φ is non-decreasing (see (Séjourné et al., 2019, Proposition 2)). Thus the maximization in ( f, g) is optimal when the above inequality is actually an equality, i.e., Z

Sd 1 fθ(θ (x))dˆσK(θ) = f(x), Z

Sd 1 gθ(θ (y))dˆσK(θ) = g(y).

Plugging the above relation in the functional L yields the desired result on the dual of USOT and ends the proof.

We mention a strong duality result which is very general and which we use in the proof of Theorem 3.9. This result is taken from (Liero et al., 2018, Theorem 2.4) which itself takes it from (Simons, 2006).

Proposition A.2. (Liero et al., 2018, Theorem 2.4) Consider two sets A and B be nonempty convex sets of some vector spaces. Assume A is endowed with a Hausdorff topology. Let L : A B R be a function such that

1. a 7 L(a, b) is convex and lower-semicontinuous on A, for every b B

2. b 7 L(a, b) is concave on B, for every a A.

If there exists b B and κ > supb B infa A L(a, b) such that the set {a A, L(a, b ) < κ} is compact in A, then

inf a A sup b B L(a, b) = sup b B inf a A L(a, b)

We also consider the following to swap the supremum in the integral which defines sliced-UOT (and in particular sliced-OT). In what follows we note sliced potentials as functions fθ(z) with (θ, z) Sd 1 R, such that

SUOT(α, β) = Z

h sup fθ gθ C1

Z φ fθd(θ α) + Z φ gθd(θ β) i dˆσK(θ).

Note that with the above definition, z 7 fθ(z) is continuous for any θ, but θ 7 fθ(z) is only ˆσK-measurable.

Lemma A.3. Consider two sets X and Y , a measure σ such that σ(X) < + . Assume Y is compact. Consider a function F : X Y R. Assume there exists a sequence (yn) in Y such that F( , yn) supy Y F( , y) uniformly. Then one has

X F(x, y)dσ(x) = Z

X sup y Y F(x, y)dσ(x).

Proof. Define G(x) = supy Y F(x, y) and H(x, y) G(x) F(x, y). One has H 0 by definition, and the desired equality can be rewritten as

X F(x, y)dσ(x) = Z

X sup y Y F(x, y)dσ(x)

X H(x, y)dσ(x) = 0.

Since the integral involving H is non-negative, the infimum is zero if and only if we have a sequence (yn) such that R

X H( , yn)dσ 0. By assumption, one has F( , yn) supy Y F( , y) uniformly, i.e., ||H( , yn)|| 0. This implies thanks to Holder s inequality that

X H( , yn)dσ σ(X)||H( , yn)||

Published in Transactions on Machine Learning Research (12/2024)

Thus by assumption one has R

X F( , yn)dσ R

X Gdσ, which indeed means that we have the desired permutation between supremum and integral.

Lemma A.4. Let p [1, + ) and assume that C1(x, y) = |x y|p. Consider two positive measures (α, β) with compact support. Assume that the measure ˆσK is discrete, i.e., ˆσK = 1 K PK i=1 δθi with θi Sd 1, i = 1, . . . , n. Then, one can swap the integral over the sphere and the supremum in the dual formulation of SUOT, such that

SUOT(α, β) = sup fθ gθ C1

h Z φ fθd(θ α) + Z φ gθd(θ β) i dˆσK(θ).

In particular, this result is valid for SOT.

Proof. The proof consists in applying Lemma A.3 for (X, Y ) chosen as X = supp(ˆσK) Sd 1 and

Y = θ supp(ˆσK), fθ : R R, gθ : R R, fθ(x) + gθ(y) C1(x, y) .

The functions in Y are dual potentials, and by definition are continuous for any θ. Let F : X Y R be the functional defined as

F : (θ, (fθ)θ, (gθ)θ) 7 Z fθd(θ α) + Z gθd(θ β) .

Since the measures (α, β) have compact support, then by Lemma A.5, the supremum is attained over a subset of dual potentials of Y such that for any fixed θ X, (fθ, gθ) are Lipschitz-continuous and bounded, thus uniformly equicontinuous functions (with constants independent of θ). By the Ascoli-Arzela theorem, the set of uniformly equicontinuous functions is compact for the uniform convergence. Hence, for any θ X, there exists a sequence of dual potentials (fθ,n, gθ,n) which uniformly converges to optimal dual potentials (fθ, gθ) (up to extraction of subsequence). Besides, we have OT(θ α, θ β) = F(θ, fθ, gθ) and F(θ, (fθ,n)θ, (gθ,n)θ) OT(θ α, θ β) as n + . Denote Fn(θ) F(θ, (fθ,n)θ, (gθ,n)θ) and OT(θ) OT(θ α, θ β). In order to apply Lemma A.3, we need to prove that the convergence of (Fn(θ))n N to OT(θ α, θ β) is uniform w.r.t. θ, i.e., supθ X |Fn(θ) OT(θ)| 0 as n + .

First, note that for any θ X,

|Fn(θ) OT(θ)| m(α) fθ,n fθ + m(β) gθ,n gθ .

Since for a fixed θ X, (fθ,n, gθ,n)n N uniformly converge to (fθ, gθ), this means that

θ X, ε > 0, N(ε, θ), n N(ε, θ), m(α) fθ,n fθ + m(β) gθ,n gθ < ε.

Since we assume that σ is supported on a discrete set, then the cardinal of X is finite and one can define N(ε) maxθ X N(ε, θ). This yields,

ε > 0, N(ε), n N(ε), sup θ X |Fn(θ) OT(θ)| < ε.

which means that supθ X |Fn(θ) OT(θ)| 0, thus concludes the proof.

Lemma A.5. Let p [1, + ) and C1(x, y) = |x y|p. Consider two positive measures (α, β) M+(Rd) whose support is such that Cd(x, y) = ||x y||p R. Then for any θ Sd 1, one can restrict without loss of generality the problem UOT(θ α, θ β) as a supremum over dual potentials satisfying fθ(x)+gθ(y) C1(x, y), uniformly bounded by M and uniformly L-Lipschitz, where M and L do not depend on θ.

Published in Transactions on Machine Learning Research (12/2024)

Proof. We adapt the proof of (Santambrogio, 2015, Proposition 1.11), and focus on showing that the uniform boundedness and Lipschitz constant are independent of θ Sd 1 in this setting. Here we consider the translation-invariant formulation of UOT from (Séjourné et al., 2022b), i.e., UOT(α, β) = supf g Cd H(f, g), where H(f, g) = supλ R D(f + λ, g λ). It is proved in (Séjourné et al., 2022b, Proposition 9) that the above problem has the same primal and is thus equivalent to optimize D. By definition one has H(f, g) = H(f +λ, g λ) for any λ R, i.e., this formulation shares the same invariance as Balanced OT. Thus we can reuse all arguments from (Santambrogio, 2015, Proposition 1.11), such that for UOT(α, β), one can use the constraint f(x) + g(y) Cd(x, y) and the assumption Cd(x, y) R to prove that without loss of generality, on can restrict to potentials such that f(x) [0, R] and g(y) [ R, R]. Furthermore if the cost satisfies in Rd

|Cd(x, y) Cd(x , y )| L(||x x || + ||y y ||),

then one can also restrict w.l.o.g. to potentials which are L-Lipschitz. For the cost Cd(x, y) = ||x y||p

with p 1, this holds with constant L = p Rp 1 because the support is bounded and the gradient of Cd is radially non-decreasing.

Regarding OT(θ α, θ β), the bounds (Mθ, Lθ) could be refined by considering the dependence in θ Sd 1. However we prove now these constants can be upper-bounded by a finite constant independent of θ. In this setting we consider the cost

C1(θ (x), θ (y)) = | θ, x y |p ||θ||p||x y||p ||x y||p,

by Cauchy-Schwarz inequality. Therefore, if (α, β) have supports such that ||x y||p R, then (θ α, θ β) also have supports bounded by R in R. Similarly note that the derivative of h(x) = xp is non-decreasing for p 1. Hence the cost C1(θ (x), θ (y)) has a bounded derivative, which reads

p| θ, x y |p 1 p||θ||p 1||x y||p 1 p|x y||p 1 p Rp 1.

Thus on the supports of (θ α, θ β) one can also bound the Lipschitz constant of the cost C1(x, y) = |x y|p

by the same constant L.

Remark: Extending Theorem 3.9. We conjecture that Theorem 3.9 also holds when σ is the uniform measures over Sd 1, since the above holds for any N N and ˆσN converges weakly* to σ. Proving this result would require that potentials (fθ, gθ) are also regular (i.e., Lipschitz and bounded) w.r.t θ Sd 1. This regularity is proved in (Xi & Niles-Weed, 2022) assuming (α, β) have densities, but remains unknown for discrete measures. Since discretizing σ corresponds to the computational approach, we assume it to be discrete, so that no additional assumption than boundedness on (α, β) is required. For instance, such result remains valid for semi-discrete UOT computation.

A.3 Metric properties: Proof of Proposition 3.3 and Proposition 3.8

Proof of Proposition 3.3. Metric properties of SUOT. Symmetry and non-negativity are immediate. Assume SUOT(α, β) = 0. Since σ is the uniform distribution on Sd 1, then for any θ Sd 1, UOT(θ α, θ β) = 0, and since UOT is assumed to be definite, then θ α = θ β. By (Bogachev & Ruas, 2007, Proposition 3.8.6), this implies that α and β have the same Fourier transform. By injectivity of the Fourier transform, we conclude that α = β, hence SUOT is definite. The triangle inequality results from applying

Published in Transactions on Machine Learning Research (12/2024)

the Minkowski inequality then the triangle inequality for UOT1/p for p [1, + ): for any α, β, γ M+(Rd),

SUOT1/p(α, β)

Sd 1 UOT(θ α, θ β)dσ(θ)

UOT1/p(θ α, θ γ) + UOT1/p(θ γ, θ β) pdσ(θ)

UOT1/p(θ α, θ γ) pdσ(θ)

UOT1/p(θ γ, θ β) pdσ(θ)

= SUOT1/p(α, γ) + SUOT1/p(γ, β).

Proof of Proposition 3.8. Metric properties of USOT. Let (α, β) M+(Rd). Non-negativity is immediate, as USOT is defined as a program minimizing a sum of positive terms. SOT is symmetric, thus when φ1 = φ2, we obtain symmetry of the functional w.r.t. (α, β). Assume Dφ is definite, i.e., Dφ(α|β) = 0 implies α = β. Assume now that USOT(α, β) = 0, and denote by (π1, π2) the optimal marginals attaining the infimum in (9). USOT(α, β) = 0 implies that SOT(π1, π2) = 0, Dφ(π1|α) = 0 and Dφ(π2|β) = 0. These three terms are definite, which yields α = π1 = π2 = β, hence the definiteness of USOT. The Partial OT setting (i.e., Dφ = ρTV) is treated in Appendix A.7.

A.4 Sample complexity: Proof of Theorem 3.4

Theorem 3.4 is obtained by adapting (Nadjahi et al., 2020b, Theorems 4 and 5). We provide the detailed derivations below.

Proof of Theorem 3.4. Let (α, β) e M e M with respective empirical approximations ˆαn, ˆβn over n samples. By using the definition of SUOT, the triangle inequality and the assumed sample complexity of UOT for univariate measures, we show that

E SUOT(α, β) SUOT(ˆαn, ˆβn) (15)

UOT(θ α, θ β) UOT(θ ˆαn, θ ˆβn) dσ(θ) (16)

UOT(θ α, θ β) UOT(θ ˆαn, θ ˆβn) dσ(θ) (17)

Sd 1 E UOT(θ α, θ β) UOT(θ ˆαn, θ ˆβn) dσ(θ) (18)

Sd 1 κ(n)dσ(θ) = κ(n) , (19)

which completes the proof for the first setting.

Next, let α e M with corresponding empirical approximation ˆαn. Then, using the definition of SUOT, the triangle inequality (w.r.t. integral) and the assumed convergence rate in UOT,

E |SUOT(ˆαn, α)| (20)

Sd 1 UOT(θ ˆαn, θ α)dσ(θ) E Z

UOT(θ ˆαn, θ α) dσ(θ) (21)

Sd 1 E UOT(θ ˆαn, θ α) dσ(θ) Z

Sd 1 ξ(n)dσ(θ) = ξ(n) . (22)

Published in Transactions on Machine Learning Research (12/2024)

Corollary A.6. Assume for µ M+(R), E|UOT(µ, ˆµn)| ξ(n) and that for p 1, UOT1/p satisfies non-negativity, symmetry and the triangle inequality on M+(R) M+(R). Then, for α, β M+(Rd),

E SUOT1/p(α, β) SUOT1/p(ˆαn, ˆβn) 2ξ(n)1/p. (23)

Proof. Since UOT1/p satisfies non-negativity, symmetry and the triangle inequality on M+(R) M+(R), SUOT1/p verifies these three metric properties on M+(Rd) M+(Rd) by Proposition 3.3, and we can derive its sample complexity as follows. For any α, β in M+(Rd) with respective empirical approximations ˆαn, ˆβn, applying the triangle inequality yields for p [1, + ), UOT1/p(α, β) UOT1/p(ˆαn, ˆβn) UOT1/p(ˆαn, α) + UOT1/p(ˆβn, β) . (24)

Taking the expectation of (24) with respect to ˆαn, ˆβn gives,

E SUOT1/p(α, β) SUOT1/p(ˆαn, ˆβn) E|SUOT1/p(ˆαn, α)| + E|SUOT1/p(ˆβn, β)| (25)

{E |SUOT(ˆαn, α)|}1/p + {E|SUOT(ˆβn, β)|}1/p (26)

ξ(n)1/p + ξ(n)1/p = 2ξ(n)1/p , (27)

where (26) is immediate if p = 1, and results from applying Hölder s inequality on Sd 1 if p > 1, and (27) follows from (22).

A.5 Comparison of SUOT, USOT, SOT, and proof of Theorem 3.10 and Theorem 3.11

In this section, we establish several bounds to compare SUOT, USOT and SOT on the space of compactlysupported measures. We provide the detailed derivations and auxiliary lemmas needed for the proofs. The Partial OT setting (i.e., Dφ = ρTV) is treated in Appendix A.7.

Proof of Theorem 3.10. Theorem 3.10 is a direct consequence from Theorems A.7 and A.8.

Theorem A.7. Let (α, β) M+(Rd) M+(Rd). Then, SUOT(α, β) USOT(α, β).

Proof. To show that SUOT(α, β) USOT(α, β), we use a sub-optimality argument. Let π be the solution USOT(α, β) and denote by (π1, π2) the marginals of π. For any θ Sd 1, denote by πθ the solution of OT(θ π1, θ π2). By definition of USOT, the marginals of πθ are given by (θ π1, θ π2). Since the sequence (πθ)θ is suboptimal for the problem SUOT(α, β), one has

SUOT(α, β) Z

Z C1dπθ + Dφ1(θ π1|θ α) + Dφ2(θ π2|θ β) dσ(θ) (28)

Z C1dπθdσ(θ) + Dφ1(π1|α) + Dφ2(π2|β) (29)

= USOT(α, β), (30)

where the second inequality results from Lemma A.10, and the last equality follows from the definition of USOT(α, β).

Theorem A.8. Let (α, β) M+(Rd) M+(Rd). Additionally, let p [1, + ) and assume that for all x, y Rd, θ Sd 1, C1 θ (x), θ (y) Cd(x, y). Then, USOT(α, β) UOT(α, β).

Proof. By (Bonnotte, 2013, Proposition 5.1.3), SOT(µ, ν) OT(µ, ν) as C1 θ (x), θ (y) Cd(x, y) for all x, y Rd, θ Sd 1. Let π be the solution of UOT(α, β) with marginals (π1, π2). These marginals are

Published in Transactions on Machine Learning Research (12/2024)

sub-optimal for USOT(α, β), we have

USOT(α, β) SOT(π1, π2) + Dφ1(π1|α) + Dφ2(π2|β) , (31)

OT(π1, π2) + Dφ1(π1|α) + Dφ2(π2|β) , (32)

= UOT(α, β) , (33)

where the last equality is obtained because π is optimal in UOT(α, β).

Proof of Theorem 3.11.

Theorem A.9. Let X be a compact subset of Rd with radius R and consider α, β M+(X). Additionally, let p [1, + ) and assume C1(x, y) = |x y|p for (x, y) R and Cd(x, y) = x y p for (x, y) Rd. Let ρ > 0 and assume Dφ1 = Dφ2 = ρKL. Then, UOT(α, β) c SUOT(α, β)1/(d+1), where c = c(m(α), m(β), ρ, R) is a non-decreasing function of m(α) and m(β).

Proof. We adapt the proof of (Bonnotte, 2013, Lemma 5.1.4), which establishes a bound between OT and SOT. The first step consists in bounding from above the distance between two regularized measures.

Let ψ : Rd R+ be a smooth and radial function such that supp(ψ) Bd(0, 1) and R

Rd ψ(x)d Leb(x) = 1. Let ψλ(x) = λ dψ(x/λ). For any function f defined on Rs (s 1), denote by F[f] the Fourier transform of f defined for x Rs as F[f](x) = R

Rs f(w)e i w,x dw. Let αλ = α φλ and βλ = β φλ where is the convolution operator. Let (f, g) such that f g Cd. By using the isometry properties of the Fourier transform and the definition of ψλ, then representing the variables with polar coordinates, we have Z

Rd φ (f(x))dαλ(x) = Z

Rd F[φ f](w)F[α](w)F[ψ](λw)dw (34)

0 F[φ f](rθ)F[α](rθ)F[ψ](λr)rd 1drdθ . (35)

Since φ f is a real-valued function, F[φ f] is an even function, then Z

Rd φ (f(x))dαλ(x) (36)

R F[φ f](rθ)F[α](rθ)F[ψ](λr)|r|d 1drdθ (37)

R F[φ f](rθ)F[θ α](r)F[ψ](λr)|r|d 1drdθ (38)

R F[φ f](rθ)

R e irudθ α(u)

F[ψ](λr) |r|d 1 drdθ (39)

R φ (f(x))e ir(u+ θ,x )dθ α(u)

F[ψ](λr) |r|d 1 dxdrdθ . (40)

Equation 38 follows from the property of push-forward measures, (39) results from the definition of the Fourier transform and u [ R, R], and (40) results from the definition of the Fourier transform and Fubini s theorem. By making a change of variables (x becomes x uθ), we obtain Z

Rd φ (f(x))dαλ(x) (41)

R φ (f(x uθ))e ir θ,x dθ α(u)F[ψ](λr) |r|d 1 dxdrdθ (42)

R φ (f(x uθ))e ir θ,x dθ α(u)F[ψ](λr) |r|d 1 dxdrdθ , (43)

Published in Transactions on Machine Learning Research (12/2024)

where (43) follows from supp(α) Bd(0, R). Indeed, this implies that supp(αλ) Bd(0, R + λ), thus the domain of x 7 φ f(x uθ) is contained in Bd(0, 2R + λ).

Similarly, one can show that Z

Rd φ (g(y))dβλ(y) (44)

R φ (g(y uθ))e ir θ,y dθ β(u)F[ψ](λr) |r|d 1 dydrdθ . (45)

By (43) and (45), we obtain

Rd φ (f(x))dαλ(x) + Z

Rd φ (g(y))dβλ(y) (46)

R φ (f(x uθ))dθ α(u)

R φ (g(x uθ))dθ β(u) o e ir θ,x F[ψ](λr) |r|d 1 dxdrdθ , (47)

R φ (f(x uθ))dθ α(u) + Z R

R φ (g(x uθ))dθ β(y)

e ir θ,x dx

Bd(0,2R+λ) UOT(θ α, θ β)|e ir θ,x |dx (49)

(2R + λ)d UOT(θ α, θ β) , (50)

where (49) is obtained by taking the supremum of (48) over the set of potentials ( f, g) such that for u [ R, R], (x, θ) Bd(0, 2R + λ) Sd 1, f(u) = f(x uθ), g(u) = g(x uθ), which is included in the set of potentials (f , g ) s.t. f : R R, g : R R and f g C1. Therefore,

Rd φ (f(x))dαλ(x) + Z

Rd φ (g(y))dβλ(y)

2(2R + λ)d A(Sd 1) Z

Sd 1 UOT(θ α, θ β)dσ(θ) Z

R λ d F[ψ](r) |r|d 1 dr (51)

c(2R + λ)dλ d SUOT(α, β) , (52)

where c = 1

R F[ψ](r) |r|d 1 dr. We deduce from the dual formulation of UOT (3) and (49) that,

UOT(αλ, βλ) c(2R + λ)dλ d SUOT(α, β) . (53)

The last step of the proof consists in relating UOT(αλ, βλ) with UOT(α, β). For any (f, g) such that f g Cd, we have Z

Rd φ (f(x))dα(x) + Z

Rd φ (g(y))dβ(y) UOT(αλ, βλ) (54)

Rd φ (f(x))dα(x) + Z

Rd φ (g(x))dβ(x) Z

Rd φ (f(x))dαλ(x) Z

Rd φ (g(y))dβλ(y) (55)

Rd{φ (f(x)) ψλ φ (f(x))}dα(x) + Z

Rd{φ (g(y)) ψλ φ (g(y))}dβ(y) . (56)

Published in Transactions on Machine Learning Research (12/2024)

φ (f(x)) ψλ φ (f(x)) = λ d Z

φ (f(x)) φ (f(y)) ψ x y

φ (f(x)) φ (f(y)) ψ x y

Since Dφ = ρKL, then for z R, φ (z) = ρ(1 e z/ρ), so for (x, y) Rd Rd,

φ (f(x)) φ (f(y)) = ρ(e f(y)/ρ e f(x)/ρ) (59)

By Lemma A.13, the potentials (f, g) are bounded by constants depending on m(α), m(β). Therefore, we can bound (59) as follows. |φ (f(x)) φ (f(y))| e λ /ρ x y . (60)

We thus derive the following upper-bound on (58).

φ (f(x)) ψλ φ (f(x)) λ de λ /ρ Z

Rd x y ψ x y

λ d+1e λ /ρ Z

By doing the change of variables z = (y x)/λ and using the fact that ψ is a radial function, we obtain

φ (f(x)) ψλ φ (f(x)) λe λ /ρ Z

Rd z ψ(z)dz . (63)

Similarly, using the bounds on g in Lemma A.13, one can show that

|φ (g(x)) φ (g(y))| e(λ +R)/ρ x y , (64)

φ (g(x)) ψλ φ (g(x)) λe(λ +R)/ρ Z

Rd z ψ(z)dz . (65)

We conclude that, Z

Rd φ (f(x))dα(x) + Z

Rd φ (g(y))dβ(y) UOT(αλ, βλ) e λ /ρ + e(λ +R)/ρ λM1(ψ) , (66)

where M1(ψ) R

Rd z ψ(z)dz. Taking the supremum on both sides over (f, g) such that f g Cd yields,

UOT(α, β) UOT(αλ, βλ) e λ /ρ + e(λ +R)/ρ λM1(ψ) . (67)

Finally, by combining (53) with the above inequality, we obtain

UOT(α, β) e λ /ρ + e(λ +R)/ρ λM1(ψ) + c(2R + λ)dλ d SUOT(α, β) (68)

c λ 1 + (2R + λ)dλ (d+1)SUOT(α, β) , (69)

where c is a constant satisfying c c and c e λ /ρ + e(λ +R)/ρ M1(ψ). By choosing λ = Rd/(d+1)SUOT(α, β)1/(d+1), (69) becomes

UOT(α, β) c Rd/(d+1)SUOT(α, β)1/(d+1) 1 + (2R + λ)d R d . (70)

We conclude using that SUOT(α, β) is bounded from above. Indeed, SUOT(α, β) ρ(m(α) + m(β)) since on the one hand, π is suboptimal in (3) thus UOT(α, β) ρ(m(α) + m(β)), and on the other hand, m(α) = m(θ α) for any θ Sd 1. This yields λ Rd/(d+1)ρ1/(d+1)(m(α) + m(β))1/(d+1), hence

UOT(α, β) c(m(α), m(β), ρ, R)SUOT(α, β)1/(d+1) , (71)

where c(m(α), m(β), ρ, R) is a non-decreasing function of m(α) and m(β).

Published in Transactions on Machine Learning Research (12/2024)

Additional Lemmas.

Lemma A.10. For any θ Sd 1 and α, β M+(Rd), Dφ(θ α|θ β) Dφ(α|β).

Proof. For α, β M+(Rs) with s 1, the dual characterization of φ-divergences reads (Liero et al., 2018, Theorem 2.7)

Dφ(α|β) = sup f E(Rs)

Rs φ (f(x))dβ(x) Z

Rs f(x)dα(x),

where E(Rs) denotes the space of lower semi-continuous functions from Rs to R {+ }. Therefore, for any θ Sd 1 and α, β M+(Rd),

Dφ(θ α|θ β) = sup f E(R)

R φ (f(t))d(θ β)(t) Z

R f(t)d(θ α)(t) (72)

= sup g:Rd R s.t. f E(R), g=f θ

Rd φ (g(x))dβ(x) Z

Rd g(x)dα(x) (73)

where (73) results from the definition of push-forward measures. We conclude the proof by observing that the supremum in (73) is taken over a subset of E(Rd).

Lemma A.11. (Santambrogio, 2015, Proposition 1.11) Let p [1, + ) and assume Cd(x, y) = x y p. Let α, β with compact support, such that Cd(x, y) Rp for (x, y) supp(α) supp(β). Then without loss of generality the dual potentials (f, g) of UOT(α, β) satisfy f(x) [0, R] and g(y) [ R, R].

Lemma A.12. (Séjourné et al., 2022b, Proposition 2) Define the translation-invariant dual formulation

UOT(α, β) = sup f g Cd sup λ R

Z φ 1(f + λ)dα + Z φ 2(g λ)dβ. (74)

Let ρ > 0 and assume Dφ1 = Dφ2 = ρKL. Take optimal potentials (f, g) in (74). Then optimal potentials in (3) are given by (f + λ (f, g), g λ (f, g)), where the optimal translation λ reads

Sβ ρ(g) Sα ρ (f)

, Sα ρ (f) ρ log Z e f/ρdα,

and we call Sα ρ (f) the soft-minimum of f. When m(α) = 1 and m f(x) M, then m Sα ρ (f) M.

Lemma A.13. Assume (α, β) have compact support such that, for (x, y) supp(α) supp(β), C(x, y) R. Then, without loss of generality, one can restrict the optimization of the dual formulation (3) of UOT(α, β) over the set of potentials satisfying for (x, y) supp(α) supp(β),

f(x) [λ , λ + R] , g(y) [ λ R, λ + R] ,

where λ [ R + ρ

m(β)]. In particular, one has

f(x) [ R + ρ

m(β)] , g(y) [ 3R

Proof. Consider the translation-invariant dual formulation (74): if (f, g) are optimal, then for any λ R, (f +λ, g λ) are also optimal. We leverage the structure of the dual constraint f g Cd with Lemma A.11. Since for (x, y) supp(α) supp(β), Cd(x, y) R, then without loss of generality, f(x) [0, R] and g(y) [ R, R]. The potentials (f, g) are optimal for the translation-invariant dual energy, and we need a bound for the original dual functional (3). To this end, we leverage Lemma A.12 to compute the optimal

Published in Transactions on Machine Learning Research (12/2024)

translation, such that (f, g) = (f+λ (f, g), g λ (f, g)). Let α = α/m(α) and β = β/m(β) be the normalized probability measures. The translation can be written as,

λ (f, g) = 1

S β ρ(g) S α ρ (f)

where the functional Sα ρ is defined in Lemma A.12. Since α and β are probability measures, then by (Genevay

et al., 2019, Proposition 1), f(x) [0, R] and g(x) [ R, R] respectively imply S α ρ (f) [0, R] and S β ρ(g)

[ R, R]. Combining these bounds on S α ρ (f), S β ρ(g) with the expression of λ (f, g), (75) yields the desired bounds on the optimal potentials (f, g) of the dual formulation (3).

A.6 Metrizing weak convergence: Proof of Theorem 3.12

The Kullback-Leibler setting is treated here. The Partial OT setting (i.e., Dφ = ρTV) is treated in Appendix A.7.

Proof. Let (αn) be a sequence of measures in M+(X) and α M+(X), where X Rd is compact with radius R > 0. First, we assume that αn α. Then, by (Liero et al., 2018, Theorem 2.25), under our assumptions, αn α is equivalent to limn + UOT(αn, α) = 0. This implies that limn + SUOT(αn, α) = 0 and limn + USOT(αn, α) = 0, since by Theorem 3.11 and non-negativity of SUOT (Proposition 3.3),

0 SUOT(αn, α) USOT(αn, α) UOT(αn, α) .

Conversely, assume either that limn + SUOT(αn, α) = 0 or limn + USOT(αn, α) = 0. First assume there exists M > 0 such that for large enough n N , m(αn) M, then by Theorem 3.11, there exists c > 0 such that UOT(αn, α) c SUOT(αn, α) 1/(d+1). Since c is doesn t depend on the masses (m(αn), m(α)), it does not depend on n. By Theorem 3.11, it yields metric equivalence between SUOT, USOT and UOT, thus limn + UOT(αn, α) = 0. By (Liero et al., 2018, Theorem 2.25), we eventually obtain αn α, which is the desired result.

The remaining step thus consists in proving that the sequence of masses (m(αn))n N is indeed uniformly bounded by M > 0 for large enough n. Note that for any (α, β) M+(Rd), one has UOT(α, β) ρ( p

m(β))2. Indeed one has UOT(α, β) D(λ, λ), where D denotes the dual functional (3) and λ = ρ

m(β). Note that the pair (λ, λ) are feasible dual potentials for the constraint f g Cd, because the cost Cd is positive in our setting. The property of push-forwards measures means that for any θ Sd 1, one has m(θ α) = m(α). Therefore, we obtain the following bounds for n large enough.

USOT(αn, α) SUOT(αn, α) Z

m(θ α) 2 dσ(θ),

Hence, limn + SUOT(αn, α) = 0 or limn + USOT(αn, α) = 0 implies limn + m(αn) = m(α). In other terms the mass of sequence converges and is thus uniformly bounded for large enough n. Since we proved that m(αn) < M and m(α) is finite, it ends the proof.

A.7 Properties of sliced partial OT

We provide in this subsection the proofs of Proposition 3.3, Theorems 3.11 and 3.12 for the setting of sliced partial OT. To this end, we rely on a formulation for SUOT and USOT when Dφ1 = Dφ2 = ρTV, which we prove below. Equation 76 is proved in (Piccoli & Rossi, 2014) and can then be applied to SUOT: we include it for completeness. Equation 77 is our contribution and is specific to USOT.

Published in Transactions on Machine Learning Research (12/2024)

Lemma A.14. Let ρ > 0 and assume Dφ1 = Dφ2 = ρTV and Cd(x, y) = ||x y||. Then, for any (α, β) M+(Rd),

UOT(α, β) = sup f E

Z f(x)d(α β)(x), (76)

where E = {f : Rd R, ||f||Lip 1, ||f|| ρ}, ||f|| supx Rd |f(x)| and ||f||Lip sup(x,y) Rd |f(x) f(y)|

Furthermore, for C1(x, y) = |x y| and an empirical approximation ˆσN = 1

N PN i=1 δθi of σ, one has

USOT(α, β) = sup (fθ) E

Sd 1 fθ(θ (x))dˆσN(θ) d(α β)(x) , (77)

E = { θ supp(ˆσN), fθ : R R, ||fθ||Lip 1, || Z

Sd 1 fθ θ dˆσN(θ)|| ρ},

and the Lipschitz norm here is defined w.r.t. C1 as ||f||Lip sup(x,y) Rd |f(x) f(y)|

Proof. We start with the formulation of Equation 3 and Theorem 3.9. For USOT one has

USOT(α, β) = sup fθ( ) gθ( ) C1

Sd 1 fθ(θ (x))dσN(θ) dα(x)

Sd 1 gθ(θ (y))dσN(θ) dβ(y) .

When Dφ = ρTV, the function φ reads φ (x) = x for x [ ρ, ρ], φ (x) = ρ when x ρ, and φ (x) = otherwise. Noting favg(x) = R

Sd 1 fθ(θ (x))dσN(θ) and gavg(x) = R

Sd 1 gθ(θ (x))dσN(θ). This formula on φ imposes favg(x) ρ and gavg(x) ρ. Furthermore, since we perform a supremum w.r.t. (favg, gavg) where φ attains a plateau, then without loss of generality, we can impose the constraint favg(x) ρ and gavg(x) ρ, as it will have no impact on the optimal dual functional value. Thus we have that ||favg|| ρ and ||gavg|| ρ. To obtain the Lipschitz property, we use the constraint that fθ( ) gθ( ) C1 for any θ supp(σN), as well as (Santambrogio, 2015, Proposition 3.1). Thus by using c-transform for the cost C1(x, y) = |x y|, we can take w.l.o.g fθ( ) = gθ( ) with fθ( ) a 1-Lipschitz function. Thus w.l.o.g we can perform the supremum over (fθ)θ E, and rephrase the functional as desired, since we have that φ (favg) = favg.

The proof for UOT is exactly the same, except that our inputs are (f, g) instead of (fθ, gθ).

We can now prove Proposition 3.3, Theorems 3.11 and 3.12 in the setting of sliced Partial OT. All those results are summarized in the following statement.

Theorem A.15. (Properties of Sliced Partial OT) Assume C1(x, y) = |x y| and Dφ1 = Dφ2 = ρTV. Then, USOT satisfies the triangle inequality. Additionally, for any (α, β) M+(X) where X Rd is compact with radius R, UOT(α, β) c(ρ, R) SUOT(α, β)1/(d+1), and USOT and SUOT both metrize the weak convergence.

Published in Transactions on Machine Learning Research (12/2024)

Proof of Sliced Partial OT properties. First we prove that in that setting USOT is a metric. Reusing Lemma A.14, we have that for any measures (α, β, γ)

USOT(α, γ) = sup (fθ)θ E

Sd 1 fθ(θ (x))dσN

= sup (fθ)θ E

Sd 1 fθ(θ (x))dσN

d(α β + β γ)(x)

sup (fθ)θ E

Sd 1 fθ(θ (x))dσN

+ sup (fθ)θ E

Sd 1 fθ(θ (x))dσN

= USOT(α, β) + USOT(β, γ).

Note that reusing Lemma A.14, we have that SUOT is a sliced integral probability metric over the space of bounded and Lipschitz functions. More precisely, we satisfy the assumptions of (Nadjahi et al., 2020b, Theorem 3), so that one has UOT(α, β) c(ρ, R)(SUOT(α, β))1/(d+1).

To prove that USOT and SUOT metrize the weak* convergence, the proof is very similar to that of Theorem 3.12 detailed above. Assuming that αn α implies SUOT(αn, α) 0 and USOT(αn, α) 0 is already proved in Appendix A.6. To prove the converse, the proof is also the same, i.e., we use the property that SUOT, USOT and UOT are equivalent metrics, which holds as we assumed that supports of (α, β) are compact in a ball of radius R. Note that since the bound UOT(α, β) c(ρ, R)(SUOT(α, β))1/(d+1) holds independently of the measure s masses, we do not need to uniformly bound m(αn), compared to the KL setting of Theorem 3.12.

B Additional details for Section 4

B.1 Postponed Proofs for Section 4

Proof of Proposition 4.1. Our goal is to compute the first order variation of the SUOT functional. Given that SUOT(α, β) = R

Sd 1 UOT(θ α, θ β)dσ(θ), one can apply Proposition B.1 slice-wise. Since measures are assumed to have compact support, one can apply the dominated convergence theorem and differentiate under the integral sign. Furthermore, the translation-invariant formulation in the setting of SUOT reads

SUOT(α, β) = Z

Sd 1 sup fθ gθ C1

Z φ fθ( ) + λθ dθ α (78)

+ Z φ gθ( ) λθ dθ β

In the setting where φ is smooth and strictly concave (such as Dφ = ρKL), there always exists a unique optimal λ θ. Furthermore, one can apply the envelope theorem such that the Fréchet differential w.r.t. to a perturbation (rθ, sθ) of (fθ, gθ) reads Z

" Z rθ( ) φ fθ( ) + λ θ(fθ, gθ) dθ α (80)

+ Z sθ( ) φ gθ( ) λ θ(fθ, gθ) dθ β

αθ = φ fθ( ) + λ (fθ, gθ) α, βθ = φ gθ( ) λ (fθ, gθ) β ,

Published in Transactions on Machine Learning Research (12/2024)

yields the desired result, i.e., the first order variation is Z

" Z rθ( )d(θ αθ) + Z sθ( )d(θ βθ)

Proof of Proposition 4.2. Our goal is to compute the first order variation of the USOT functional. First, we leverage Theorem 3.9 such that USOT reads

USOT(α, β) = sup fθ( ) gθ( ) C1

Sd 1 fθ(θ (x))dˆσK(θ) dα(x) (83)

Sd 1 gθ(θ (y))dˆσK(θ) dβ(y) (84)

= sup fθ( ) gθ( ) C1

Z φ 1 favg(x) dα(x) + Z φ 2 gavg(y) dβ(y), (85)

favg(x) = Z

Sd 1 fθ(θ (x))dˆσK(θ), gavg(y) = Z

Sd 1 gθ(θ (y))dˆσK(θ).

From this, we derive the translation-invariant formulation as follows.

USOT(α, β) = sup fθ( ) gθ( ) C1 sup λ R

Z φ 1 favg(x) + λ dα(x) (86)

+ Z φ 2 gavg(y) λ dβ(y), (87)

For smooth and strictly concave φ , there exists a unique λ (favg, gavg) attaining the supremum. Furthermore, one can apply the enveloppe theorem and differentiate under the integral sign (since the support is compact). Consider perturbations (rθ( ), sθ( )) of (fθ( ), gθ( )). Write

ravg(x) = Z

Sd 1 rθ(θ (x))dˆσK(θ), savg(y) = Z

Sd 1 sθ(θ (y))dˆσK(θ).

Given that φ 1(favg + ravg) = φ 1(favg) + ravg φ 1(favg) + o(||ravg|| ), the first order variation reads Z ravg(x) φ 1 favg(x) + λ (favg, gavg) dα(x) (88)

+ Z savg(y) φ 2 gavg(y) λ (favg, gavg) dβ(y). (89)

Then we define

α = φ 1(favg + λ (favg, gavg))α, β = φ 2(gavg λ (favg, gavg))β,

such that the first order variation reads Z ravg(x)d α(x) + Z savg(y)d β(y). (90)

One can then explicit the definition of (ravg, savg), such that it reads Z

Z rθ(θ (x))d α(x) + Z

Z sθ(θ (y))d β(y) (91)

Z rθdθ α(x) + Z

Z sθdθ β(y). (92)

By optimizing the above over the constraint set {rθ sθ C1}, we identify the computation of SOT( α, β), which concludes the proof.

Published in Transactions on Machine Learning Research (12/2024)

B.2 Frank-Wolfe methodology for computing UOT

Background: FW for UOT. Our approach to compute SUOT and USOT takes inspiration from the construction of (Séjourné et al., 2022b). It consists in applying a Frank-Wolfe (FW) procedure over the dual formulation of UOT. Such approach is equivalent to solve a sequence of balanced OT problems between measures ( α, β) which are iterative renormalizations of (α, β). While the idea holds in wide generality, it is especially efficient in 1D where OT has low algorithmic complexity, and we reuse it in our sliced setting.

FW algorithm consists in optimizing a functional H over a compact, convex set C by optimizing its linearization H. Given a current iterate xt of FW algorithm, one computes rt+1 arg maxr C H(xt), r , and performs a convex update xt+1 = (1 γt+1)xt + γt+1rt+1. One typically chooses the learning rate γt = 2 2+t. This yields the routine FWStep of Section 4 which is detailed below.

Algorithm 4 FWStep(f, g, r, s, γ) Input: α, β, f, g, γ Output: Normalized measures (α, β) as in (96)

f(x) (1 γ)f(x) + γr(x) g(y) (1 γ)g(y) + γs(y) Return (f, g)

In the setting of UOT, one would take C = {f g Cd}. However, this set is not compact as it contains (λ, λ) for any λ R. Thus, Séjourné et al. (2022b) propose to optimise a translation-invariant dual functional H(f, g; α, β) supλ R D(f + λ, g λ; α, β), with D defined in (3). Similar to the balanced OT dual, one has H(f + λ, g λ; α, β) = H(f, g; α, β), thus one can apply (Santambrogio, 2015, Proposition 1.11) to assume w.l.o.g. that, e.g., f(0) = 0 and restrict to a compact set of functions. We emphasize that FW algorithm is well-posed to optimize H, but not D.

Note that once we have the dual variables (f, g) maximizing H, we retrieve optimal dual variables maximizing D as (f + λ (f, g), g λ (f, g)), where λ (f, g) arg maxλ R D(f + λ, g λ; α, β). The KL setting where Dφ1 = ρ1KL and Dφ2 = ρ2KL is especially convenient, because λ (f, g) admits a closed form, which avoids iterative subroutines to compute it. In that case, it reads

λ (f, g) = ρ1ρ2 ρ1 + ρ2 log

R e f(x)/ρ1dα(x) R e g(y)/ρ2dβ(y)

We summarize the FW algorithm for UOT in the proposition below. We refer to (Séjourné et al., 2022b) for more details on the algorithm and pseudo-code. We adapt this approach and result for SUOT and USOT.

Proposition B.1. (Séjourné et al., 2022b) Assume φ is smooth. Given current iterates (f (t), g(t)), the linear FW oracle of UOT(α, β) is OT( α(t), β(t)), where α(t) = φ (f (t) + λ (f (t), g(t)))α and β(t) = φ (g(t) λ (f (t), g(t)))β. In particular, one has m( α(t)) = m( β(t)), thus the balanced OT problem always has finite value. More precisely, the FW update reads

(f (t+1), g(t+1)) = (1 γ(t+1))(f (t), g(t)) + γ(t+1)(r(t+1), s(t+1)), (94)

where (r(t+1), s(t+1)) arg max r s Cd

Z r(x)d α(t)(x) + Z s(y)d β(t)(y). (95)

Recall that the in KL setting one has φ i (x) = ρi(1 e x/ρi), thus φ i (x) = e x/ρi. Thus in that case one normalizes the measures as

f + λ (f, g)

where λ is defined in (93).

This defines the Norm routine in Algorithm 1.

Published in Transactions on Machine Learning Research (12/2024)

B.3 Implementation of Sliced OT to return dual potentials

Recall from Section 4, Algorithms 2 and 3 and more precisely, Propositions 4.1 and 4.2, that FW linear oracle is a sliced OT program, i.e., a set of OT problems computed between univariate distributions of M+(R). Therefore, a key building block of our algorithm is to compute the loss and dual variables of these univariate OT problems. We explain below how one can compute the sliced OT loss and dual potentials. The computation of the loss consists in implementing closed formulas of OT between univariate distributions, as detailed in (Santambrogio, 2015, Proposition 2.17). More precisely, when C1(x, y) = |x y|p and (µ, ν) M+(R), then

OT(µ, ν) = Z 1

0 |F [ 1] µ (t) F [ 1] ν (t)|pdt, (97)

where F [ 1] µ denotes the inverse cumulative distribution function (ICDF) of µ.

Algorithm 5 Sliced OTLoss(α, β, {θ}, p)

Input: α, β, projections {θ}, exponent p Output: OT(θ α, θ β) as in (97) for θ {θ} do

Project support of θ α and θ β Sort weights of (θ α, θ β) and support (θ (x)), (θ (y)) s.t. support is non-decreasing Compute ICDF of θ α and θ β Compute OT(θ α, θ β) as in (97) with exponent p end for

To compute dual potentials using backpropagation, one computes the sliced OT losses (using Algorithm 5) then calls the backpropagation w.r.t to inputs (α, β), because their gradients are optimal dual potentials (Santambrogio, 2015, Proposition 7.17). We describe this procedure in Algorithm 6.

Algorithm 6 Sliced OTPotentials Backprop(α, β, {θ}, p)

Input: α, β, projections {θ}, exponent p Output: Dual potentials (fθ, gθ) solving OT(θ α, θ β) Enable gradients w.r.t. (θ α, θ β) Call Sliced OTLoss(α, β, {θ}, p) Sum (but do not average) losses L = P

θ OT(θ α, θ β). Backpropagate L w.r.t. (α, β) Return (fθ, gθ) as gradients of L w.r.t. (α, β).

The implementation of the dual potentials using 1D closed forms relies on the north-west corner rule principle, which can be vectorized in Py Torch in order to be computed in parallel. The contribution of our implementation thus consists in making such algorithm GPU-compatible and allowing for a parallel computation for every slice simultaneously. We stress that this constitutes a non-trivial piece of code, and we refer the interested reader to the code in our supplementary material for more details on the implementation.

B.4 Output optimal sliced marginals

In all our algorithms, we focus on dual formulations of SUOT and USOT, which optimize the dual potentials. However, one might want the output variables of the primal formulation (See Definition 3.6). In particular, the marginals of optimal transport plans are interesting because they are interpreted as normalized versions of inputs (α, β) where geometric outliers have been removed. We detail where this interpretation comes from in the setting of UOT, and then give how it is adapted to SUOT and USOT. In particular, we justify that the Norm routine suffices to compute them.

Published in Transactions on Machine Learning Research (12/2024)

Figure 5: |SUOT(α, β) \ SUOTt| and |USOT(α, β) \ USOTt| against iteration t, where \ SUOTt, \ USOTt are the estimated SUOT, USOT using t FW iterations. Plots are in log-scale. All figures are issued from the same run, but zoomed on a subset of first iterations: (left) 1000 iterations of FW, (middle) 200 iterations, (right) 20 iterations.

Case of UOT. We focus on the Dφi = ρi KL. As per (Liero et al., 2018, Equation 4.21), we have at optimality that the optimal transport π plan solving UOT(α, β) as in (2) has marginals (π 1, π 2) which read π 1 = e f /ρ1α and π 2 = e g /ρ2β, where (f , g ) are the optimal dual potentials solving (3). Since on supp(π ) one also has f (x) + g (y) = Cd(x, y), if the transportation cost Cd(x, y) is large (i.e., we are matching a geometric outlier), so are f (x) and g (y), and eventually the weights π 1(x) and π 2(y) are small, hence the interpretation of the geometric normalization of the measures. Note that in that case, one obtain (π 1, π 2) by calling Norm(α, β, f , g , ρ1, ρ2).

Case of SUOT. Since SUOT(α, β) consists in integrating UOT(θ α, θ β) w.r.t. σ, it shares many similarities with UOT. For any θ, we consider πθ and (fθ, gθ) solving the primal and dual formulation of UOT(θ α, θ β). The marginals of πθ are thus given by (e fθ/ρ1α, e gθ/ρ2β). In particular, we retrieve the observation made in Figure 1 that the optimal marginals change for each θ. In that case we call for each θ the routine Norm(α, β, fθ, gθ, ρ1, ρ2).

Case of USOT. Recall that the optimal marginals (π1, π2) in USOT(α, β) do not depend on θ, contrary to SUOT(α, β). Leveraging the dual formulation of Theorem 3.9, and looking at the Lagrangian which is defined in the proof of Theorem 3.9 (see Appendix A.2), we have the optimality condition that π1 = e favg/ρ1α and π2 = e gavg/ρ2β. Thus in that case, calling Norm(α, β, favg, gavg, ρ1, ρ2) yields the desired marginals.

B.5 Convergence of Frank-Wolfe iterations: Empirical analysis

We display below an experiment on synthetic dataset to illustrate the convergence of Frank-Wolfe iterations. We also provide insights on the number of iterations that yields a reasonable approximation: a few iterations suffices in our practical settings, typically F = 20.

The results are displayed in Figure 5. We consider the empirical distributions (α, β) computed over respectively, N = 400 and M = 500 samples over the unit hypercube [0, 1]d, d = 10. Moreover, β is slightly shifted by a vector of uniform coordinates 0.5 1d. We choose ρ = 1 and report the estimation of SUOT(α, β) and USOT(α, β) through Frank-Wolfe iterations. We estimate the true values by running F = 5000 iterations, and display the difference between the estimated score and the true values. Appendix B.5 shows that numerical precision is reached in a few tens of iterations. As learning tasks do not usually require an estimation of losses up to numerical precision, we think that it is hence reasonable to take F 20 in numerical applications.

Published in Transactions on Machine Learning Research (12/2024)

Table 2: Dataset characteristics.

BBCSport Movies Goodreads genre Goodreads like

Doc 737 2000 1003 1003 Train 517 1500 752 752 Test 220 500 251 251 Classes 5 2 8 2 Mean words by doc 116 54 182 65 1491 538 1491 538 Median words by doc 104 175 1518 1518 Max words by doc 469 577 3499 3499

C Additional details on Section 5

C.1 Document classification: Technical details and additional results

C.1.1 Datasets

We sum up the statistics of the different datasets in Table 2.

BBCSport. The BBCSport dataset (Kusner et al., 2015) contains articles between 2004 and 2005, and is composed of 5 classes. We average over the 5 same train/test split of (Kusner et al., 2015). The dataset can be found in https://github.com/mkusner/wmd/tree/master.

Movie Reviews. The movie reviews dataset (Pang et al., 2002) is composed of 1000 positive and 1000 negative reviews. We take five different random 75/25 train/test split. The data can be found in http: //www.cs.cornell.edu/people/pabo/movie-review-data/.

Goodreads. This dataset, proposed in (Maharjan et al., 2017), and which can be found at https: //ritual.uh.edu/multi_task_book_success_2017/, is composed of 1003 books from 8 genres. A first possible classification task is to predict the genre. A second task is to predict the likability, which is a binary task where a book is said to have success if it has an average rating 3.5 on the website Goodreads (https://www.goodreads.com). The five train/test split are randomly drawn with 75/25 proportions.

C.1.2 Technical Details

All documents are embedded with the Word2Vec model (Mikolov et al., 2013) in dimension d = 300. The embedding can be found in https://drive.google.com/file/d/0B7Xk Cwp I5KDYNl NUTTl SS21p Qm M/view? resourcekey=0-wj GZd NAUop6Wyk Tt Mip30g.

In this experiment, we report the results averaged over 5 random train/test split. For discrepancies which are approximated using random projections, we additionally average the results over 3 different computations, and we report this standard deviation in Table 1. Furthermore, we always use 500 projections to approximate the sliced discrepancies. For Frank-Wolfe based methods, we use 10 iterations, which we found to be enough to have a good accuracy. We added an ablation of these two hyperparameters in Figure 7. We report the results obtained with the best ρ for USOT and SUOT computed among a grid ρ {10 4, 5 10 4, 10 3, 5 10 3, 10 2, 10 1, 1}. For USOT, the best ρ is consistently around 5 10 3 for the Movies and Goodreads datasets, and around 5 10 4 for the BBCSport dataset. We used a second finer grid and reported the results obtained with ρ = 0.00021 on BBCSport, ρ = 0.004 for Goodreads on the likability task and ρ = 0.003 for the genre task. For SUOT, the best ρ obtained was 0.01 for the BBCSport dataset, 1.0 for the movies dataset and 0.5 for the goodreads dataset. For UOT, we used ρ = 1.0 on the BBCSport dataset. For the movies dataset, the best ρ obtained on a subset was 50, but it took an unreasonable amount of time to run on the full dataset as the runtime increases with ρ (see (Chapel et al., 2021, Figure 3)). On the goodreads dataset, it took too much memory on the GPU. For Sinkhorn UOT, we used ε = 0.1 and ρ = 1.0 on the BBCSport and ε = 0.001, ρ = 0.1 on the Goodreads dataset, and ε = 0.01 and ρ = 0.1 on the Movies dataset. Note

Published in Transactions on Machine Learning Research (12/2024)

Figure 6: Runtime on the BBCSport dataset (left) and on the Goodreads dataset (right).

that we also tested other set of parameters for UOT and Sinkhorn UOT, but on a coarser grid than USOT and SUOT given their computational time. For each method, the number of neighbors used for the k-NN method is obtained via cross-validation.

C.1.3 Additional experiments

Runtime. We report in Figure 6 the runtime of computing the different discrepancies between each pair of documents, and in Table 3 the full runtimes. On the BBCSport dataset, the documents have in average 116 words, thus the main bottleneck is the projection step for sliced OT methods. Hence, we observe that OT runs slightly faster than SOT and the sliced unbalanced counterparts. Goodreads is a dataset with larger documents, with on average 1491 words by document. Therefore, as OT scales cubically with the number of samples, we observe here that all sliced methods run faster than OT, which confirms that sliced methods scale better w.r.t. the number of samples. In this setting, we were not able to compute UOT with the POT implementation in a reasonable time. Computations have been performed with a NVIDIA Tesla V100 GPU.

Table 3: Runtimes on Document Classification

BBCSport Goodreads

OT Average ( 10 3 s) 3.29 1.61 440.30 259 Full (s) 891 221252

SOT Average ( 10 3s) 1.80 6.22 4.49 1.44 Full (s) 487 2256

USOT Average ( 10 3s) 14.67 1.29 14.45 0.88 Full (s) 3897 7260

SUOT Average ( 10 3s) 13.9 1.21 14.32 0.95 Full (s) 3770 7193

Ablations. We plot in Figure 7 accuracy as a function of the number of projections and the number of iterations of the Frank-Wolfe algorithm. We averaged the accuracy obtained with the same setting described in Appendix C.1.2, with varying number of projections K {4, 10, 21, 46, 100, 215, 464, 1000} and number of FW iterations F {1, 2, 3, 4, 5, 10, 15, 20}. Regarding the hyperparameter ρ, we selected the one returning the best accuracy, i.e., ρ = 5 10 4 for USOT and ρ = 10 2 for SUOT.

Choice of ρ. In Table 4, we also add the results obtained when ρ is chosen by cross validation for USOT and SUOT. In this case, the results are slightly below the best one, but are still better than SOT.

Unnormalizing measures. As we have mentioned in Section 5, we have performed document classification experiments by normalizing word histograms to be probabilities. It allows to compare SUOT and USOT with SOT since sliced OT is only well-defined between probabilities. However, it seems reasonnable to compare

Published in Transactions on Machine Learning Research (12/2024)

100 101 102 103

Number of Projections

SOT USOT SUOT

5 10 15 20 Iterations FW

SOT USOT SUOT

Figure 7: Ablation on BBCSport of the number of projections (left) and of the number of Frank-Wolfe iterations (right).

Table 4: Accuracy on document classification. Grid-search over ρ is performed. We report the best accuracy over ρ, which corresponds to ρ = 5.10 6 for Unnormalized USOT and ρ = 5.10 6 for SOPT.

BBCSport Movies Goodreads genre Goodreads like

OT 94.55 74.44 55.22 71.00 UOT 96.73 - - - Sinkhorn UOT 95.45 72.48 53.55 67.81 SOT 89.39 0.76 66.95 0.45 50.09 0.51 65.60 0.20 SUOT 90.12 0.15 67.84 0.37 50.15 0.04 66.72 0.38 USOT 93.52 0.04 69.21 0.37 52.67 0.62 67.78 0.39 SUSOT 92.73 0.27 69.53 0.53 51.93 0.53 67.33 0.26 SUOT (+CV on ρ) 90.00 0.59 67.40 0.64 49.67 0.79 66.43 0.44 USOT (+CV on ρ) 92.61 0.55 68.64 0.29 52.06 7.20 66.61 0.72 USOT (Unnormalized) 86 0.56 - - - SOPT (Unnormalized) 87.27 0.20 - - -

with other unbalanced sliced methods such as SOPT (Bai et al., 2023). We chose to compare with this competitor since their code is available in Python. However, a numerical restriction of their algorithm is that it only outputs measures with constants weights, i.e., distributions α = P αiδxi and β = P βjδyj where αi = βj = 1, but the number of samples in α and β may differ. Under this modeling assumption, the total mass of each measure corresponds to the number of words in the sentence. We performed the comparison on the BBC dataset, using 500 projections for both SOPT and USOT. Unfortunately, the quadratic footprint of computing the similarity kernel does not scale reasonably for SOPT for larger datasets such as Movies or Goodreads, especially because their algorithm is not GPU-compatible compared to ours. We cross-validated the parameter ρ {p.10 k, k J0, 6K, p {1., 5.}}

The result is detailed in the table below. What is noticeable is that the performance degrades for both USOT and SOPT using this parametrization. Furthermore, we observed that the paramater ρ yielding the best accuracy is much smaller for unnormalized measures than for the best one for normalized histograms (i.e., 1e 5 here compared to 1e 3 with normalized measures). Our interpretation of this observation is that considering unnormalized measures adds an additional information of the sentence length via the masses of (α, β). It seems that this additional information dominates the comparison of measures, instead of focusing on the measures support (i.e., the word embedding) which encodes the semantic information of words. When ρ is large the kernel value of USOT/SOPT is mainly dictated by the mass (i.e., sentence length) comparison. Thus smaller ρ seems to give less importance on sentence length, hence a better performance. We also note that performance of SOPT and USOT on unnormalized measures are rather similar. It means that for the choice of marginal prior Dφ = ρTV or Dφ = ρKL does not significantly matter for this specific task, compared to the preprocessing normalization of measures.

C.2 Unbalanced sliced Wasserstein barycenters

We define below the formulation of the USOT barycenter which was used in the experiments of Figure 4 to average predictions of geophysical data. We then detail how we computed it.

Published in Transactions on Machine Learning Research (12/2024)

0 25 50 75 100 125 150 175

0 25 50 75 100 125 150 175

0 25 50 75 100 125 150 175

0 25 50 75 100 125 150 175

0 25 50 75 100 125 150 175

(a) SOT barycenters

0 25 50 75 100 125 150 175

0 25 50 75 100 125 150 175

0 25 50 75 100 125 150 175

0 25 50 75 100 125 150 175

0 25 50 75 100 125 150 175

(b) USOT barycenters with ρ1 = ρ2 = 100

0 25 50 75 100 125 150 175

0 25 50 75 100 125 150 175

0 25 50 75 100 125 150 175

0 25 50 75 100 125 150 175

0 25 50 75 100 125 150 175

(c) USOT barycenters with ρ1 = ρ2 = 0.01

0 25 50 75 100 125 150 175

0 25 50 75 100 125 150 175

0 25 50 75 100 125 150 175

0 25 50 75 100 125 150 175

0 25 50 75 100 125 150 175

(d) USOT barycenters with ρ1 = 0.01 and ρ2 = 100

Figure 8: Interpolation with USOT as a barycenter computation. We compare different interpolations using SOT or USOT with different settings for the ρ values

Definition C.1. Consider a set of measures (α1, . . . , αB) M+(Rd)B, and a set of non-negative coefficients (ω1, . . . , ωB) 0 such that PB b=1 ωb = 1. We define the barycenter problem (in the KL setting) as

B((αb)b, (ωb)b) inf β P(Rd)

b=1 ωb USOT(αb, β), (98)

= inf β P(Rd)

b=1 inf (πb,1,πb,2) SOT(πb,1, πb,2) + ρ1KL(πb,1|αb) + ρ2KL(πb,2|β), (99)

where P(Rd) denotes the set of probability measures.

Published in Transactions on Machine Learning Research (12/2024)

To compute the barycenter, we aggregate several building blocks. First, since we consider that the barycenter β P(Rd) is a probability, we perform mirror descent as in (Beck & Teboulle, 2003; Cuturi & Doucet, 2014b). More precisely, we use a Nesterov accelerated version of mirror descent. We also tried projected gradient descent, but it did not yield consistent outputs (due to convergence speed (Beck & Teboulle, 2003)). Second, we use a Stochastic-USOT version (see Section 4), i.e., we sample new projections at each iteration of the barycenter update (but not a each iteration of the FW subroutines in Algorithm 3). This procedure is described in Algorithm 7.

Algorithm 7 Barycenter((αb)b, (ωb)b, ρ1, ρ2, lr)

Input: measures (αb)b, weights (ωb)b, ρ1, ρ2, learning rate lr, FW iter F Output: Optimal barycenter β of (98)

t 1 Init (β, β, ˆβ) as uniform distribution over a grid while not converged do do

β (1 γ)ˆβ + γ β Sample projections (θk)K k=1 Compute B((αb)b, (ωb)b) by calling USOT(αb, β, F, (θk)K k=1, ρ1, ρ2) in Algorithm 3 for each b Compute g as the gradient of B((αb)b, (ωb)b) w.r.t. variable β

β exp( lr γ 1 g)β β β/m( β) ˆβ (1 γ)ˆβ + γ β t t + 1 end while

We illustrate this algorithm with several examples of interpolation in Figure 8. We propose to compute an interpolation between two measures located on a fixed grid of size 200 200 with different values of ρi in Dφi = ρi KL. For illustration purposes, we construct the source distribution as a mixture of two Gaussians with a small and a larger mode, and the target distribution as a single Gaussian. Those distributions are normalized over the grid such that both total norms are equal to one (which is not required by our unbalanced sliced variants but grants more interpretability and possible comparisons with SOT). Figure 8a shows the result of the interpolation at three timestamps (t = 0.25, 0.5 and 0.75) of a SOT interpolation (within this setting, ω1 = 1 t and ω2 = t). As expected, the two modes of the source distribution are transported over the target one. We verify in Figure 8b that for a large value of ρ1 = ρ2 = 100, the USOT interpolation behaves similarly as SOT, as expected from the theory. When ρ1 = ρ2 = 0.01, the smaller mode is not moved during the interpolation, whereas the larger one is stretched toward the target (Figure 8c). Finally, in Figure 8d, an asymmetric configuration of ρ1 = 0.01 and ρ2 = 100 allows to get an interpolation when only the big mode of the source distribution is displaced toward the target. In all those cases, the mirror-descent algorithm 7 is run for 500 iterations. Even for a large grid of 200 200, those different results are obtained in a 2 3 minutes on a commodity GPU, while the OT or UOT barycenters are untractable with a limited computational budget.

C.3 Unbalanced version of hyperbolic SOT.

To illustrate the modularity of our FW algorithm, we aim at comparing synthetic mixtures of Wrapped Normal Distribution on the 2-hyperbolic manifold H (Nagano et al., 2019), so that the FW oracle is hyperbolic sliced OT (Bonet et al., 2023c). The parameter θ characterizes on H any geodesic curve passing through the origin, and each sample is projected by taking the shortest path to such geodesics. Once projected on a geodesic curve, we sort data and compute SOT w.r.t. hyperbolic metric d H. We consider the 2-hyperbolic manifold on the Poincaré disc. As illustrated in Figure 9, the input measure α (in red) is a mixture of 3 isotropic normal distributions, with a mode at the top of the disc playing the role of an outlier. The measure β is a mixture of two anisotropic normal distributions, whose means are close to two modes of α, but are slightly shifted at the disc s center. We show on Figure 9 the impact of the parameter ρ = ρ1 = ρ2 on the optimal marginals of USOT.

Published in Transactions on Machine Learning Research (12/2024)

ρ = 10 3 ρ = 10 2 ρ = 10 1 ρ = 100 ρ = 101 Inputs (ρ = )

Marginal π1 Marginal π2

Figure 9: KDE estimation (kernel e d2 H/σ) of optimal (π1, π2) of USOT(α, β) when Dφi = ρKL.

This experiment illustrates several take-home messages, mentioned in Section 3. First, the optimal marginals (π1, π2) are renormalisation of (α, β) accounting for their geometry, which are able to remove outliers for properly tuned ρ. When ρ is large, (π1, π2) (α, β) and we retrieve SOT. When ρ is too small, outliers are removed, but we see a shift of the modes, so that modes of (π1, π2) are closer to each other, but do not exactly correspond to those of (α, β). Second, note that such plot cannot be made with SUOT, since the optimal marginals depend on the projection θ (see Figure 1). Third, we emphasize that we are indeed able to reuse any variant of SOT.

C.4 Choice and interpretation of hyperparameter ρ

An immediate drawback of our framework is the induced additional computational cost w.r.t. SOT. While the above experimental results show that SUOT and USOT improve performance significantly over SOT, and though the complexity is still sub-quadratic in number of samples, our FW approach uses SOT as a subroutine, rendering it necessarily more expensive. Additionally, another practical burden comes from the introduction of extra hyperparameters (ρ1, ρ2), which may be tuned using cross-validation. Therefore, a future direction would be to derive efficient strategies to tune (ρ1, ρ2), maybe w.r.t. the applicative context, and further complement the possible interpretations of ρ as a threshold for the geometric information encode by C1, Cd. While we leave the automation of tuning (ρ1, ρ2) for future works, we provide below some details and intuitions on the choice of ρ for the previous experiments. We hope these insights will help the practitioner on how they should chose tune this additional parameter.

General intuition on ρ. The parameter ρ when ρ1 = ρ2 = ρ can be understood as a characteristic distance to decide whether or not two sample should be matched by the coupling π in the primal formulation of (2). Typically, transportation happens for samples (x, y) such that Cd(x, y) ρ, while samples such that Cd(x, y) ρ are interpreted as geometric outliers, and are discarded in the matching π(x, y). In the case of SUOT and USOT, there is somehow a similar interpretation, but not for the same quantities, and we rely on their definitions (Equations 7 and 10), as well as the constraint set E in Theorem 3.9.

One sees that for SUOT(α, β) we have a set of 1D-UOT problems between (θ α, θ β), thus the threshold interpretation holds depending on whether C1(θ (x), θ (y)) ρ or C1(θ (x), θ (y)) ρ. In particular the dependence in θ explains why the outlier threshold depends on the considered projection. Note also we consider C1 instead of Cd.

For USOT(α, β) it is different because the marginals (π1, π2) which we optimize in Equation (9) are independent of θ, and common to all projections. Informally speaking, we interpret that the threshold value to discard a matching between (x, y) depends on whether some quantity proportional to R

Sd 1 C1(θ (x), θ (y))dπθ(x, y)dσ(θ) is larger or smaller than ρ. This quantity is not properly defined as it depends on the optimized variables (πθ)θ, hence the informality of our intuition. However, we wish to emphasize that the parameter ρ should be interpreted differently between SUOT and USOT. As highlighted

Published in Transactions on Machine Learning Research (12/2024)

number of samples n

ρ1 = 1, ρ2 = 1

USOT, d = 5 SUOT, d = 5 USOT, d = 10 SUOT, d = 10 USOT, d = 20 SUOT, d = 20 USOT, d = 50 SUOT, d = 50 USOT, d = 100 SUOT, d = 100 1/n

number of samples n

ρ1 = 1, ρ2 = 0.1

USOT, d = 5 SUOT, d = 5 USOT, d = 10 SUOT, d = 10 USOT, d = 20 SUOT, d = 20 USOT, d = 50 SUOT, d = 50 USOT, d = 100 SUOT, d = 100 1/n

number of samples n

ρ1 = 0.1, ρ2 = 0.1

USOT, d = 5 SUOT, d = 5 USOT, d = 10 SUOT, d = 10 USOT, d = 20 SUOT, d = 20 USOT, d = 50 SUOT, d = 50 USOT, d = 100 SUOT, d = 100 1/n

Figure 10: Sample complexity: E[|L(ˆαn, ˆβn) L(α, β)|] against n, with L = SUOT or USOT and α = β = N(0d, Id) (thus L(α, β) = 0). Results are averaged over 20 runs and the shaded areas represent the 10th and 90th percentiles. All curves exhibit the same convergence rate for any d, that is 1/n (see solid line black curve).

experimentally for document classification in Figure 2, we observe that the value of ρ yielding the best performance is not the same for each loss.

Choice of ρ for hyperbolic data. In Figure 9, the hyperbolic distance between overlapping modes is 0.96, while distance from side modes to the top red outlier is 2.83. Thus, a proper choice of should lie in between, which seems consistent with the observation of Figure 9 for ρ = 1. Indeed we see that we have a satisfying trade-off between removing the top mode and preserving the crescent shape structure of main blue modes.

Choice of ρ for barycenter experiments. For the barycenter, we used insights from Figure 8 which interpolates circular blobs using asymmetric (ρ1, ρ2), where ρ1 is the parameter penalizing the input measures fidelity, and ρ2 the parameter of the barycenter. For Figure 4 (especially line (d)), we also took assymetric (ρ1, ρ2) with large ρ2 = 1e4 for the barycenter to force data matching. Then for inputs ρ1 = 1e1 is roughly the distance between cyclones (see Figure 4), to keep them in the barycenter. All in all, we force the barycenter to match the cyclone structure which matters most, while any structure who would be beyond this ρ1 distance between input measure would be discarded.

Interpretation of ρ for document classification. In this task the measures support are given by word embedding in high dimension, for which we have no intuition of what is for instance the characteristic distance between different semantic clusters, and thus no idea on how ρ should be tuned. For this reason (and more generally in ML tasks), we need to perform a cross-validation over this hyperparameter. We would like to comment the dependence of the document classification accuracy w.r.t. ρ, which can be observed in Figure 2. One can notice that as ρ increases, the accuracy increases until it reaches a peak , until then it decreases to reach a plateau as ρ . When ρ , SUOT and USOT converge to SOT (see Definitions 2.4 and 3.6), and we get similar performances. As ρ 0, marginals (π1, π2) are allowed to differ significantly from inputs (α, β), meaning that SUOT/USOT almost ignore input data. Therefore, ρ should be tuned to extract information from inputs while removing noise. In Figure 2, the peaks correspond to such optimal ρ, and the gain in performance justify the use of SUOT/USOT over SOT.

C.5 Illustration of the sample complexity

We investigate the sample complexity of SUOT and USOT in practice and report the results in Figure 10. Our goal is to empirically verify Theorem 3.4 for SUOT, and explore the convergence rate for USOT. To this end, we consider α = β = N(0d, Id) and compute SUOT(αn, βn) and USOT(αn, βn) for different number of samples n and dimension d. This allows us to explore the convergence rate of SUOT(αn, βn) to SUOT(α, β) = 0 (respectively, of USOT(αn, βn) to USOT(α, β) = 0) as a function of n and d.

Published in Transactions on Machine Learning Research (12/2024)

Figure 10 shows that all curves share the same slope w.r.t n, for any d and for both SUOT and USOT. This experiment is consistent with the dimension-free rate we established in Theorem 3.4 for SUOT. Interestingly, it also reveals that the dimension-free rate holds for USOT as well in that specific setting. More experiments and/or theoretical justification are needed to verify if this holds for more general distributions.