# lcot_linear_circular_optimal_transport__1e1bd8b8.pdf Published as a conference paper at ICLR 2024 LCOT: LINEAR CIRCULAR OPTIMAL TRANSPORT Roc ıo D ıaz Mart ın 1, Ivan Medri 2, Yikun Bai 3, Xiran Liu3, Kangbai Yan1, Gustavo K. Rohde 4, Soheil Kolouri 3 1Department of Mathematics, Vanderbilt University, Nashville, TN 37240, USA. 2Department of Computer Science, Tennessee State University, Nashville, TN 37209, USA. 3Department of Computer Science, Vanderbilt University, Nashville, TN 37240, USA. 4Department of Biomedical Engineering, Department of Electrical and Computer Engineering, University of Virginia, Charlottesville, VA 22908, USA. 1{rocio.p.diaz.martin,kangbai.yan}@vanderbilt.edu 3{yikun.bai,xinran.liu,soheil.kolouri}@vanderbilt.edu 2imedri@tnstate.edu, 4gustavo@virginia.edu The optimal transport problem for measures supported on non-Euclidean spaces has recently gained ample interest in diverse applications involving representation learning. In this paper, we focus on circular probability measures, i.e., probability measures supported on the unit circle, and introduce a new computationally efficient metric for these measures, denoted as Linear Circular Optimal Transport (LCOT). The proposed metric comes with an explicit linear embedding that allows one to apply Machine Learning (ML) algorithms to the embedded measures and seamlessly modify the underlying metric for the ML algorithm to LCOT. We show that the proposed metric is rooted in the Circular Optimal Transport (COT) and can be considered the linearization of the COT metric with respect to a fixed reference measure. We provide a theoretical analysis of the proposed metric and derive the computational complexities for pairwise comparison of circular probability measures. Lastly, through a set of numerical experiments, we demonstrate the benefits of LCOT in learning representations of circular measures. 1 INTRODUCTION Optimal transport (OT) (Villani, 2009; Peyr e et al., 2019) is a mathematical framework that seeks the most efficient way of transforming one probability measure into another. The OT framework leads to a geometrically intuitive and robust metric on the set of probability measures, referred to as the Wasserstein distance. It has become an increasingly popular tool in machine learning, data analysis, and computer vision (Kolouri et al., 2017; Khamis et al., 2023). OT s applications encompass generative modeling (Arjovsky et al., 2017; Tolstikhin et al., 2017; Kolouri et al., 2018), domain adaptation (Courty et al., 2017; Damodaran et al., 2018), transfer learning (Alvarez-Melis & Fusi, 2020; Liu et al., 2022), supervised learning (Frogner et al., 2015), clustering (Ho et al., 2017), image and pointcloud registration (Haker et al., 2004; Bai et al., 2022; Le et al., 2023), and even inverse problems (Mukherjee et al., 2021), among others. Recently, there has been an increasing interest in OT for measures supported on manifolds (Bonet et al., 2023; Sarrazin & Schmitzer, 2023). This surging interest is primarily due to: 1) real-world data is often supported on a lowdimensional manifold embedded in larger-dimensional Euclidean spaces, and 2) many applications inherently involve non-Euclidean geometry, e.g., geophysical data or cortical signals in the brain. In this paper, we are interested in efficiently comparing probability measures supported on the unit circle, aka circular probability measures, using the optimal transport framework. Such probability measures, with their densities often represented as circular/rose histograms, are prevalent in many applications, from computer vision and signal processing domains to geology and astronomy. For instance, in classic computer vision, the color content of an image can be accounted for by its hue These authors contributed equally to this work. Acknowledges support from ONR N000142212505, and NIH GM130825. Acknowledges partial support from the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR00112190135 and HR00112090023, and the Wellcome LEAP Foundation. Published as a conference paper at ICLR 2024 in the HSV space, leading to one-dimensional circular histograms. Additionally, local image/shape descriptors are often represented via circular histograms, as evidenced in classic computer vision papers like SIFT (Lowe, 2004) and Shape Context (Belongie et al., 2000). In structural geology, the orientation of rock formations, such as bedding planes, fault lines, and joint sets, can be represented via circular histograms Twiss & Moores (1992). In signal processing, circular histograms are commonly used to represent the phase distribution of periodic signals (Levine et al., 2002). Additionally, a periodic signal can be normalized and represented as a circular probability density function (PDF). Notably, a large body of literature exists on circular statistics (Jammalamadaka & Sen Gupta, 2001). More specific to our work, however, are the seminal works of Delon et al. (2010) and Rabin et al. (2011), which provide a thorough study of the OT problem and transportation distances on the circle (see also Cabrelli & Molter (1998)). OT on circles has also been recently revisited in various papers (Hundrieser et al., 2022; Bonet et al., 2023; Beraha & Pegoraro, 2023; Quellmalz et al., 2023), further highlighting the topic s timeliness. Unlike OT on the real line, generally, the OT problem between probability measures defined on the circle does not have a closed-form solution. This stems from the intrinsic metric on the circle and the fact that there are two paths between any pair of points on a circle (i.e., clockwise and counter-clockwise). Interestingly, however, when one of the probability measures is the Lebesgue measure, i.e., the uniform distribution, the 2-Wasserstein distance on the circle has a closed-form solution, which we will discuss in the Background section. We present the Linear Circular OT (LCOT), a new transport-based distance for circular probability measures. By leveraging the closed-form solution of the circular 2-Wasserstein distance between each distribution and the uniform distribution on the circle, our method sidesteps the need for optimization. Concisely, we determine the Monge maps that push the uniform distribution to each input measure using the closed-form solution, then set the distance between the input measures based on the disparities between their respective Monge maps. Our approach draws parallels with the Linear Optimal Transport (LOT) framework proposed by Wang et al. (2013) and can be seen as an extension of the cumulative distribution transform (CDT) presented by (Park et al., 2018) to circular probability measures (see also, Aldroubi et al. (2022; 2021)). The idea of linearized (unbalanced) optimal transport was also studied recently in various works (Cai et al., 2022; Moosm uller & Cloninger, 2023; Sarrazin & Schmitzer, 2023; Cloninger et al., 2023). From a geometric perspective, we provide explicit logarithmic and exponential maps between the space of probability measures on the unit circle and the tangent space at a reference measure (e.g., the Lebesgue measure) Wang et al. (2013); Cai et al. (2022); Sarrazin & Schmitzer (2023). Then, we define our distance in this tangent space, giving rise to the terminology Linear Circular OT. The logarithmic map provides a linear embedding for the LCOT distance, while the exponential map inverts this embedding. We provide a theoretical analysis of the proposed metric, LCOT, and demonstrate its utility in various problems. Contributions. Our specific contributions in this paper include 1) proposing a computationally efficient metric for circular probability measures, 2) providing a theoretical analysis of the proposed metric, including its computational complexity for pairwise comparison of a set of circular measures, and 3) demonstrating the robustness of the proposed metric in manifold learning, measure interpolation, and clustering/classification of probability measures. 2 BACKGROUND 2.1 CIRCLE SPACE The unit circle S1 can be defined as the quotient space S1 = R/Z = {{x + n : n Z} : x [0, 1)} . The above definition is equivalent to [0, 1] under the identification 0 = 1. For the sake of simplicity in this article, we treat them as indistinguishable. Furthermore, we adopt a parametrization of the circle as [0, 1), designating the North Pole as 0 and adopting a clockwise orientation. This will serve as our canonical parametrization. Let | | denote the absolute value on R. With the aim of avoiding any confusion, when necessary, we will denote it by | |R. Then, a metric on S1 can be defined as |x y|S1 := min{|x y|R, 1 |x y|R}, x, y [0, 1) or, equivalently, as |x y|S1 := min k Z |x y + k|R, Published as a conference paper at ICLR 2024 where for the second formula x, y R are understood as representatives of two classes of equivalence in R/Z, but these two representatives x, y do not need to belong to [0, 1). It turns out that such a metric defines a geodesic distance: It is the smaller of the two arc lengths between the points x, y along the circumference (cf. Jammalamadaka & Sen Gupta (2001), where here we parametrize angles between 0 and 1 instead of between 0 and 2π. Besides, the circle S1 can be endowed with a group structure. Indeed, as the quotient space R/Z it inherits the addition from R modulo Z. Equivalently, for any x, y [0, 1), we can define the operations +, as (x, y) 7 x y, if x y [0, 1) x y 1, otherwise. (1) 2.2 DISTRIBUTIONS ON THE CIRCLE Regarded as a set, S1 can be identified with [0, 1). Thus, signals over S1 can be interpreted as 1periodic functions on R. More generally, every measure µ P(S1) can be regarded as a measure on R by µ(A + n) := µ(A), for every A [0, 1) Borel subset, and n Z. (2) Then, its cumulative distribution function, denoted by Fµ, is defined as Fµ(y) := µ([0, y)) = Z y 0 dµ, y [0, 1) (3) and can be extended to a function on R by Fµ(y + n) := Fµ(y) + n, y [0, 1), n Z. (4) Figure 2 shows the concept of Fµ and its extension to R. In the rest of this article, we do not distinguish between the definition of measures on S1 or their periodic extensions into R, as well as between their CDFs or their extended CDFs into R. Definition 2.1. [Cumulative distribution function with respect to a reference point] Let µ P(S1), and consider a reference point x0 S1. Assume that S1 is identified as [0, 1) according to our canonical parametrization. By abuse of notation, also denote by x0 the point in [0, 1) that corresponds to the given reference point when considering the canonical parametrization. We define Fµ,x0(y) := Fµ(x0 + y) Fµ(x0). Figure 1: Visualization of densities (blue and orange) on S1 and after unrolling them to [0, 1) by considering a cutting point x0. The blue density is the uniform distribution on S1, represented as having height 1 over the unit circle in black. The reference point x0 can be considered as the origin for parametrizing the circle as [0, 1) starting from x0. That is, x0 will correspond to 0, and from there, we move according to the clockwise orientation. Thus, we can think of x0 in the above definition as a cutting point : A point where we cut S1 into a line by x0 and so we can unroll PDFs and CDFs over the circle into R. See Figures 1 and 2. Besides, note that Fµ,x0(0) = 0 and Fµ,x0(1) = 1 by the 1-periodicity of µ. This is to emphasize that in the new system of coordinates, or in the new parametrization of S1 as [0, 1) starting from x0, the new origin x0 plays the role of 0. Finally, notice that if x0 is the North Pole, which corresponds to 0 in the canonical parametrization of the circle, then Fµ,x0 = Fµ. Definition 2.2. The quantile function F 1 µ,x0 : [0, 1] [0, 1] is defined as F 1 µ,x0(y) := inf{x : Fµ,x0(x) > y}. 2.3 OPTIMAL TRANSPORT ON THE CIRCLE 2.3.1 PROBLEM SETUP Given µ, ν P(S1), let c(x, y) := h(|x y|S1) be the cost of transporting a unit mass from x to y on the circle, where h : R R+ is a convex increasing function. The Circular Optimal Transport Published as a conference paper at ICLR 2024 cost between µ and ν is defined as COTh(µ, ν) := inf γ Γ(µ,ν) S1 S1 c(x, y) dγ(x, y), (5) where Γ(µ, ν) is the set of all transport plans from µ to ν, that is, γ Γ(µ, ν) is such that γ P(S1 S1) having first and second marginals µ and ν, respectively. There always exists a minimizer γ of 5, and it is called a Kantorovich optimal plan (see, for example, Santambrogio (2015, Th. 1.4)). When h(x) = |x|p, for 1 p < , we denote COTh( , ) = COTp( , ), and COTp( , )1/p defines a distance on P(S1). In general, COTh(µ, ν) inf M: M#µ=ν S1 h(|M(x) x|S1) dµ(x), (6) and a minimizer M : S1 S1 of the right-hand side of 6, among all maps M that pushforward µ to ν 1, might not exist. In this work, we will consider the cases where a minimizer M does exist, for example, when the reference measure µ is absolutely continuous with respect to the Lebesgue measure on S1 (see Mc Cann (2001); Santambrogio (2015)). In these scenarios, such map M is called an optimal transportation map or a Monge map. Moreover, as µ, ν P(S1) can be regarded as measures on R according to equation 2, we can work with transportation maps M : R R that are 1-periodic functions satisfying M#µ = ν. Proposition 2.3. Two equivalent formulations of COTh are the following: COTh(µ, ν) = inf x0 [0,1) 0 h(|F 1 µ,x0(x) F 1 ν,x0(x)|R) dx (7) 0 h(|F 1 µ (x) F 1 ν (x α)|R) dx. (8) When there exist minimizers xcut and αµ,ν of equation 7 and equation 8, respectively, the relation between them is given by αµ,ν = Fµ(xcut) Fν(xcut). (9) Moreover, if µ = Unif(S1) and h(x) = |x|2, it can be verified that αµ,ν is the antipodal of E(ν), i.e., αµ,ν = xcut Fν(xcut) = E(ν) 1/2. (10) The proof of equation 8 in Proposition 2.3 is provided in Delon et al. (2010) for the optimal coupling for any pair of probability measures on S1. For the particular and enlightening case of discrete probability measures on S1, we refer the reader to Rabin et al. (2011). In that article, equation 7 is introduced. Finally, equation 10 is given for example in Bonet et al. (2023, Proposition 1). 1The pushforward M#µ is defined by the change of variables R φ(y) d M#µ(y) := R φ(M(x)) dµ(x), for every continuous function φ : S1 C. Figure 2: Left: The density of a probability measure, µ. Middle: visualization of the periodic extension to R of a CDF, Fµ, of measure µ on [0, 1) S1. Right: Visualization of Fµ,x0 given in Definition 2.1, where the parameterization of the circle is changed; now, the origin 0 is the point x0. Published as a conference paper at ICLR 2024 Proposition 2.3 allows us to see the optimization problem of transporting measures supported on the circle as an optimization problem on the real line by looking for the best cutting point so that the circle can be unrolled into the real line by 1-periodicity. Remark 2.4. In general, if h is strictly convex, the minimizer of equation 8 is unique (see Appendix A.2), but there can be multiple minimizers for equation 7 (see Figure 8 in Appendix A.2 and equation 9). However, when a minimizer xcut of equation 7 exists, it will lead to the optimal transportation displacement on a circular domain (see Section 2.3.2 below). 2.3.2 A CLOSED-FORM FORMULA FOR THE OPTIMAL CIRCULAR DISPLACEMENT Let xcut be a minimizer of equation 7, that is, COTh(µ, ν) = Z 1 0 h(|F 1 µ,xcut(x) F 1 ν,xcut(x)|R) dx. (11) From equation 11, one can now emulate the Optimal Transport Theory on the real line (see, for e.g., Santambrogio (2015)): The point xcut provides a reference where one can cut the circle. Subsequently, computing the optimal transport between µ and ν boils down to solving an optimal transport problem between two distributions on the real line. We consider the parametrization of S1 as [0, 1) by setting xcut as the origin and moving along the clockwise orientation. Let us use the notation ex [0, 1) for the points given by such parametrization, and the notation x [0, 1) for the canonical parametrization. That is, the change of coordinates from the two parametrizations is given by x = ex + xcut. Then, if µ does not give mass to atoms, by equation 11 and the classical theory of Optimal Transport on the real line, the optimal transport map (Monge map) that takes a point ex to a point ey is given by F 1 ν,xcut Fµ,xcut(ex) = ey (12) That is, 12 defines a circular optimal transportation map from µ to ν written in the parametrization that establishes xcut as the origin. If we want to refer everything to the original labeling of the circle, that is, if we want to write equation 12 with respect to the canonical parametrization, we need to change coordinates ex = x xcut ey = y xcut . (13) Therefore, a closed-form formula for an optimal circular transportation map in the canonical coordinates is given by M ν µ(x) := F 1 ν,xcut Fµ,xcut(x xcut) + xcut = y, x [0, 1), (14) and the corresponding optimal circular transport displacement that takes x to y is M ν µ(x) x = F 1 ν,xcut Fµ,xcut(x xcut) (x xcut), x [0, 1). (15) In summary, we condense the preceding discussion in the following result. The proof is provided in Appendix A.1. While the result builds upon prior work, drawing from Bonet et al. (2023); Rabin et al. (2011); Santambrogio (2015), it offers an explicit formula for the optimal Monge map. Theorem 2.5. Let µ, ν P(S1). Assume that µ is absolutely continuous with respect to the Lebesgue measure on S1 (that is, it does not give mass to atoms). 1. If xcut is a minimizer of equation 7, then equation 14 defines an optimal circular transportation map. (We will use the notation M ν µ for the Monge map from µ to ν.) 2. If αµ,ν minimizes equation 8, then M ν µ(x) = F 1 ν (Fµ(x) αµ,ν) (16) 3. If x0, x1 are two minimizers of equation 7, then F 1 ν,x0 Fµ,x0(x x0) + x0 = F 1 ν,x1 Fµ,x1(x x1) + x1 x [0, 1). 4. If the cost function h is strictly convex, the optimal map defined by the formula equation 14 is unique. (The uniqueness is as functions on S1, or as functions on R up to modulo Z). 5. If also ν does not give mass to atoms, then (M ν µ) 1 = M µ ν . Having established the necessary background, we are now poised to introduce our proposed metric. In the subsequent section, we present the Linear Circular Optimal Transport (LCOT) metric. Published as a conference paper at ICLR 2024 3.1 LINEAR CIRCULAR OPTIMAL TRANSPORT EMBEDDING (LCOT) By following the footsteps of Wang et al. (2013), starting from the COT framework, we will define an embedding for circular measures by computing the optimal displacement from a fixed reference measure. Then, the Lp-distance on the embedding space defines a new distance between circular measures (Theorem 3.6 below). Definition 3.1 (LCOT Embedding). For a fixed reference measure µ P(S1) that is absolutely continuous with respect to the Lebesgue measure on S1, we define the Linear Circular Optimal Transport (LCOT) Embedding of a target measure ν P(S1) with respect to the cost COTh( , ), for a strictly convex increasing function h : R R+, by bνµ,h(x) := F 1 ν,xcut(Fµ(x xcut)) (x xcut) = F 1 ν (Fµ(x) αµ,ν) x, x [0, 1), (17) where xcut is any minimizer of equation 7 and αµ,ν is the minimizer of equation 8. The LCOT-Embedding corresponds to the optimal (circular) displacement that comes from the problem of transporting the reference measure µ to the given target measure ν with respect to a general cost COTh( , ) (see equation 16 from Theorem 2.5 and equation 15). Definition 3.2 (LCOT discrepancy). Under the settings of Definition 3.1, we define the LCOTdiscrepancy by LCOTµ,h(ν1, ν2) := Z 1 0 h min k Z{| bν1 µ,h(t) bν2 µ,h(t) + k|R} dµ(t), ν1, ν2 P(S1). In particular, when h( ) = | |p, for 1 < p < , we define the LCOTµ,p distance as LCOTµ,p(ν1, ν2) := bν1 µ,h bν2 µ,h p Lp(S1,dµ) = Z 1 min k Z{| bν1 µ,h(t) bν2 µ,h(t) + k|R} p dµ(t) where Lp(S1, dµ) := {f : S1 R | f Lp(S1,dµ) := Z S1 |f(t)|p S1 dµ(t) 1/p < }. (18) If µ = Unif(S1), we use the notation Lp(S1) := Lp(S1, dµ). The embedding ν 7 bν as outlined by equation 17 is consistent with the definition of the Logarithm function given in (Sarrazin & Schmitzer, 2023, Definition 2.7) (we also refer to Wang et al. (2013) for the LOT framework). However, the emphasis of the embedding in this paper is on computational efficiency, and a closed-form solution is provided. Additional details are available in Appendix A.6. Remark 3.3. If the reference measure is µ = Unif(S1), given a target measure ν P(S1), we denote the LCOT-Embedding bνµ,h of ν with respect to the cost COT2( , ) (i.e., h(x) = |x|2) simply by bν. Due to Theorem 2.5 and equation 10, the expression 17 reduces to bν(x) := F 1 ν x, x [0, 1). (19) In this setting, we denote LCOTµ,h( , ) simply by LCOT( , ). That is, given ν1, ν2 P(S1), LCOT(ν1, ν2) := bν1 bν2 2 L2(S1) = Z 1 min k Z{| bν1(t) bν2(t) + k|R} 2 dt. (20) All our experiments are performed using the embedding bν given by 19 due to the robustness of the closed-form formula 10 for the minimizer αµ,ν of equation 8 when h(x) = |x|2 and µ = Unif(S1). Remark 3.4. Let µ P(S1) be absolutely continuous with respect to the Lebesgue measure on S1, and h : R R+ a strictly convex increasing function. Given ν P(S1). COTh(µ, ν) = Z 1 0 h |bνµ,h(t)|S1 dt = Z 1 0 h |bνµ,h(t) bµµ,h(t)|S1 dt = LCOTµ,h(µ, ν). In particular, COT2(µ, ν) = bν 2 L2(S1) = bν bµ 2 L2(S1) = LCOT(µ, ν). Published as a conference paper at ICLR 2024 Proposition 3.5 (Invertibility of the LCOT-Embedding.). Let µ P(S1) be absolutely continuous with respect to the Lebesgue measure on S1, h : R R+ a strictly convex increasing function, and let ν P(S1). Then, ν = (bνµ,h + id)#µ. We refer to Proposition A.2 in the Appendix for more properties of the LCOT-Embedding. Theorem 3.6. Let µ P(S1) be absolutely continuous with respect to the Lebesgue measure on S1, and let h(x) = |x|p, for 1 < p < . Then LCOTµ,p( , )1/p is a distance on P(S1). In particular, LCOT( , )1/2 is a distance on P(S1). 3.2 LCOT INTERPOLATION BETWEEN CIRCULAR MEASURES Given a COT Monge map and the LCOT embedding, we can compute a linear interpolation between circular measures (refer to Wang et al. (2013) for a similar approach on the Euclidean setting). First, for arbitrary measures σ, ν P(S1) the COT interpolation can be written as: ρCOT t := ((1 t)id + t M ν σ)# σ, t [0, 1]. (21) Similarly, for a fixed reference measure µ P(S1), we can write the LCOT interpolation as: ρLCOT t := ((1 t)(bσ + id) + t(bν + id))# µ, t [0, 1], (22) where we have ρCOT t=0 = ρLCOT t=0 = σ and ρCOT t=1 = ρLCOT t=1 = ν. In Figure 3, we show such interpolations between the reference measure µ and two arbitrary measures ν1 and ν2 for COT and LCOT. As can be seen, the COT and LCOT interpolations between µ and νis coincide (by definition), while the interpolation between ν1 and ν2 is different for the two methods. We also provide an illustration of the logarithmic and exponential maps to, and from, the LCOT embedding. LCOT Embedding LCOT Interpolation COT Interpolation Machine learning algorithms are applied in the Euclidean embedding space Figure 3: Left: Illustration of the LCOT embedding, the linearization process (logarithmic map), and measure interpolations. Right: Pairwise interpolations between reference measure µ and measures ν1 and ν2, using formulas in equation 21 (COT) and equation 22 (LCOT). We refer to Appendix A.5.1 for a real data application. 3.3 TIME COMPLEXITY OF LINEAR COT DISTANCE BETWEEN DISCRETE MEASURES According to (Delon et al., 2010, Theorem 6.2), for discrete measures ν1, ν2 with N1, N2 sorted points, the binary search algorithm requires O((N1 + N2) log(1/ϵ)) computational time to find an ϵ-approximate solution for αν1,ν2. If M is the least common denominator for all probability masses, an exact solution can be obtained in O((N1 +N2) ln M). Then, for a given ϵ > 0 and K probability measures, {νk}K k=1, each with N points, the total time to pairwise compute the COT distance is O(K2N ln(1/ϵ)). For LCOT, when the reference µ is the Lebesgue measure, the optimal αµ,νk has a closed-form solution (see equation 10) and the time complexity for computing the LCOT embedding via equation 19 is O(N). The LCOT distance calculation between νi and νj according to equation 20 requires O(N) computations. Hence, the total time for pairwise LCOT distance computation between K probability measures, {νk}K k=1, each with N points, would be O(K2N + KN). See Appendix A.3 for further explanation. Published as a conference paper at ICLR 2024 To verify these time complexities, we evaluate the computational time for COT and LCOT algorithms and present the results in Figure 4. We generate K random discrete measures, {νk}K k=1, each with N samples, and for the reference measure, µ, we choose: 1) the uniform discrete measure, and 2) a random discrete measure, both with N0 = N samples. To calculate αµ,νk, we considered the two scenarios, using the binary search Delon et al. (2010) for the non-uniform reference, and using equation 10 for the uniform reference. We labeled them as, uniform ref. and non-uniform ref. Then, in our first experiment, we set K = 2 and measured the wall-clock time for calculating COT and LCOT while varying N {100, 200, . . . , 20000}. For our second experiment, and for N {500, 1000, 5000}, we vary K {2, 4, 6, . . . , 64} and measure the total time for calculating pairwise COT and LCOT distances. The computational benefit of LCOT is evident from Figure 4. Figure 4: Computational time analysis of COT and LCOT, for pairwise comparison of K discrete measures, each with N samples. Left: Wall-clock time in seconds for K = 2 and N {100, 200, . . . , 20000}. Right: Wall-clock time in seconds for N {500, 1000, 5000}, and K {2, 4, 6, . . . , 64}. Solid lines are COT, dotted are LCOT with a uniform reference and dashdotted are LCOT with a non-uniform reference. 4 EXPERIMENTS To better understand the geometry induced by the LCOT metric, we perform Multidimensional Scaling (MDS) (Kruskal, 1964) on a family of densities, where the discrepancy matrices are computed using LCOT, COT, OT (with a fixed cutting point), and the Euclidean distance. Experiment 1. We generate three families of circular densities, calculate pairwise distances between them, and depict their MDS embedding in Figure 5. In short, the densities are chosen as follows; we start with two initial densities: (1) a von Mises centered at the south pole of the circle (µ=0.5), (2) a bimodal von Mises centered at the east (µ=0.25) and west (µ=0.75) ends of the circle. Then, we create 20 equally distant circular translations of each of these densities to capture the geometry of the circle. Finally, we parametrize the COT geodesic between the initial densities and generate 20 extra densities on the geodesic. Figure 5 shows these densities in green, blue, and red, respectively. The representations given by the MDS visualizations show that LCOT and COT capture the geometry of the circle coded in the translation property in an intuitive fashion. In contrast, OT and Euclidean distances do not capture the underlying geometry of the problem. Experiment 2. To assess the separability properties of the LCOT embedding, we follow a similar experiment design as in Landler et al. (2021). We consider six groups of circular density functions as in the third row of Figure 5: unimodal von Mises (axial: 0 ), wrapped skew-normal, symmetric bimodal von Mises (axial: 0 and 180 ), asymmetric bimodal von Mises (axial: 0 and 120 ), symmetric trimodal von Mises (axial: 0 , 120 and 240 ), asymmetric trimodal von Mises (axial: 0 , 240 and 225 ). We assign a von Mises distribution with a small spread (κ = 200) to each distribution s axis/axes to introduce random perturbations of these distributions. We generate 20 sets of simulated densities and sample each with 50-100 samples. Following the computation of pairwise distances among the sets of samples using LCOT, COT, OT, and Euclidean methods, we again employ MDS to visualize the separability of each approach across the six circular density classes mentioned above. The outcomes are presented in the bottom row of Figure 5. It can be seen that LCOT stands out for its superior clustering outcomes, featuring distinct boundaries between the actual classes, outperforming the other methods. Experiment 3. In our last experiment, we consider the calculation of the barycenter of circular densities. Building upon Experiments 1 and 2, we generated unimodal, bimodal, and trimodal von Mises distributions. For each distribution s axis/axes, we assigned a von Mises distribution with a Published as a conference paper at ICLR 2024 small spread (κ = 200) to introduce random perturbations. These distributions are shown in Figure 6 (left). Subsequently, we computed both the Euclidean average of these densities and the LCOT barycenter. Notably, unlike COT, the invertible nature of the LCOT embedding allows us to directly calculate the barycenter as the inverse of the embedded distributions average (see Appendix A.4). The resulting barycenters are illustrated in Figure 6. As observed, the LCOT method accurately recovers the correct barycenter without necessitating additional optimization steps. Unimodal von Mises Distributions Transition from Unimodal to Bimodal Bimodal von Mises Distributions Figure 5: MDS for embedding classes of probability densities into an Euclidean space of dimension 2 where the original pair-wise distances (COT-distance, LOT-distance, Euclidean or L2-distance) are preserved as well as possible. Figure 6: The LCOT barycenter compared to the Euclidean mean. 5 CONCLUSION AND DISCUSSION In this paper, we present the Linear Circular Optimal Transport (LCOT) discrepancy, a new metric for circular measures derived from the Linear Optimal Transport (LOT) framework Wang et al. (2013); Kolouri et al. (2016); Park et al. (2018); Cai et al. (2022); Aldroubi et al. (2022); Moosm uller & Cloninger (2023). The LCOT offers 1) notable computational benefits over the COT metric, particularly in pairwise comparisons of numerous measures, and 2) a linear embedding where the L2(S1) between embedded distributions equates to the LCOT metric. We consolidated scattered results on circular OT into Theorem 2.5 and introduced the LCOT metric and embedding, validating LCOT as a metric in Theorem 3.6. In Section 3.3, we assess LCOT s computational complexity for pairwise comparisons of K circular measures, juxtaposing it with COT. We conclude by showcasing LCOT s empirical strengths via MDS embeddings on varied circular densities using different metrics. Published as a conference paper at ICLR 2024 Akram Aldroubi, Shiying Li, and Gustavo K Rohde. Partitioning signal classes using transport transforms for data analysis and machine learning. Sampling theory, signal processing, and data analysis, 19(1):6, 2021. Akram Aldroubi, Rocio Diaz Martin, Ivan Medri, Gustavo K Rohde, and Sumati Thareja. The signed cumulative distribution transform for 1-d signal analysis and classification. Foundations of Data Science, 4:137 163, 2022. David Alvarez-Melis and Nicolo Fusi. Geometric dataset distances via optimal transport. Advances in Neural Information Processing Systems, 33:21428 21439, 2020. Martin Arjovsky, Soumith Chintala, and L eon Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pp. 214 223. PMLR, 2017. Yikun Bai, Bernhard Schmitzer, Mathew Thorpe, and Soheil Kolouri. Sliced optimal partial transport. ar Xiv preprint ar Xiv:2212.08049, 2022. Serge Belongie, Jitendra Malik, and Jan Puzicha. Shape context: A new descriptor for shape matching and object recognition. Advances in neural information processing systems, 13, 2000. Mario Beraha and Matteo Pegoraro. Wasserstein principal component analysis for circular measures. ar Xiv preprint ar Xiv:2304.02402, 2023. Cl ement Bonet, Paul Berg, Nicolas Courty, Franc ois Septier, Lucas Drumetz, and Minh-Tan Pham. Spherical Sliced-Wasserstein. ICLR, 2023. Carlos A Cabrelli and Ursula M Molter. A linear time algorithm for a matching problem on the circle. Information processing letters, 66(3):161 164, 1998. Tianji Cai, Junyi Cheng, Bernhard Schmitzer, and Matthew Thorpe. The linearized hellinger kantorovich distance. SIAM Journal on Imaging Sciences, 15(1):45 83, 2022. Alexander Cloninger, Keaton Hamm, Varun Khurana, and Caroline Moosm uller. Linearized Wasserstein dimensionality reduction with approximation guarantees. ar Xiv preprint ar Xiv:2302.07373, 2023. Nicolas Courty, R emi Flamary, Amaury Habrard, and Alain Rakotomamonjy. Joint distribution optimal transportation for domain adaptation. Advances in Neural Information Processing Systems, 30, 2017. Bharath Bhushan Damodaran, Benjamin Kellenberger, R emi Flamary, Devis Tuia, and Nicolas Courty. Deepjdot: Deep joint distribution optimal transport for unsupervised domain adaptation. In Proceedings of the European conference on computer vision (ECCV), pp. 447 463, 2018. Julie Delon, Julien Salomon, and Andrei Sobolevski. Fast transport optimization for Monge costs on the circle. SIAM Journal on Applied Mathematics, 70(7):2239 2258, 2010. Charlie Frogner, Chiyuan Zhang, Hossein Mobahi, Mauricio Araya, and Tomaso A Poggio. Learning with a Wasserstein loss. Advances in neural information processing systems, 28, 2015. Steven Haker, Lei Zhu, Allen Tannenbaum, and Sigurd Angenent. Optimal mass transport for registration and warping. International Journal of computer vision, 60:225 240, 2004. Nhat Ho, Xuan Long Nguyen, Mikhail Yurochkin, Hung Hai Bui, Viet Huynh, and Dinh Phung. Multilevel clustering via Wasserstein means. In International conference on machine learning, pp. 1501 1509. PMLR, 2017. Shayan Hundrieser, Marcel Klatt, and Axel Munk. The statistics of circular optimal transport. In Directional Statistics for Innovative Applications: A Bicentennial Tribute to Florence Nightingale, pp. 57 82. Springer, 2022. S Rao Jammalamadaka and Ashis Sen Gupta. Topics in circular statistics, volume 5. world scientific, 2001. Published as a conference paper at ICLR 2024 Abdelwahed Khamis, Russell Tsuchida, Mohamed Tarek, Vivien Rolland, and Lars Petersson. Earth movers in the big data era: A review of optimal transport in machine learning. ar Xiv preprint ar Xiv:2305.05080, 2023. Soheil Kolouri, Akif B Tosun, John A Ozolek, and Gustavo K Rohde. A continuous linear optimal transport approach for pattern analysis in image datasets. Pattern recognition, 51:453 462, 2016. Soheil Kolouri, Se Rim Park, Matthew Thorpe, Dejan Slepcev, and Gustavo K Rohde. Optimal mass transport: Signal processing and machine-learning applications. IEEE signal processing magazine, 34(4):43 59, 2017. Soheil Kolouri, Phillip E Pope, Charles E Martin, and Gustavo K Rohde. Sliced Wasserstein autoencoders. In International Conference on Learning Representations, 2018. Joseph B Kruskal. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29(1):1 27, 1964. Lukas Landler, Graeme D Ruxton, and E Pascal Malkemper. Advice on comparing two independent samples of circular data in biology. Scientific reports, 11(1):20337, 2021. Tung Le, Khai Nguyen, Shanlin Sun, Kun Han, Nhat Ho, and Xiaohui Xie. Diffeomorphic deformation via sliced Wasserstein distance optimization for cortical surface reconstruction. ar Xiv preprint ar Xiv:2305.17555, 2023. Joel D Levine, Pablo Funes, Harold B Dowse, and Jeffrey C Hall. Signal analysis of behavioral and molecular cycles. BMC neuroscience, 3(1):1 25, 2002. Xinran Liu, Yikun Bai, Yuzhe Lu, Andrea Soltoggio, and Soheil Kolouri. Wasserstein task embedding for measuring task similarities. ar Xiv preprint ar Xiv:2208.11726, 2022. David G Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60:91 110, 2004. Robert J Mc Cann. Polar factorization of maps on riemannian manifolds. Geometric & Functional Analysis GAFA, 11(3):589 608, 2001. Caroline Moosm uller and Alexander Cloninger. Linear optimal transport embedding: provable Wasserstein classification for certain rigid transformations and perturbations. Information and Inference: A Journal of the IMA, 12(1):363 389, 2023. Subhadip Mukherjee, Marcello Carioni, Ozan Oktem, and Carola-Bibiane Sch onlieb. End-to-end reconstruction meets data-driven regularization for inverse problems. Advances in Neural Information Processing Systems, 34:21413 21425, 2021. Se Rim Park, Soheil Kolouri, Shinjini Kundu, and Gustavo K Rohde. The cumulative distribution transform and linear pattern classification. Applied and computational harmonic analysis, 45(3): 616 641, 2018. Gabriel Peyr e, Marco Cuturi, et al. Computational optimal transport: With applications to data science. Foundations and Trends in Machine Learning, 11(5-6):355 607, 2019. Michael Quellmalz, Robert Beinert, and Gabriele Steidl. Sliced optimal transport on the sphere. ar Xiv preprint ar Xiv:2304.09092, 2023. Julien Rabin, Julie Delon, and Yann Gousseau. Transportation distances on the circle. Journal of Mathematical Imaging and Vision, 41:147 -167, 2011. F. Santambrogio. Optimal Transport for Applied Mathematicians. Calculus of Variations, PDEs and Modeling. Birkh auser, 2015. Cl ement Sarrazin and Bernhard Schmitzer. Linearized optimal transport on manifolds. ar Xiv preprint ar Xiv:2303.13901, 2023. Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein autoencoders. ar Xiv preprint ar Xiv:1711.01558, 2017. Published as a conference paper at ICLR 2024 Robert J Twiss and Eldridge M Moores. Structural geology. Macmillan, 1992. Cedric Villani. Optimal transport: old and new. Springer, 2009. Wei Wang, Dejan Slepˇcev, Saurav Basu, John A Ozolek, and Gustavo K Rohde. A linear optimal transportation framework for quantifying and visualizing variations in sets of images. International journal of computer vision, 101:254 269, 2013. Published as a conference paper at ICLR 2024 Proof of Proposition 2.3. The proof of Proposition 2.3 is provided in Delon et al. (2010) for the optimal coupling for any pair of probability measures on S1. For the particular and enlightening case of discrete probability measures on S1, we refer the reader to Rabin et al. (2011). For completeness, notice that the relation between x0 and α hols by changing variables, using 1periodicity of µ and ν and Definition 2.2 (see also Bonet et al. (2023, Proposition 1)): Z 1 0 h(|F 1 µ,x0(x) F 1 ν,x0(x)|R) dx 0 h(|(Fµ( + x0) Fµ(x0)) 1(x) (Fν( + x0) Fν(x0)) 1(x)|R) dx 0 h(|(Fµ Fµ(x0)) 1(x) (Fν Fν(x0)) 1(x)|R) dx 0 h(|F 1 µ (x + Fµ(x0)) F 1 ν (x + Fν(x0))|R) dx 0 h(|F 1 µ (x + Fµ(x0) Fν(x0) | {z } α ) F 1 ν (x)|R) dx In particular, if h(x) = |x|2, and µ = Unif(S1), then COT2(µ, ν) = inf α R 0 |F 1 µ (x + α) F 1 ν (x)|2 R dx 0 |x + α F 1 ν (x)|2 dx 0 |F 1 ν (x) x|2 dx 2α Z 1 0 (F 1 ν (x) x) dx + α2 0 |F 1 ν (x) x|2 dx 2α Z 1 0 x dν(x) 1 0 |F 1 ν (x) x|2 dx 2α E(ν) 1 0 |F 1 ν (x) x|2 dx 2 E(ν) 1 | {z } αµ,ν Therefore, in this case, the minimizer αµ,ν of equation 8 is unique and has the closed-form αµ,ν = E(ν) 1/2. Proof of Remark 2.4. We will show that, in general, the minimizer αµ,ν of equation 8 is unique. Our arguments are based on the paper Delon et al. (2010). Specifically, the role played by the function (F θ) 1, where F θ(x) = F(x) + θ in Delon et al. (2010) is substituted by F 1(x α) in our case (i.e., our parameter α correspond to θ in the mentioned paper). Our hypotheses are the following: 1. Let c(x, y) := h(|x y|) for h : R R+ strictly convex (for example h(x) = |x|p with p > 1), or, more generally, let c : R R R satisfying the Monge condition or the (strict) cyclical monotonicity condition: c(u1, v1) + c(u2, v2) < c(u1, v2) + c(u2, v1) u1 < u2, v1 < v2. (23) Published as a conference paper at ICLR 2024 2. Let µ, ν be two probability measures absolutely continuous with respect to the Lebesgue measure on S1. The idea of the proof will rely on showing that the cost function Cost(α) := Z 1 0 c(F 1 µ (x), F 1 ν (x α)) dx (24) is strictly convex and continuous (as a function on α), and so it has a unique global minimum (that we will denote by αµ,ν). Let cµ,ν(x, y) := c(F 1 µ (x), F 1 ν (y)) (25) Under the above conditions, it holds that cµ,ν( , ) satisfies the Monge condition equation 23. Then, to prove strictly convexity of Cost(α) it is sufficient to show that < Cost(α ) + Cost(α ) Assume α α and let α := α +α 2 . On the one hand, since Cost (α) = Z 1 0 cµ,ν(x, x α) dx α α cµ,ν(x, x α) dx 0 cµ,ν(y + α α, y α ) dy, where in the last line we used the change of variables y = x+α α and the fact that 2α α = α . Therefore, 2 Cost (α) = Z 1 0 cµ,ν(z, z α) dz + Z 1 0 cµ,ν(z + α α, z α ) dz. (27) On the other hand, by repeating the same idea we have, Cost(α ) = Z 1 0 cµ,ν(x, x α ) dx = Z 1+α α α α cµ,ν(x, x α ) dx 0 cµ,ν(y + α α, y α) dy, Cost(α ) + Cost(α ) = Z 1 0 cµ,ν(z, z α ) dz + Z 1 0 cµ,ν(z + α α, z α) dz. (28) Since u1(z) := z < z + α α =: u2(x) and v1(z) := z α < z α =: v2(z), we have that c(u1(z), v1(z)) + c(u2(z), v2(z)) < c(u1(z), v2(z)) + c(u2(z), v1(z)) because cµ,ν( , ) satisfies Monge condition equation 23. Thus, from equation 27 and equation 28 we obtain 2 Cost(α) = Z 1 0 cµ,ν(u1(z), v1(z)) dz + Z 1 0 cµ,ν(u2(z), v2(z)) dz 0 cµ,ν(u1(z), v2(z)) dz + Z 1 0 cµ,ν(u2(z), v1(z)) dz = Cost(α ) + Cost(α ), and so equation 26 holds. The continuity of α 7 Cost(α) holds as the integral in equation 24 is finite for all α. Published as a conference paper at ICLR 2024 Remark A.1. We mention that in the case that c(x, y) = h(x y) for h(x) = |x| (studied for example in Cabrelli & Molter (1998); Hundrieser et al. (2022)) we have convexity but not strictly convexity. However, the authors in Hundrieser et al. (2022) prove that a closed-formula for a minimizer αµ,ν of equation 8: αµ,ν = min arg min u R 0 |(Fµ Fν)(t) u|dt called the Level Median of the function Fµ Fν. To show that, it is used the fact that in this case equation 8 takes the form COTh(µ, ν) = inf α R 0 |Fµ(t) Fν(t) α| dt. Proof of Theorem 2.5. 1. First, we will show that the map M ν µ given by equation 14 satisfies (M ν µ)#µ = ν. Here µ and ν are the extended measures form S1 to R having CDFs equal to Fµ and Fν, respectively, defined by equation 3 and 4. By choosing the system of coordinates ex [0, 1) that starts at xcut (see Figure 7) then, M ν µ(ex) = F 1 ν,xcut Fµ,xcut(ex) (see equation 12). Let µxcut and νxcut be the (1-periodic) measures on R having CDFs Fµ,xcut and Fν,xcut, respectively, i.e., Fν,xcut(ex) = µxcut([0, ex)) (analogously for νxcut). That is, we have unrolled µ and ν from S1 to R, where the origin 0 R corresponds to xcut S1 (see Figure 1). Thus, a classic computation yields (F 1 ν,xcut Fµ,xcut)#µxcut = (F 1 ν,xcut)# ((Fµ,xcut)#µxcut) = (F 1 ν,xcut)#LS1 = νxcut where LS1 = Unif(S1) denotes the Lebesgue measure on the circle. We used that (Fµ)#µ = LS1 as µ does not give mass to atoms, and so, if we change the system of coordinates we also have (Fµ,xcut)#µxcut = LS1. Finally, we have to switch coordinates. Let z(ex) := ex + xcut (that is, z(ex) = x). To visualize this, see Figure 7. It holds that z#νcut = ν (29) (where we recall that ν is the extended measure form S1 to R having CDF equal to Fµ as in equation 3 and 4). Let us check this fact for intervals: z#νxcut([a, b]) = νxcut(z 1([a, b])) = ν([z 1(a), z 1(b)]) = νxcut([a xcut, b xcut]) = Fν,xcut(b xcut) Fν,xcut(a xcut) = Fν(b) Fν(xcut) (Fν(a) Fν(xcut)) = Fν(b) Fν(a) = ν([a, b]). Besides, it holds that Fµ,xcut( xcut)#µ = Unif(S1), (30) in the sense that it is the Lebesgue measure on S1 extended periodically (with period 1) to the real line, which we denote by LS1. Let us sketch the proof for intervals. First, notice that Fµ,xcut(x xcut) = Fµ(x) Fµ(xcut) and so its inverse is y 7 F 1 µ (y + xcut). Therefore, (Fµ,xcut( xcut))#µ ([a, b]) = µ [F 1 µ (a + xcut), F 1 µ (b + xcut)] = Fµ(F 1 µ (a + xcut)) Fµ(F 1 µ (b + xcut)) = b a. Published as a conference paper at ICLR 2024 Figure 7: The unit circle (black) can be parametrized as [0, 1) in many different ways. In the figure, we marked in black the North Pole as 0. The canonical parametrization of S1 identifies the North Pole with 0. Then, also in black, we pick a point xcut. The distance in blue x that starts at 0 equals the distance in red ex that starts at xcut plus the corresponding starting point xcut. This allows us to visualize the change of coordinates given by equation 13. (M ν µ)#µ = F 1 ν,xcut(Fµ,xcut( xcut) + xcut) = z(F 1 ν,xcut(Fµ,xcut( xcut))) = z#(F 1 ν,xcut)#(Fµ,xcut( xcut))#µ = z#(F 1 ν,xcut)#LS1 (by equation 30) = z#νxcut = ν (by equation 29). Now, let us prove that M ν µ is optimal. First, assume that µ is absolutely continuous with respect to the Lebesgue measure on S1 and let fµ denote its density function. We will use the change of variables u = Fµ,xcut(x xcut) = Fµ(x) Fµ(xcut) du = fµ(x) dx. 0 h(|M ν µ(x) x|R) dµ(x) = Z 1 0 h(|F 1 ν,xcut(Fµ,xcut(x xcut)) (x xcut)|R) fµ(x)dx | {z } dµ(x) xcut h(|F 1 ν,xcut(u) F 1 µ,xcut(u)|R) du 0 h(|F 1 ν,xcut(u) F 1 µ,xcut(u)|R) du = COTh(µ, ν). Now, let us do the proof in general: Z 1 0 h(|M ν µ(x) x|R) dµ(x) = Z 1 0 h(|F 1 ν,xcut(Fµ,xcut(x xcut)) (x xcut)|R) dµ(x) 0 h(|F 1 ν,xcut(y) F 1 µ,xcut(y)|R) d(Fµ,xcut( xcut))#µ(y) 0 h(|F 1 ν,xcut(u) F 1 µ,xcut(u)|R) du = COTh(µ, ν). In the last equality we have used that Fµ,xcut( xcut)#µ is the Lebesgue measure (see equation 30). Published as a conference paper at ICLR 2024 2. Using the definition of the generalized inverse (quantile function), we have M ν µ(t) = F 1 ν,xcut(Fµ,xcut(x xcut)) + xcut = inf{x : Fν,xcut(x ) > Fµ,xcut(x xcut)} + xcut = inf{x : Fν(x + xcut) Fν(xcut) > Fµ(x) Fµ(xcut)} + xcut = inf{x : Fν(x + xcut) > Fµ(x) Fµ(xcut) + F(xcut)} + xcut = inf{x : Fν(x + xcut) > Fµ(x) αµ,ν} + xcut = inf{y xcut : Fν(y) > Fµ(x) αµ,ν} + xcut = inf{y : Fν(y) > Fµ(x) αµ,ν} + xcut xcut = F 1 ν (Fµ(x) αµ,ν). 3. This part follows from the previous item as the right-hand side of equation 16 does not depend on any minimizer of equation 7. 4. From (Mc Cann, 2001, Theorem 13), there exists a unique optimal Monge map for the optimal transport problem on the unit circle. Therefore, by using Remark 2.4, M ν µ is the unique optimal transport map from µ to ν. For the quadratic case h(x) = |x|2, we refer for example to Santambrogio (2015, Th. 1.25, Sec. 1.3.2)). Moreover, in this particular case, there exists a function φ such that M ν µ(x) = x φ(x), where φ is a Kantorovich potential (that is, a solution to the dual optimal transport problem on S1) and the sum is modulo Z. 5. The identity (M ν µ) 1 = (M µ ν ) holds from the symmetry of the cost equation 5 that one should optimize. Also, it can be verified using equation 16 and the fact that from equation 9 αµ,ν = αν,µ: M µ ν M ν µ(x) = F 1 µ Fν(F 1 ν (Fµ(x) αµ,ν)) αν,µ = F 1 µ (Fµ(x) αµ,ν + αµ,ν) = x. Proposition A.2 (Properties of the LCOT-Embedding). Let µ P(S1) be absolutely continuous with respect to the Lebesgue measure on S1, and let ν P(S1). 1. bµµ,h 0. 2. bνµ,h(x) [ 0.5, 0.5] for every x [0, 1). 3. Let ν1, ν2 P(S1) with ν1 that does not give mass to atoms, then the map M := ( bν2 µ,h bν1 µ,h) (( bν1 µ,h + id) 1) + id, (31) satisfies M#ν1 = ν2 (however, it is not necessarily an optimal circular transport map). Proof of Proposition A.2. 1. It trivially holds that the optimal Monge map from the distribution µ to itself is the identity id, or equivalently, that the optimal displacement is zero for all the particles. 2. It holds from the fact of being the optimal displacement, that is, COTh(µ, ν) = inf M: M#µ=ν S1 h(|M(x) x|S1) dµ(x) = Z S1 h(|bνµ,h(x)|S1) dµ(x), and from the fact that |z|S1 is at most 0.5. 3. We will use that bνµ,h = M ν µ id, and that (M ν µ) 1 = M µ ν : Published as a conference paper at ICLR 2024 M(x) = ( bν2 µ,h bν1 µ,h) M µ ν1(x) + x = (M ν2 µ M ν1 µ ) M µ ν1(x) + x = M ν2 µ M µ ν1(x) x + x = M ν2 µ M µ ν1(x). Finally, notice that (M ν2 µ M µ ν1)#ν1 = (M ν2 µ )# (M µ ν1)#ν1 = (M ν2 µ )#µ = ν2. Now, we will proceed to prove Theorem 3.6. By having this result, it is worth noticing that LCOTµ,p( , )1/p endows P(S1) with a metric-space structure. The proof is based on the fact that we have introduced an explicit embedding and then we have considered an Lp-distance. It will follow that we have defined a kernel distance (that is in fact positive semidefinite). Proof of Theorem 3.6. From equation 20, it is straightforward to prove the symmetric property and non-negativity. If ν1 = ν2, by the uniqueness of the optimal COT map (see Theorem 2.5, Part 3), we have bν1 µ,h = bν2 µ,h. Thus, LCOTµ,h(ν1, ν2) = 0. For the reverse direction, if LCOTµ,h(ν1, ν2) = 0, then h(min k Z{| bν1 µ,h(x) bν2 µ,h(x) + k|}) = 0 µ a.s. Thus, bν1 µ,h(x) 1 bν2 µ,h(x) µ a.s. (where 1 stands for the equality modulo Z). That is, M ν1 µ (x) = bν1 µ,h(x) + x 1 bν2 µ,h(x) + x = M ν2 µ (x) µ a.s. Let S [0, 1) denote the set of x such that the equation above holds, we have µ(S) = 1, µ(S1\S) = 0. Equivalently, for any (measurable) B S1, µ(B S) = µ(B). Pick any Borel set A S1, we have: ν1(A) = µ (M ν1 µ ) 1(A) = µ (M ν1 µ ) 1(A) S = µ (M ν2 µ ) 1(A) S = µ((M ν2 µ ) 1(A)) = ν2(A) (32) where the first and last equation follows from the fact M ν1 µ , M ν2 µ are push forward mapping from µ to ν1, ν2 respectively. Finally, we verify the triangular inequality. Here we will use that h(x) = |x|p, for 1 p < . Let ν1, ν2, ν3 P(S1), LCOT µ,p(ν1, ν2)1/p = Z 1 0 (| bν1(t) bν2(t)|S1)p dµ(t) 1/p 0 (| bν1(t) bν3(t)|S1 + | bν3(t) bν2(t)|S1)p dµ(t) 1/p 0 | bν1(t) bν3(t)|p S1 dµ(t) 1/p + Z 1 0 | bν3(t) bν2(t)|p S1 dµ(t) 1/p = LCOTµ,p(ν1, ν3)1/p + LCOTµ,p(ν2, ν3)1/p where the last inequality holds from Minkowski inequality. Published as a conference paper at ICLR 2024 A.2 UNDERSTANDING THE RELATION BETWEEN THE MINIMIZERS OF EQUATION 7 AND EQUATION 8 We briefly revisit the discussion in Section equation 2.3.1, specifically in Remark equation 2.4, concerning the optimizers xcut and αµ,ν of equation 7 and equation 8, respectively. Assuming minimizers exist for equation 7 and equation 8, we first explain why we adopt the terminology cutting point (xcut) for a minimizer of equation 7 and not for the minimizer αµ,ν of equation 8. On the one hand, the cost function presented in 7 is given by Cost(x0) := Z 1 0 h(|F 1 µ,x0(x) F 1 ν,x0(x)|R) dx. (33) We seek to minimize over x0 [0, 1) S1, aiming to find an optimal x0 that affects both CDFs Fµ and Fν. By looking at the cost 33, for each fixed x0, we change the system of reference by adopting x0 as the origin. Then, once an optimal x0 is found (called xcut), it leads to the optimal transportation displacement, providing a change of coordinates to unroll the CDFs of µ and ν into R and allowing the use the classical Optimal Transport Theory on the real line (see Section equation 2.3.2 and the proofs in Appendix equation A.1). On the other hand, the cost function in 8 is Cost(α) := Z 1 0 h(|F 1 µ (x + α) F 1 ν (x)|R) dx, and the minimization runs over every real number α. Here, the shift by α affects only one of the CDFs, not both. Therefore, it will not allow for a consistent change in the system of reference. This is why we do not refer to α as a cutting point in this paper, but we do refer to the minimizer of equation 7 as xcut. Finally, Figure 8 below is meant to provide a visualization of Remark 2.4, that is, to show through an example that, when minimizers for 7 and equation 8 do exist, while one could have multiple minimizers of 7, the minimizer of 8 is unique. 0.0 0.2 0.4 0.6 0.8 1.0 Densities on the circle Target den ity, ν Uniform den ity, μ 0.0 0.2 0.4 0.6 0.8 1.0 Co t a a funtion of x 0.4 0.2 0.0 0.2 0.4 Co t a a function of α 0.0 0.2 0.4 0.6 0.8 1.0 Clo e formula for the optimal α Figure 8: Top left: Uniform density, µ, and a random target density ν on S1. Top right: The circular transportation cost R 1 0 |F 1 µ,x0(x) F 1 ν,x0(x)|2 dx is depicted as a function of the cut, x0, showing that the optimization in equation 7 can have multiple minimizers. Bottom right: Following equation 9, we depict the difference between the two CDFs, Fµ(x) Fν(x), for each x [0, 1) S1. As can be seen, for the optimal cuts (dotted red lines), the difference is constant, indicating that the optimal α for equation 8 is unique. Bottom left: The optimizer for the circular transportation cost in equation 8 is unique, and given that µ is the uniform measure, it has a closed-form solution E(ν) 1 Published as a conference paper at ICLR 2024 A.3 TIME COMPLEXITY OF LINEAR COT In this section, we assume that we are given discrete or empirical measures.2 First, we mention that according to (Delon et al., 2010, Section 6), given two non-decreasing step functions F and G represented by [ [x1, . . . , x N1], [F(x1), . . . , F(x N1)] ] and [ [y1, . . . , y N2], [G(y1), . . . , G(y N2)] ], the computation of an integral of the form Z c(F 1(x), G 1(x)) dx requires O(N1 + N2) evaluations of a given cost function c( , ). Now, by considering the reference measure µ = Unif(S1) we will detail our algorithm for computing LCOT(ν1, ν2). Let us assume that ν1, ν2 are two discrete probability measures on S1 having N1 and N2 masses, respectively. We represent these measures νi = PNi j=1 mi jδxi j (that is, ν1 has mass m1 j at location x1 j for j = 1, . . . , N1, and analogously for ν2) as arrays of the form νi = [ [xi 1, . . . , xi Ni], [mi 1, . . . , mi Ni] ], i = 1, 2. Algorithm to compute LCOT: 1. For i = 1, 2, compute αµ,νi = E(νi) 1/2. 2. For i = 1, 2, represent Fνi( ) + αµ,νi as the arrays [ [xi 1, . . . , xi Ni], [ci 1, . . . , ci Ni] ] where ci 1 := mi 1 + αµ,νi, ci j := ci j 1 + mi j, for j = 2, . . . , Ni. 3. Use that F 1 ν (x αµ,ν) = (Fν( ) + αµ,ν) 1(x), and the algorithm provided in (Delon et al., 2010, Section 6) mentioned above with F = Fν1( ) + αµ,ν1 and G = Fν2( ) + αµ,ν2 to compute LCOT(ν1, ν2) = bν1 bν2 2 L2(S1) 0 | F 1 ν1 (x αµ,ν1) x F 1 ν2 (x αµ,ν2) x |2 S1 dx 0 |(Fν1( ) + αµ,ν1 | {z } F ) 1(x) (Fν2( ) + αµ,ν2 | {z } G ) 1(x)|2 S1 dx Each step requires O(N1+N2) operations. Therefore, the full algorithm to compute LCOT(ν1, ν2) is of order O(N1 + N2). A.4 LCOT BARYCENTER Although the following definition holds for any non-atomic reference measure µ P(S1), for simplicity, we consider the reference measure as µ = Unif(S1). Given N target measures ν1, . . . , νN P(S1), as LCOT2( , ) is a distance, their LCOT barycenter is defined by the measure νb such that νb = argminν P(S1) 1 N j=1 LCOT2(ν, νj) = argminν P(S1) 1 N j=1 bν bνj 2 L2(S1). 2It is worth mentioning that for some applications, the LCOT framework can be also used for continuous densities, as in the case of the CDT Park et al. (2018). Published as a conference paper at ICLR 2024 In the embedding space, it can be shown that the minimizer of argminbν 1 N j=1 bν bνj 2 L2(S1) is given by the circular mean ν(x) := circle mean({ bν1(x), . . . c νN(x)}) := 1 PN i=1 sin(2π bνi(x)) PN i=1 cos(2π bνi(x)) For each x [0, 1), the last value is the average of the angles {2π bν1(x), . . . , 2π c νN(x)}, which is then normalized to fall within the range [ 0.5, 0.5]. By using the closed formula for the inverse of the LCOT Embedding provided by Proposition 3.5, we can go back to the measure space obtaining the LCOT barycenter between ν1, . . . , νN as νb = (ν + id)#µ. (34) In our experiments, we use the expression equation 34. A.5 EXTRA FIGURES AND EXPERIMENTS The following Figure 9 is from an experiment analogous to Experiment 1 but for a different family of measures (Figure 9 Left). We include it to have an intuition of how the LCOT behaves under translations and dilations of an initial von Mises density. Figure 9: MDS for embedding classes of probability densities into an Euclidean space of dimension 2 where the original pair-wise distances (COT-distance, LOT-distance, Euclidean or L2-distance) are preserved as well as possible. A.5.1 LCOT INTERPOLATION: REAL DATA APPLICATION Hue transfer experiment: In Figures 10, 11, and 12 we interpolate the hue channel between pairs of images using LCOT interpolation (given by equation 22), and COT interpolation (given by equation 21). Given a pair of images, one is considered as the source and the other as the target. Each image of M N pixels is represented using Hue, Saturation, and Value channels (HSV). We compute a density-normalized histogram of the Hue values across all pixels and consider this histogram as a circular density. Each bin represents at the same time a color variety (Hue value) and a point (angle) in the circle. Thus, displacements in the circle correspond to color conversions. Each interpolation (LCOT / COT) provides a curve of color conversions parametrized between t = 0 (source) and t = 1 (target). For three different pairs of images, Figures 10, 11, and 12 depict color-converted images using steps t = 0.25, 0.5, 0.75 for each interpolation type. Published as a conference paper at ICLR 2024 Figure 10: LCOT and COT color interpolations. Figure 11: LCOT and COT color interpolations. Figure 12: LCOT and COT color interpolations. Published as a conference paper at ICLR 2024 A.5.2 LCOT FOR HUE-BASED IMAGE RETRIEVAL Given a data set of N = 100 flower images represented using hue, saturation, and value channels (HSV), we use the hue channel for image retrieval. For this, the hue information of an image is used as a primary feature for searching. We extract the hue component from the images of the data set and compute density-normalized histograms of the hue values across all pixels. We consider these histograms as circular densities {νi}N i=1 (similarly to A.5.1). For LCOT comparison and retrieval, we compute the LCOT transforms {bνi}N i=1. Given a new query image, we compute its hue histogram denoted σ. Then, we perform LCOT-matching, that is, we embed the input image in LCOT space by computing bσ so that we can calculate N squared Euclidean distances bσ bνi 2 2 (i.e., we are computing LCOT2(σ, νi) for i = 1, . . . , N). We sort the obtained LCOT-distance values in ascending order. In Figure 13 we show the four closest and furthest images recovered using this technique. In Figure 14, we repeat the same experiment but using classic COT-matching for the same data set. As before, given a query image, we first compute its hue histogram denoted by σ. Then, we perform N COT-distances COT2(σ, νi), for i = 1, . . . , N, and We sort the obtained COT-distance values. Figure 13: LCOT-approach for Hue-based image retrieval. The leftmost image is the original query image. In the upper row, we retrieve the 4 closest images in Hue space according to LCOT, while the bottom row shows the 4 furthest images with respect to LCOT-distance. Figure 14: COT-distance for Hue-based image retrieval. The leftmost image is the original query image. In the upper row, we retrieve the 4 closest images in Hue space according to COT-distance, while the bottom row shows the 4 furthest images with respect to COT-distance. In both figures, the retrieval of images with similar color content is evident. The advantage of using LCOT over COT is that when using COT each new image requires to solve N new circular optimal transport problems whereas LCOT only requires to solve one followed by N Euclidean distance calculations for comparison and sorting. For M queries we have to compute MN COT distances when using the COT approach (N for each query) but only solve N +M COT problems when using LCOT (N for pre-processing the data set + one per query). Published as a conference paper at ICLR 2024 A.6 UNDERSTANDING THE EMBEDDING IN DIFFERENTIAL GEOMETRY Our embedding ν 7 bν as given by equation equation 17 aligns with the definition of the Logarithm function presented in (Sarrazin & Schmitzer, 2023, Definition 2.7). To be specific, for µ, ν S1 and the Monge mapping M ν µ, the Logarithm function as introduced in Sarrazin & Schmitzer (2023) is expressed as: P2(S1) ν 7 log COT µ (ν) L2(S1, TS1; µ). Here, the tangent bundle of S1 is represented as TS1 := {(x, Tx(S1))| x S1}, where Tx(S1) denotes the tangent space at the point x S1. The space L2(S1, TS1; µ) is the set of vector fields on S1 with squared norms (based on the metric on TS1), that are µ-integrable. The function (vector field) log COT µ (ν) is defined as: log COT µ (ν) := (S1 x 7 (x, vx)) TS1, where vx 7 Tx(S1) is the initial velocity of the unique constant speed geodesic curve x 7 T ν µ(x). The relation between log COT µ (ν) and bν in equation 19 can be established as follows: For any x in S1, the spaces Tx(S1) and S1 can be parameterized by R and [0, 1), respectively. Then, the unique constant speed curve x 7 M ν µ(x) is given by: x(t) := x + t(M ν µ(x) x), t [0, 1]. Then, the initial velocity is M ν µ(x) x. Drawing from equation 15, Theorem 2.5, and Proposition A.2, we find bν(x) = M ν µ(x) x for all x in S1, making bν and log COT µ (ν) equivalent. However, it is important to note that while log COT µ is defined for a generic (connected, compact, and complete3) manifold, it does not provide a concrete computational method for the embedding log COT µ . Our focus in this paper is on computational efficiency, delivering a closed-form formula. Regarding the embedding space, in Sarrazin & Schmitzer (2023), the space L2(S1, TS1; µ) is equipped with the L2, induced by TS1. Explicitly, for any f belonging to L2(S1, TS1; µ), 0 f(x) 2 xdx = Z 1 0 |f(x)|2 dx, where f(x) 2 x represents the norm square in the tangent space Tx(S1) of the vector f(x). By parameterizing S1 and Tx(S1) as [0, 1) and R, respectively, this squared norm becomes |f(x)|2. Consequently, L2(S1, TS1; µ) becomes an inner product space, whereby the expression (polarization identity) f + g 2 f 2 g 2 establishes an inner product between f and g. However, in this paper, the introduced embedding space L2(S1, dµ) presented in equation 18. This space uses the L2-norm on the circle, defined for each f in L2(S1, dµ) as: f 2 L2(S1;dµ) = Z S1 |f(x)|2 S1 dµ. Unlike the previous space, this does not induce an inner product (in fact, | |S1 is not a norm). As such, throughout this paper, we term our embedding as a linear embedding rather than a Euclidean embedding . 3In Sarrazin & Schmitzer (2023), the Riemannian manifold is not necessarily compact. However, the measures µ, ν must have compact support sets. For brevity, we have slightly overlooked this difference.