# riemannian_convex_potential_maps__ef94c083.pdf

Riemannian Convex Potential Maps

Samuel Cohen * 1 Brandon Amos * 2 Yaron Lipman 2 3

Modeling distributions on Riemannian manifolds is a crucial component in understanding non Euclidean data that arises, e.g., in physics and geology. The budding approaches in this space are limited by representational and computational tradeoffs. We propose and study a class of ﬂows that uses convex potentials from Riemannian optimal transport. These are universal and can model distributions on any compact Riemannian manifold without requiring domain knowledge of the manifold to be integrated into the architecture. We demonstrate that these ﬂows can model standard distributions on spheres, and tori, on synthetic and geological data. Our source code is freely available online at github.com/facebookresearch/rcpm.

1. Introduction

Today s generative models have had wide-ranging successes of modeling non-trivial probability distributions that naturally arise in ﬁelds such as physics (K ohler et al., 2019; Rezende et al., 2019), climate science (Mathieu & Nickel, 2020), and reinforcement learning (Haarnoja et al., 2018). Generative modeling on straight spaces (i.e., Euclidean) are pretty well-developed and include (continuous) normalizing ﬂows (Rezende & Mohamed, 2015; Dinh et al., 2016; Chen et al., 2018), generative adversarial networks (Goodfellow et al., 2014), and variational auto-encoders (Kingma & Welling, 2014; Rezende et al., 2014).

In many applications however, data resides on spaces with more complicated structure, e.g., Riemannian manifolds such as spheres, tori, and cylinders. Using Euclidean generative models on this data is problematic from two aspects: ﬁrst, Euclidean models will allocate mass in infeasible areas of the space; and second, Euclidean models will often

*Equal contribution 1University College London 2Facebook AI Research 3Weizmann Institute of Science. Correspondence to: Samuel Cohen <samuel.cohen.19@ucl.ac.uk>, Brandon Amos <brandon.amos.cs@gmail.com>, Yaron Lipman <ylipman@fb.com,yaron.lipman@weizmann.ac.il>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

Figure 1. Illustration of a discrete c-concave function φ (blue) over

a base manifold M (bold line). These consist of discrete components { i, yi} and have a Riemannian gradient rφ 2 Tx M.

need to squeeze mass in zero volume subspaces. Moreover, knowledge of the space geometry can improve the learning process by incorporating an efﬁcient geometric inductive bias as part of the modeling and learning pipeline.

Flow-based generative models are the state-of-the-art in Euclidean settings and are starting to be extended to Riemannian manifolds (Rezende et al., 2020; Mathieu & Nickel, 2020; Lou et al., 2020). However, in contrast with some models in the Euclidean case (Kong & Chaudhuri, 2020; Huang et al., 2020), the representational capacity and universality of these models is not well-understood. Some of these approaches are efﬁciently tailored to speciﬁc choices of manifolds, but the methods and theory of ﬂows on general Riemannian manifolds are not well-understood.

In this paper we introduce the Riemannian Convex Potential Map (RCPM), a generic model for generative modeling on arbitrary Riemannian manifolds that enjoys universal representational power. RCPM (illustrated in ﬁg. 2) is based on Optimal Transport (OT) over Riemannian manifolds (Mc Cann, 2001; Villani, 2008; Sei, 2013; Rezende et al., 2020) and generalizes the convex potential ﬂows in the Euclidean setting by Huang et al. (2020). We prove that RCPMs are universal on any compact Riemannian manifold, which comes from the fact that our discrete c-concave potential functions are universal. Our experimental demonstrations show that RCPMs are competitive and model standard distributions on spheres and tori. We further show a case study in modeling continental drift where we transport Earth s land mass on the sphere.

Riemannian Convex Potential Maps

Figure 2. Illustration of a Riemannian convex potential map on a sphere. From left to right: 1) base distribution µ of a mixture of wrapped

Gaussians, 2) learned c-convex potential, 3) mesh grid distorted by the exponential map of the Riemannian gradient of the potential, 4) transformed distribution .

2. Related Work

Euclidean potential ﬂows. Most related to our work, is the work by Huang et al. (2020) that leveraged Euclidean optimal transport, parameterized using input convex neural networks (ICNNs) (Amos et al., 2017) to construct universal normalizing ﬂows on Euclidean spaces. Similarly, Korotin et al. (2021); Makkuva et al. (2020) compute optimal transport maps via ICNNs. Riemannian optimal transport replaces the standard Euclidean convex functions with socalled c-convex or c-concave functions, and the Euclidean translation by exponential map. Unfortunately, the notion of c-convex or c-concave functions is intricate and a simple characterization of such functions is not known. Our approach is to approximate arbitrary c-concave functions on general Riemannian manifolds using discrete c-concave functions that are simply the minimum of a ﬁnite number of translated squared intrinsic distance functions, see ﬁg. 1. Intuitively, this construction resembles the approximation of a Euclidean concave function as the minimum of a ﬁnite collection of afﬁne tangents. Although simple, we prove that discrete c-concave functions are in fact dense in the space of c-concave functions and therefore replacing general cconcave functions with discrete c-concave functions leads to a universal Riemannian OT model. Related, Gangbo & Mc Cann (1996) considered OT maps of discrete measures which are deﬁned via discrete c-concave functions.

Exponential map ﬂows. Sei (2013); Rezende et al. (2020) propose distinct parameterizations for c-convex functions living on the sphere speciﬁcally. The latter applies it to training ﬂows on the sphere using the construction from Mc Cann s theorem. Our work can be seen as a generalization of the exponential-map approach in Rezende et al. (2020) to arbitrary Riemannian manifolds. In contrast to this work, the maps from our discrete c-concave layers are universal.

Other Riemannian ﬂows. Mathieu & Nickel (2020); Lou et al. (2020) propose extensions of continuous normalizing ﬂows to the Riemannian manifold setting. These are ﬂexible with respect to the choice of manifold, but their representational capacity is not well-understood and solving ODEs on manifolds can be expensive. In parallel, Brehmer & Cranmer (2020) proposed a method for simultaneously learning

the manifold data lives on and a normalizing ﬂow on the learned manifold. Bose et al. (2020) consider hyperbolic normalizing ﬂows.

Optimal transport on Riemannian manifolds. Optimal transport on spherical manifolds has been extensively studied from theoretical standpoints. Figalli & Rifford (2009); Loeper (2009); Kim & Mc Cann (2012) study the regularity (continuity, smoothness) of transport maps on spheres and other non-negatively curved manifolds. Regularity and smoothness are more intricate on negatively curved manifolds, e.g. hyperbolic spaces. Nevertheless, several works demonstrated that transport can be made smooth through a minor change to the Riemannian cost (Lee & Li, 2012). Alvarez-Melis et al. (2020); Hoyos-Idrobo (2020) leverage this to learn transport maps on hyperbolic spaces, in which case maps are parameterized as hyperbolic neural networks.

3. Background

In this section, we introduce the relevant background on normalizing ﬂows and Riemannian optimal transport theory.

3.1. Normalizing ﬂows

Normalizing ﬂows parameterize probability distributions 2 P(M), on a manifold M, by pushing a simple base (prior) distribution µ 2 P(M) through a diffeomorphism1

In turn, sampling from distribution amounts to transforming samples x taken from the base distribution via s:

y = s(x) , where x µ. (1)

In the language of measures, is the push-forward of the base measure µ through the transformation s, denoted by = s#µ. If densities exist, then they adhere the change of variables formula

(y) = µ(x)| det Js(x)| 1, (2)

where we slightly abuse notation by denoting the densities again as µ, . In practice, a normalizing ﬂow s is often

1A diffeomorphism is a differentiable bijective mapping with a differentiable inverse.

Riemannian Convex Potential Maps

deﬁned as a composition of simpler, primitive diffeomorphisms s1, . . . , s T : M ! M, i.e.,

s = s T s1. (3) For a more substantial review of computational and representational trade-offs inherent to this class of model on Euclidean spaces, we refer to Papamakarios et al. (2019).

3.2. c-convexity and concavity

Let (M, g) be a smooth compact Riemannian manifold without boundary, and c(x, y) = 1

2d(x, y)2, where d(x, y) is the intrinsic distance function on the manifold. We use the following generalizations of convex and concave functions: Deﬁnition 1. A function φ : M ! R [ {+1} is c-convex if it is not identically +1 and there exists : M ! R [ { 1} such that

( c(x, y) + (y)) (4)

Deﬁnition 2. A function φ : M ! R[{ 1} is c-concave if it is not identically 1 and there exists : M ! R [ { 1} such that

y2M (c(x, y) + (y)) (5)

We denote the space of c-concave functions on M as b C(M). We also note that if is c-concave, is c-convex, hence c-concavity results can be directly extended into c-convexity results by negation. We also use the c-inﬁmal convolution:

x2M (c(x, y) (x)) . (6)

c-concave functions φ satisfy the involution property:

φcc = φ. (7)

When M is a product of spheres or a Euclidean space, (e.g., spheres, tori), b C(M) is a convex space (Figalli et al., 2010b; Figalli & Villani, 2011) where a convex combinations of c-concave functions are c-concave. In the case M = Rd

and c(x, y) = x T y, Euclidean concavity is recovered.

3.3. Riemannian Optimal Transport

Optimal transport deals with ﬁnding efﬁcient ways to push a base probability measure µ 2 P(M) to a target measure 2 P(M), i.e., s#µ = . Often s considered is more general than a diffeormorphism, namely a transport plan which is a bi-measure on M M.

When M is a smooth compact manifold with no boundary, µ, 2 P(M), and µ has density (i.e., is absolutely continuous w.r.t. the volume measure of M), Theorem 9 of Mc Cann (2001) shows that there is a unique (up-to µ-zero sets) transport map t : M ! M that pushes µ to , i.e., s#µ = , and minimizes the transport cost

c(x, s(x))dµ(x). (8)

Furthermore, this OT map is given by

t(x) = expx [ rφ(x)] , (9) where φ is a c-concave function, exp is the Riemannian exponential map, and r is the Riemannian gradient. Note that equivalently t(x) = expx [r (x)] for c-convex .

As a consequence, there always exists a (Borel) mapping t : M ! M such that t#µ = where t is of the form of eq. (9). The issue of regularity and smoothness of OT maps is a delicate one and has been extensively studied (see, e.g., Villani (2008); Figalli et al. (2010b); Figalli & Villani (2011)); in general, OT maps are not smooth, but can be seen as a natural generalization to normalizing ﬂows, relaxing the smoothness of s. Henceforth, we will call OT maps ﬂows. In fact, our discrete c-concave functions, the gradient of which are shown to approximate general OT maps, deﬁne piecewise smooth maps.

Constant-speed geodesics : [0, 1] ! M between a sample x (from µ) and t(x) can also be recovered µ-almost everywhere on the manifold (Figalli & Villani, 2011) as

(l) = expx [ lrφ(x)] . (10) For a geodesic starting at x0 2 M, (0) = x0 and (1) = t(x0).

4. Riemannian Convex Potential Maps2

Our goal is to represent optimal transport maps on Riemannian manifolds t : M ! M. The key idea is to build upon the theory of Mc Cann (2001) and parameterize the space of optimal transport maps by c-concave functions φ : M ! M, see deﬁnition 2. Given a c-concave function, the map t is computed via eq. (9). This requires computing the intrinsic gradient of φ and the exponential map on M.

4.1. Discrete c-concave functions

Let {yi}i2[m] M be a set of m discrete points, where [m] = {1, 2, . . . , m}, and deﬁne the function to be

i if x = yi +1 otherwise (11)

where i 2 R are arbitrary. Plugging this choice in deﬁnition 2 of c-concave functions, we get that

i2[m] (c(x, yi) + i) (12)

is c-concave. We denote the collection of these functions over M by b Cd(M). We will use this modeling metaphor for parameterizing c-concave functions. Our learnable parameters of a single c-concave function will consist of

= {(yi, i)}i2[m] M R. (13)

2As mentioned in sects. 3.2 and 3.3, both c-convex and cconcave can be used; we follow Mc Cann (2001) and use cconcavity in the theory and derivations here.

Riemannian Convex Potential Maps

Let i? = arg mini2[m] (c(x, yi) + i). The discrete cconcave function in eq. (12) is differentiable, except where two pieces c(x, yi) + i meet, and if x belongs to the cut locus of yi? on M, which is of volume measure zero (Takashi, 1996). Excluding such cases, the gradient of φ at x is:

rxφ(x) = rx

c(x, yi?) + ?

= rxc(x, yi?) = logx(yi?), (14)

where log is the logarithmic map on the manifold. See ﬁg. 1 for an illustration of a discrete c-concave function. Intuitively, the optimal transport generated by discrete c-concave functions is piecewise constant as expx( rxφ(x)) = expx( ( logx(yi?))) = yi?. This can be seen as generalizing semi-discrete optimal transport (Peyre & Cuturi, 2019), which aims at ﬁnding transport maps between continuous and discrete probability measures, to the manifold setting.

Relation to the Euclidean concave case. In the Euclidean setting, i.e., when M = Rd and c(x, y) = x T y, a Euclidean concave (closed) function φ can be expressed as

x T y + (y)

Replacing Rd with a ﬁnite set of points yi 2 Rd, i 2 [m], leads to the discrete Legendre-Fenchel transform (Lucet, 1997); it basically amounts to approximating the concave

function φ via the minimum of a collection of afﬁne functions. This transform can be shown to converge to φ under reﬁnement (Lucet, 1997). We next prove convergence of discrete c-concave functions to their continuous counterparts.

Expressive power of discrete c-concave functions. Let us show that eq. (12) can approximate arbitrary c-concave functions φ : M ! R [ {1} on compact manifolds M. We will prove the following theorem:

Theorem 1. For compact, boundaryless, smooth manifold M, we have b Cd(M) dense in b C(M).

By dense we mean that for every ˆφ 2 b C(M) there exists a sequence φ 2 b Cd(M), where # 0, so that for almost all x 2 M we have that φ (x) ! ˆφ(x) and rxφ (x) ! rx ˆφ(x), as # 0.

The proof is based on a construction of φ using an -net of M. A set of points {yi}i2[m] M is called -net if M [i2[m]B(yi, ), where B(yi, ) is the -radius ball centered at yi. Formulated differently, every point y 2 M has a point in the net that is at-most distance away. On compact manifolds, for arbitrary > 0, there exists a ﬁnite -net {yi}i2[m]. Note that m ! 1 as # 0, but it is ﬁnite

for every particular . Our candidate for approximating ˆφ is:

φ (x) = min

c(x, yi) ˆφc(yi)

where ˆφc is the inﬁmal c-convolution (see eq. (6)) of ˆφ. The approximation in eq. (15) is motivated by the involution property (eq. (7)). ˆφ = (ˆφc)c and therefore

ˆφ(x) = inf

c(x, y) ˆφc(y)

Proof. Let ˆφ : M ! R be an arbitrary c-concave function over M. Let # 0 denote a sequence of positive numbers converging monotonically to zero. We will show that φ deﬁned in eq. (15) converges uniformly to ˆφ over M and furthermore, that their Riemannian gradients rφ (x) converge pointwise to rˆφ(x) for almost all x 2 M (i.e., up to a set of zero volume).

Uniform convergence. We start by noting that ˆφc is also c-concave by deﬁnition, and Lemma 2 in Mc Cann (2001) implies that ˆφc is |M|-Lipschitz, namely

***ˆφc(x) ˆφc(y)

*** |M|d(x, y),

for all x, y 2 M. We denote by |M| the diameter of M, that is:

|M| = sup x,y2M

d(x, y), (16)

and |M| < 1 since M is compact. In particular ˆφc is either everywhere inﬁnite (non-interesting case), or is ﬁnite (in fact, bounded) over M.

Next, we establish an upper bound. For all x 2 M:

ˆφ(x) = inf

c(x, y) ˆφc(y)

c(x, yi) ˆφc(yi)

Note that this upper bound is true for all choices of yi. Next, we show a tight lower bound.

Furthermore, Lemma 1 in Mc Cann (2001) asserts that c(x, y) = 1 2d(x, y)2 is also |M|-Lipschitz as a function of each of its variables. Therefore, using the -net, we get that for each x, y 2 M there exists i 2 [m] so that

c(x, y) ˆφc(y) c(x, yi) ˆφc(yi) 2|M|

ˆφ(x) = inf

c(x, y) ˆφc(y)

c(x, yi) + ˆφc(yi)

= φ (x) 2|M|

Therefore we have that φ converge uniformly in M to ˆφ.

Riemannian Convex Potential Maps

Pointwise convergence of gradients. Let O M be the set of points where the gradients of ˆφ and φ (for the entire countable sequence ) are not deﬁned, then O is of volume-measure zero on M. Indeed, the functions ˆφ, φ are differentiable almost everywhere on M by Lemmas 2 and 4 in Mc Cann (2001). Furthermore, if we denote by ˆt the optimal transport deﬁned by ˆφ, as discussed in Chapter 13 in Villani (2008) the set of all x 2 M for which ˆt(x)

belongs to the cut locus is of measure zero. We add this set to O, keeping it of measure zero.

Fix x 2 M \ O, and choose an arbitrary > 0. We show convergence of rxφ (x) ! rx ˆφ(x) by showing we can take element small enough so that the two tangent vectors rxφ (x), rx ˆφ(x) 2 Tx M are at most apart.

Lemma 7 in Mc Cann (2001) shows that the unique minimizer of h(y) = c(x, y) ˆφc(y) is achieved at y = expx[ rx ˆφ(x)]. In particular, rx ˆφ(x) = logx(y?). As explained above, y is not on the cut locus of x.

Recall that rxc(x, y) = logx(y) (Mc Cann, 2001), which is a continuous function of y in vicinity of y . Therefore there exists an 0 > 0 so that if y 2 B(y , 0) we have that k logx(y) + logx(y )k < , where the norm is the Riemannian norm in the tangent space at x, i.e., Tx M.

Consider the set Aδ = {y 2 M | h(y) < h(y ) + δ}. Since h(y) is continuous (in fact, Lipschitz) and y is its unique minimum, we can ﬁnd a 0 < δ sufﬁciently small so that Aδ B(y , 0). This means that any y /2 B(y , 0) satisﬁes h(y) h(y ) + δ. On the other hand, from continuity of h we can ﬁnd < 0 so that all y 2 B(y , ) we have h(y) < h(y ) + δ.

Now consider the element φ . Due to the -net we know there is at-least one yi 2 B(y , ) leading to h(yi) < h(y ) + δ, and as mentioned above every y /2 B(y , 0) satisﬁes h(y) h(y )+δ. This means that the yi that achieves the minimum of h(yi) among all i 2 [m] in eq. (15) has to reside in B(y , 0), and φ (x) = c(x, yi) ˆφc(yi) in a small neighborhood of x. Therefore, rxφ (x) = logx(yi). Since rx ˆφ(x) = logx(y ) and d(yi, y ) < 0, our choice

of 0 implies that

+++rx ˆφ(x) rxφ (x)

4.2. RCPM architecture

Now that we have set-up an expressive approximation to c-concave functions we can take the same route as Rezende et al. (2020), and deﬁne individual ﬂow blocks sj, j 2 [T] (see eq. (3)) using the exponential map as suggested by Mc Cann s theorem. Each ﬂow block sj is deﬁned as:

sj(x) = exp( rxφj(x)), j = 1, . . . , T (19)

φj(x) = min

We learn both locations y(j)

i 2 M and offsets (j)

i 2 R for i 2 [m] and j 2 [T]; these form our model parameters . We also consider multi-layer blocks as detailed later.

4.3. Universality of RCPM

We next build upon theorem 1 to show RCPM is universal. We show that a single block s, i.e., eqs. (19) and (20) with T = 1 can already approximate arbitrary the optimal transport t : M ! M. Due to the theory of Mc Cann (2001) (see sect. 3.3) this means that s can push any absolutely continuous base probability µ to a general arbitrarily well.

Theorem 2. If µ, are two probability measures in P(M) and µ is absolutely continuous w.r.t volume measure of M, then there exists a sequence of discrete c-concave potentials φ , where # 0, such that exp [ rφ ]

p ! t, where t is the optimal map pushing µ to and p denotes convergence in probability.

Proof. Let φ be the sequence from eq. (15). It is enough to show pointwise convergence of exp[ rφ (x)] to t(x) = exp[ rˆφ(x)] for µ-almost every x. Note, as above, that the set of points O M where the gradients of φ and φ are not deﬁned is of µ-measure zero. So ﬁx x 2 M \ O.

Theorem 1 implies that the tangent vector rxφ (x) 2 Tx M converges in the Riemannian norm over Tx M to rxφ(x) 2 Tx M. Furthermore, from the Hopf-Rinow Theorem exp is deﬁned over all Tx M and it is continuous where it is deﬁned (Mc Cann, 2001). This shows the pointwise convergence.

As a result of theorem 2, the multi-block version of RCPM is also universal, because individual blocks can approximate the identity arbitrarily well according to theorem 1.

5. On Implementing and Training RCPMs

We now describe how to train Riemannian convex potential maps and how to increase their ﬂexibility and expressivity through architectural choices preserving c-concavity.

5.1. Variants of RCPM

Our basic model is multi-block s = s T s1, where si are deﬁned in eqs. (19) and (20). We also consider two variants of this model. Let us denote σ(s) = min {0, s}, the concave analog of Re LU.

Multi-layer block on convex spaces. First, in some manifolds, c-concave functions form a convex space, that is convex combination of c-concave functions is again c-concave. Examples of such spaces include Euclidean spaces, spheres (Sei, 2013), and product of spheres (e.g., tori) (Figalli et al.,

Riemannian Convex Potential Maps

2010a). One possibility to enrich our discrete c-concave model in such spaces is to convex combine and compose multiple c-concave potentials which preserves c-concavity, similar in spirit to ICNN (Amos et al., 2017). In more detail, we deﬁne the c-concave potential of a single block 'j, j 2 [T], to be a convex combination and composition of several discrete c-concave functions. For brevity let ' = 'j, and deﬁne ' = K, where K is deﬁned by:

k = (1 wk 1)φk 1 + wk 1σ( k 1), (21)

where k 2 [K], wk 2 [0, 1] are learnable weights, and φk 2

b Cd(M) are discrete c-concave functions used to deﬁne the jth block. The RCPMs from sect. 4.2 can be reproduced with K = 1. In general RCPMs in this case are composed by T blocks, each is built out of K discrete c-concave function.

Identity initialization. In the general case (i.e., even in manifolds where c-concave functions are not a convex space) one can still deﬁne

'j(x) = σ(φj(x)). (22)

We note that if all i 0 at initialization, σ(φj(x)) 0. In this case, we claim that the initial ﬂow is the identity map, that is s(x) = x. Indeed, the gradient of a constant function vanishes everywhere, and by deﬁnition of the OT, s(x) = expx[0] = x.

5.2. Learning

We now discuss how to train the proposed ﬂow model. We denote by = s#µ, the prior density pushed by our RCPM model s, with parameters . To learn a target distribution we consider either minimizing the KL divergence between the generated distribution and the data distribution :

KL( | ) = E (x)

log (x) log (x)

or, minimizing the negative log-likelihood under the model:

NLL( ) = E (x) log µ (x). (24)

We optimize these with gradient-based methods.

For low-dimensional manifolds, the Jacobian logdeterminants appearing in the computation of KL/likelihood losses can be exactly computed efﬁciently. For higherdimensional manifolds, stochastic trace estimation techniques can be leveraged (Huang et al., 2020).

Depending on the considered application, it may be more practical to parameterize either the forward mapping (from base to target), or the backward mapping (from target to base). For instance, in the density estimation context, the backward map from target samples to base samples is typically parameterized, and can be trained by maximum likelihood (minimizing NNL) using eq. (24).

5.3. Smoothing via the soft-min operation

While the proposed layers si are universal, they are deﬁned using the gradients of the discrete c-concave potentials that take the form rxφ(x) = logx(yi), where yi is the argument minimizing the r.h.s. in eq. (12) (see also eq. (14)). This means that the i do not transfer gradients. Intuitively, considering ﬁg. 1, the i represent the heights of the different c-concave pieces and since we only work with their derivatives, the heights are not seen by the optimizer. Furthermore, potential gradients rxφ are discontinuous at meeting points of c-concave pieces.

We alleviate both problems by replacing the min operation by a soft-min operation, minγ, similarly to Cuturi & Blondel (2017). The soft-min operation minγ is deﬁned as

minγ(a1, . . . , an) = γ log

In the limit γ ! 0, minγ ! min. While c-concavity is not guaranteed to be preserved by this modiﬁcation, it is recovered in the γ ! 0 limit. Also, gradients with respect to offsets are not zero anymore, and (j)

i are optimized through the training process.

5.4. Discussion

We now discuss some practice-theory gaps, and mark interesting open questions and future work directions.

Model smoothing and optimization. The construction in Section 4, i.e., exponential map of a discrete c-concave function, is an optimal transport map and universal in the sense that it can approximate any OT between an absolutely continuous µ and arbitrary , over a compact Riemannian manifold. It is not, however, a diffeomorphism. As a practical way of optimizing this model to approximate arbitrary we suggested smoothing the min operation with soft-min. If this, now a differentiable function, is c-concave then the smoothed version leads to a diffeomorphism (ﬂow). While we are unable to prove that the soft version is c-concave, we veriﬁed numerically that it indeed leads to a diffeomorphism (see ﬁg. 8, Appendix). We leave the question of whether the soft-min operator preserves c-convexity on the sphere and more general manifolds to future work. Furthermore, it is worth noting that the proposed model can potentially be optimized with other methods than as a ﬂow, for instance by directly optimizing the Wasserstein loss similarly to Makkuva et al. (2019) in the Euclidean case, or semidiscrete transport methods (Peyre & Cuturi, 2019). Both would not require the map to be a diffeomorphism. We leave such directions to future work as well.

Scalability. We follow Rezende et al. (2020), which relies on reformulating the log-determinant in terms of the Jaco-

Riemannian Convex Potential Maps

Table 1. We trained a RCPM to optimize the KL on the 4-mode

dataset shown in ﬁg. 3 and compare the KL and ESS to the M obiusspline ﬂow (MS) and exponential-map sum-of-radial ﬂow (EMSRE) from Rezende et al. (2020). We report the mean and standard derivation from 10 trials of the RCPM.

Model KL [nats] ESS

M obius-spline Flow 0.05 (0.01) 90%

Radial Flow 0.10 (0.10) 85%

RCPM 0.003 (0.0004) 99.3%

Table 2. Comparison of the runtime per training iteration of our

model with Rezende et al. over 1000 trials with batch size of 256.

Method Runtime (sec/iteration)

Radial (NT = 1, K = 12) 2.05 10 3 1.33 10 4

Radial (NT = 6, K = 5) 6.26 10 3 2.95 10 4

Radial (NT = 24, K = 1) 1.92 10 2 5.24 10 4

RCPM (NT = 5, K = 68) 8.79 10 3 1.81 10 4

bian expressed via an orthonormal basis of the tangent space. The Jacobian determinant term is similar to the Euclidean case suggesting that the high dimensional case can reuse techniques from Huang et al. (2020). Table 2 shows our running times are comparable to Rezende et al. (2020).

6. Experiments

This section empirically demonstrates the practicality and ﬂexibility of RCPMs. We consider synthetic manifold learning tasks similar to Rezende et al. (2020); Lou et al. (2020) on both spheres and tori, and a real-life application over the sphere. We cover the different use cases of RCPMs: density estimation, mapping estimation and geodesic transport.

6.1. Synthetic Sphere Experiments

KL training. Our ﬁrst experiment is taken from Rezende et al. (2020), the task is to train a Riemannian ﬂow generating a 4-modal distribution deﬁned on the S2 sphere via a reverse-mode KL minimization. This experiment allows quantitative comparison of the different models theoretical and practical expressiveness. We report results obtained with their best performing models: a M obius-spline ﬂow and a radial ﬂow. The latter is an exponential-map ﬂow with radial layers (24 block of 1 component). We train a 5-block RCPM. The exponential map and the intrinsic distance required for RCPMs are closed-form for the sphere. More implementation details are provided in app. C.

Figure 3. Learned RCPMs on the sphere. Following Rezende et al. (2020), we learn the ﬁrst density with the reverse KL, and following Lou et al. (2020), we learn the second with maximum likelihood.

Results are logged in table 1. Notably, our model signiﬁcantly outperforms both baseline models with a KL of 0.003, almost an order of magnitude smaller than the runner-up with a KL of 0.05. This highlights the expressive power of the RCPM model class. We also provide a visualization of the trained RCPM in ﬁg. 3 (top row), where we show KDE estimates performed in spherical coordinates with a bandwidth of 0.2. Finally, table 2 compares the runtime per training iteration of our model with Rezende et al. (2020) s models over 1000 trials with a batch size of 256. Our model s speed is comparable to theirs while leading to signiﬁcantly improved KL/ESS.

Likelihood training. We demonstrate an RCPM trained via maximum likelihood on a more challenging dataset, the checkerboard, also studied in Lou et al. (2020). Figure 3 (bottom) shows the RCPM generated density on the right. We found visualizing the density of our model challenging because some regions had unusually high density values around the poles. We hence binarized the density plot. We provide the original density in ﬁg. 9 (Supplementary).

We now consider an experiment on the torus: T 2 = S1 S1. The exponential map and intrinsic distance required for RCPMs are known in closed-form. The exponential map ﬂows in Rezende et al. (2020) do not apply to this setting as their c-concave layers are speciﬁc to the sphere.

We train a 6-block RCPM model (with 1 layer per block) by KL minimization. As can be inspected from ﬁg. 5, the RCPM model is able to recover the target density accurately, and the model achieves a KL of 0.03 and an ESS of 94.7.

Riemannian Convex Potential Maps

Base RCPM Target

Figure 4. We trained a 7-block RCPM ﬂow to learn to map a base density over ground mass on earth of 90 million years ago such a

density over current earth. To learn, we minimize the KL divergence between the model and the target distribution.

Figure 5. We trained an RCPM to learn a 3-modal density on

the torus T 2 = S1 S1 (KL: 0.03, ESS: 94.7).

6.3. Case Study: Continental Drift3

Finally, we consider a real-world application of our model on geological data in the context of continental drift (Wilson, 1963). We aim to demonstrate the versatility and ﬂexibil-

ity of the framework with three distinct settings: mapping estimation, density estimation, and geodesic transport, all through the lens of RCPMs.

Mapping estimation. We begin with mapping estimation. We aim to learn a ﬂow t mapping the base distribution of ground mass on earth 90 million years ago (ﬁg. 4, left), to a ground mass distribution on current earth (ﬁg. 4, right) the target. We train a 7-blocks RCPM with 3-layers blocks (see sect. 5.1) by minimizing the KL divergence between the model and target distributions. In ﬁg. 4 (Middle), we show the RCPM result, where it successfully learns to recover the target density over current earth. Hence, the mapping t can be used to map mass from old earth to current earth.

Transport geodesics. We demonstrate the use of transport geodesics induced by exponential-map ﬂows. We train a 1-block RCPM which allows to recover approximations of optimal-transport geodesics following eq. (10). These curves are induced by transport mappings exp(trφ), t 2 [0, 1], which we visualize for a grid of starting points x0 on the sphere in ﬁg. 6. Such geodesics illustrate the optimal

3The source maps of ﬁgs. 4, 6 and 7 are 2020 Colorado Plateau Geosystems Inc.

Figure 6. Plot of the transport geodesics arising from a 1-block

RCPM trained in the setting of ﬁg. 4, and following eq. (10). We observe that samples stretch according to continental movements.

transport evolution of earth ground across times. This relates to the well-known and studied geological process of continental drift (Wilson, 1963). North-American and Eurasian tectonic plates move away from each other at a small rate per year, which is illustrated in eq. (10). Denote as junction the junction between Eurasian and North-American continents in old earth. We observe that particles x0 located at the right of the junction will have geodesics transporting them towards the right, while particles located at the left of such junction will be transported towards the left, which is the expected behavior given the evolution of continental locations across time (see ﬁg. 4 left and right).

Density estimation. Finally, we consider RCPMs as density estimation tools. In this setting, we aim to learn a ﬂow from a known base distribution (e.g., uniform on the sphere) to a target distribution (e.g., distribution of mass over earth) given samples from the latter. We train an RCPM model with 6 blocks (and 1 layer per block) by maximum likelihood. We show the results for this experiment in ﬁg. 7. We observe that the model is able to recover the distribution of mass on current earth.

7. Conclusion

In this paper, we propose to build ﬂows on compact Riemannian manifolds following the celebrated theory of Mc Cann (2001), that is using exponential map applied to gradients

Riemannian Convex Potential Maps

Figure 7. We trained a 6-block RCPM in the density estimation

setting. The base distribution is the uniform distribution on the sphere and the target is the ground of current earth.

of c-concave functions. Our main contribution is observing that the rather intricate space of c-concave functions over arbitrary compact Riemannian manifold can be approximated with discrete c-concave functions. We provide a theoretical result showing that discrete c-concave functions are dense in the space of c-concave functions. We use this theoretical result to prove that maps deﬁned via a discrete c-concave potentials are universal. Namely, can approximate arbitrary optimal transports between an absolutely continuous source distribution and arbitrary target distribution on the manifold.

We build upon this theory to design a practical model, RCPM, that can be applied to any manifold where the exponential map and the intrinsic distance are known, and enjoys maximal expressive power. We experimented with RCPM, and used it to train ﬂows on spheres and tori, on both synthetic and real data. We observed that RCPM outperforms previous approaches on standard manifold ﬂow tasks. We also provided a case study demonstrating the potential of RCPMs for applications in geology.

Future work includes training ﬂows on more general manifolds, e.g., manifolds deﬁned with signed distance functions, and using RCPM on other manifold learning tasks where the expressive power of RCPM can potentially make a difference. One particular interesting venue is generalizing the estimation of barycenters (means) of probability measures on Euclidean spaces to the Riemannian setting through the use of discrete c-concave functions.

Acknowledgments

We thank Ricky Chen, Laurent Dinh, Maximilian Nickel and Marc Deisenroth for insightful discussions and acknowledge the Python community (Van Rossum & Drake Jr, 1995; Oliphant, 2007) for creating the core tools that enabled our work, including JAX (Bradbury et al., 2018), Hydra (Yadan, 2019), Jupyter (Kluyver et al., 2016), Matplotlib (Hunter, 2007), numpy (Oliphant, 2006; Van Der Walt et al., 2011), pandas (Mc Kinney, 2012), and Sci Py (Jones et al., 2014).

Alvarez-Melis, D., Mroueh, Y., and Jaakkola, T. Unsupervised hi-

erarchy matching with optimal transport over hyperbolic spaces. In Proceedings of the Twenty Third International Conference on Artiﬁcial Intelligence and Statistics, pp. 1606 1617, 2020.

Amos, B., Xu, L., and Kolter, J. Z. Input convex neural networks.

In International Conference on Machine Learning, pp. 146 155, 2017.

Bose, J., Smofsky, A., Liao, R., Panangaden, P., and Hamilton, W.

Latent variable modelling with hyperbolic normalizing ﬂows. In International Conference on Machine Learning, pp. 1045 1055. PMLR, 2020.

Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary,

C., Maclaurin, D., Necula, G., Paszke, A., Vander Plas, J., Wanderman-Milne, S., and Zhang, Q. JAX: composable transformations of Python+Num Py programs, 2018. URL http://github.com/google/jax.

Brehmer, J. and Cranmer, K. Flows for simultaneous manifold learning and density estimation. ar Xiv preprint ar Xiv:2003.13913, 2020.

Chen, R. T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud,

D. Neural ordinary differential equations. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 6572 6583, 2018.

Cuturi, M. and Blondel, M. Soft-dtw: A differentiable loss func-

tion for time-series. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, pp. 894 903, 2017.

Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation

using real nvp. ar Xiv preprint ar Xiv:1605.08803, 2016.

Figalli, A. and Rifford, L. Continuity of optimal transport maps and

convexity of injectivity domains on small deformations of s2. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 62 (12):1670 1706, 2009.

Figalli, A. and Villani, C. Optimal Transport and Curvature, pp. 171 217. Springer Berlin Heidelberg, Berlin, Heidelberg, 2011. ISBN 978-3-642-21861-3. doi: 10. 1007/978-3-642-21861-3 4. URL https://doi.org/10. 1007/978-3-642-21861-3_4.

Figalli, A., Kim, Y.-H., and Mc Cann, R. Regularity of optimal

transport maps on multiple products of spheres. Journal of the European Mathematical Society, 15, 06 2010a. doi: 10.4171/ JEMS/388.

Figalli, A., Kim, Y.-H., and Mc Cann, R. Regularity of optimal

transport maps on multiple products of spheres. Journal of the European Mathematical Society, 15, 06 2010b. doi: 10.4171/ JEMS/388.

Gangbo, W. and Mc Cann, R. J. The geometry of optimal transportation. Acta Math., 177(2):113 161, 1996. doi: 10. 1007/BF02392620. URL https://doi.org/10.1007/

BF02392620.

Riemannian Convex Potential Maps

Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-

Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, pp. 2672 2680, 2014.

Haarnoja, T., Hartikainen, K., Abbeel, P., and Levine, S. La-

tent space policies for hierarchical reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning, pp. 1851 1860, 2018.

Hoyos-Idrobo, A. Aligning hyperbolic representations: an optimal

transport-based approach. ar Xiv: Machine Learning, 2020.

Huang, C.-W., Chen, R. T. Q., Tsirigotis, C., and Courville, A.

Convex potential ﬂows: Universal probability distributions with optimal transport and convex optimization, 2020.

Hunter, J. D. Matplotlib: A 2d graphics environment. Computing

in science & engineering, 9(3):90, 2007.

Jones, E., Oliphant, T., and Peterson, P. Sci Py: Open source

scientiﬁc tools for Python. 2014.

Kim, Y.-H. and Mc Cann, R. J. Towards the smoothness of optimal

maps on riemannian submersions and riemannian products (of round spheres in particular). Journal f ur die reine und angewandte Mathematik, 2012(664):1 27, 2012.

Kingma, D. P. and Welling, M. Auto-encoding variational bayes.

In 2nd International Conference on Learning Representations, 2014.

Kluyver, T., Ragan-Kelley, B., P erez, F., Granger, B. E., Bus-

sonnier, M., Frederic, J., Kelley, K., Hamrick, J. B., Grout, J., Corlay, S., et al. Jupyter notebooks-a publishing format for reproducible computational workﬂows. In ELPUB, pp. 87 90, 2016.

K ohler, J., Klein, L., and No e, F. Equivariant ﬂows: sampling

conﬁgurations for multi-body systems with symmetric energies. Ar Xiv, abs/1910.00753, 2019.

Kong, Z. and Chaudhuri, K. The expressive power of a class

of normalizing ﬂow models. In Proceedings of the Twenty Third International Conference on Artiﬁcial Intelligence and Statistics, pp. 3599 3609, 2020.

Korotin, A., Egiazarian, V., Asadulaev, A., Saﬁn, A., and Bur-

naev, E. Wasserstein-2 generative networks. In International Conference on Learning Representations, 2021. URL https: //openreview.net/forum?id=b Eoxz W_EXsa.

Lee, P. W. Y. and Li, J. New examples satisfying ma-trudinger-

wang conditions. SIAM J. Math. Anal., 44(1):61 73, 2012. doi: 10.1137/110820543. URL https://doi.org/10.1137/ 110820543.

Loeper, G. On the regularity of solutions of optimal transportation

problems. Acta Math., 202(2):241 283, 2009. doi: 10.1007/ s11511-009-0037-8. URL https://doi.org/10.1007/ s11511-009-0037-8.

Lou, A., Lim, D., Katsman, I., Huang, L., Jiang, Q., Lim, S. N., and

De Sa, C. M. Neural manifold ordinary differential equations. Advances in Neural Information Processing Systems, 33, 2020.

Lucet, Y. Faster than the fast legendre transform, the linear-time

legendre transform. Numerical Algorithms, 16(2):171 185, 1997.

Makkuva, A., Taghvaei, A., Oh, S., and Lee, J. Optimal transport

mapping via input convex neural networks. In Proceedings of the 37th International Conference on Machine Learning, pp. 6672 6681, 2020.

Makkuva, A. V., Taghvaei, A., Oh, S., and Lee, J. D. Optimal

transport mapping via input convex neural networks. ar Xiv preprint ar Xiv:1908.10962, 2019.

Mathieu, E. and Nickel, M. Riemannian continuous normalizing

ﬂows. Advances in Neural Information Processing Systems, 33, 2020.

Mc Cann, R. J. Polar factorization of maps on riemannian mani-

folds. Geometric & Functional Analysis GAFA, 11(3):589 608, 2001.

Mc Kinney, W. Python for data analysis: Data wrangling with

Pandas, Num Py, and IPython. O Reilly Media, Inc. , 2012.

Oliphant, T. E. A guide to Num Py, volume 1. Trelgol Publishing

Oliphant, T. E. Python for scientiﬁc computing. Computing in

Science & Engineering, 9(3):10 20, 2007.

Papamakarios, G., Nalisnick, E. T., Rezende, D. J., Mohamed, S.,

and Lakshminarayanan, B. Normalizing ﬂows for probabilistic modeling and inference. abs/1912.02762, 2019.

Peyre, G. and Cuturi, M. Computational optimal transport: With

applications to data science. Foundations and Trends in Machine Learning, 11(5-6):355 607, 2019.

Rezende, D. and Mohamed, S. Variational inference with normaliz-

ing ﬂows. In Proceedings of the 32nd International Conference on Machine Learning, pp. 1530 1538, 2015.

Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic back-

propagation and approximate inference in deep generative models. In Proceedings of the 31st International Conference on Machine Learning, pp. 1278 1286, 2014.

Rezende, D. J., Racani ere, S., Higgins, I., and Toth, P. Equivariant

hamiltonian ﬂows. In ML for Physical Sciences, Neur IPS 2019 Worshop, 2019.

Rezende, D. J., Papamakarios, G., Racani ere, S., Albergo, M. S.,

Kanwar, G., Shanahan, P. E., and Cranmer, K. Normalizing ﬂows on tori and spheres. ar Xiv preprint ar Xiv:2002.02428, 2020.

Sei, T. A jacobian inequality for gradient maps on the sphere

and its application to directional statistics. Communications in Statistics-Theory and Methods, 42(14):2525 2542, 2013.

Takashi, S. Riemannian geometry / Takashi Sakai ; translated

by Takashi Sakai. Translations of mathematical monographs. American Mathematical Society, Providence, R.I, 1996. ISBN 0-8218-0284-4.

Van Der Walt, S., Colbert, S. C., and Varoquaux, G. The numpy ar-

ray: a structure for efﬁcient numerical computation. Computing in Science & Engineering, 13(2):22, 2011.

Riemannian Convex Potential Maps

Van Rossum, G. and Drake Jr, F. L. Python reference manual.

Centrum voor Wiskunde en Informatica Amsterdam, 1995.

Villani, C. Optimal transport: old and new, volume 338. Springer

Science & Business Media, 2008.

Wilson, J. T. Continental drift. Scientiﬁc American, 208(4):86

103, 1963. ISSN 00368733, 19467087. URL http://www.

jstor.org/stable/24936535.

Yadan, O. Hydra - a framework for elegantly conﬁguring complex

applications. Github, 2019. URL https://github.com/ facebookresearch/hydra.