# riemannian_continuous_normalizing_flows__7fba6058.pdf

Riemannian Continuous Normalizing Flows

Emile Mathieu , Maximilian Nickel emile.mathieu@stats.ox.ac.uk, maxn@fb.com Department of Statistics, University of Oxford, UK Facebook Artiﬁcial Intelligence Research, New York, USA

Normalizing ﬂows have shown great promise for modelling ﬂexible probability distributions in a computationally tractable way. However, whilst data is often naturally described on Riemannian manifolds such as spheres, tori, and hyperbolic spaces, most normalizing ﬂows implicitly assume a ﬂat geometry, making them either misspeciﬁed or ill-suited in these situations. To overcome this problem, we introduce Riemannian continuous normalizing ﬂows, a model which admits the parametrization of ﬂexible probability measures on smooth manifolds by deﬁning ﬂows as the solution to ordinary diﬀerential equations. We show that this approach can lead to substantial improvements on both synthetic and real-world data when compared to standard ﬂows or previously introduced projected ﬂows.

1 Introduction

Figure 1: Trajectories generated on the sphere to model volcano eruptions. Note that these converge to the known Ring of Fire.

Learning well-speciﬁed probabilistic models is at the heart of many problems in machine learning and statistics. Much focus has therefore been placed on developing methods for modelling and inferring expressive probability distributions. Normalizing ﬂows (Rezende and Mohamed, 2015) have shown great promise for this task as they provide a general and extensible framework for modelling highly complex and multimodal distributions (Papamakarios et al., 2019).

An orthogonal but equally important aspect of well-speciﬁed models is to correctly characterize the geometry which describes the proximity of data points. Riemannian manifolds provide a general framework for this purpose and are a natural approach to model tasks in many scientiﬁc ﬁelds ranging from earth and climate science to biology and computer vision. For instance, storm trajectories may be modelled as paths on the sphere (Karpatne et al., 2019), the shape of proteins can be parametrized using tori (Hamelryck et al., 2006), cell developmental processes can be described through paths in hyperbolic space (Klimovskaia et al., 2020), and human actions can be recognized in video using matrix manifolds (Lui, 2012). If appropriately chosen, manifold-informed methods can lead to improved sample complexity and generalization, improved ﬁt in the low parameter regime, and guide inference methods to interpretable models. They can also be understood as a geometric prior that encodes a practitioner s assumption about the data and imposes an inductive bias.

However, conventional normalizing ﬂows are not readily applicable to such manifold-valued data since their implicit Euclidean assumption makes them unaware of the underlying geometry or borders of the manifold. As a result they would yield distributions having some or all of their mass lying

Work done while at Facebook AI research.

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

outside the manifold, rendering them ill-suited or even misspeciﬁed so that central concepts like the reverse Kullback-Leibler (KL) divergence would not even be deﬁned.

In this work, we propose a principled way to combine both of these aspects and parametrize ﬂexible probability distributions on Riemannian manifolds. Speciﬁcally, we introduce Riemmanian continuous normalizing ﬂows in which ﬂows are deﬁned via vector ﬁelds on manifolds and computed as the solution to the associated ordinary diﬀerential equation (ODE) (see Figure 1 for an illustration). Intuitively, our method operates by ﬁrst parametrizing a vector ﬁeld on the manifold with a neural network, then sampling particles from a base distribution, and ﬁnally approximating their ﬂow along the vector ﬁeld using a numerical solver. Both the neural network and the solver are aware of the underlying geometry which ensures that the ﬂow is always located on the manifold yielding a Riemannian method.

This approach allows us to combine multiple important advantages: One major challenge of normalizing ﬂows lies in designing transformations that enable eﬃcient sampling and density computation. By basing our approach on continuous normalizing ﬂows (CNFs) (Chen et al., 2018; Grathwohl et al., 2019; Salman et al., 2018) we avoid strong structural constraints to be imposed on the ﬂow, as is the case for most discrete normalizing ﬂows. Such unconstrained free-form ﬂows have empirically been shown to be highly expressive (Chen et al., 2019; Grathwohl et al., 2019). Moreover, projected methods require a diﬀerentiable mapping from a Euclidean space to the manifold, yet such a function cannot be bijective, which in turn leads to numerical challenges. By taking a Riemannian approach, our method is more versatile since it does not rely on an ad-hoc projection map and simultaneously reduces numerical artefacts that interfere with training. To the best of our knowledge, our method is the ﬁrst to combine these properties as existing methods for normalizing ﬂows on manifolds are either discrete (Bose et al., 2020; Rezende et al., 2020), projected (Gemici et al., 2016; Falorsi et al., 2019; Bose et al., 2020) or manifold-speciﬁc (Sei, 2011; Bose et al., 2020; Rezende et al., 2020).

We empirically demonstrate the advantages of our method on constant curvature manifolds i.e., the Poincaré disk and the sphere and show the beneﬁts of the proposed approach compared to non Riemannian and projected methods for maximum likelihood estimation and reverse KL minimization. We also apply our method to density estimation on earth-sciences data (e.g., locations of earthquakes, ﬂoods and wildﬁres) and show that it yields better generalization performance and faster convergence.

2 Continuous Normalizing Flows on Riemannian Manifolds

Normalizing ﬂows operate by pushing a simple base distribution through a series of parametrized invertible maps, referred as the ﬂow. This can yield a highly complex and multimodal distribution which is typically assumed to live in a Euclidean vector space. Here, we propose a principled approach to extend normalizing ﬂows to manifold-valued data, i.e. Riemmanian continuous normalizing ﬂows (RCNFs). Following CNFs (Chen et al., 2018; Grathwohl et al., 2019; Salman et al., 2018) we deﬁne manifold ﬂows as the solutions to ODEs. The high-level idea is to parametrize ﬂows through the time-evolution of manifold-valued particles z in particular via their velocity z(t) = fθ(z(t), t) where fθ denotes a vector ﬁeld. Particles are ﬁrst sampled from a simple base distribution, and then their evolution is integrated by a manifold-aware numerical solver, yielding a new complex multimodal distribution of the particles. This Riemannian and continuous approach has the advantages of allowing almost free-form neural networks and of not requiring any mapping from a Euclidean space which would potentially lead to numerical challenges.

For practical purposes, we focus our theoretical and experimental discussion on constant curvature manifolds (see Table 1). In addition to being widely used in the literature (Nickel and Kiela, 2017; Davidson et al., 2018; Mardia and Jupp, 2000; Hasnat et al., 2017), these manifolds are convenient to work with since most related geometrical quantities are available in closed-form. However, our

Table 1: Summary of d-dimensional continuous constant (sectional) curvature manifolds.

Geometry Model Curvature Coordinates p

det g = d Vol /d Leb Rd Compact

Euclidean Rd Real vector space K = 0 Cartesian z 1 No

Hyperbolic Bd K Poincaré ball K < 0 Cartesian z 2 / 1 + K z 2 d No Elliptic Sd K Hypersphere K > 0 n-spherical ϕ K d 1

2 Qd 2 i=1 sin(ϕi)d i 1 Yes

proposed approach is generic and could be used on a broad class of manifolds such as product and matrix manifolds like tori and Grassmanians. For a brief overview of relevant concepts in Riemannian geometry please see Appendix A.1 or Lee (2003) for a more thorough introduction.

In the following, we develop the key components which allow us to deﬁne continuous normalizing ﬂows that are aware of the underlying Riemannian geometry: ﬂow, likelihood, and vector ﬁeld.

Vector ﬂows Flows in conventional normalizing ﬂows are deﬁned as smooth mappings φ : Rd Rd which transform a base distribution z P0 into a complex distribution Pθ. For normalizing ﬂows to be well-behaved and convenient to work with, the ﬂow is required to be bijective and diﬀerentiable which introduces signiﬁcant structural constraints on φ. Continuous normalizing ﬂows overcome this issue by deﬁning the ﬂow φ : Rd R Rd generated by an ordinary diﬀerential equation, allowing for unrestricted neural network architectures. Here we show how vector ﬁelds can be used to deﬁne similar ﬂows φ : M R M on general Riemannian manifolds.

Consider the temporal evolution of a particle z(t) lying on a d-dimensional manifold M, whose velocity is given by a vector ﬁeld fθ(z(t), t). Intuitively, fθ(z(t), t) indicates the direction and speed along which the particle is moving on the manifold s surface. Classic examples for such vector ﬁelds include weathercocks giving wind direction and compasses pointing toward the magnetic north pole of the earth. Formally, let Tz M denote the tangent space at z and T M = z M Tz M the associated tangent bundle. Furthermore, let fθ : M R 7 T M denote a vector ﬁeld on M. The particle s time-evolution according to fθ is then given by the following ODE

dt = fθ(z(t), t). (1)

To transform a base distribution using this vector ﬁeld, we are then interested in a particle s position after time t. When starting at an initial position z(0) = z0, the ﬂow operator φ : M R 7 M gives the particle s position at any time t as z(t) = φ(z0, t). Leveraging the fundamental theorem of ﬂows (Lee, 2003), we can show that under mild conditions, this ﬂow is bijective and diﬀerentiable. We write C1 for the set of diﬀerentiable functions whose derivative are continuous. Proposition 1 (Vector ﬂows). Let M be a smooth complete manifold. Furthermore, let fθ be a C1-bounded time-dependent vector ﬁeld. Then there exists a global ﬂow φ : M R 7 M such that for each t R, the map φ( , t) : M 7 M is a C1-diﬀeomorphism (i.e. C1 bijection with C1 inverse).

Proof. See Appendix E.1 for a detailed derivation.

Note that scaling the vector ﬁeld as f α θ αfθ results in a time-scaled ﬂow φα(z, t) = φ(z, αt). The integration duration t is therefore arbitrary. Without loss of generality we set t = 1 and write φ φ( , 1). Concerning the evaluation of the ﬂow φ, it generally does no accept a closed-form solution and thus requires to be approximated numerically. To this extent we rely on an explicit and adaptive Runge-Kutta (RK) integrator of order 4 (Dormand and Prince, 1980). However, standard integrators used in CNFs generally do not preserve manifold constraints (Hairer, 2006) . To overcome this issue we rely on a projective solver (Hairer, 2011). This solver works by conveniently solving the ODE in the ambient Cartesian coordinates and projecting each step onto the manifold. Projections onto Sd are computationally cheap since they amount to l2 norm divisions. No projection is required for the Poincaré ball.

Likelihood Having a ﬂow at hand, we are now interested in evaluating the likelihood of our pushforward model Pθ = φ P0. Here, the pushforward operator indicates that one obtains samples z φ P0 as z = φ(z0) with z0 P0. For this purpose, we derive in the following the change in density in terms of the geometry of the manifold and show how to eﬃciently estimate the likelihood.

Change in density In normalizing ﬂows, we can compute the likelihood of a sample via the change in density from the base distribution to the pushforward. Applying the chain rule we get

log pθ(z) log p0(z0) = log

= log det φ(z0)

In general, computing the Jacobian s determinant of the ﬂow is challenging since it requires d reverse-mode automatic diﬀerentiations to obtain the full Jacobian matrix, and O(d3) operations to

compute its determinant. CNFs side step direct computation of the determinant by leveraging the time-continuity of the ﬂow and re-expressing Equation 2 as the integral of the instantaneous change in log density R t 0 log pθ(z(t))

t dt. However, standard CNFs make an implicit Euclidean assumption to compute this quantity which is violated for general Riemannian manifolds. To overcome this issue we express the instantaneous change in log-density in terms of the Riemannian metric. In particular, let G(z) denote the matrix representation of the Riemannian metric for a given manifold M, then G(z) endows tangent spaces Tz M with an inner product. For instance in the Poincaré ball Bd, it holds that G(z) = (2 / 1 + K z 2) Id, while in Euclidean space Rd we have G(z) = Id, where Id denotes the identity matrix. Using the Liouville equation, we can then show that the instantaneous change in variable is deﬁned as follows. Proposition 2 (Instantaneous change of variables). Let z(t) be a continuous manifold-valued random variable given in local coordinates, which is described by the ODE from Equation 1 with probability density pθ(z(t)). The change in log-probability then also follows a diﬀerential equation given by

log pθ(z(t))

t = div(fθ(z(t), t)) = |G(z(t))| 1

2 tr |G(z(t))|fθ(z(t), t)

= tr fθ(z(t), t)

! |G(z(t))| 1

2 D fθ(z(t), t),

|G(z(t))| E . (4)

Proof. For a detailed derivation of Equation 3 see Appendix C.

Note that in the Euclidean setting |G(z)| = 1 thus the second term of Equation 4 vanishes and we recover the formula from Grathwohl et al. (2019); Chen et al. (2018).

Estimating the divergence Even though the determinant of Equation 2 has been replaced in Equation 3 by a trace operator with lower computational complexity, we still need to compute the full Jacobian matrix of fθ. Similarly to Grathwohl et al. (2019); Salman et al. (2018), we make use of Hutchinson s trace estimator to compute the Jacobian eﬃciently. In particular, Hutchinson (1990) showed that tr(A) = Ep(ϵ)[ϵ Aϵ] with p(ϵ) being a d-dimensional random vector such that E[ϵ] = 0 and Cov(ϵ) = Id. Leveraging this trace estimator to approximate the divergence in Equation 3 yields

div(fθ(z(t), t)) = |G(z(t))| 1

" ϵ |G(z(t))|fθ(z(t), t)

z ϵ # . (5)

We note that the variance of this estimator can potentially be high since it scales with the inverse of the determinant term |G(z(t))| (see Appendix D.2). By integrating Equation 3 over time with the stochastic divergence estimator from Equation 5, we get the following total change in log-density between the manifold-valued random variables z and z0

0 div(fθ(z(t), t)) dt = Ep(ϵ)

0 |G(z(t))| 1

2 ϵ |G(z(t))|fθ(z(t), t)

z ϵ dt # . (6)

It can be seen that Equation 6 accounts again for the underlying geometry through the metric G(z(t)). Table 1 lists closed-form solutions of its determinant for constant curvature manifolds. Furthermore, the vector-Jacobian product can be computed through backward auto-diﬀerentiation with linear complexity, avoiding the quadratic cost of computing the full Jacobian matrix. Additionally, the integral is approximated via the discretization of the ﬂow returned by the solver.

Choice of base distribution P0 The closer the initial base distribution is to the target distribution, the easier the learning task should be. However, it is challenging in practice to incorporate such prior knowledge. We consequently use a uniform distribution on Sd since it is the most "uncertain" distribution. For the Poincaré ball Bd, we rely on a standard wrapped Gaussian distribution NW (Nagano et al., 2019; Mathieu et al., 2019) because it is convenient to work with.

Vector ﬁeld Finally, we discuss the form of the vector ﬁeld fθ : M R T M which generates the ﬂow φ used to pushforward samples. We parametrize fθ via a feed-forward neural network which takes as input manifold-valued particles, and outputs their velocities. The architecture of the vector ﬁeld has a direct impact on the expressiveness of the distribution and is thus crucially important. In order to take into account these geometrical properties we make use of speciﬁc input and output layers that we describe below. The rest of the architecture is based on a multilayer perceptron.

Input layer To inform the neural network about the geometry of the manifold M, we use as ﬁrst layer a geodesic distance layer (Ganea et al., 2018; Mathieu et al., 2019) which generalizes linear layers to manifolds, and can be seen as computing distances to decision boundaries on M. These boundaries are parametrized by geodesic hyperplanes Hw, and the associated neurons hw(z) d M(z, Hw), with d M being the geodesic distance. Horizontally stacking several of these neurons makes a geodesic distance layer. We refer to Appendix E.2 for more details.

Output layer To constrain the neural net to T M, we output vectors in Rd+1 when M = Sd, before projecting them to the tangent space i.e. fθ(z) = proj Tz M neural_net(z). This is not necessary in Bd

since the ambient space is of equal dimension. Yet, velocities scale as fθ(z) z = |G(z)|1/2 fθ(z) 2, hence we scale the neural_net by |G(z)| 1/2 s.t. fθ(z) z = neural_net(z) 2.

Regularity For the ﬂow to be bijective, the vector ﬁeld fθ is required to be C1 and bounded (cf Proposition 1). The boundness and smoothness conditions can be satisﬁed by relying on bounded smooth non-linearities in fθ such as tanh, along with bounded weight and bias at the last layer.

Training In density estimation and inference tasks, one aims to learn a model Pθ with parameters θ by minimising a divergence L(θ) = D(PD || Pθ) w.r.t. a target distribution PD. In our case, the parameters θ refer to the parameters of the vector ﬁeld fθ. We minimize the loss L(θ) using ﬁrstorder stochastic optimization, which requires Monte Carlo estimates of loss gradients θ L(θ). We back-propagate gradients through the explicit solver with O(1/h) memory cost, h being the step size. When the loss L(θ) is expressed as an expectation over the model Pθ, as in the reverse KL divergence, we rely on the reparametrization trick (Kingma and Welling, 2014; Rezende et al., 2014). In our experiments we will consider both the negative log-likelihood and reverse KL objectives

LLike(θ) = Ez PD log pθ(z) and LKL(θ) = DKL (Pθ PD) = Ez Pθ log pθ(z) log p D(z) . (7)

Additionally, regularization terms can be added in the hope of improving training and generalization. See Appendix D for a discussion and connections to the dynamical formulation of optimal transport.

3 Related work

Here we discuss previous work that introduced normalizing ﬂows on manifolds. For clarity we split these into projected vs Riemannian methods which we describe below.

Projected methods These methods consist in parametrizing a normalizing ﬂow on Rd and then pushing-forward the resulting distribution along an invertible map ψ : Rd M. Yet, the existence of such an invertible map is equivalent to M being homeomorphic to Rd (e.g. being "ﬂat"), hence limiting the scope of that approach. Moreover there is no principled way to choose such a map, and diﬀerent choices lead to diﬀerent numerical or computational challenges which we discuss below.

Exponential map The ﬁrst generic projected map that comes to mind in this setting is the exponential map expµ : TµM Rd M, which parameterizes geodesics starting from µ with velocity v TµM. This leads to so called wrapped distributions PW θ = expµ P, with P a probability measure on Rd. This approach has been taken by Falorsi et al. (2019) to parametrize probability distributions on Lie groups. Yet, in compact manifolds such as spheres or the SO(3) group computing the density of wrapped distributions requires an inﬁnite summation, which in practice needs to be truncated. This is not the case however on hyperbolic spaces (like the Poincaré ball) since the exponential map is bijective on these manifolds. This approach has been proposed in Bose et al. (2020) where they extend Real-NVP (Dinh et al., 2017) to the hyperboloid model of hyperbolic geometry. In addition to this wrapped Real-NVP, they also introduced a hybrid coupling model which is empirically shown to be more expressive. We note however that the exponential map is believed to be "badly behaved" away from the origin (Dooley and Wildberger, 1993; Al-Mohy and Higham, 2010).

Stereographic map Alternatively to the exponential map, Gemici et al. (2016) proposed to parametrize probability distributions on Sd via the stereographic projection deﬁned as ρ(z) = z2:d / (1 + z1) with projection point {µ0} = ( 1, 0, . . . , 0). Gemici et al. then push a probability measure P deﬁned on Rd along the inverse of the stereographic map ρ, yielding PS θ = ρ 1 P.

Figure 2: Probability densities on B2. Models have been trained by maximum likelihood to ﬁt NW(exp0(2 x), Σ). The black semi-circle indicates the disk s border. The best run out of twelve trainings is shown for each model.

Negative log-likelihood

Unscaled Rescaled

Figure 3: Ablation study of the vector ﬁeld architecture for the Riemannian model. Models have been trained to ﬁt a NW(exp0( x), Σ).

0 500 1000 1500 epochs

Negative Log-likelihood

Naive Wrapped Riemannian

0 500 1000 1500 epochs

0 500 1000 1500 epochs

Figure 4: Negative Log-likelihood of CNFs trained to ﬁt a NW(exp0(α x), Σ) target on B2.

However, the stereographic map ρ is not injective, and projects µ0 to . This implies that spherical points close to the projection point {µ0} are mapped far away from the origin of the plane. Modelling probability distributions with mass close to { µ0} may consequently be numerically challenging since the norm of the Euclidean ﬂow would explode. Similarly, Rezende et al. (2020) introduced ﬂows on hyperspheres and tori by using the inverse tangent function. Although this method is empirically shown to perform well, it similarly suﬀers from numerical instabilities near singularity points.

Riemannian methods In contrast to projected methods which rely on mapping the manifold to a Euclidean space, Riemannian methods do not. As a consequence they side-step any artefact or numerical instability arising from the manifold s projection. Early work (Sei, 2011) proposed transformations along geodesics on the hypersphere by evaluating the exponential map at the gradient of a scalar manifold function. Recently, Rezende et al. (2020) introduced ad-hoc discrete Riemannian ﬂows for hyperspheres and tori based on Möbius transformations and spherical splines. We contribute to this line of work by introducing continuous ﬂows on general Riemannian manifolds. In contrast to discrete ﬂows (e.g. Bose et al., 2020; Rezende et al., 2020), time-continuous ﬂows as ours alleviate strong structural constraints on the ﬂow by implicitly parametrizing it as the solution to an ODE (Grathwohl et al., 2019). Additionally, recent and concurrent work (Lou et al., 2020; Falorsi and Forré, 2020) proposed to extend neural ODEs to smooth manifolds.

4 Experimental results

We evaluate the empirical performance of the above-mentioned models on hyperbolic and spherical geometry. We will ﬁrst discuss experiments on two synthetic datasets where we highlight speciﬁc pathologies of the naive and projected methods via unimodal distributions at the point (or the limit) of the pathology. This removes additional modelling artefacts that would be introduced through more complex distributions and allows to demonstrate advantages of our approach on the respective manifolds. We further show that these advantages also translate to substantial gains on highly multi-modal real world datasets.

For all projected models (e.g. stereographic and wrapped cf Section 3), the vector ﬁeld s architecture is chosen to be a multilayer perceptron as in Grathwohl et al. (2019), whilst the architecture described in Section 2 is used for our Riemannian (continuous normalizing ﬂow) model. For fair comparisons,

Stereographic

Base P0 Model Pθ Target PD

Figure 5: Probability distributions on S2. Models trained to ﬁt a v MF(µ = µ0, κ = 10).

model Stereographic Riemannian Loss κ

LLike 100 63.60 3.56 1.78 0.01 50 32.68 3.15 1.09 0.01 10 6.45 2.42 0.52 0.01

LKL 100 1.56 0.34 0.04 0.02 50 0.68 0.16 0.03 0.02 10 0.12 0.01 0.01 0.00

Table 2: Performance of continuous ﬂows on S2 with v MF(µ = µ0, κ) targets (the smaller the better). When models perfectly ﬁt the target PD, then LLike = H[PD] which decreases with κ, explaining LLike s results for the Riemannian model.

we also parametrize projected models with a CNF. Also, all models are chosen to have approximately the same number of parameters. All models were implemented in Py Torch (Paszke et al., 2017) and trained by stochastic optimization with Adam (Kingma and Ba, 2015). All 95% conﬁdence intervals are computed over 12 runs. Please refer to Appendix G for full experimental details.

Hyperbolic geometry and limits of conventional and wrapped methods First, we aim to show that conventional normalizing ﬂows are ill-suited for modelling target manifold distributions. These are blind to the geometry, so we expect them to behave poorly when the target is located where the manifold behaves most diﬀerently from a Euclidean space. We refer to such models as naive and discuss their properties in more detail in Appendix B.1. Second, we wish to inspect the behaviour of wrapped models (see Section 3) when the target is away from the exponential map origin.

To this extent we parametrize a wrapped Gaussian target distribution NW(exp0(α x), Σ) = expµ N(α x, Σ) deﬁned on the Poincaré disk B2 (Nagano et al., 2019; Mathieu et al., 2019). The scalar parameter α allows us to locate the target closer or further away from the origin of the disk. We put three CNFs models on the benchmark; our Riemannian (from Section 2), a conventional naive and a wrapped model. The base distribution P0 is a standard Gaussian for the naive and wrapped models, and a standard wrapped Gaussian for the Riemannian model. Models are trained by maximum likelihood until convergence. Throughout training, the Hutchinson s estimator is used to approximate the divergence as in Equation 5. It can be seen from Figure 4 that the Riemannian model indeed outperforms the naive and wrapped models as we increase the values of α i.e., the closer we move to the boundary of the disk. Figure 2 shows that qualitatively the naive and wrapped models seem to indeed fail to properly ﬁt the target when it is located far from the origin. Additionally, we assess the architectural choice of the vector ﬁeld used in our Riemannian model. In particular, we conduct an ablation study on the rescaling of the output layer, by training for 10 iterations a rescaled and an unscaled version of our model. Figure 3 shows that the number of function evaluations (NFE) tends to be large and sometimes even dramatically diverges when the vector ﬁeld s output is unscaled. In addition to increasing the computational cost, this in turns appears to worsen the convergence s speed of the model. This further illustrates the beneﬁts of our vector ﬁeld parameterization.

Spherical geometry and limits of the stereographic projection model Next, we evaluate the ability of our model and the stereographic projection model from Section 3 to approximate distributions on the sphere which are located around the projection point µ0. We empirically assess this phenomenon by choosing the target distribution to be a Von-Mises Fisher (Downs, 1972) distribution v MF(µ, κ) located at µ = µ0, and with concentration κ (which decreases with the variance). Along with the stereographic projection method, we also consider our Riemannian model from Section 2. We neither included the naive model since it is misspeciﬁed here (leading to an undeﬁned reverse KL divergence), nor the wrapped model as computing its density requires an inﬁnite summation (see Section 3). The base distribution P0 is chosen to be a standard Gaussian on R2 for the stereographic model and a uniform distribution on S2 for the Riemannian model. Models are trained by computing the exact divergence. The performance of these two models are quantitatively assessed on both the negative log-likelihood and reverse KL criteria.

Table 3: Negative test log-likelihood of continuous normalizing ﬂows on S2 datasets.

Volcano Earthquake Flood Fire

Mixture v MF 0.31 0.07 0.59 0.01 1.09 0.01 0.23 0.02 Stereographic 0.64 0.20 0.43 0.04 0.99 0.04 0.40 0.06 Riemannian 0.97 0.15 0.19 0.04 0.90 0.03 0.66 0.05

Learning curves

0 1000 2000 3000

0 500 1000 epochs

0 500 1000 epochs

0 500 1000 epochs

Data size 829 6124 4877 12810

Figure 5 shows densities of the target distribution along with the base and learned distributions. We observe that the stereographic model fails to push mass close enough to the singularity point µ0, as opposed to the Riemannian model which perfectly ﬁts the target. Table 2 shows the negative log-likelihood and reverse KL losses of both models when varying the concentration parameter κ of the v MF target. The larger the concentration κ is, the closer to the singularity point µ0 the target s mass gets. We observe that the Riemannian model outperforms the stereographic one to ﬁt the target for both objectives, although this performance gap shrinks as the concentration gets smaller. Also, we believe that the gap in performance is particularly large for the log-likelihood objective because it heavily penalizes models that fail to cover the support of the target. When the v MF target is located away from the singularity point, we noted that both models were performing similarly well.

Density estimation of spherical data Finally, we aim to measure the expressiveness and modelling capabilities of our method on real world datasets. To this extent, we gathered four earth location datasets, representing respectively volcano eruptions (NOAA, 2020b), earthquakes (NOAA, 2020a), ﬂoods (Brakenridge, 2017) and wild ﬁres (EOSDIS, 2020). We approximate the earth s surface (and thus also these data points) as a perfect sphere. Along our Riemannian CNF, we also assess the ﬁtting capacity of a mixture of von Mises-Fisher (v MF) distributions and a stereographic projected CNF. The locations of the v MF components are learned via stochastic Riemannian optimization (Bonnabel, 2013; Becigneul and Ganea, 2019). The learning rate and number of components are selected by hyperparameter grid search. In our experiments, we split datasets randomly into training and testing datasets, and ﬁt the models by maximum likelihood estimation on the training dataset. CNF models are trained by computing the exact divergence.

We observe from Table 3 that for all datasets, the Riemannian model outperforms its stereographic counterpart and the mixture of v MF distributions by a large margin. It can also be seen from the learning curves that the Riemannian model converges faster. Figure 6 shows the learned spherical distributions along with the training and testing datasets. We note that qualitatively the stereographic distribution is generally more diﬀuse than its Riemannian counterpart. It also appears to allocate some of its mass outside the target support, and to cover less of the data points. Additional ﬁgures are shown in Appendix H.

Limitations In the spherical setting, the stochastic estimator to approximate the divergence from Equation 5 exhibits high variance. Its variance scales with the inverse of |G(z)| = sin(θ), which becomes too large around the north pole and thus requires the use of the exact estimator. On largescale datasets, where the stochastic estimator has important runtime advantages, this issue could be alleviated by choosing a diﬀerent vector ﬁeld basis than the one induced by some local coordinates (e.g. Falorsi and Forré, 2020). For the Poincaré ball, no such variance behavior of the stochastic estimator exists and it can readily be applied to large-scale data.

Compared to a well-optimized linear layer, the use of the geodesic distance layer (see Section 2) induces an extra computational cost as shown in Figure 7. Empirically, the geodesic layer helps to improve performance in the hyperbolic setting but had less of an eﬀect in the spherical setting. As

Stereographic

Figure 6: Density estimation for earth sciences data. Blue and red dots represent training and testing datapoints, respectively. Heatmaps depict the log-likelihood of the trained models.

such, the geodesic layer can be regarded as an optional component that can improve the quality of the model at an additional computational cost.

5 Discussion

In this paper we proposed a principled way to parametrize expressive probability distributions on Riemannian manifolds. Speciﬁcally, we introduced Riemmanian continuous normalizing ﬂows in which ﬂows are deﬁned via vector ﬁelds on manifolds and computed as the solution to the associated ODE. We empirically demonstrated that this method can yield substantial improvements when modelling data on constant curvature manifolds compared to conventional or projected ﬂows.

Broader impact

The work presented in this paper focuses on the learning of well-speciﬁed probabilistic models for manifold-valued data. Consequently, its applications are especially promising to advance scientiﬁc understanding in ﬁelds such as earth and climate science, computational biology, and computer vision. As a foundational method, our work inherits the broader ethical aspects and future societal consequences of machine learning in general.

Acknowledgments

We are grateful to Adam Foster, Yann Dubois, Laura Ruis, Anthony Caterini, Adam Golinski, Chris Maddison, Salem Said, Alessandro Barp, Tom Rainforth and Yee Whye Teh for fruitful discussions and support. EM research leading to these results received funding from the European Research Council under the European Union s Seventh Framework Programme (FP7/20072013) ERC grant agreement no. 617071 and he acknowledges Microsoft Research and EPSRC for funding EM s studentship.

Al-Mohy, A. H. and Higham, N. J. (2010). A New Scaling and Squaring Algorithm for the Matrix Exponential. SIAM Journal on Matrix Analysis and Applications, 31(3):970 989.

Ambrosio, L. (2003). Optimal transport maps in Monge-Kantorovich problem. ar Xiv:math/0304389 [math].

Avron, H. and Toledo, S. (2011). Randomized algorithms for estimating the trace of an implicit symmetric positive semi-deﬁnite matrix. Journal of the ACM, 58(2).

Becigneul, G. and Ganea, O.-E. (2019). Riemannian adaptive optimization methods. In International Conference on Learning Representations.

Benamou, J.-D. and Brenier, Y. (2000). A computational ﬂuid mechanics solution to the Monge Kantorovich mass transfer problem. Numerische Mathematik, 84(3):375 393.

Blumenson, L. E. (1960). A derivation of n-Dimensional spherical coordinates. The American Mathematical Monthly, 67(1):63 66.

Bonnabel, S. (2013). Stochastic gradient descent on Riemannian manifolds. IEEE Transactions on Automatic Control, 58(9):2217 2229.

Bose, A. J., Smofsky, A., Liao, R., Panangaden, P., and Hamilton, W. L. (2020). Latent variable modelling with hyperbolic normalizing ﬂows. Proceedings of the 37th International Conference on Machine Learning.

Brakenridge, G. (2017). Global active archive of large ﬂood events. http://floodobservatory. colorado.edu/Archives/index.html. Dartmouth Flood Observatory, University of Colorado,.

Chen, R. T. Q., Behrmann, J., Duvenaud, D. K., and Jacobsen, J.-H. (2019). Residual ﬂows for invertible generative modeling. In Wallach, H., Larochelle, H., Beygelzimer, A., d Alché-Buc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems 32, pages 9916 9926. Curran Associates, Inc.

Chen, R. T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. (2018). Neural ordinary diﬀerential equations. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R., editors, Advances in Neural Information Processing Systems 31, pages 6571 6583. Curran Associates, Inc.

Davidson, T. R., Falorsi, L., De Cao, N., Kipf, T., and Tomczak, J. M. (2018). Hyperspherical variational auto-encoders. 34th Conference on Uncertainty in Artiﬁcial Intelligence (UAI-18).

Dinh, L., Sohl-Dickstein, J., and Bengio, S. (2017). Density estimation using Real NVP. In International Conference on Learning Representations.

Dooley, A. and Wildberger, N. (1993). Harmonic analysis and the global exponential map for compact Lie groups. Functional Analysis and Its Applications, 27(1):21 27.

Dormand, R. J. and Prince, J. P. (1980). A family of embedded Runge-Kutta formulae. Journal of Computational and Applied Mathematics, pages 19 26.

Downs (1972). Orientational statistics. Biometrika, 59.

Drucker, H. and Cun, Y. L. (1992). Improving Generalization Performance Using Double Backpropagation. IEEE Transactions on Neural Networks and Learning Systems, 3(6):991 997.

EOSDIS (2020). Active ﬁre data. https://earthdata.nasa.gov/earth-observation-data/ near-real-time/firms/active-fire-data. Land, Atmosphere Near real-time Capability for EOS (LANCE) system operated by NASA s Earth Science Data and Information System (ESDIS).

Falorsi, L., de Haan, P., Davidson, T. R., and Forré, P. (2019). Reparameterizing distributions on lie groups. volume 89 of Proceedings of Machine Learning Research, pages 3244 3253. PMLR.

Falorsi, L. and Forré, P. (2020). Neural Ordinary Diﬀerential Equations on Manifolds. ar Xiv:2006.06663 [cs, stat].

Finlay, C., Jacobsen, J.-H., Nurbekyan, L., and Oberman, A. M. (2020). How to train your neural ODE. Proceedings of the 37th International Conference on Machine Learning.

Ganea, O., Becigneul, G., and Hofmann, T. (2018). Hyperbolic neural networks. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R., editors, Advances in Neural Information Processing Systems 31, pages 5345 5355. Curran Associates, Inc.

Gemici, M. C., Rezende, D., and Mohamed, S. (2016). Normalizing Flows on Riemannian Manifolds. ar Xiv:1611.02304 [cs, math, stat].

Grathwohl, W., Chen, R. T. Q., Bettencourt, J., and Duvenaud, D. (2019). Scalable reversible generative models with free-form continuous dynamics. In International Conference on Learning Representations.

Hairer, E. (2006). Geometric Numerical Integration : Structure-Preserving Algorithms for Ordinary Diﬀerential Equations. Springer Series in Computational Mathematics ; 31. Springer, Berlin, 2nd ed. edition.

Hairer, E. (2011). Solving Diﬀerential Equations on Manifolds. page 55.

Hamelryck, T., Kent, J. T., and Krogh, A. (2006). Sampling Realistic Protein Conformations Using Local Structural Bias. PLo S Computational Biology, 2(9).

Hasnat, M. A., Bohné, J., Milgram, J., Gentric, S., and Chen, L. (2017). Von Mises-Fisher Mixture Model-based Deep learning: Application to Face Veriﬁcation. ar Xiv:1706.04264 [cs].

Hutchinson (1990). A stochastic estimator of the trace of the inﬂuence matrix for laplacian smoothing splines. Communications in Statistics - Simulation and Computation, 19(2):433 450.

Karpatne, A., Ebert-Uphoﬀ, I., Ravela, S., Babaie, H. A., and Kumar, V. (2019). Machine learning for the geosciences: Challenges and opportunities. IEEE Transactions on Knowledge and Data Engineering, 31(8):1544 1554.

Kingma, D. P. and Ba, J. (2015). Adam: A method for stochastic optimization. In Bengio, Y. and Le Cun, Y., editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.

Kingma, D. P. and Welling, M. (2014). Auto-encoding variational bayes. In Bengio, Y. and Le Cun, Y., editors, 2nd International Conference on Learning Representations, ICLR 2014, Banﬀ, AB, Canada, April 14-16, 2014, Conference Track Proceedings.

Klimovskaia, A., Lopez-Paz, D., Bottou, L., and Nickel, M. (2020). Poincaré maps for analyzing complex hierarchies in single-cell data. Nat Commun., 11:2966.

Lee, J. M. (2003). Introduction to Smooth Manifolds. Number 218 in Graduate Texts in Mathematics. Springer, New York.

Lou, A., Lim, D., Katsman, I., Huang, L., Jiang, Q., Lim, S.-N., and De Sa, C. (2020). Neural manifold ordinary diﬀerential equations.

Lui, Y. M. (2012). Advances in matrix manifolds for computer vision. Image and Vision Computing, 30(6):380 388.

Mardia, K. V. and Jupp, P. E. (2000). Directional Statistics. Wiley Series in Probability and Statistics. J. Wiley, Chichester ; New York.

Mathieu, E., Le Lan, C., Maddison, C. J., Tomioka, R., and Teh, Y. W. (2019). Continuous hierarchical representations with poincaré variational auto-encoders. In Wallach, H., Larochelle, H., Beygelzimer, A., d Alché-Buc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems 32, pages 12565 12576. Curran Associates, Inc.

Nagano, Y., Yamaguchi, S., Fujita, Y., and Koyama, M. (2019). A Diﬀerentiable Gaussian-like Distribution on Hyperbolic Space for Gradient-Based Learning. In International Conference on Machine Learning (ICML).

Nickel, M. and Kiela, D. (2017). Poincaré embeddings for learning hierarchical representations. In Advances in Neural Information Processing Systems, pages 6341 6350.

NOAA (2020a). Global signiﬁcant earthquake database. https://data.nodc.noaa.gov/ cgi-bin/iso?id=gov.noaa.ngdc.mgg.hazards:G012153. National Geophysical Data Center / World Data Service (NGDC/WDS): NCEI/WDS Global Signiﬁcant Earthquake Database. NOAA National Centers for Environmental Information.

NOAA (2020b). Global signiﬁcant volcanic eruptions database. https://data.nodc.noaa.gov/ cgi-bin/iso?id=gov.noaa.ngdc.mgg.hazards:G10147. National Geophysical Data Center / World Data Service (NGDC/WDS): NCEI/WDS Global Signiﬁcant Volcanic Eruptions Database. NOAA National Centers for Environmental Information.

Novak, R., Bahri, Y., Abolaﬁa, D. A., Pennington, J., and Sohl-Dickstein, J. (2018). Sensitivity and generalization in neural networks: an empirical study. In International Conference on Learning Representations.

Papamakarios, G., Nalisnick, E., Rezende, D. J., Mohamed, S., and Lakshminarayanan, B. (2019). Normalizing Flows for Probabilistic Modeling and Inference. ar Xiv:1912.02762 [cs, stat].

Papamakarios, G., Pavlakou, T., and Murray, I. (2017). Masked autoregressive ﬂow for density estimation. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems 30, pages 2338 2347. Curran Associates, Inc.

Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., De Vito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. (2017). Automatic diﬀerentiation in Py Torch. In NIPS-W.

Pennec, X. (2006). Intrinsic Statistics on Riemannian Manifolds: Basic Tools for Geometric Measurements. Journal of Mathematical Imaging and Vision, 25(1):127 154.

Petersen, P. (2006). Riemannian Geometry. Springer-Verlag New York.

Rezende, D. and Mohamed, S. (2015). Variational inference with normalizing ﬂows. volume 37 of Proceedings of Machine Learning Research, pages 1530 1538, Lille, France. PMLR.

Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. volume 32 of Proceedings of Machine Learning Research, pages 1278 1286, Bejing, China. PMLR.

Rezende, D. J., Papamakarios, G., Racanière, S., Albergo, M. S., Kanwar, G., Shanahan, P. E., and Cranmer, K. (2020). Normalizing Flows on Tori and Spheres. Proceedings of the 37th International Conference on Machine Learning.

Salman, H., Yadollahpour, P., Fletcher, T., and Batmanghelich, K. (2018). Deep Diﬀeomorphic Normalizing Flows. ar Xiv:1810.03256 [cs, stat].

Sei, T. (2011). A Jacobian Inequality for Gradient Maps on the Sphere and Its Application to Directional Statistics. Communications in Statistics - Theory and Methods, 42(14):2525 2542.

Skopek, O., Ganea, O.-E., and Bécigneul, G. (2020). Mixed-curvature variational autoencoders. In International Conference on Learning Representations.

Ungar, A. A. (2008). A gyrovector space approach to hyperbolic geometry. Synthesis Lectures on Mathematics and Statistics, 1(1):1 194.