# differentiating_through_the_fréchet_mean__9996c3db.pdf Differentiating through the Fr echet Mean Aaron Lou * 1 Isay Katsman * 1 Qingxuan Jiang * 1 Serge Belongie 1 Ser-Nam Lim 2 Christopher De Sa 1 Recent advances in deep representation learning on Riemannian manifolds extend classical deep learning operations to better capture the geometry of the manifold. One possible extension is the Fr echet mean, the generalization of the Euclidean mean; however, it has been difficult to apply because it lacks a closed form with an easily computable derivative. In this paper, we show how to differentiate through the Fr echet mean for arbitrary Riemannian manifolds. Then, focusing on hyperbolic space, we derive explicit gradient expressions and a fast, accurate, and hyperparameter-free Fr echet mean solver. This fully integrates the Fr echet mean into the hyperbolic neural network pipeline. To demonstrate this integration, we present two case studies. First, we apply our Fr echet mean to the existing Hyperbolic Graph Convolutional Network, replacing its projected aggregation to obtain state-of-the-art results on datasets with high hyperbolicity. Second, to demonstrate the Fr echet mean s capacity to generalize Euclidean neural network operations, we develop a hyperbolic batch normalization method that gives an improvement parallel to the one observed in the Euclidean setting1. 1. Introduction Recent advancements in geometric representation learning have utilized hyperbolic space for tree embedding tasks (Nickel & Kiela, 2017; 2018; Yu & De Sa, 2019). This is due to the natural non-Euclidean structure of hyperbolic *Equal contribution 1Department of Computer Science, Cornell University, NY, Ithaca, USA 2Facebook AI, NY, New York, USA. Correspondence to: Aaron Lou , Isay Katsman , Qingxuan Jiang . Proceedings of the 37 th International Conference on Machine Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by the author(s). 1Our Py Torch implementation of the differentiable Fr echet mean can be found at https://github.com/CUVL/Differentiable Frechet-Mean. Figure 1. Depicted above is the Fr echet mean, µ, of three points, x1, x2, x3 in the Lorentz model of hyperbolic space. As one can see, the Fr echet mean conforms with the geometry of the hyperboloid and is vastly different from the standard Euclidean mean. space, in which distances grow exponentially as one moves away from the origin. Such a geometry is naturally equipped to embed trees, since if we embed the root of the tree near the origin and layers at successive radii, the geometry of hyperbolic space admits a natural hierarchical structure. More recent work has focused specifically on developing neural networks that exploit the structure of hyperbolic space (Ganea et al., 2018; Tifrea et al., 2019; Chami et al., 2019; Liu et al., 2019). A useful structure that has thus far not been generalized to non-Euclidean neural networks is that of the Euclidean mean. The (trivially differentiable) Euclidean mean is necessary to perform aggregation operations such as attention (Vaswani et al., 2017), and stability-enhancing operations such as batch normalization (Ioffe & Szegedy, 2015), in the context of Euclidean neural networks. The Euclidean mean extends naturally to the Fr echet mean in non-Euclidean geometries (Fr echet, 1948). However, unlike the Euclidean mean, the Fr echet mean does not have a closed-form solution, and its computation involves an argmin operation that cannot be easily differentiated. This makes the important operations we are able to perform in Euclidean space hard to Differentiating through the Fr echet Mean generalize to their non-Euclidean counterparts. In this paper, we extend the methods in Gould et al. (2016) to differentiate through the Fr echet mean, and we apply our methods to downstream tasks. Concretely, our paper s contributions are that: We derive closed-form gradient expressions for the Fr echet mean on Riemannian manifolds. For the case of hyperbolic space, we present a novel algorithm for quickly computing the Fr echet mean and a closed-form expression for its derivative. We use our Fr echet mean computation in place of the neighborhood aggregation step in Hyperbolic Graph Convolution Networks (Chami et al., 2019) and achieve state-of-the-art results on graph datasets with high hyperbolicity. We introduce a fully differentiable Riemannian batch normalization method which mimics the procedure and benefit of standard Euclidean batch normalization. 2. Related Work Uses of Hyperbolic Space in Machine Learning. The usage of hyperbolic embeddings first appeared in Kleinberg (2007), in which the author uses them in a greedy embedding algorithm. Later analyses by Sarkar (2011) and Sala et al. (2018) demonstrate the empirical and theoretical improvement of this approach. However, only recently in Nickel & Kiela (2017; 2018) was this method extended to machine learning. Since then, models such as those in Ganea et al. (2018); Chami et al. (2019); Liu et al. (2019) have leveraged hyperbolic space operations to obtain better embeddings using a hyperbolic version of deep neural networks. Fr echet Mean. The Fr echet mean (Fr echet, 1948), as the generalization of the classical Euclidean mean, offers a plethora of applications in downstream tasks. As a mathematical construct, the Fr echet mean has been thoroughly studied in texts such as Karcher (1977); Charlier (2013); Bac ak (2014). However, the Fr echet mean is an operation not without complications; the general formulation requires an argmin operation and offers no closed-form solution. As a result, both computation and differentiation are problematic, although previous works have attempted to resolve such difficulties. To address computation, Gu et al. (2019) show that a Riemannian gradient descent algorithm recovers the Fr echet mean in linear time for products of Riemannian model spaces. However, without a tuned learning rate, it is too hard to ensure performance. Brooks et al. (2019) instead use the Karcher Flow Algorithm (Karcher, 1977); although this method is manifold-agnostic, it is slow in practice. We address such existing issues in the case of hyperbolic space by providing a fast, hyperparameter-free algorithm for computing the Fr echet mean. Some works have addressed the differentiation issue by circumventing it, instead relying on pseudo-Fr echet means. In Law et al. (2019), the authors utilize a novel squared Lorentzian distance (as opposed to the canonical distance for hyperbolic space) to derive explicit formulas for the Fr echet mean in pseudo-hyperbolic space. In Chami et al. (2019), the authors use an aggregation method in the tangent space as a substitute. Our work, to the best of our knowledge, is the first to provide explicit derivative expressions for the Fr echet mean on Riemannian manifolds. Differentiating through the argmin. Theoretical foundations of differentiating through the argmin operator have been provided in Gould et al. (2016). Similar methods have subsequently been used to develop differentiable optimization layers in neural networks (Amos & Kolter, 2017; Agrawal et al., 2019). Given that the Fr echet mean is an argmin operation, one might consider utilizing the above differentiation techniques. However, a na ıve application fails, as the Fr echet mean s argmin domain is a manifold, and Gould et al. (2016) deals specifically with Euclidean space. Our paper extends this theory to the case of general Riemannian manifolds, thereby allowing the computation of derivatives for more general argmin problems, and, in particular, for computing the derivative of the Fr echet mean in hyperbolic space. 3. Background In this section, we establish relevant definitions and formulas of Riemannian manifolds and hyperbolic spaces. We also briefly introduce neural network layers in hyperbolic space. 3.1. Riemannian Geometry Background Here we provide some of the useful definitions from Riemannian geometry. For a more in-depth introduction, we refer the interested reader to our Appendix C or texts such as Lee (2003) and Lee (1997). Manifold and tangent space: An n-dimensional manifold M is a topological space that is locally homeomorphic to Rn. The tangent space Tx M at x is defined as the vector space of all tangent vectors at x and is isomorphic to Rn. We assume our manifolds are smooth, i.e. the maps are diffeomorphic. The manifold admits local coordinates (x1, . . . , xn) which form a basis (dx1, . . . , dxn) for the tangent space. Riemannian metric and Riemannian manifold: For a manifold M, a Riemannian metric ρ = (ρx)x M is a Differentiating through the Fr echet Mean Table 1. Summary of operations in the Poincar e ball model and the hyperboloid model (K < 0) Poincar e Ball Hyperboloid Manifold Dn K = {x Rn : x, x 2 < 1 K } Hn K = {x Rn+1 : x, x L = 1 Metric g DK x = (λK x )2g E where λK x = 2 1+K x 2 2 and g E = I g HK x = η, where η is I except η0,0 = 1 Distance d K D (x, y) = 1 |K| cosh 1 1 2K x y 2 2 (1+K x 2 2)(1+K y 2 2) d K H (x, y) = 1 |K| cosh 1(K x, y L) Exp map exp K x (v) = x K |K| λK x v 2 exp K x (v) = cosh( p |K|||v||L)x + v sinh( Log map log K x (y) = 2 |K|λK x tanh 1( p |K| x K y 2) x Ky x Ky 2 log K x (y) = cosh 1(K x,y L) sinh(cosh 1(K x,y L))(y K x, y Lx) Transport PT K x y(v) = λK x λK y gyr[y, x]v PT K x y(v) = v K y,v L 1+K x,y L (x + y) Table 2. Summary of hyperbolic counterparts of Euclidean operations in neural networks Operation Formula Matrix-vector multiplication A K x = exp K 0 (A log K 0 (x)) Bias translation x K b = expx(PT K 0 x(b)) Activation function σK1,K2(x) = exp K1 0 (σ(log K2 0 (x))) smooth collection of inner products ρx : Tx M Tx M R on the tangent space of every x M. The resulting pair (M, ρ) is called a Riemannian manifold. Note that ρ induces a norm in each tangent space Tx M, given by v ρ = p ρx( v, v) for any v Tx M. We oftentimes associate ρ to its matrix form (ρij) where ρij = ρ(dxi, dxj) when given local coordinates. Geodesics and induced distance function: For a curve γ : [a, b] M, we define the length of γ to be L(γ) = R b a γ (t) ρ dt. For x, y M, the distance d(x, y) = inf L(γ) where γ is any curve such that γ(a) = x, γ(b) = y. A geodesic γxy from x to y, in our context, should be thought of as a curve that minimizes this length2. Exponential and logarithmic map: For each point x M and vector v Tx M, there exists a unique geodesic γ : [0, 1] M where γ(0) = x, γ (0) = v. The exponential map expx : Tx M M is defined as expx( v) = γ(1). Note that this is an isometry, i.e. v ρ = d(x, expx( v)). The logarithmic map logx : M Tx M is defined as the inverse of expx, although this can only be defined locally3. Parallel transport: For x, y M, the parallel transport PTx y : Tx M Ty M defines a way of transporting the 2Formally, geodesics are curves with 0 acceleration w.r.t. the Levi-Civita connection. There are geodesics which are not minimizing curves, such as the larger arc between two points on a great circle of a sphere; hence this clarification is important. 3Problems in definition arise in the case of conjugate points (Lee, 1997). However, exp is a local diffeomorphism by the inverse function theorem. local geometry from x to y along the unique geodesic that preserves the metric tensors. 3.2. Hyperbolic Geometry Background We now examine hyperbolic space, which has constant curvature K < 0, and provide concrete formulas for computation. The two equivalent models of hyperbolic space frequently used are the Poincar e ball model and the hyperboloid model. We denote Dn K and Hn K as the ndimensional Poincar e ball and hyperboloid models with curvature K < 0, respectively. 3.2.1. BASIC OPERATIONS Inner products: We define x, y 2 to be the standard Euclidean inner product and x, y L to be the Lorentzian inner product x0y0 + x1y1 + + xnyn. Gyrovector operations: For x, y Dn K, the M obius addition (Ungar, 2008) is x K y = (1 2K x, y 2 K y 2 2)x + (1 + K x 2 2)y 1 2K x, y 2 + K2 x 2 2 y 2 2 (1) This induces M obius subtraction K which is defined as x K y = x K y. In the theory of gyrogroups, the notion of the gyration operator (Ungar, 2008) is given by gyr[x, y]v = K(x K y) K (x K (y K v)) (2) Riemannian operations on hyperbolic space: We summarize computations for the Poincar e ball model and the Differentiating through the Fr echet Mean hyperboloid model in Table 1. 3.3. Hyperbolic Neural Networks Introduced in Ganea et al. (2018), hyperbolic neural networks provide a natural generalization of standard neural networks. Hyperbolic linear layer: Recall that a Euclidean linear layer is defined as f : Rm Rn, f = σ(Ax + b) where A Rn m, x Rm, b Rn and σ is some activation function. With analogy to Euclidean layers, a hyperbolic linear layer g : Hm Hn is defined by g = σK,K(A K x K b), where A Rn m, x Hm, b Hn, and we replace the operations by hyperbolic counterparts outlined in Table 2. Hyperbolic neural networks are defined as compositions of these layers, similar to how conventional neural networks are defined as compositions of Euclidean layers. 4. A Differentiable Fr echet Mean Operation for General Riemannian Manifolds In this section, we provide a few theorems that summarize our method of differentiating through the Fr echet mean. 4.1. Background on the Fr echet Mean Fr echet mean and variance: On a Riemannian manifold (M, ρ), the Fr echet mean µfr M and Fr echet variance σ2 fr R of a set of points B = {x(1), , x(t)} with each x(l) M are defined as the solution and optimal values of the following optimization problem (Bac ak, 2014): µfr = arg min µ M l=1 d(x(l), µ)2 (3) σ2 fr = min µ M 1 l=1 d(x(l), µ)2 (4) In Appendix A, we provide proofs to illustrate that this definition is a natural generalization of Euclidean mean and variance. The Fr echet mean can be further generalized with an arbitrary re-weighting. In particular, for positive weights {wl}l [t], we can define the weighted Fr echet mean as: µfr = arg min µ M l=1 wl d(x(l), µ)2 (5) This generalizes the weighted Euclidean mean to Riemannian manifolds. Figure 2. Depicted above is the Fr echet mean, µ, of three points, x1, x2, x3 in the Poincar e ball model of hyperbolic space, D2 1, as well as the negative gradients (shown in red) with respect to the loss function L = µ 2. 4.2. Differentiating Through the Fr echet Mean All known methods for computing the Fr echet mean rely on some sort of iterative solver (Gu et al., 2019). While backpropagating through such a solver is possible, it is computationally inefficient and suffers from numerical instabilities akin to those found in RNNs (Pascanu et al., 2013). To circumvent these issues, recent works compute gradients at the solved value instead of differentiating directly, allowing for full neural network integration (Chen et al., 2018; Poganˇci c et al., 2020). However, to the best of our knowledge, no paper has investigated backpropagation on a manifold-based convex optimization solver. Hence, in this section, we construct the gradient, relying on the fact that the Fr echet mean is an argmin operation. 4.2.1. DIFFERENTIATING THROUGH THE ARGMIN OPERATION Motivated by previous works on differentiating argmin problems (Gould et al., 2016), we propose a generalization which allows us to differentiate the argmin operation on the manifold. The full theory is presented in Appendix D. 4.2.2. CONSTRUCTION OF THE FR ECHET MEAN DERIVATIVE Since the Fr echet mean is an argmin operation, we can apply the theorems in Appendix D to obtain gradients with respect to the input points. This operation (as well the resulting gradients) are visualized in Figure 2. Differentiating through the Fr echet Mean For the following theorems, we denote e as the total derivative (or Jacobian) for notational convenience. Theorem 4.1. Let M be an n-dimensional Riemannian manifold, and let {x} = (x(1), . . . , x(t)) (M)t be a set of data points with weights w1, . . . , wt R+. Let f : (M)t M M be given by f({x} , y) = t P l=1 wl d(x(l), y)2 and x = µfr({x}) = arg miny M f({x}, y) be the Fr echet mean. Then with respect to local coordinates we have e x(i)µfr({x}) = f Y Y ({x} , x) 1f X(i)Y ({x} , x) (6) where the functions f X(i)Y ({x} , y) = e x(i) yf({x} , y) and f Y Y ({x} , y) = 2 yyf({x} , x) are defined in terms of local coordinates. Proof. This is a special case of Theorem D.1 in the appendix. This is because the Fr echet objective function f is a twice differentiable real-valued function for specific x(i) and y (under our geodesic formulation); thus we obtain the desired formulation. The full explanation can be found in Remark D. While the above theorem gives a nice theoretical framework with minimal assumptions, it is in practice too unwieldy to apply. In particular, the requirement of local coordinates renders most computations difficult. We now present a version of the above theorem which assumes that the manifold is embedded in Euclidean space4. Theorem 4.2. Assume the conditions and values in Theorem 4.1. Furthermore, assume M is embedded (as a Riemannian manifold) in Rm with m dim M, then we can write e x(i)µfr({x}) = f p Y Y ({x} , x) 1f p X(i)Y ({x} , x) (7) where f p Y Y ({x} , y) = e y(proj Tx M yf)({x} , y), f p X(i)Y ({x} , y) = e x(i)(proj Tx M yf)({x} , y), and proj Tx M : Rm Tx M = Rn is the linear subspace projection operator. Proof. Similar to the relationship between Theorem 4.1 and Theorem D.1, this is a special case of Theorem D.3 in the appendix. 5. Hyperbolic Fr echet Mean Although we have provided a formulation for differentiating through the Fr echet mean on general Riemannian manifolds, 4We also present a more general way to take the derivative that drops this restriction via an exponential map-based parameterization in Appendix D. to properly integrate it in the hyperbolic setting we need to address two major difficulties: 1. The lack of a fast forward computation. 2. The lack of an explicit derivation of a backpropagation formula. Resolving these difficulties will allow us to define a Fr echet mean neural network layer for geometric, and specifically hyperbolic, machine learning tasks. 5.1. Forward Computation of the Hyperbolic Fr echet Mean Previous forward computations fall into one of two categories: (1) fast, inaccurate computations which aim to approximate the true mean with a pseudo-Fr echet mean, or (2) slow, exact computations. In this section we focus on outperforming methods in the latter category, since we strive to compute the exact Fr echet mean (pseudo-means warp geometry). 5.1.1. FORMER ATTEMPTS AT COMPUTING THE FR ECHET MEAN The two existing algorithms for Fr echet mean computation are (1) Riemannian gradient-based optimization (Gu et al., 2019) and (2) iterative averaging (Karcher, 1977). However, in practice both algorithms are slow to converge even for simple synthetic examples of points in hyperbolic space. To overcome this difficulty, which can cripple neural networks, we propose the following algorithm that is much faster in practice. Algorithm 1 Poincar e model Fr echet mean algorithm Inputs: x(1), , x(t) Dn K Rn+1 and weights w1, . . . , wt R+. Algorithm: y0 = x(1) Define g(y) = 2 arccosh(1+2y) y2+y for k = 0, 1, , T: for l = 1, 2, , t: αl = wlg x(l) yk 2 (1+K x(l) 2)(1+K yk 2) 1 1+K x(l) 2 l=1 αl , b = t P l=1 αlx(l) , c = t P l=1 αl x(l) 2 yk+1 = (a c K) (a c K)2+4K b 2 Differentiating through the Fr echet Mean 5.1.2. ALGORITHM FOR FR ECHET MEAN COMPUTATION VIA FIRST-ORDER BOUND The core idea of our algorithm relies on the fact that the square of distance metric is a concave function for both the Poincar e ball and hyperboloid model. Intuitively, we select an initial guess and use a first-order bound to minimize the Fr echet mean objective. The concrete algorithm for the Poincar e ball model is given as Algorithm 1 above. Note that the algorithm is entirely hyperparameter-free and does not require setting a step-size. Additionally we introduce three different initializations: 1. Setting y0 = x(1). 2. Setting y0 = x(arg maxi wi). 3. Setting y0 to be the output of the first step of the Karcher flow algorithm (Karcher, 1977). We tried these initializations for our test tasks (in which weights were equal, tasks described in Section 6), and found little difference between them in terms of performance. Even for toy tasks with varying weights, these three methods produced nearly the same results. However, we give them here for completeness. Moreover, we can prove that the algorithm is guaranteed to converge. Theorem 5.1. Let x(1), , x(t) Dn K be t points5 in the Poincar e ball, w1, . . . , wt R+ be their weights, and let their weighted Fr echet mean be the solution to the following optimization problem. µfr = arg min y Dn K f(y) (8) where f(y) = l=1 wl d Dn K(x(l), y)2 wl |K| arccosh2 1 2K x(l) y 2 (1 + K x(l) 2)(1 + K y 2) (9) Then Algorithm 1 gives a sequence of points {yk} such that their limit lim k yk = µfr converges to the Fr echet mean Proof. See Theorem E.2 in the appendix. The algorithm and proof of convergence for the hyperboloid model are given in Appendix E.1 and are omitted here for brevity. 5Here we present the version for K = 1 for cleaner presentation. The generalization to arbitrary K < 0 is easy to compute, but clutters presentation. 5.1.3. EMPIRICAL COMPARISON TO PREVIOUS FR ECHET MEAN COMPUTATION ALGORITHMS To demonstrate the efficacy of our algorithm, we compare it to previous approaches on randomly generated data. Namely, we compare against a na ıve Riemannian Gradient Descent (RGD) approach (Udris te, 1994) and against the Karcher Flow algorithm (Karcher, 1977). We test our Fr echet mean algorithm against these methods on synthetic datasets of ten on-manifold randomly generated 16dimensional points. We run all algorithms until they are within ϵ = 10 12 of the true Fr echet mean in norm, and report the number of iterations this takes in Table 3 for both hyperboloid (H) and Poincar e (P) models of hyperbolic space. Note that we significantly outperform the other algorithms. We also observe that by allowing 200x more computation, a grid search on the learning hyperparameter6 in RGD obtains nearly comparable or better results (last row of Table 3 for both models). However, we stress that this requires much more computation, and note that our algorithm produces nearly the same result while being hyperparameter-free. Table 3. Empirical computation of the Fr echet mean; the average number of iterations, as well as runtime, required to become accurate within ϵ = 10 12 of the true Fr echet mean are reported. 10 trials are conducted, and standard deviation is reported. The primary baselines are the RGD (Udris te, 1994) and Karcher Flow (Karcher, 1977) algorithms. (H) refers to hyperboloid and (P) refers to Poincar e. Iterations Time (ms)7 RGD (lr = 0.01) 801.0 21.0 932.9 130.0 Karcher Flow 62.5 6.0 50.9 8.9 Ours 13.7 0.9 6.1 1.9 RGD + Grid Search on lr 27.7 0.8 5333.5 770.7 RGD (lr = 0.01) 773.8 22.1 1157.3 74.8 Karcher Flow 57.5 9.1 59.8 10.4 Ours 13.4 0.5 9.1 1.3 RGD + Grid Search on lr 10.5 0.5 6050.6 235.2 We also find that this convergence improvement translates to real world applications. Specifically, we find that for the graph link prediction experimental setting in Section 6.1.3, our forward pass takes anywhere from 15 25 iterations, significantly outperforming the 1000+ needed with RGD and 120 needed with Karcher Flow. 6The grid search starts from lr = 0.2 and goes to lr = 0.4 in increments of 0.01 for the Poincar e ball model, and from lr = 0.2 to 0.28 for the hyperboloid model (same increment). 7Experiments were run with an Intel Skylake Core i7-6700HQ 2.6 GHz Quad core CPU. Differentiating through the Fr echet Mean 5.2. Backward Computation of the Hyperbolic Fr echet Mean For the backward computation, we re-apply the general Riemannian theory for differentiating through the Fr echet mean in Section 4 to hyperbolic space. Since most autodifferentiation packages do not support manifold-aware higher order differentiation, we derive the gradients explicitly. We begin with the Poincar e ball model by setting M = Dn K and applying Theorem 4.2. Theorem 5.2. Let x(1), , x(t) Dn K Rn be t points in the Poincar e ball and w1, . . . , wt R+ be the weights. Let their weighted Fr echet mean µfr be solution to the following optimization problem µfr(x(1), , x(t)) = arg min y Dn K f({x}, y) (10) where f({x}, y) = l=1 wl d Dn K(x(l), y)2 = wl |K| arccosh2 1 2K||x(l) y||2 2 (1 + K||x(l)||2 2)(1 + K||y||2 2) (11) Then the derivative of µfr with respect to x(i) is given by e x(i)µfr({x}) = f Y Y f({x} , x) 1f X(i)Y ({x} , x) (12) where x = µfr({x}) and f Y Y , f X(i)Y are defined in Theorem 4.2 8. The full concrete derivation of the above terms for the geometry induced by this manifold choice is given in Appendix Theorem F.3. Proof. This is a concrete application of Theorem 4.2. In particular since our manifold is embedded in Rn (Dn K Rn). Note that this is the total derivative in the ambient Euclidean space9. For the full proof see Theorem F.3 in the Appendix. The derivation for the hyperboloid model is given in Appendix F.2. 6. Case Studies To demonstrate the efficacy of our developed theory, we investigate the following test settings. In the first setting, we directly modify the hyperbolic aggregation strategy in Hyperbolic GCNs (Chami et al., 2019) to use our differentiable 8The projection operation is trivial since dim Rn = dim Dn K. 9To transform Euclidean gradients into Riemannian ones, simply multiply by inverse of the matrix of the metric. Fr echet mean layer. This was the original intent10 but was not feasible without our formulation. In the second setting, we introduce Hyperbolic Batch Normalization (HBN) as an extension of the regular Euclidean Batch Normalization (EBN). When combined with hyperbolic neural networks (Ganea et al., 2018), HBN exhibits benefits similar to those of EBN with Euclidean networks. 6.1. Hyperbolic Graph Convolutional Neural Networks (HGCNs) 6.1.1. ORIGINAL FRAMEWORK Introduced in Chami et al. (2019), Hyperbolic Graph Convolutional Networks (GCNs) provide generalizations of Euclidean GCNs to hyperbolic space. The proposed network architecture is based on three different layer types: feature transformation, activation, and attention-based aggregation. Feature transformation: The hyperbolic feature transformation consists of a gyrovector matrix multiplication followed by a gyrovector addition. hl i = (W l Kl 1 xl 1 i ) Kl 1 bl (13) Attention-based aggregation: Neighborhood aggregation combines local data at a node. It does so by projecting the neighbors using the logarithmic map at the node, averaging in the tangent space, and projecting back with the exponential map at the node. Note that the weights wij are positive and can be trained or defined by the graph adjacency matrix. AGGK(xi) = exp K xi j N(i) wij log K xi xj Activation: The activation layer applies a hyperbolic activation function. xl i = σ Kl 1,Kl (yl i) (15) 6.1.2. PROPOSED CHANGES The usage of tangent space aggregation in the HGCN framework stemmed from the lack of a differentiable Fr echet mean operation. As a natural extension, we substitute our Fr echet mean in place of the aggregation layer. 6.1.3. EXPERIMENTAL RESULTS We use precisely the same architecture as in Chami et al. (2019), except we substitute all hyperbolic aggregation layers with our differentiable Fr echet mean layer. Furthermore, 10We quote directly from the paper Chami et al. (2019): An analog of mean aggregation in hyperbolic space is the Fr echet mean, which, however, has no closed form solution. Instead, we propose to... Differentiating through the Fr echet Mean we test with precisely the same hyperparameters (learning rate, test/val split, and the like) as Chami et al. (2019) for a fair comparison. Our new aggregation allows us to achieve new state-of-the-art results on the Disease and Disease-M graph datasets (Chami et al., 2019). These datasets induce ideal test tasks for hyperbolic learning since they have very low Gromov δ-hyperbolicity (Adcock et al., 2013), which indicates the structure is highly tree-like. Our results and comparison to the baseline are given in Table 4. We run experiments for 5 trials and report the mean and standard deviation. Due to practical considerations, we only test with the Poincar e model11. For reference, the strongest baseline results with the hyperboloid model are reported from Chami et al. (2019) (note that we outperform these results as well). On the rather non-hyperbolic Co RA (Sen et al., 2008) dataset, our performance is comparable to that of the best baseline. Note that this is similar to the performance exhibited by the vanilla HGCN. Hence we conjecture that when the underlying dataset is not hyperbolic in nature, we do not observe improvements over the best Euclidean baseline methods. Table 4. ROC AUC results for Link Prediction (LP) on various graph datasets, averaged over 5 trials (with standard deviations). Graph hyperbolicity values are also reported (lower δ is more hyperbolic). Results are given for models learning in Euclidean (E), Hyperboloid (H), and Poincar e (P) spaces. Note that the best Euclidean method is GAT (Veliˇckovi c et al., 2018) and is shown below for fair comparison on Co RA. We highlight the best result only if our result gives a p-value < 0.01 after running a pairedsignificance t-test. Disease Disease-M Co RA δ = 0 δ = 0 δ = 11 MLP 72.6 0.6 55.3 0.5 83.1 0.5 GAT 69.8 0.3 69.5 0.4 93.7 0.1 HNN 75.1 0.3 60.9 0.4 89.0 0.1 HGCN 90.8 0.3 78.1 0.4 92.9 0.1 HGCN 76.4 8.9 81.4 3.4 93.4 0.4 Ours 93.7 0.4 91.0 0.6 92.9 0.4 6.2. Hyperbolic Batch Normalization Euclidean batch normalization (Ioffe & Szegedy, 2015) is one of the most widely used neural network operations that has, in many cases, obviated the need for explicit regularization such as dropout (Srivastava et al., 2014). In particular, analysis demonstrates that batch normalization induces a smoother loss surface which facilitates convergence and 11The code for HGCN included only the Poincar e model implementation at the time this paper was submitted. Hence we use the Poincar e model for our experiments, although our contributions include derivations for both hyperboloid and Poincar e models. yields better final results (Santurkar et al., 2018). Generalizing this for Riemannian manifolds is a natural extension, and such a computation would involve a differentiable Fr echet mean. 6.2.1. THEORETICAL FORMULATION AND ALGORITHM In this section we formulate Riemannian Batch Normalization as a natural extension of standard Euclidean Batch Normalization. This concept is, to the best of our knowledge, only touched upon by Brooks et al. (2019) in the specific instance of the manifold of positive semidefinite matrices. However, we argue in Appendix G that, unlike our method, their formulation is incomplete and lacks sufficient generality to be considered a true extension. Algorithm 2 Riemannian Batch Normalization Training Input: Batches of data points {x(t) 1 , , x(t) m } M for t [1, . . . , T], testing momentum η [0, 1] Learned Parameters: Target mean µ M, target variance (σ )2 R Training Algorithm: µtest Frechet Mean({x(1) 1 , . . . , x(1) m }) σtest 0 for t = 1, . . . , T: µ = Frechet Mean({x(t) 1 , . . . , x(t) m }) i=1 d(x(t) i , µ)2 µtest = Frechet Mean({µtest, µ}, {η, 1 η}) σtest = (t 1)σtest+σ t for i = 1, , m: xi (t) expµ σ σ PTµ µ (logµ x(t) i ) return normalized batch x1 (t), , xm (t) Testing Input: Test data points {x1, , xs} M, final running mean µtest and running variance σtest Testing Algorithm: µ = Frechet Mean({x1, , xs}) i=1 d(xi, µ)2 for i = 1, , s: xi expµtest σtest σ PTµ µtest(logµ xi) return normalized batch x1, , xs Our full algorithm is given in Algorithm 2. Note that in practice we use σ2 + ϵ in place of σ as in the original formulation to avoid division by zero. 6.2.2. EXPERIMENTAL RESULTS We apply Riemannian Batch Normalization (specifically for hyperbolic space) to the encoding Hyperbolic Neural Network (HNN) (Ganea et al., 2018) in the framework of Differentiating through the Fr echet Mean Figure 3. The graphs above correspond to a comparison of the HNN baseline, which uses a two-layer hyperbolic neural network encoder, and the baseline augmented with hyperbolic batch normalization after each layer. The columns correspond to the Co RA (Sen et al., 2008), Disease (Chami et al., 2019), and Disease-M (Chami et al., 2019) datasets, respectively. The top row shows the comparison in terms of validation loss, and the bottom row shows the comparison in terms of validation ROC AUC. The figures show that we converge faster and attain better performance in terms of both loss and ROC. Note that although Co RA is not hyperbolic (as previously mentioned), we find it encouraging that introducing hyperbolic batch normalization produces an improvement regardless of dataset hyperbolicity. Chami et al. (2019). We run on the Co RA (Sen et al., 2008), Disease (Chami et al., 2019), and Disease-M (Chami et al., 2019) datasets and present the validation loss and ROC AUC diagrams in Figure 3. In terms of both loss and ROC, our method results in both faster convergence and a better final result. These improvements are expected as they appear when applying standard batch normalization to Euclidean neural networks. So, our manifold generalization does seem to replicate the useful properties of standard batch normalization. Additionally, it is encouraging to see that, regardless of the hyperbolic nature of the underlying dataset, hyperbolic batch normalization produces an improvement when paired with a hyperbolic neural network. 7. Conclusion and Future Work We have presented a fully differentiable Fr echet mean operation for use in any differentiable programming setting. Concretely, we introduced differentiation theory for the general Riemannian case, and for the demonstrably useful case of hyperbolic space, we provided a fast forward pass algorithm and explicit derivative computations. We demonstrated that using the Fr echet mean in place of tangent space aggregation yields state-of-the-art performance on link prediction tasks in graphs with tree-like structure. Additionally, we ex- tended batch normalization (a standard Euclidean operation) to the realm of hyperbolic space. On a graph link prediction test task, we showed that hyperbolic batch normalization gives benefits similar to those experienced in the Euclidean setting. We hope our work paves the way for future developments in geometric representation learning. Potential future work can focus on speeding up our computation of the Fr echet mean gradient, finding applications of our theory on manifolds beyond hyperbolic space, and applying the Fr echet mean to generalize more standard neural network operations. 8. Acknowledgements We would like to acknowledge Horace He for his helpful comments regarding implementation. In addition, we would like to thank Facebook AI for funding equipment that made this work possible. Adcock, A. B., Sullivan, B. D., and Mahoney, M. W. Treelike structure in large social and information networks. 2013 IEEE 13th International Conference on Data Mining, pp. 1 10, 2013. Agrawal, A., Amos, B., Barratt, S., Boyd, S., Diamond, Differentiating through the Fr echet Mean S., and Kolter, J. Z. Differentiable convex optimization layers. In Advances in Neural Information Processing Systems, pp. 9558 9570, 2019. Amos, B. and Kolter, J. Z. Optnet: Differentiable optimization as a layer in neural networks. In Proceedings of the 34th International Conference on Machine Learning Volume 70, pp. 136 145, 2017. Bac ak, M. Computing medians and means in hadamard spaces. SIAM Journal on Optimization, 24(3):1542 1566, 2014. Brooks, D., Schwander, O., Barbaresco, F., Schneider, J.- Y., and Cord, M. Riemannian batch normalization for spd neural networks. In Advances in Neural Information Processing Systems, pp. 15463 15474, 2019. Casado, M. L. Trivializations for gradient-based optimization on manifolds. In Advances in Neural Information Processing Systems, pp. 9154 9164, 2019. Chami, I., Ying, Z., R e, C., and Leskovec, J. Hyperbolic graph convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 4869 4880, 2019. Charlier, B. Necessary and sufficient condition for the existence of a fr echet mean on the circle. ESAIM: Probability and Statistics, 17:635 649, 2013. Chen, T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. Neural ordinary differential equations. In Advances in Neural Information Processing Systems, pp. 6571 6583, 2018. Fr echet, M. Les el ements al eatoires de nature quelconque dans un espace distanci e. In Annales de l institut Henri Poincar e, volume 10, pp. 215 310, 1948. Ganea, O., B ecigneul, G., and Hofmann, T. Hyperbolic neural networks. In Advances in Neural Information Processing Systems, pp. 5345 5355, 2018. Gould, S., Fernando, B., Cherian, A., Anderson, P., Cruz, R. S., and Guo, E. On differentiating parameterized argmin and argmax problems with application to bi-level optimization. Ar Xiv, abs/1607.05447, 2016. Gu, A., Sala, F., Gunel, B., and R e, C. Learning mixedcurvature representations in product spaces. In International Conference on Learning Representations, 2019. Gulcehre, C., Denil, M., Malinowski, M., Razavi, A., Pascanu, R., Hermann, K. M., Battaglia, P., Bapst, V., Raposo, D., Santoro, A., and de Freitas, N. Hyperbolic attention networks. In International Conference on Learning Representations, 2019. Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pp. 448 456, 2015. Karcher, H. Riemannian center of mass and mollifier smoothing. Communications on Pure and Applied Mathematics, 30(5):509 541, 1977. Kleinberg, R. D. Geographic routing using hyperbolic space. IEEE INFOCOM 2007 - 26th IEEE International Conference on Computer Communications, pp. 1902 1909, 2007. Law, M., Liao, R., Snell, J., and Zemel, R. Lorentzian distance learning for hyperbolic representations. In International Conference on Machine Learning, pp. 3672 3681, 2019. Lee, J. Riemannian Manifolds: An Introduction to Curvature. Graduate Texts in Mathematics. Springer New York, 1997. Lee, J. M. Introduction to smooth manifolds. Graduate Texts in Mathematics, 218, 2003. Liu, Q., Nickel, M., and Kiela, D. Hyperbolic graph neural networks. In Advances in Neural Information Processing Systems, pp. 8228 8239, 2019. Nickel, M. and Kiela, D. Poincar e embeddings for learning hierarchical representations. In Advances in Neural Information Processing Systems, pp. 6338 6347, 2017. Nickel, M. and Kiela, D. Learning continuous hierarchies in the Lorentz model of hyperbolic geometry. In Proceedings of the 35th International Conference on Machine Learning, pp. 3779 3788, 2018. Pascanu, R., Mikolov, T., and Bengio, Y. On the difficulty of training recurrent neural networks. In International Conference on Machine Learning, pp. 1310 1318, 2013. Poganˇci c, M. V., Paulus, A., Musil, V., Martius, G., and Rolinek, M. Differentiation of blackbox combinatorial solvers. In International Conference on Learning Representations, 2020. Sala, F., Sa, C. D., Gu, A., and R e, C. Representation tradeoffs for hyperbolic embeddings. Proceedings of Machine Learning Research, 80:4460 4469, 2018. Santurkar, S., Tsipras, D., Ilyas, A., and Madry, A. How does batch normalization help optimization? In Advances in Neural Information Processing Systems, pp. 2483 2493, 2018. Differentiating through the Fr echet Mean Sarkar, R. Low distortion delaunay embedding of trees in hyperbolic plane. In Proceedings of the 19th International Conference on Graph Drawing, GD 11, pp. 355 366, Berlin, Heidelberg, 2011. Springer-Verlag. Sen, P., Namata, G., Bilgic, M., Getoor, L., Gallagher, B., and Eliassi-Rad, T. Collective classification in network data. AI Magazine, 29:93 106, 2008. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929 1958, 2014. Tifrea, A., Becigneul, G., and Ganea, O.-E. Poincar e glove: Hyperbolic word embeddings. In International Conference on Learning Representations, 2019. Udris te, C. Convex functions and optimization methods on Riemannian manifolds. Mathematics and Its Applications. Springer, Dordrecht, 1994. doi: 10.1007/ 978-94-015-8390-9. Ungar, A. A. A gyrovector space approach to hyperbolic geometry. Synthesis Lectures on Mathematics and Statistics, 1(1):1 194, 2008. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. Ar Xiv, abs/1706.03762, 2017. Veliˇckovi c, P., Cucurull, G., Casanova, A., Romero, A., Li o, P., and Bengio, Y. Graph attention networks. In International Conference on Learning Representations, 2018. Wolter, F.-E. Distance function and cut loci on a complete riemannian manifold. Archiv der Mathematik, 32(1):92 96, 1979. doi: 10.1007/BF01238473. Yu, T. and De Sa, C. M. Numerically accurate hyperbolic embeddings using tiling-based models. In Advances in Neural Information Processing Systems, pp. 2021 2031, 2019.