# hyperbolic_neural_networks__9f2a160c.pdf Hyperbolic Neural Networks Octavian-Eugen Ganea Dept. of Computer Science ETH Zürich Zurich, Switzerland Gary Bécigneul Dept. of Computer Science ETH Zürich Zurich, Switzerland Thomas Hofmann Dept. of Computer Science ETH Zürich Zurich, Switzerland Hyperbolic spaces have recently gained momentum in the context of machine learning due to their high capacity and tree-likeliness properties. However, the representational power of hyperbolic geometry is not yet on par with Euclidean geometry, mostly because of the absence of corresponding hyperbolic neural network layers. This makes it hard to use hyperbolic embeddings in downstream tasks. Here, we bridge this gap in a principled manner by combining the formalism of Möbius gyrovector spaces with the Riemannian geometry of the Poincaré model of hyperbolic spaces. As a result, we derive hyperbolic versions of important deep learning tools: multinomial logistic regression, feed-forward and recurrent neural networks such as gated recurrent units. This allows to embed sequential data and perform classification in the hyperbolic space. Empirically, we show that, even if hyperbolic optimization tools are limited, hyperbolic sentence embeddings either outperform or are on par with their Euclidean variants on textual entailment and noisy-prefix recognition tasks. 1 Introduction It is common in machine learning to represent data as being embedded in the Euclidean space Rn. The main reason for such a choice is simply convenience, as this space has a vectorial structure, closedform formulas of distance and inner-product, and is the natural generalization of our intuition-friendly, visual three-dimensional space. Moreover, embedding entities in such a continuous space allows to feed them as input to neural networks, which has led to unprecedented performance on a broad range of problems, including sentiment detection [15], machine translation [3], textual entailment [22] or knowledge base link prediction [20, 6]. Despite the success of Euclidean embeddings, recent research has proven that many types of complex data (e.g. graph data) from a multitude of fields (e.g. Biology, Network Science, Computer Graphics or Computer Vision) exhibit a highly non-Euclidean latent anatomy [8]. In such cases, the Euclidean space does not provide the most powerful or meaningful geometrical representations. For example, [10] shows that arbitrary tree structures cannot be embedded with arbitrary low distortion (i.e. almost preserving their metric) in the Euclidean space with unbounded number of dimensions, but this task becomes surprisingly easy in the hyperbolic space with only 2 dimensions where the exponential growth of distances matches the exponential growth of nodes with the tree depth. The adoption of neural networks and deep learning in these non-Euclidean settings has been rather limited until very recently, the main reason being the non-trivial or impossible principled generalizations of basic operations (e.g. vector addition, matrix-vector multiplication, vector translation, vector inner product) as well as, in more complex geometries, the lack of closed form expressions for basic objects (e.g. distances, geodesics, parallel transport). Thus, classic tools such as multinomial Equal contribution, correspondence at {octavian.ganea,gary.becigneul}@inf.ethz.ch 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. logistic regression (MLR), feed forward (FFNN) or recurrent neural networks (RNN) did not have a correspondence in these geometries. How should one generalize deep neural models to non-Euclidean domains ? In this paper we address this question for one of the simplest, yet useful, non-Euclidean domains: spaces of constant negative curvature, i.e. hyperbolic. Their tree-likeness properties have been extensively studied [12, 13, 26] and used to visualize large taxonomies [18] or to embed heterogeneous complex networks [17]. In machine learning, recently, hyperbolic representations greatly outperformed Euclidean embeddings for hierarchical, taxonomic or entailment data [21, 10, 11]. Disjoint subtrees from the latent hierarchical structure surprisingly disentangle and cluster in the embedding space as a simple reflection of the space s negative curvature. However, appropriate deep learning tools are needed to embed feature data in this space and use it in downstream tasks. For example, implicitly hierarchical sequence data (e.g. textual entailment data, phylogenetic trees of DNA sequences or hierarchial captions of images) would benefit from suitable hyperbolic RNNs. The main contribution of this paper is to bridge the gap between hyperbolic and Euclidean geometry in the context of neural networks and deep learning by generalizing in a principled manner both the basic operations as well as multinomial logistic regression (MLR), feed-forward (FFNN), simple and gated (GRU) recurrent neural networks (RNN) to the Poincaré model of the hyperbolic geometry. We do it by connecting the theory of gyrovector spaces and generalized Möbius transformations introduced by [2, 26] with the Riemannian geometry properties of the manifold. We smoothly parametrize basic operations and objects in all spaces of constant negative curvature using a unified framework that depends only on the curvature value. Thus, we show how Euclidean and hyperbolic spaces can be continuously deformed into each other. On a series of experiments and datasets we showcase the effectiveness of our hyperbolic neural network layers compared to their "classic" Euclidean variants on textual entailment and noisy-prefix recognition tasks. We hope that this paper will open exciting future directions in the nascent field of Geometric Deep Learning. 2 The Geometry of the Poincaré Ball Basics of differential geometry are presented in appendix A. 2.1 Hyperbolic space: the Poincaré ball The hyperbolic space has five isometric models that one can work with [9]. Similarly as in [21] and [11], we choose to work in the Poincaré ball. The Poincaré ball model (Dn, g D) is defined by the manifold Dn = {x Rn : x < 1} equipped with the following Riemannian metric: g D x = λ2 xg E, where λx := 2 1 x 2 , (1) g E = In being the Euclidean metric tensor. Note that the hyperbolic metric tensor is conformal to the Euclidean one. The induced distance between two points x, y Dn is known to be given by d D(x, y) = cosh 1 1 + 2 x y 2 (1 x 2)(1 y 2) Since the Poincaré ball is conformal to Euclidean space, the angle between two vectors u, v Tx Dn \ {0} is given by cos( (u, v)) = g D x(u, v) p g Dx(u, u) p g Dx(v, v) = u, v 2.2 Gyrovector spaces In Euclidean space, natural operations inherited from the vectorial structure, such as vector addition, subtraction and scalar multiplication are often useful. The framework of gyrovector spaces provides an elegant non-associative algebraic formalism for hyperbolic geometry just as vector spaces provide the algebraic setting for Euclidean geometry [2, 25, 26]. In particular, these operations are used in special relativity, allowing to add speed vectors belonging to the Poincaré ball of radius c (the celerity, i.e. the speed of light) so that they remain in the ball, hence not exceeding the speed of light. We will make extensive use of these operations in our definitions of hyperbolic neural networks. For c 0, denote2 by Dn c := {x Rn | c x 2 < 1}. Note that if c = 0, then Dn c = Rn; if c > 0, then Dn c is the open ball of radius 1/ c. If c = 1 then we recover the usual ball Dn. Note that for c, c > 0, Dn c and Dn c are isometric. Möbius addition. The Möbius addition of x and y in Dn c is defined as x c y := (1 + 2c x, y + c y 2)x + (1 c x 2)y 1 + 2c x, y + c2 x 2 y 2 . (4) In particular, when c = 0, one recovers the Euclidean addition of two vectors in Rn. Note that without loss of generality, the case c > 0 can be reduced to c = 1. Unless stated otherwise, we will use as 1 to simplify notations. For general c > 0, this operation is not commutative nor associative. However, it satisfies x c 0 = 0 c x = x. Moreover, for any x, y Dn c , we have ( x) c x = x c ( x) = 0 and ( x) c (x c y) = y (left-cancellation law). The Möbius substraction is then defined by the use of the following notation: x c y := x c ( y). See [29, section 2.1] for a geometric interpretation of the Möbius addition. Möbius scalar multiplication. For c > 0, the Möbius scalar multiplication of x Dn c \ {0} by r R is defined as r c x := (1/ c) tanh(r tanh 1( c x )) x and r c 0 := 0. Note that similarly as for the Möbius addition, one recovers the Euclidean scalar multiplication when c goes to zero: limc 0 r c x = rx. This operation satisfies desirable properties such as n c x = x c c x (n additions), (r+r ) c x = r c x c r c x (scalar distributivity3), (rr ) c x = r c (r c x) (scalar associativity) and |r| c x/ r c x = x/ x (scaling property). Distance. If one defines the generalized hyperbolic metric tensor gc as the metric conformal to the Euclidean one, with conformal factor λc x := 2/(1 c x 2), then the induced distance function on (Dn c , gc) is given by4 dc(x, y) = (2/ c) tanh 1 c x c y . (6) Again, observe that limc 0 dc(x, y) = 2 x y , i.e. we recover Euclidean geometry in the limit5. Moreover, for c = 1 we recover d D of Eq. (2). Hyperbolic trigonometry. Similarly as in the Euclidean space, one can define the notions of hyperbolic angles or gyroangles (when using the c), as well as hyperbolic law of sines in the generalized Poincaré ball (Dn c , gc). We make use of these notions in our proofs. See Appendix B. 2.3 Connecting Gyrovector spaces and Riemannian geometry of the Poincaré ball In this subsection, we present how geodesics in the Poincaré ball model are usually described with Möbius operations, and push one step further the existing connection between gyrovector spaces and the Poincaré ball by finding new identities involving the exponential map, and parallel transport. In particular, these findings provide us with a simpler formulation of Möbius scalar multiplication, yielding a natural definition of matrix-vector multiplication in the Poincaré ball. Riemannian gyroline element. The Riemannian gyroline element is defined for an infinitesimal dx as ds := (x + dx) c x, and its size is given by [26, section 3.7]: ds = (x + dx) c x = dx /(1 c x 2). (7) What is remarkable is that it turns out to be identical, up to a scaling factor of 2, to the usual line element 2 dx /(1 c x 2) of the Riemannian manifold (Dn c , gc). 2We take different notations as in [25] where the author uses s = 1/ c. 3 c has priority over c in the sense that a c b c c := (a c b) c c and a c b c c := a c (b c c). 4The notation x c y should always be read as ( x) c y and not (x c y). 5The factor 2 comes from the conformal factor λx = 2/(1 x 2), which is a convention setting the curvature to 1. Geodesics. The geodesic connecting points x, y Dn c is shown in [2, 26] to be given by: γx y(t) := x c ( x c y) c t, with γx y : R Dn c s.t. γx y(0) = x and γx y(1) = y. (8) Note that when c goes to 0, geodesics become straight-lines, recovering Euclidean geometry. In the remainder of this subsection, we connect the gyrospace framework with Riemannian geometry. Lemma 1. For any x Dn and v Tx Dn c s.t. gc x(v, v) = 1, the unit-speed geodesic starting from x with direction v is given by: γx,v(t) = x c , where γx,v : R Dn s.t. γx,v(0) = x and γx,v(0) = v. (9) Proof. One can use Eq. (8) and reparametrize it to unit-speed using Eq. (6). Alternatively, direct computation and identification with the formula in [11, Thm. 1] would give the same result. Using Eq. (6) and Eq. (9), one can sanity-check that dc(γ(0), γ(t)) = t, t [0, 1]. Exponential and logarithmic maps. The following lemma gives the closed-form derivation of exponential and logarithmic maps. Lemma 2. For any point x Dn c , the exponential map expc x : Tx Dn c Dn c and the logarithmic map logc x : Dn c Tx Dn c are given for v = 0 and y = x by: expc x(v) = x c tanh cλc x v , logc x(y) = 2 cλcx tanh 1( c x c y ) x c y x c y . (10) Proof. Following the proof of [11, Cor. 1.1], one gets expc x(v) = γx, v λcx v (λcx v ). Using Eq. (9) gives the formula for expc x. Algebraic check of the identity logc x(expc x(v)) = v concludes. The above maps have more appealing forms when x = 0, namely for v T0Dn c \ {0}, y Dn c \ {0}: expc 0(v) = tanh( c v ) v c v , logc 0(y) = tanh 1( c y ) y c y . (11) Moreover, we still recover Euclidean geometry in the limit c 0, as limc 0 expc x(v) = x + v is the Euclidean exponential map, and limc 0 logc x(y) = y x is the Euclidean logarithmic map. Möbius scalar multiplication using exponential and logarithmic maps. We studied the exponential and logarithmic maps in order to gain a better understanding of the Möbius scalar multiplication (Eq. (5)). We found the following: Lemma 3. The quantity r x can actually be obtained by projecting x in the tangent space at 0 with the logarithmic map, multiplying this projection by the scalar r in T0Dn c , and then projecting it back on the manifold with the exponential map: r c x = expc 0(r logc 0(x)), r R, x Dn c . (12) In addition, we recover the well-known relation between geodesics connecting two points and the exponential map: γx y(t) = x c ( x c y) c t = expc x(t logc x(y)), t [0, 1]. (13) This last result enables us to generalize scalar multiplication in order to define matrix-vector multiplication between Poincaré balls, one of the essential building blocks of hyperbolic neural networks. Parallel transport. Finally, we connect parallel transport along the unique geodesic from 0 to x to gyrovector spaces with the following theorem, which we prove in appendix C. Theorem 4. In the manifold (Dn c , gc), the parallel transport w.r.t. the Levi-Civita connection of a vector v T0Dn c to another tangent space Tx Dn c is given by the following isometry: P c 0 x(v) = logc x(x c expc 0(v)) = λc 0 λcx v. (14) As we ll see later, this result is crucial in order to define and optimize parameters shared between different tangent spaces, such as biases in hyperbolic neural layers or parameters of hyperbolic MLR. 3 Hyperbolic Neural Networks Neural networks can be seen as being made of compositions of basic operations, such as linear maps, bias translations, pointwise non-linearities and a final sigmoid or softmax layer. We first explain how to construct a softmax layer for logits lying in a Poincaré ball. Then, we explain how to transform a mapping between two Euclidean spaces as one between Poincaré balls, yielding matrix-vector multiplication and pointwise non-linearities in the Poincaré ball. Finally, we present possible adaptations of various recurrent neural networks to the hyperbolic domain. 3.1 Hyperbolic multiclass logistic regression In order to perform multi-class classification on the Poincaré ball, one needs to generalize multinomial logistic regression (MLR) also called softmax regression to the Poincaré ball. Reformulating Euclidean MLR. Let s first reformulate Euclidean MLR from the perspective of distances to margin hyperplanes, as in [19, Section 5]. This will allow us to easily generalize it. Given K classes, one learns a margin hyperplane for each such class using softmax probabilities: k {1, ..., K}, p(y = k|x) exp (( ak, x bk)) , where bk R, x, ak Rn. (15) Note that any affine hyperplane in Rn can be written with a normal vector a and a scalar shift b: Ha,b = {x Rn : a, x b = 0}, where a Rn \ {0}, and b R. (16) As in [19, Section 5], we note that a, x b = sign( a, x b) a d(x, Ha,b). Using Eq. (15): p(y = k|x) exp(sign( ak, x bk) ak d(x, Hak,bk)), bk R, x, ak Rn. (17) As it is not immediately obvious how to generalize the Euclidean hyperplane of Eq. (16) to other spaces such as the Poincaré ball, we reformulate it as follows: Ha,p = {x Rn : p + x, a = 0} = p + {a} , where p Rn, a Rn \ {0}. (18) This new definition relates to the previous one as Ha,p = Ha, a,p . Rewriting Eq. (17) with b = a, p : p(y = k|x) exp(sign( pk + x, ak ) ak d(x, Hak,pk)), with pk, x, ak Rn. (19) It is now natural to adapt the previous definition to the hyperbolic setting by replacing + by c: Definition 3.1 (Poincaré hyperplanes). For p Dn c , a Tp Dn c \ {0}, let {a} := {z Tp Dn c : gc p(z, a) = 0} = {z Tp Dn c : z, a = 0}. Then, we define6 Poincaré hyperplanes as Hc a,p := {x Dn c : logc p(x), a p = 0} = expc p({a} ) = {x Dn c : p c x, a = 0}. (20) The last equality is shown appendix D. Hc a,p can also be described as the union of images of all geodesics in Dn c orthogonal to a and containing p. Notice that our definition matches that of hypergyroplanes, see [27, definition 5.8]. A 3D hyperplane example is depicted in Fig. 1. Next, we need the following theorem, proved in appendix E: Theorem 5. dc(x, Hc a,p) := inf w Hca,p dc(x, w) = 1 c sinh 1 2 c| p c x, a | (1 c p c x 2) a Final formula for MLR in the Poincaré ball. Putting together Eq. (19) and Thm. 5, we get the hyperbolic MLR formulation. Given K classes and k {1, . . . , K}, pk Dn c , ak Tpk Dn c \ {0}: p(y = k|x) exp(sign( pk c x, ak ) q gcpk(ak, ak)dc(x, Hc ak,pk)), x Dn c , (22) 6where , denotes the (Euclidean) inner-product of the ambient space. or, equivalently p(y = k|x) exp λc pk ak c sinh 1 2 c pk c x, ak (1 c pk c x 2) ak , x Dn c . (23) Notice that when c goes to zero, this goes to p(y = k|x) exp(4 pk + x, ak ) = exp((λ0 pk)2 pk + x, ak ) = exp( pk + x, ak 0), recovering the usual Euclidean softmax. However, at this point it is unclear how to perform optimization over ak, since it lives in Tpk Dn c and hence depends on pk. The solution is that one should write ak = P c 0 pk(a k) = (λc 0/λc pk)a k, where a k T0Dn c = Rn, and optimize a k as a Euclidean parameter. 3.2 Hyperbolic feed-forward layers Figure 1: An example of a hyperbolic hyperplane in D3 1 plotted using sampling. The red point is p. The shown normal axis to the hyperplane through p is parallel to a. In order to define hyperbolic neural networks, it is crucial to define a canonically simple parametric family of transformations, playing the role of linear mappings in usual Euclidean neural networks, and to know how to apply pointwise non-linearities. Inspiring ourselves from our reformulation of Möbius scalar multiplication in Eq. (12), we define: Definition 3.2 (Möbius version). For f : Rn Rm, we define the Möbius version of f as the map from Dn c to Dm c by: f c(x) := expc 0(f(logc 0(x))), (24) where expc 0 : T0m Dm c Dm c and logc 0 : Dn c T0n Dn c . Note that similarly as for other Möbius operations, we recover the Euclidean mapping in the limit c 0 if f is continuous, as limc 0 f c(x) = f(x). This definition satisfies a few desirable properties too, such as: (f g) c = f c g c for f : Rm Rl and g : Rn Rm (morphism property), and f c(x)/ f c(x) = f(x)/ f(x) for f(x) = 0 (direction preserving). It is then straight-forward to prove the following result: Lemma 6 (Möbius matrix-vector multiplication). If M : Rn Rm is a linear map, which we identify with its matrix representation, then x Dn c , if Mx = 0 we have M c(x) = (1/ c) tanh Mx x tanh 1( c x ) Mx and M c(x) = 0 if Mx = 0. Moreover, if we define the Möbius matrix-vector multiplication of M Mm,n(R) and x Dn c by M c x := M c(x), then we have (MM ) c x = M c (M c x) for M Ml,m(R) and M Mm,n(R) (matrix associativity), (r M) c x = r c (M c x) for r R and M Mm,n(R) (scalar-matrix associativity) and M c x = Mx for all M On(R) (rotations are preserved). Pointwise non-linearity. If ϕ : Rn Rn is a pointwise non-linearity, then its Möbius version ϕ c can be applied to elements of the Poincaré ball. Bias translation. The generalization of a translation in the Poincaré ball is naturally given by moving along geodesics. But should we use the Möbius sum x c b with a hyperbolic bias b or the exponential map expc x(b ) with a Euclidean bias b ? These views are unified with parallel transport (see Thm 4). Möbius translation of a point x Dn c by a bias b Dn c is given by x x c b = expc x(P c 0 x(logc 0(b))) = expc x λc 0 λcx logc 0(b) . (26) We recover Euclidean translations in the limit c 0. Note that bias translations play a particular role in this model. Indeed, consider multiple layers of the form fk(x) = ϕk(Mkx), each of which having Möbius version f c k (x) = ϕ c k (Mk c x). Then their composition can be re-written f c k f c 1 = expc 0 fk f1 logc 0. This means that these operations can essentially be performed in Euclidean space. Therefore, it is the interposition between those with the bias translation of Eq. (26) which differentiates this model from its Euclidean counterpart. Concatenation of multiple input vectors. If a vector x Rn+p is the (vertical) concatenation of two vectors x1 Rn, x2 Rp, and M Mm,n+p(R) can be written as the (horizontal) concatenation of two matrices M1 Mm,n(R) and M2 Mm,p(R), then Mx = M1x1 + M2x2. We generalize this to hyperbolic spaces: if we are given x1 Dn c , x2 Dp c, x = (x1 x2)T Dn c Dp c, and M, M1, M2 as before, then we define M c x := M1 c x1 c M2 c x2. Note that when c goes to zero, we recover the Euclidean formulation, as limc 0 M cx = limc 0 M1 cx1 c M2 cx2 = M1x1 + M2x2 = Mx. Moreover, hyperbolic vectors x Dn c can also be "concatenated" with real features y R by doing: M c x c y c b with learnable b Dm c and M Mm,n(R). 3.3 Hyperbolic RNN Naive RNN. A simple RNN can be defined by ht+1 = ϕ(Wht + Uxt + b) where ϕ is a pointwise non-linearity, typically tanh, sigmoid, Re LU, etc. This formula can be naturally generalized to the hyperbolic space as follows. For parameters W Mm,n(R), U Mm,d(R), b Dm c , we define: ht+1 = ϕ c(W c ht c U c xt c b), ht Dn c , xt Dd c. (27) Note that if inputs xt s are Euclidean, one can write xt := expc 0(xt) and use the above formula, since expc W cht(P c 0 W cht(Uxt)) = W c ht c expc 0(Uxt) = W c ht c U c xt. GRU architecture. One can also adapt the GRU architecture: rt = σ(W rht 1 + U rxt + br), zt = σ(W zht 1 + U zxt + bz), ht = ϕ(W(rt ht 1) + Uxt + b), ht = (1 zt) ht 1 + zt ht, (28) where denotes pointwise product. First, how should we adapt the pointwise multiplication by a scaling gate? Note that the definition of the Möbius version (see Eq. (24)) can be naturally extended to maps f : Rn Rp Rm as f c : (h, h ) Dn c Dp c 7 expc 0(f(logc 0(h), logc 0(h ))). In particular, choosing f(h, h ) := σ(h) h yields7 f c(h, h ) = expc 0(σ(logc 0(h)) logc 0(h )) = diag(σ(logc 0(h))) c h . Hence we adapt rt ht 1 to diag(rt) c ht 1 and the reset gate rt to: rt = σ logc 0(W r c ht 1 c U r c xt c br), (29) and similarly for the update gate zt. Note that as the argument of σ in the above is unbounded, rt and zt can a priori take values onto the full range (0, 1). Now the intermediate hidden state becomes: ht = ϕ c((Wdiag(rt)) c ht 1 c U c xt b), (30) where Möbius matrix associativity simplifies W c (diag(rt) c ht 1) into (Wdiag(rt)) c ht 1. Finally, we propose to adapt the update-gate equation as ht = ht 1 c diag(zt) c ( ht 1 c ht). (31) Note that when c goes to zero, one recovers the usual GRU. Moreover, if zt = 0 or zt = 1, then ht becomes ht 1 or ht respectively, similarly as in the usual GRU. This adaptation was obtained by adapting [24]: in this work, the authors re-derive the update-gate mechanism from a first principle called time-warping invariance. We adapted their derivation to the hyperbolic setting by using the notion of gyroderivative [4] and proving a gyro-chain-rule (see appendix F). 4 Experiments SNLI task and dataset. We evaluate our method on two tasks. The first is natural language inference, or textual entailment. Given two sentences, a premise (e.g. "Little kids A. and B. are playing soccer.") and a hypothesis (e.g. "Two children are playing outdoors."), the binary classification task is to predict whether the second sentence can be inferred from the first one. This defines a partial order in the sentence space. We test hyperbolic networks on the biggest real dataset for this task, SNLI [7]. It consists of 570K training, 10K validation and 10K test sentence pairs. Following [28], we merge the "contradiction" and "neutral" classes into a single class of negative sentence pairs, while the "entailment" class gives the positive pairs. 7If x has n coordinates, then diag(x) denotes the diagonal matrix of size n with xi s on its diagonal. SNLI PREFIX-10% PREFIX-30% PREFIX-50% FULLY EUCLIDEAN RNN 79.34 % 89.62 % 81.71 % 72.10 % HYP RNN+FFNN, EUCL MLR 79.18 % 96.36 % 87.83 % 76.50 % FULLY HYPERBOLIC RNN 78.21 % 96.91 % 87.25 % 62.94 % FULLY EUCLIDEAN GRU 81.52 % 95.96 % 86.47 % 75.04 % HYP GRU+FFNN, EUCL MLR 79.76 % 97.36 % 88.47 % 76.87 % FULLY HYPERBOLIC GRU 81.19 % 97.14 % 88.26 % 76.44 % Table 1: Test accuracies for various models and four datasets. Eucl denotes Euclidean, Hyp denotes hyperbolic. All word and sentence embeddings have dimension 5. We highlight in bold the best baseline (or baselines, if the difference is less than 0.5%). PREFIX task and datasets. We conjecture that the improvements of hyperbolic neural networks are more significant when the underlying data structure is closer to a tree. To test this, we design a proof-of-concept task of detection of noisy prefixes, i.e. given two sentences, one has to decide if the second sentence is a noisy prefix of the first, or a random sentence. We thus build synthetic datasets PREFIX-Z% (for Z being 10, 30 or 50) as follows: for each random first sentence of random length at most 20 and one random prefix of it, a second positive sentence is generated by randomly replacing Z% of the words of the prefix, and a second negative sentence of same length is randomly generated. Word vocabulary size is 100, and we generate 500K training, 10K validation and 10K test pairs. Experimental details are presented in appendix G. Models architecture. Our neural network layers can be used in a plug-n-play manner exactly like standard Euclidean layers. They can also be combined with Euclidean layers. However, optimization w.r.t. hyperbolic parameters is different (see below) and based on Riemannian gradients which are just rescaled Euclidean gradients when working in the conformal Poincaré model [21]. Thus, back-propagation can be applied in the standard way. In our setting, we embed the two sentences using two distinct hyperbolic RNNs or GRUs. The sentence embeddings are then fed together with their squared distance (hyperbolic or Euclidean, depending on their geometry) to a FFNN (Euclidean or hyperbolic, see Sec. 3.2) which is further fed to an MLR (Euclidean or hyperbolic, see Sec. 3.1) that gives probabilities of the two classes (entailment vs neutral). We use cross-entropy loss on top. Note that hyperbolic and Euclidean layers can be mixed, e.g. the full network can be hyperbolic and only the last layer be Euclidean, in which case one has to use log0 and exp0 functions to move between the two manifolds in a correct manner as explained for Eq. 24. For the results shown in Tab. 1, we run each model (baseline or ours) exactly 3 times and report the test result corresponding to the best validation result from these 3 runs. We do this because the highly non-convex spectrum of hyperbolic neural networks sometimes results in convergence to poor local minima, suggesting that initialization is very important. Results. Results are shown in Tab. 1. Note that the fully Euclidean baseline models might have an advantage over hyperbolic baselines because more sophisticated optimization algorithms such as Adam do not have a hyperbolic analogue at the moment. We first observe that all GRU models overpass their RNN variants. Hyperbolic RNNs and GRUs have the most significant improvement over their Euclidean variants when the underlying data structure is more tree-like, e.g. for PREFIX10% for which the tree relation between sentences and their prefixes is more prominent we reduce the error by a factor of 3.35 for hyperbolic vs Euclidean RNN, and by a factor of 1.5 for hyperbolic vs Euclidean GRU. As soon as the underlying structure diverges more and more from a tree, the accuracy gap decreases for example, for PREFIX-50% the noise heavily affects the representational power of hyperbolic networks. Also, note that on SNLI our methods perform similarly as with their Euclidean variants. Moreover, hyperbolic and Euclidean MLR are on par when used in conjunction with hyperbolic sentence embeddings, suggesting further empirical investigation is needed for this direction (see below). MLR classification experiments. For the sentence entailment classification task we do not see a clear advantage of hyperbolic MLR compared to its Euclidean variant. A possible reason is that, when trained end-to-end, the model might decide to place positive and negative embeddings in a manner that is already well separated with a classic MLR. As a consequence, we WORDNET SUBTREE MODEL D = 2 D = 3 D = 5 D = 10 ANIMAL.N.01 3218 / 798 HYP EUCL log0 47.43 1.07 41.69 0.19 38.89 0.01 91.92 0.61 68.43 3.90 62.57 0.61 98.07 0.55 95.59 1.18 89.21 1.34 99.26 0.59 99.36 0.18 98.27 0.70 GROUP.N.01 6649 / 1727 HYP EUCL log0 81.72 0.17 61.13 0.42 60.75 0.24 89.87 2.73 63.56 1.22 61.98 0.57 87.89 0.80 67.82 0.81 67.92 0.74 91.91 3.07 91.38 1.19 91.41 0.18 WORKER.N.01 861 / 254 HYP EUCL log0 12.68 0.82 10.86 0.01 9.04 0.06 24.09 1.49 22.39 0.04 22.57 0.20 55.46 5.49 35.23 3.16 26.47 0.78 66.83 11.38 47.29 3.93 36.66 2.74 MAMMAL.N.01 953 / 228 HYP EUCL log0 32.01 17.14 15.58 0.04 13.10 0.13 87.54 4.55 44.68 1.87 44.89 1.18 88.73 3.22 59.35 1.31 52.51 0.85 91.37 6.09 77.76 5.08 56.11 2.21 Table 2: Test F1 classification scores (%) for four different subtrees of Word Net noun tree. 95% confidence intervals for 3 different runs are shown for each method and each dimension. Hyp denotes our hyperbolic MLR, Eucl denotes directly applying Euclidean MLR to hyperbolic embeddings in their Euclidean parametrization, and log0 denotes applying Euclidean MLR in the tangent space at 0, after projecting all hyperbolic embeddings there with log0. further investigate MLR for the task of subtree classification. Using an open source implementation8 of [21], we pre-trained Poincaré embeddings of the Word Net noun hierarchy (82,115 nodes). We then choose one node in this tree (see Table 2) and classify all other nodes (solely based on their embeddings) as being part of the subtree rooted at this node. All nodes in such a subtree are divided into positive training nodes (80%) and positive test nodes (20%). Figure 2: Hyperbolic (left) vs Direct Euclidean (right) binary MLR used to classify nodes as being part in the GROUP.N.01 subtree of the Word Net noun hierarchy solely based on their Poincaré embeddings. The positive points (from the subtree) are in red, the negative points (the rest) are in yellow and the trained positive separation hyperplane is depicted in green. The same splitting procedure is applied for the remaining Word Net nodes that are divided into a negative training and negative test set respectively. Three variants of MLR are then trained on top of pre-trained Poincaré embeddings[21] to solve this binary classification task: hyperbolic MLR, Euclidean MLR applied directly on the hyperbolic embeddings (even if mathematically this is not respecting the hyperbolic geometry) and Euclidean MLR applied after mapping all embeddings in the tangent space at 0 using the log0 map. More experimental details in appendix G.2. Quantitative results are presented in Table 2. We can see that the hyperbolic MLR overpasses its Euclidean variants in almost all settings, sometimes by a large margin. Moreover, to provide further understanding, we plot the 2-dimensional embeddings and the trained separation hyperplanes (geodesics in this case) in Figure 2. 5 Conclusion We showed how classic Euclidean deep learning tools such as MLR, FFNNs, RNNs or GRUs can be generalized in a principled manner to all spaces of constant negative curvature combining Riemannian geometry with the elegant theory of gyrovector spaces. Empirically we found that our models outperform or are on par with corresponding Euclidean architectures on sequential data with implicit hierarchical structure. We hope to trigger exciting future research related to better understanding of the hyperbolic non-convexity spectrum and development of other non-Euclidean deep learning methods. Our data and Tensorflow [1] code are publicly available9. 8https://github.com/dalab/hyperbolic_cones 9https://github.com/dalab/hyperbolic_nn Acknowledgements We thank Igor Petrovski for useful pointers regarding the implementation. This research is funded by the Swiss National Science Foundation (SNSF) under grant agreement number 167176. Gary Bécigneul is also funded by the Max Planck ETH Center for Learning Systems. [1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for large-scale machine learning. 2016. [2] Ungar Abraham Albert. Analytic hyperbolic geometry and Albert Einstein s special theory of relativity. World scientific, 2008. [3] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations (ICLR), 2015. [4] Graciela S Birman and Abraham A Ungar. The hyperbolic derivative in the poincaré ball model of hyperbolic geometry. Journal of mathematical analysis and applications, 254(1):321 333, 2001. [5] S. Bonnabel. Stochastic gradient descent on riemannian manifolds. IEEE Transactions on Automatic Control, 58(9):2217 2229, Sept 2013. [6] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multi-relational data. In Advances in neural information processing systems (NIPS), pages 2787 2795, 2013. [7] Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 632 642. Association for Computational Linguistics, 2015. [8] Michael M Bronstein, Joan Bruna, Yann Le Cun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18 42, 2017. [9] James W Cannon, William J Floyd, Richard Kenyon, Walter R Parry, et al. Hyperbolic geometry. Flavors of geometry, 31:59 115, 1997. [10] Christopher De Sa, Albert Gu, Christopher Ré, and Frederic Sala. Representation tradeoffs for hyperbolic embeddings. ar Xiv preprint ar Xiv:1804.03329, 2018. [11] Octavian-Eugen Ganea, Gary Bécigneul, and Thomas Hofmann. Hyperbolic entailment cones for learning hierarchical embeddings. In Proceedings of the thirty-fifth international conference on machine learning (ICML), 2018. [12] Mikhael Gromov. Hyperbolic groups. In Essays in group theory, pages 75 263. Springer, 1987. [13] Matthias Hamann. On the tree-likeness of hyperbolic spaces. Mathematical Proceedings of the Cambridge Philosophical Society, page 1 17, 2017. [14] Christopher Hopper and Ben Andrews. The Ricci flow in Riemannian geometry. Springer, 2010. [15] Yoon Kim. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746 1751. Association for Computational Linguistics, 2014. [16] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR), 2015. [17] Dmitri Krioukov, Fragkiskos Papadopoulos, Maksim Kitsak, Amin Vahdat, and Marián Boguná. Hyperbolic geometry of complex networks. Physical Review E, 82(3):036106, 2010. [18] John Lamping, Ramana Rao, and Peter Pirolli. A focus+ context technique based on hyperbolic geometry for visualizing large hierarchies. In Proceedings of the SIGCHI conference on Human factors in computing systems, pages 401 408. ACM Press/Addison-Wesley Publishing Co., 1995. [19] Guy Lebanon and John Lafferty. Hyperplane margin classifiers on the multinomial manifold. In Proceedings of the international conference on machine learning (ICML), page 66. ACM, 2004. [20] Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. A three-way model for collective learning on multi-relational data. In Proceedings of the international conference on machine learning (ICML), volume 11, pages 809 816, 2011. [21] Maximillian Nickel and Douwe Kiela. Poincaré embeddings for learning hierarchical representations. In Advances in Neural Information Processing Systems (NIPS), pages 6341 6350, 2017. [22] Tim Rocktäschel, Edward Grefenstette, Karl Moritz Hermann, Tomáš Koˇcisk y, and Phil Blunsom. Reasoning about entailment with neural attention. In Proceedings of the International Conference on Learning Representations (ICLR), 2015. [23] Michael Spivak. A comprehensive introduction to differential geometry. Publish or perish, 1979. [24] Corentin Tallec and Yann Ollivier. Can recurrent neural networks warp time? In Proceedings of the International Conference on Learning Representations (ICLR), 2018. [25] Abraham A Ungar. Hyperbolic trigonometry and its application in the poincaré ball model of hyperbolic geometry. Computers & Mathematics with Applications, 41(1-2):135 147, 2001. [26] Abraham Albert Ungar. A gyrovector space approach to hyperbolic geometry. Synthesis Lectures on Mathematics and Statistics, 1(1):1 194, 2008. [27] Abraham Albert Ungar. Analytic hyperbolic geometry in n dimensions: An introduction. CRC Press, 2014. [28] Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. Order-embeddings of images and language. In Proceedings of the International Conference on Learning Representations (ICLR), 2016. [29] J Vermeer. A geometric interpretation of ungar s addition and of gyration in the hyperbolic plane. Topology and its Applications, 152(3):226 242, 2005.