# constant_curvature_graph_convolutional_networks__c2edf09a.pdf Constant Curvature Graph Convolutional Networks Gregor Bachmann * 1 Gary B ecigneul * 1 Octavian-Eugen Ganea 2 Interest has been rising lately towards methods representing data in non-Euclidean spaces, e.g. hyperbolic or spherical, that provide specific inductive biases useful for certain real-world data properties, e.g. scale-free, hierarchical or cyclical. However, the popular graph neural networks are currently limited in modeling data only via Euclidean geometry and associated vector space operations. Here, we bridge this gap by proposing mathematically grounded generalizations of graph convolutional networks (GCN) to (products of) constant curvature spaces. We do this by i) introducing a unified formalism permitting a differentiable interpolation between all geometries of constant curvature, ii) leveraging gyrobarycentric coordinates that generalize the classic Euclidean concept of the center of mass. Our class of models smoothly recover their Euclidean counterparts when the curvature goes to zero from either side. Empirically, we outperform Euclidean GCNs in the tasks of node classification and distortion minimization for symbolic data exhibiting non-Euclidean behavior, according to their discrete curvature. 1. Introduction Graph Convolutional Networks. The success of convolutional networks and deep learning for image data has inspired generalizations for graphs for which sharing parameters is consistent with the graph geometry. Bruna et al. (2014); Henaff et al. (2015) are the pioneers of spectral graph convolutional neural networks in the graph Fourier space using localized spectral filters on graphs. However, in order to reduce the graph-dependency on the Laplacian *Equal contribution 1Department of Computer Science, ETH Z urich 2Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology. Correspondence to: Gregor Bachmann , Gary B ecigneul , Octavian-Eugen Ganea . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). eigenmodes, Defferrard et al. (2016) approximate the convolutional filters using Chebyshev polynomials leveraging a result of Hammond et al. (2011). The resulting method (discussed in Appendix A) is computationally efficient and superior in terms of accuracy and complexity. Further, Kipf & Welling (2017) simplify this approach by considering first-order approximations obtaining high scalability. The proposed graph convolutional networks (GCN) is interpolating node embeddings via a symmetrically normalized adjacency matrix, while this weight sharing can be understood as an efficient diffusion-like regularizer. Recent works extend GCNs to achieve state of the art results for link prediction (Zhang & Chen, 2018), graph classification (Hamilton et al., 2017; Xu et al., 2018) and node classification (Klicpera et al., 2019; Veliˇckovi c et al., 2018). Euclidean geometry in ML. In machine learning (ML), data is most often represented in a Euclidean space for various reasons. First, some data is intrinsically Euclidean, such as positions in 3D space in classical mechanics. Second, intuition is easier in such spaces, as they possess an appealing vectorial structure allowing basic arithmetic and a rich theory of linear algebra. Finally, a lot of quantities of interest such as distances and inner-products are known in closed-form formulae and can be computed very efficiently on the existing hardware. These operations are the basic building blocks for most of today s popular machine learning models. Thus, the powerful simplicity and efficiency of Euclidean geometry has led to numerous methods achieving state-of-the-art on tasks as diverse as machine translation (Bahdanau et al., 2015; Vaswani et al., 2017), speech recognition (Graves et al., 2013), image classification (He et al., 2016) or recommender systems (He et al., 2017). Riemannian ML. In spite of this success, certain types of data (e.g. hierarchical, scale-free or spherical data) have been shown to be better represented by non-Euclidean geometries (Defferrard et al., 2019; Bronstein et al., 2017; Nickel & Kiela, 2017; Gu et al., 2019), leading in particular to the rich theories of manifold learning (Roweis & Saul, 2000; Tenenbaum et al., 2000) and information geometry (Amari & Nagaoka, 2007). The mathematical framework in vigor to manipulate non-Euclidean geometries is known as Riemannian geometry (Spivak, 1979). Although its theory leads to many strong and elegant results, some of its basic Constant Curvature Graph Convolutional Networks Figure 1. Euclidean embeddings of trees of different depths. All the four most inner circles are identical. Ideal node embeddings should match in distance the graph metric, e.g. the distance between the pink and green nodes should be the same as their shortest path length. Notice how we quickly run out of space, e.g. the pink and green nodes get closer as opposed to farther. This issue is resolved when embedding trees in hyperbolic spaces. quantities such as the distance function d( , ) are in general not available in closed-form, which can be prohibitive to many computational methods. Representational Advantages of Geometries of Constant Curvature. An interesting trade-off between general Riemannian manifolds and the Euclidean space is given by manifolds of constant sectional curvature. They define together what are called hyperbolic (negative curvature), elliptic (positive curvature) and Euclidean (zero curvature) geometries. As discussed below and in Appendix B, Euclidean spaces have limitations and suffer from large distortion when embedding certain types of data such as trees. In these cases, the hyperbolic and spherical spaces have representational advantages providing a better inductive bias for the respective data. The hyperbolic space can be intuitively understood as a continuous tree: the volume of a ball grows exponentially with its radius, similarly as how the number of nodes in a binary tree grows exponentially with its depth (see fig. 1). Its tree-likeness properties have long been studied mathematically (Gromov, 1987; Hamann, 2017; Ungar, 2008) and it was proven to better embed complex networks (Krioukov et al., 2010), scale-free graphs and hierarchical data compared to the Euclidean geometry (Cho et al., 2019; Sala et al., 2018; Ganea et al., 2018b; Gu et al., 2019; Nickel & Kiela, 2018; 2017; Tifrea et al., 2019). Several important tools or methods found their hyperbolic counterparts, such as variational autoencoders (Mathieu et al., 2019; Ovinnikov, 2019), attention mechanisms (Gulcehre et al., 2018), matrix multiplications, recurrent units and multinomial logistic regression (Ganea et al., 2018a). Similarly, spherical geometry provides benefits for modeling spherical or cyclical data (Defferrard et al., 2019; Matousek, 2013; Davidson et al., 2018; Xu & Durrett, 2018; Gu et al., 2019; Grattarola et al., 2018; Wilson et al., 2014). Computational Efficiency of Constant Curvature Spaces (CCS). CCS are some of the few Riemannian manifolds to possess closed-form formulae for geometric quantities of interest in computational methods, i.e. distance, geodesics, exponential map, parallel transport and their gradients. We also leverage here the closed expressions for weighted centroids. Linear Algebra of CCS: Gyrovector Spaces. In order to study the geometry of constant negative curvature in analogy with the Euclidean geometry, Ungar (1999; 2005; 2008; 2016) proposed the elegant non-associative algebraic formalism of gyrovector spaces. Recently, Ganea et al. (2018a) have linked this framework to the Riemannian geometry of the space, also generalizing the building blocks for non Euclidean deep learning models operating with hyperbolic data representations. However, it remains unclear how to extend in a principled manner the connection between Riemannian geometry and gyrovector space operations for spaces of constant positive curvature (spherical). By leveraging Euler s formula and complex analysis, we present to our knowledge the first unified gyro framework that allows for a differentiable interpolation between geometries of constant curvatures irrespective of their signs. This is possible when working with the Poincar e ball and stereographic spherical projection models of respectively hyperbolic and spherical spaces. GCNs in Constant Curvature Spaces. In this work, we introduce an extension of graph convolutional networks that allows to learn representations residing in (products of) constant curvature spaces with any curvature sign. We achieve this by combining the derived unified gyro framework together with the effectiveness of GCNs (Kipf & Welling, 2017). Concurrent to our work, Chami et al. (2019); Liu et al. (2019) consider graph neural networks that learn embeddings in hyperbolic space via tangent space aggregation. Their approach will be analyzed more closely in section 3.4. Our model is more general as it produces representations in a strict super-set containing the hyperbolic space. Constant Curvature Graph Convolutional Networks Figure 2. Geodesics in the Poincar e disk (left) and the stereographic projection of the sphere (right). 2. The Geometry of Constant Curvature Spaces Riemannian Geometry. A manifold M of dimension d is a generalization to higher dimensions of the notion of surface, and is a space that locally looks like Rd. At each point x M, M can be associated a tangent space Tx M, which is a vector space of dimension d that can be understood as a first order approximation of M around x. A Riemannian metric g is given by an inner-product gx( , ) at each tangent space Tx M, gx varying smoothly with x. A given g defines the geometry of M, because it can be used to define the distance between x and y as the infimum of the lengths of smooth paths γ : [0, 1] M from x to y, where the length is defined as ℓ(γ) := R 1 0 p gγ(t)( γ(t), γ(t))dt . Under certain assumptions, a given g also defines a curvature at each point. Unifying all curvatures κ. There exist several models of respectively constant positive and negative curvatures. For positive curvature, we choose the stereographic projection of the sphere, while for negative curvature we choose the Poincar e model which is the stereographic projection of the Lorentz model. As explained below, this choice allows us to generalize the gyrovector space framework and unify spaces of both positive and negative curvature κ into a single model which we call the κ-stereographic model. The κ-stereographic model. For a curvature κ R and a dimension d 2, we study the model std κ defined as std κ = {x Rd | κ x 2 2 < 1} equipped with its Riemannian metric gκ x = 4 (1+κ||x||2)2 I =: (λκ x)2I. Note in particular that when κ 0, std κ is Rd, while when κ < 0 it is the open ball of radius 1/ κ. Gyrovector spaces & Riemannian geometry. As discussed in section 1, the gyrovector space formalism is used to generalize vector spaces to the Poincar e model of hyperbolic geometry (Ungar, 2005; 2008). In addition, important quantities from Riemannian geometry can be rewritten in terms of the M obius vector addition and scalar-vector multiplication (Ganea et al., 2018a). We here extend gyrovector spaces to the κ-stereographic model, i.e. allowing positive curvature. For κ > 0 and any point x std κ, we will denote by x the unique point of the sphere of radius κ 1 2 in Rd+1 whose stereographic projection is x. As detailed in Appendix C, it is given by x := (λκ xx, κ 1 2 (λκ x 1)). (1) For x, y std κ, we define the κ-addition, in the κstereographic model by: x κy = (1 2κx T y κ||y||2)x + (1 + κ||x||2)y 1 2κx T y + κ2||x||2||y||2 std κ. (2) The κ-addition is defined in all the cases except for spherical geometry and x = y/(κ y 2) as stated by the following theorem proved in Appendix C.2.1. Theorem 1 (Definiteness of κ-addition). We have 1 2κx T y + κ2||x||2||y||2 = 0 if and only if κ > 0 and x = y/(κ y 2). For s R and x std κ (and |s tan 1 κ x | < κ 1 2 π/2 if κ > 0), the κ-scaling in the κ-stereographic model is given by: s κ x = tanκ (s tan 1 κ ||x||) x ||x|| std κ, (3) where tanκ equals κ 1/2 tan if κ > 0 and ( κ) 1/2 tanh if κ < 0. This formalism yields simple closed-forms for various quantities including the distance function (see fig. 3) inherited from the Riemannian manifold (std κ, gκ), the exp and log maps, and geodesics (see fig. 2), as shown by the following theorem. Theorem 2 (Extending gyrovector spaces to positive curvature). For x, y std κ, x = y, v = 0, (and x = y/(κ y 2) if κ > 0), the distance function is given bya: dκ(x, y) = 2|κ| 1/2 tan 1 κ x κ y , (4) the unit-speed geodesic from x to y is unique and given by γx y(t) = x κ (t κ ( x κ y)) , (5) and finally the exponential and logarithmic maps are described as: expκ x(v) = x κ |κ| 1 2 λκ x||v|| logκ x(y) = 2|κ| 1 2 λκx tan 1 κ || x κ y|| x κ y || x k y|| (7) a We write x y for ( x) y and not (x y). Constant Curvature Graph Convolutional Networks Figure 3. Heatmap of the distance function dκ(x, ) in st2 κ for κ = 0.254 (left) and κ = 0.248 (right). Proof sketch: The case κ 0 was already taken care of by Ganea et al. (2018a). For κ > 0, we provide a detailed proof in Appendix C.2.2. The exponential map and unit-speed geodesics are obtained using the Egregium theorem and the known formulas in the standard spherical model. The distance then follows from the formula dκ(x, y) = logκ x(y) x which holds in any Riemannian manifold. Around κ = 0. One notably observes that choosing κ = 0 yields all corresponding Euclidean quantities, which guarantees a continuous interpolation between κstereographic models of different curvatures, via Euler s formula tan(x) = i tanh(ix) where i := 1. But is this interpolation differentiable with respect to κ? It is, as shown by the following theorem, proved in Appendix C.2.3. Theorem 3 (Differentiability of std κ w.r.t. κ around 0). Let v = 0 and x, y Rd, such that x = y (and x = y/(κ y 2) if κ > 0). Quantities in Eqs. (4,5,6, 7) are well-defined for |κ| < 1/ min( x 2, y 2), i.e. for κ small enough. Their first order derivatives at 0 and 0+ exist and are equal. Moreover, for the distance we have up to quadratic terms in κ: dκ(x, y) 2 x y 2κ x y 3/3 + (x T y) x y 2 (8) Note that for x T y 0, this tells us that an infinitesimal change of curvature from zero to small negative, i.e. towards 0 , while keeping x, y fixed, has the effect of increasing their distance. As a consequence, we have a unified formalism that allows for a differentiable interpolation between all three geometries of constant curvature. We start by introducing the methods upon which we build. We present our models for spaces of constant sectional curvature, in the κ-stereographic model. However, the generalization to cartesian products of such spaces (Gu et al., 2019) follows naturally from these tools. 3.1. Graph Convolutional Networks The problem of node classification on a graph has long been tackled with explicit regularization using the graph Laplacian (Weston et al., 2012). Namely, for a directed graph with adjacency matrix A, by adding the following term to the loss: P i,j Aij f(xi) f(xj) 2 = f(X)T Lf(X), where L = D A is the (unnormalized) graph Laplacian, Dii := P k Aik defines the (diagonal) degree matrix, f contains the trainable parameters of the model and X = (xj i)ij the node features of the model. Such a regularization is expected to improve generalization if connected nodes in the graph tend to share labels; node i with feature vector xi is represented as f(xi) in a Euclidean space. With the aim to obtain more scalable models, Defferrard et al. (2016); Kipf & Welling (2017) propose to make this regularization implicit by incorporating it into what they call graph convolutional networks (GCN), which they motivate as a first order approximation of spectral graph convolutions, yielding the following scalable layer architecture (detailed in Appendix A): H(t+1) = σ D 1 2 H(t)W(t) (9) where A = A + I has added self-connections, Dii = P k Aik defines its diagonal degree matrix, σ is a nonlinearity such as sigmoid, tanh or Re LU = max(0, ), and W(t) and H(t) are the parameter and activation matrices of layer t respectively, with H(0) = X the input feature matrix. Constant Curvature Graph Convolutional Networks Figure 4. Left: Spherical gyromidpoint of four points. Right: M obius gyromidpoint in the Poincar e model defined by (Ungar, 2008) and alternatively, here in eq. (12). 3.2. Tools for a κ-GCN Learning a parametrized function fθ that respects hyperbolic geometry has been studied in Ganea et al. (2018a): neural layers and hyperbolic softmax. We generalize their definitions into the κ-stereographic model, unifying operations in positive and negative curvature. We explain how curvature introduces a fundamental difference between left and right matrix multiplications, depicting the M obius matrix multiplication of Ganea et al. (2018a) as a right multiplication, independent for each embedding. We then introduce a left multiplication by extension of gyromidpoints which ties the embeddings, which is essential for graph neural networks. 3.3. κ-Right-Matrix-Multiplication Let X Rn d denote a matrix whose n rows are ddimensional embeddings in std κ, and let W Rd e denote a weight matrix. Let us first understand what a right matrix multiplication is in Euclidean space: the Euclidean right multiplication can be written row-wise as (XW)i = Xi W. Hence each d-dimensional Euclidean embedding is modified independently by a right matrix multiplication. A natural adaptation of this operation to the κ-stereographic model yields the following definition. Definition 1. Given a matrix X Rn d holding κstereographic embeddings in its rows and weights W Rd e, the κ-right-matrix-multiplication is defined rowwise as (X κ W)i = expκ 0 ((logκ 0(X)W)i ) = tanκ αi tan 1 κ (||X i||) (XW)i ||(XW)i || (10) where αi = ||(XW)i || ||Xi || and expκ 0 and logκ 0 denote the exponential and logarithmic map in the κ-stereographic model. This definition is in perfect agreement with the hyperbolic Figure 5. Weighted Euclidean midpoint αx + βy scalar multiplication for κ < 0, which can also be written as r κ x = expκ 0(r logκ 0(x)). This operation is known to have desirable properties such as associativity (Ganea et al., 2018a). 3.4. κ-Left-Matrix-Multiplication as a Midpoint Extension For graph neural networks we also need the notion of message passing among neighboring nodes, i.e. an operation that combines / aggregates the respective embeddings together. In Euclidean space such an operation is given by the left multiplication of the embeddings matrix with the (preprocessed) adjacency ˆA: H(l+1) = σ( ˆAZ(l)) where Z(l) = H(l)W(l). Let us consider this left multiplication. For A Rn n, the matrix product is given row-wise by: (AX)i = Ai1X1 + + Ain Xn This means that the new representation of node i is obtained by calculating the linear combination of all the other node embeddings, weighted by the i-th row of A. An adaptation to the κ-stereographic model hence requires a notion of weighted linear combination. We propose such an operation in std κ by performing a κscaling of a gyromidpoint whose definition is reminded below. Indeed, in Euclidean space, the weighted linear combination αx+βy can be re-written as (α+β)m E(x, y; α, β) with Euclidean midpoint m E(x, y; α, β) := α α+β x+ β α+β y. See fig. 5 for a geometric illustration. This motivates generalizing the above operation to std κ as follows. Definition 2. Given a matrix X Rn d holding κstereographic embeddings in its rows and weights A Rn n, the κ-left-matrix-multiplication is defined rowwise as (A κ X)i := ( X j Aij) κ mκ(X1 , , Xn ; Ai ). The κ-scaling is motivated by the fact that dκ(0, r κ x) = Constant Curvature Graph Convolutional Networks Figure 6. Heatmap of the distance from a st2 κ-hyperplane to x st2 κ for κ = 0.254 (left) and κ = 0.248 (right) |r|dκ(0, x) for all r R, x std κ. We remind that the gyromidpoint is defined when κ 0 in the κ-stereographic model as (Ungar, 2010): mκ(x1, , xn; α) = 1 αiλκ xi Pn j=1 αj(λκxj 1)xi (12) with λκ x = 2/(1 + κ x 2). Whenever κ > 0, we have to further require the following condition: j αj(λκ xj 1) = 0. (13) For two points, one can calculate that (λκ x 1)+(λκ y 1) = 0 is equivalent to κ x y = 1, which holds in particular whenever x = y/(κ y 2). See fig. 4 for illustrations of gyromidpoints. Our operation κ satisfies interesting properties, proved in Appendix C.2.4: Theorem 4 (Neuter element & κ-scalar-associativity). We have In κ X = X, and for r R, r κ (A κ X) = (r A) κ X. The matrix A. In most GNNs, the matrix A is intended to be a preprocessed adjacency matrix, i.e. renormalized by the diagonal degree matrix Dii = P k Aik. This normalization is often taken either (i) to the left: D 1A, (ii) symmetric: D 1 2 or (iii) to the right: AD 1. Note that the latter case makes the matrix right-stochastic1, which is a property that is preserved by matrix product and exponentiation. For this case, we prove the following result in Appendix C.2.5: 1M is right-stochastic if for all i, P Theorem 5 (κ-left-multiplication by right-stochastic matrices is intrinsic). If A, B are right-stochastic, φ is a isometry of std κ and X,Y are two matrices holding κstereographic embeddings: i, dφ = dκ ((A κ φ(X))i , (B κ φ(Y))i ) = dκ((A κ X)i , (B κ Y)i ). (14) The above result means that A can easily be preprocessed as to make its κ-left-multiplication intrinsic to the metric space (std κ, dκ). At this point, one could wonder: does there exist other ways to take weighted centroids on a Riemannian manifold? We comment on two plausible alternatives. Fr echet/Karcher means. They are obtained as arg minx P i αidκ(x, xi)2; note that although they are also intrinsic, they usually require solving an optimization problem which can be prohibitively expensive, especially when one requires gradients to flow through the solution moreover, for the space std κ, it is known that the minimizer is unique if and only if κ 0. Tangential aggregations. The linear combination is here lifted to the tangent space by means of the exponential and logarithmic map and were in particular used in the recent works of Chami et al. (2019) and Liu et al. (2019). Definition 3. The tangential aggregation of x1, . . . , xn std κ w.r.t. weights {αi}1 i n, at point x std κ (for xi = x/(κ x 2) if κ > 0) is defined by: tgκ x(x1, ..., xn; α1, ..., αn) := expκ x i=1 αi logκ x(xi) The below theorem describes that for the κ-stereographic model, this operation is also intrinsic. We prove it in Appendix C.2.6. Constant Curvature Graph Convolutional Networks Theorem 6 (Tangential aggregation is intrinsic). For any isometry φ of std κ, we have tgφ(x)({φ(xi)}; {αi}) = φ(tgx({xi}; {αi})). (16) 3.5. Logits Finally, we need the logit and softmax layer, a neccessity for any classification task. We here use the model of Ganea et al. (2018a), which was obtained in a principled manner for the case of negative curvature. Their derivation rests upon the closed-form formula for distance to a hyperbolic hyperplane. We naturally extend this formula to std κ, hence also allowing for κ > 0 but leave for future work the adaptation of their theoretical analysis. p(y = k|x) = S |κ| sin 1 κ |κ| zk, ak (1 + κ||zk||2)||ak|| (17) where ak T0std κ = Rd and pk std κ are trainable parameters, x std κ , is the input, zk = pk x and S( ) is the softmax function. We reference the reader to Appendix D for further details and to fig. 6 for an illustration of eq. 17. We are now ready to introduce our κ-stereographic GCN (Kipf & Welling, 2017), denoted by κ-GCN2. Assume we are given a graph with node level features G = (V, A, X) where X Rn d with each row Xi std κ and adjacency A Rn n. We first perform a preprocessing step by mapping the Euclidean features to std κ via the projection X 7 X/(2 p |κ|||X||max), where ||X||max denotes the maximal Euclidean norm among all stereographic embeddings in X. For l {0, . . . , L 2}, the (l + 1)-th layer of κ-GCN is given by: H(l+1) = σ κ ˆA κ H(l) κ W(l) , (18) where H(0) = X, σ κ(x) := expκ 0(σ(logκ 0(x))) is the M obius version (Ganea et al., 2018a) of a pointwise nonlinearity σ and ˆA = D 1 2 . The final layer is a κ-logit layer (Appendix D): H(L) = softmax ˆA logitκ H(L 1), W(L 1) , (19) where W(L 1) contains the parameters ak and pk of the κ-logits layer. A very important property of κ-GCN is that its architecture recovers the Euclidean GCN when we let curvature go to zero: 2To be pronounced kappa GCN; the greek letter κ being commonly used to denote sectional curvature Table 1. Minimum achieved average distortion of the different models. H and S denote hyperbolic and spherical models respectively. MODEL TREE TOROIDAL SPHERICAL E10 (LINEAR) 0.045 0.0607 0.0415 E10 (RELU) 0.0502 0.0603 0.0409 H10 (κ-GCN) 0.0029 0.272 0.267 S10 (κ-GCN) 0.473 0.0485 0.0337 H5 H5 (κ-GCN) 0.0048 0.112 0.152 S5 S5 (κ-GCN) 0.51 0.0464 0.0359 H2 4 (κ-GCN) 0.025 0.084 0.062 S2 4 (κ-GCN) 0.312 0.0481 0.0378 κ-GCN κ 0 GCN. 4. Experiments We evaluate the architectures introduced in the previous sections on the tasks of node classification and minimizing embedding distortion for several synthetic as well as real datasets. We detail the training setup and model architecture choices to Appendix F. Minimizing Distortion Our first goal is to evaluate the graph embeddings learned by our GCN models on the representation task of fitting the graph metric in the embedding space. We desire to minimize the average distortion, i.e. defined similarly as in Gu et al. (2019): d G(i,j) 2 1 2 , where d(xi, xj) is the dis- tance between the embeddings of nodes i and j, while d G(i, j) is their graph distance (shortest path length). We create three synthetic datasets that best reflect the different geometries of interest: i) Tree : a balanced tree of depth 5 and branching factor 4 consisting of 1365 nodes and 1364 edges. ii) Torus : We sample points (nodes) from the (planar) torus, i.e. from the unit connected square; two nodes are connected by an edge if their toroidal distance (the warped distance) is smaller than a fixed R = 0.01; this gives 1000 nodes and 30626 edges. iii) Spherical Graph : we sample points (nodes) from S2, connecting nodes if their distance is smaller than 0.2, leading to 1000 nodes and 17640 edges. For the GCN models, we use 1-hot initial node features. We use two GCN layers with dimensions 16 and 10. The non-Euclidean models do not use additional non-linearities between layers. All Euclidean parameters are updated using the ADAM optimizer with learning rate 0.01. Curvatures are learned using gradient descent and learning rate of 0.0001. All models are trained for 10000 epochs and we report the Constant Curvature Graph Convolutional Networks Table 2. Node classification: Average accuracy across 5 splits with estimated uncertainties at 95 percent confidence level via bootstrapping on our datasplits. H and S denote hyperbolic and spherical models respectively. MODEL CITESEER CORA PUBMED AIRPORT E16 (KIPF & WELLING, 2017) 0.729 0.0054 0.814 0.004 0.792 0.0039 0.814 0.0029 H16 (CHAMI ET AL., 2019) 0.71 0.0049 0.803 0.0046 0.798 0.0043 0.844 0.0041 H16 (κ-GCN) 0.732 0.0051 0.812 0.005 0.785 0.0036 0.819 0.0033 S16 (κ-GCN) 0.721 0.0045 0.819 0.0045 0.788 0.0049 0.809 0.0058 PROD-GCN (κ-GCN) 0.711 0.0059 0.808 0.0041 0.781 0.006 0.817 0.0044 minimal achieved distortion. Distortion results. The obtained distortion scores shown in table 1 reveal the benefit of our models. The best performing architecture is the one that matches the underlying geometry of the graph. Figure 7. Histogram of Curvatures from Deviation of Parallogram Law 4.1. Node Classification We consider the popular node classification datasets Citeseer (Sen et al., 2008), Cora-ML (Mc Callum et al., 2000) and Pubmed (Namata et al., 2012). Node labels correspond to the particular subfield the published document is associated with. Dataset statistics and splitting details are deferred to the Appendix F due to the lack of space. We compare against the Euclidean model (Kipf & Welling, 2017) and the recently proposed hyperbolic variant (Chami et al., 2019). Curvature Estimations of Datasets To understand how far are the real graphs of the above datasets from the Euclidean geometry, we first estimate the graph curvature of the four studied datasets using the deviation from the Parallelogram Law (Gu et al., 2019) as detailed in Appendix G. Curvature histograms are shown in fig. 7. It can be noticed that the datasets are mostly non-Euclidean, thus offering a good motivation to apply our constant-curvature GCN architectures. Training Details We trained the baseline models in the same setting as done in Chami et al. (2019). Namely, for GCN we use one hidden layer of size 16, dropout on the embeddings and the adjacency of rate 0.5 as well as L2regularization for the weights of the first layer. We used Re LU as the non-linear activation function. For the non-Euclidean architectures, we used a combination of dropout and dropconnect for the non-Euclidean models as reported in Chami et al. (2019), as well as L2-regularization for the first layer. All models have the same number of parameters and for fairness are compared in the same setting, without attention. We use one hidden layer of dimension 16. For the product models we consider two-component spaces (e.g H8 S8) and we split the embedding space into equal dimensions of size 8. We also distribute the input features equally among the components. Non-Euclidean models use the M obius version of Re LU. Euclidean parameters use a learning rate of 0.01 for all models using ADAM. The curvatures are learned using gradient descent with a learning rate of 0.01. We show the learnt values in Appendix F. We use early stopping: we first train for 2000 epochs, then we check every 200 epochs for improvement in the validation cross entropy loss; if that is not observed, we stop. Node classification results. These are shown in table 2. It can be seen that our models are competitive with the Euclidean GCN considered and outperforms Chami et al. (2019) on Citeseer and Cora, showcasing the benefit of our proposed architecture. 5. Conclusion In this paper, we introduced a natural extension of GCNs to the stereographic models of both positive and negative curvatures in a unified manner. We show how this choice of models permits a differentiable interpolation between positive and negative curvature, allowing the curvature to be trained independent of an initial sign choice. We hope that our models will open new exciting directions into non Euclidean graph neural networks. Constant Curvature Graph Convolutional Networks 6. Acknowledgements We thank prof. Thomas Hofmann, Andreas Bloch, Calin Cruceru and Ondrej Skopek for useful discussions and anonymous reviewers for suggestions. Gary B ecigneul is funded by the Max Planck ETH Center for Learning Systems. Abu-El-Haija, S., Kapoor, A., Perozzi, B., and Lee, J. N-GCN: Multi-scale Graph Convolution for Semisupervised Node Classification. International Workshop on Mining and Learning with Graphs (MLG), 2018. Amari, S.-i. and Nagaoka, H. Methods of information geometry, volume 191. American Mathematical Soc., 2007. Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. Proceedings of the International Conference on Learning Representations, 2015. Bronstein, M. M., Bruna, J., Le Cun, Y., Szlam, A., and Vandergheynst, P. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4): 18 42, 2017. Bruna, J., Zaremba, W., Szlam, A., and Lecun, Y. Spectral networks and locally connected networks on graphs. In International Conference on Learning Representations (ICLR2014), CBLS, April 2014, pp. http openreview, 2014. Chami, I., Ying, R., R e, C., and Leskovec, J. Hyperbolic graph convolutional neural networks. Advances in Neural Information processing systems, 2019. Chen, J., Ma, T., and Xiao, C. Fastgcn: fast learning with graph convolutional networks via importance sampling. ICLR, 2018. Cho, H., De Meo, B., Peng, J., and Berger, B. Large-margin classification in hyperbolic space. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1832 1840, 2019. Davidson, T. R., Falorsi, L., De Cao, N., Kipf, T., and Tomczak, J. M. Hyperspherical Variational Auto-Encoders. Uncertainty in Artificial Intelligence (UAI), 856865, 2018. Defferrard, M., Bresson, X., and Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, pp. 3844 3852, 2016. Defferrard, M., Perraudin, N., Kacprzak, T., and Sgier, R. Deepsphere: towards an equivariant graph-based spherical cnn. In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019. URL https: //arxiv.org/abs/1904.05146. Deza, M. and Laurent, M. Geometry of Cuts and Metrics. Springer, Vol. 15, 1996. Ganea, O., B ecigneul, G., and Hofmann, T. Hyperbolic neural networks. In Advances in Neural Information Processing Systems, pp. 5345 5355, 2018a. Ganea, O.-E., Becigneul, G., and Hofmann, T. Hyperbolic entailment cones for learning hierarchical embeddings. In International Conference on Machine Learning, pp. 1632 1641, 2018b. Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and Dahl, G. E. Neural Message Passing for Quantum Chemistry. Proceedings of the International Conference on Machine Learning, 2017. Grattarola, D., Zambon, D., Alippi, C., and Livi, L. Learning graph embeddings on constant-curvature manifolds for change detection in graph streams. stat, 1050:16, 2018. Graves, A., Mohamed, A.-r., and Hinton, G. Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 6645 6649. IEEE, 2013. Gromov, M. Hyperbolic groups. In Essays in group theory, pp. 75 263. Springer, 1987. Gu, A., Sala, F., Gunel, B., and R e, C. Learning mixedcurvature representations in product spaces. Proceedings of the International Conference on Learning Representations, 2019. Gulcehre, C., Denil, M., Malinowski, M., Razavi, A., Pascanu, R., Hermann, K. M., Battaglia, P., Bapst, V., Raposo, D., Santoro, A., et al. Hyperbolic attention networks. Proceedings of the International Conference on Learning Representations, 2018. Hamann, M. On the tree-likeness of hyperbolic spaces. Mathematical Proceedings of the Cambridge Philosophical Society, pp. 1 17, 2017. doi: 10.1017/ S0305004117000238. Hamann, M. On the tree-likeness of hyperbolic spaces. Mathematical Proceedings of the Cambridge Philosophical Society, pp. 117, 2017. Hamilton, W. L., Ying, R., and Leskovec, J. Inductive Representation Learning on Large Graphs. In Advances in Neural Information Processing Systems, 2017. Constant Curvature Graph Convolutional Networks Hammond, D. K., Vandergheynst, P., and Gribonval, R. Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis, 30(2):129 150, 2011. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. He, X., Liao, L., Zhang, H., Nie, L., Hu, X., and Chua, T.-S. Neural collaborative filtering. In Proceedings of the 26th international conference on world wide web, pp. 173 182. International World Wide Web Conferences Steering Committee, 2017. Henaff, M., Bruna, J., and Le Cun, Y. Deep convolutional networks on graph-structured data. ar Xiv preprint ar Xiv:1506.05163, 2015. Kingma, D. P. and Ba, J. ADAM: A method for stochastic optimization. ICLR, 2015. Kipf, T. N. and Welling, M. Semi-supervised classification with graph convolutional networks. International Conference on Learning Representations, 2017. Klicpera, J., Bojchevski, A., and G unnemann, S. Predict then propagate: graph neural networks meet personalized pagerank. International Conference on Learning Representations, 2019. Krioukov, D., Papadopoulos, F., Kitsak, M., Vahdat, A., and Bogun a, M. Hyperbolic geometry of complex networks. Physical Review E, 82(3):036106, 2010. Liu, Q., Nickel, M., and Kiela, D. Hyperbolic graph neural networks. Advances in Neural Information processing systems, 2019. Mathieu, E., Lan, C. L., Maddison, C. J., Tomioka, R., and Teh, Y. W. Continuous hierarchical representations with Poincar e variational auto-encoders. Advances in Neural Information Processing Systems, 2019. Matousek, J. Lecture notes on metric embeddings. 2013. Mc Callum, A., Nigam, K., Rennie, J., and Seymore, K. Automating the construction of internet portals with machine learning. Information Retrieval, 3(2):127 163, 2000. Namata, G., London, B., Getoor, L., and Huang, B. Querydriven Active Surveying for Collective Classification. International Workshop on Mining and Learning with Graphs (MLG), 2012. Nickel, M. and Kiela, D. Poincar e embeddings for learning hierarchical representations. In Advances in Neural Information Processing Systems, pp. 6341 6350, 2017. Nickel, M. and Kiela, D. Learning continuous hierarchies in the lorentz model of hyperbolic geometry. In International Conference on Machine Learning, 2018. Ovinnikov, I. Poincar e Wasserstein autoencoder. ar Xiv preprint ar Xiv:1901.01427, 2019. Roweis, S. T. and Saul, L. K. Nonlinear dimensionality reduction by locally linear embedding. science, 290(5500): 2323 2326, 2000. Sala, F., De Sa, C., Gu, A., and Re, C. Representation tradeoffs for hyperbolic embeddings. In International Conference on Machine Learning, pp. 4457 4466, 2018. Sarkar, R. Low distortion delaunay embedding of trees in hyperbolic plane. International Symposium on Graph Drawing, pp. 355 366. Springer,, 2011. Sen, P., Namata, G., Bilgic, M., Getoor, L., Gallagher, B., and Eliassi-Rad, T. Collective Classification in Network Data. AI Magazine, 29(3):93 106, 2008. Spivak, M. A comprehensive introduction to differential geometry. volume four. 1979. Tenenbaum, J. B., De Silva, V., and Langford, J. C. A global geometric framework for nonlinear dimensionality reduction. science, 290(5500):2319 2323, 2000. Tifrea, A., B ecigneul, G., and Ganea, O.-E. Poincar e glove: Hyperbolic word embeddings. Proceedings of the International Conference on Learning Representations, 2019. Ungar, A. Barycentric Calculus in Euclidean and Hyperbolic Geometry. World Scientific, ISBN 9789814304931, 2010. Ungar, A. A. The hyperbolic pythagorean theorem in the Poincar e disc model of hyperbolic geometry. The American mathematical monthly, 106(8):759 763, 1999. Ungar, A. A. Analytic hyperbolic geometry: Mathematical foundations and applications. World Scientific, 2005. Ungar, A. A. A gyrovector space approach to hyperbolic geometry. Synthesis Lectures on Mathematics and Statistics, 1(1):1 194, 2008. Ungar, A. A. Analytic Hyperbolic Geometry in N Dimensions: An Introduction. CRC Press, 2014. Ungar, A. A. Novel tools to determine hyperbolic triangle centers. In Essays in Mathematics and its Applications, pp. 563 663. Springer, 2016. Constant Curvature Graph Convolutional Networks Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998 6008, 2017. Veliˇckovi c, P., Cucurull, G., Casanova, A., Romero, A., Li o, P., and Bengio, Y. Graph attention networks. International Conference on Learning Representations, 2018. Weston, J., Ratle, F., Mobahi, H., and Collobert, R. Deep learning via semi-supervised embedding. In Neural Networks: Tricks of the Trade, pp. 639 655. Springer, 2012. Wilson, R. C., Hancock, E. R., Pekalska, E., and Duin, R. P. Spherical and hyperbolic embeddings of data. IEEE transactions on pattern analysis and machine intelligence, 36(11):2255 2269, 2014. Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., and Yu, P. S. A comprehensive survey on graph neural networks. ar Xiv preprint ar Xiv:1901.00596, 2019. Xu, J. and Durrett, G. Spherical latent spaces for stable variational autoencoders. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4503 4513, 2018. Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerful are graph neural networks? International Conference on Learning Representations, 2018. Zhang, M. and Chen, Y. Link prediction based on graph neural networks. In Advances in Neural Information Processing Systems, 2018.