# constant_curvature_graph_convolutional_networks__c2edf09a.pdf

Constant Curvature Graph Convolutional Networks

Gregor Bachmann * 1 Gary B ecigneul * 1 Octavian-Eugen Ganea 2

Interest has been rising lately towards methods representing data in non-Euclidean spaces, e.g. hyperbolic or spherical, that provide speciﬁc inductive biases useful for certain real-world data properties, e.g. scale-free, hierarchical or cyclical. However, the popular graph neural networks are currently limited in modeling data only via Euclidean geometry and associated vector space operations. Here, we bridge this gap by proposing mathematically grounded generalizations of graph convolutional networks (GCN) to (products of) constant curvature spaces. We do this by i) introducing a uniﬁed formalism permitting a differentiable interpolation between all geometries of constant curvature, ii) leveraging gyrobarycentric coordinates that generalize the classic Euclidean concept of the center of mass. Our class of models smoothly recover their Euclidean counterparts when the curvature goes to zero from either side. Empirically, we outperform Euclidean GCNs in the tasks of node classiﬁcation and distortion minimization for symbolic data exhibiting non-Euclidean behavior, according to their discrete curvature.

1. Introduction

Graph Convolutional Networks. The success of convolutional networks and deep learning for image data has inspired generalizations for graphs for which sharing parameters is consistent with the graph geometry. Bruna et al. (2014); Henaff et al. (2015) are the pioneers of spectral graph convolutional neural networks in the graph Fourier space using localized spectral ﬁlters on graphs. However, in order to reduce the graph-dependency on the Laplacian

*Equal contribution 1Department of Computer Science, ETH Z urich 2Computer Science and Artiﬁcial Intelligence Laboratory, Massachusetts Institute of Technology. Correspondence to: Gregor Bachmann <gregor.bachmann@inf.ethz.ch>, Gary B ecigneul <garyb@mit.edu>, Octavian-Eugen Ganea <oct@mit.edu>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

eigenmodes, Defferrard et al. (2016) approximate the convolutional ﬁlters using Chebyshev polynomials leveraging a result of Hammond et al. (2011). The resulting method (discussed in Appendix A) is computationally efﬁcient and superior in terms of accuracy and complexity. Further, Kipf & Welling (2017) simplify this approach by considering ﬁrst-order approximations obtaining high scalability. The proposed graph convolutional networks (GCN) is interpolating node embeddings via a symmetrically normalized adjacency matrix, while this weight sharing can be understood as an efﬁcient diffusion-like regularizer. Recent works extend GCNs to achieve state of the art results for link prediction (Zhang & Chen, 2018), graph classiﬁcation (Hamilton et al., 2017; Xu et al., 2018) and node classiﬁcation (Klicpera et al., 2019; Veliˇckovi c et al., 2018).

Euclidean geometry in ML. In machine learning (ML), data is most often represented in a Euclidean space for various reasons. First, some data is intrinsically Euclidean, such as positions in 3D space in classical mechanics. Second, intuition is easier in such spaces, as they possess an appealing vectorial structure allowing basic arithmetic and a rich theory of linear algebra. Finally, a lot of quantities of interest such as distances and inner-products are known in closed-form formulae and can be computed very efﬁciently on the existing hardware. These operations are the basic building blocks for most of today s popular machine learning models. Thus, the powerful simplicity and efﬁciency of Euclidean geometry has led to numerous methods achieving state-of-the-art on tasks as diverse as machine translation (Bahdanau et al., 2015; Vaswani et al., 2017), speech recognition (Graves et al., 2013), image classiﬁcation (He et al., 2016) or recommender systems (He et al., 2017).

Riemannian ML. In spite of this success, certain types of data (e.g. hierarchical, scale-free or spherical data) have been shown to be better represented by non-Euclidean geometries (Defferrard et al., 2019; Bronstein et al., 2017; Nickel & Kiela, 2017; Gu et al., 2019), leading in particular to the rich theories of manifold learning (Roweis & Saul, 2000; Tenenbaum et al., 2000) and information geometry (Amari & Nagaoka, 2007). The mathematical framework in vigor to manipulate non-Euclidean geometries is known as Riemannian geometry (Spivak, 1979). Although its theory leads to many strong and elegant results, some of its basic

Constant Curvature Graph Convolutional Networks

Figure 1. Euclidean embeddings of trees of different depths. All the four most inner circles are identical. Ideal node embeddings should match in distance the graph metric, e.g. the distance between the pink and green nodes should be the same as their shortest path length. Notice how we quickly run out of space, e.g. the pink and green nodes get closer as opposed to farther. This issue is resolved when embedding trees in hyperbolic spaces.

quantities such as the distance function d( , ) are in general not available in closed-form, which can be prohibitive to many computational methods.

Representational Advantages of Geometries of Constant Curvature. An interesting trade-off between general Riemannian manifolds and the Euclidean space is given by manifolds of constant sectional curvature. They deﬁne together what are called hyperbolic (negative curvature), elliptic (positive curvature) and Euclidean (zero curvature) geometries. As discussed below and in Appendix B, Euclidean spaces have limitations and suffer from large distortion when embedding certain types of data such as trees. In these cases, the hyperbolic and spherical spaces have representational advantages providing a better inductive bias for the respective data.

The hyperbolic space can be intuitively understood as a continuous tree: the volume of a ball grows exponentially with its radius, similarly as how the number of nodes in a binary tree grows exponentially with its depth (see ﬁg. 1). Its tree-likeness properties have long been studied mathematically (Gromov, 1987; Hamann, 2017; Ungar, 2008) and it was proven to better embed complex networks (Krioukov et al., 2010), scale-free graphs and hierarchical data compared to the Euclidean geometry (Cho et al., 2019; Sala et al., 2018; Ganea et al., 2018b; Gu et al., 2019; Nickel & Kiela, 2018; 2017; Tifrea et al., 2019). Several important tools or methods found their hyperbolic counterparts, such as variational autoencoders (Mathieu et al., 2019; Ovinnikov, 2019), attention mechanisms (Gulcehre et al., 2018), matrix multiplications, recurrent units and multinomial logistic regression (Ganea et al., 2018a).

Similarly, spherical geometry provides beneﬁts for modeling spherical or cyclical data (Defferrard et al., 2019; Matousek, 2013; Davidson et al., 2018; Xu & Durrett, 2018; Gu et al., 2019; Grattarola et al., 2018; Wilson et al., 2014).

Computational Efﬁciency of Constant Curvature Spaces (CCS). CCS are some of the few Riemannian

manifolds to possess closed-form formulae for geometric quantities of interest in computational methods, i.e. distance, geodesics, exponential map, parallel transport and their gradients. We also leverage here the closed expressions for weighted centroids.

Linear Algebra of CCS: Gyrovector Spaces. In order to study the geometry of constant negative curvature in analogy with the Euclidean geometry, Ungar (1999; 2005; 2008; 2016) proposed the elegant non-associative algebraic formalism of gyrovector spaces. Recently, Ganea et al. (2018a) have linked this framework to the Riemannian geometry of the space, also generalizing the building blocks for non Euclidean deep learning models operating with hyperbolic data representations.

However, it remains unclear how to extend in a principled manner the connection between Riemannian geometry and gyrovector space operations for spaces of constant positive curvature (spherical). By leveraging Euler s formula and complex analysis, we present to our knowledge the ﬁrst uniﬁed gyro framework that allows for a differentiable interpolation between geometries of constant curvatures irrespective of their signs. This is possible when working with the Poincar e ball and stereographic spherical projection models of respectively hyperbolic and spherical spaces.

GCNs in Constant Curvature Spaces. In this work, we introduce an extension of graph convolutional networks that allows to learn representations residing in (products of) constant curvature spaces with any curvature sign. We achieve this by combining the derived uniﬁed gyro framework together with the effectiveness of GCNs (Kipf & Welling, 2017). Concurrent to our work, Chami et al. (2019); Liu et al. (2019) consider graph neural networks that learn embeddings in hyperbolic space via tangent space aggregation. Their approach will be analyzed more closely in section 3.4. Our model is more general as it produces representations in a strict super-set containing the hyperbolic space.

Constant Curvature Graph Convolutional Networks

Figure 2. Geodesics in the Poincar e disk (left) and the stereographic projection of the sphere (right).

2. The Geometry of Constant Curvature Spaces

Riemannian Geometry. A manifold M of dimension d is a generalization to higher dimensions of the notion of surface, and is a space that locally looks like Rd. At each point x M, M can be associated a tangent space Tx M, which is a vector space of dimension d that can be understood as a ﬁrst order approximation of M around x. A Riemannian metric g is given by an inner-product gx( , ) at each tangent space Tx M, gx varying smoothly with x. A given g deﬁnes the geometry of M, because it can be used to deﬁne the distance between x and y as the inﬁmum of the lengths of smooth paths γ : [0, 1] M from x to y, where the length is deﬁned as ℓ(γ) := R 1 0 p

gγ(t)( γ(t), γ(t))dt . Under certain assumptions, a given g also deﬁnes a curvature at each point.

Unifying all curvatures κ. There exist several models of respectively constant positive and negative curvatures. For positive curvature, we choose the stereographic projection of the sphere, while for negative curvature we choose the Poincar e model which is the stereographic projection of the Lorentz model. As explained below, this choice allows us to generalize the gyrovector space framework and unify spaces of both positive and negative curvature κ into a single model which we call the κ-stereographic model.

The κ-stereographic model. For a curvature κ R and a dimension d 2, we study the model std κ deﬁned as std κ = {x Rd | κ x 2 2 < 1} equipped with its Riemannian metric gκ x = 4 (1+κ||x||2)2 I =: (λκ x)2I. Note in particular that when κ 0, std κ is Rd, while when κ < 0 it is the open ball of radius 1/ κ.

Gyrovector spaces & Riemannian geometry. As discussed in section 1, the gyrovector space formalism is used to generalize vector spaces to the Poincar e model of hyperbolic geometry (Ungar, 2005; 2008). In addition, important quantities from Riemannian geometry can be rewritten in terms of the M obius vector addition and scalar-vector multiplication (Ganea et al., 2018a). We here extend gyrovector spaces to the κ-stereographic model, i.e. allowing positive curvature.

For κ > 0 and any point x std κ, we will denote by x the unique point of the sphere of radius κ 1

2 in Rd+1 whose stereographic projection is x. As detailed in Appendix C, it is given by

x := (λκ xx, κ 1

2 (λκ x 1)). (1)

For x, y std κ, we deﬁne the κ-addition, in the κstereographic model by:

x κy = (1 2κx T y κ||y||2)x + (1 + κ||x||2)y

1 2κx T y + κ2||x||2||y||2 std κ.

(2) The κ-addition is deﬁned in all the cases except for spherical geometry and x = y/(κ y 2) as stated by the following theorem proved in Appendix C.2.1.

Theorem 1 (Deﬁniteness of κ-addition). We have 1 2κx T y + κ2||x||2||y||2 = 0 if and only if κ > 0 and x = y/(κ y 2).

For s R and x std κ (and |s tan 1 κ x | < κ 1 2 π/2 if κ > 0), the κ-scaling in the κ-stereographic model is given by:

s κ x = tanκ (s tan 1 κ ||x||) x

||x|| std κ, (3)

where tanκ equals κ 1/2 tan if κ > 0 and ( κ) 1/2 tanh if κ < 0. This formalism yields simple closed-forms for various quantities including the distance function (see ﬁg. 3) inherited from the Riemannian manifold (std κ, gκ), the exp and log maps, and geodesics (see ﬁg. 2), as shown by the following theorem.

Theorem 2 (Extending gyrovector spaces to positive curvature). For x, y std κ, x = y, v = 0, (and x = y/(κ y 2) if κ > 0), the distance function is given bya:

dκ(x, y) = 2|κ| 1/2 tan 1 κ x κ y , (4)

the unit-speed geodesic from x to y is unique and given by

γx y(t) = x κ (t κ ( x κ y)) , (5)

and ﬁnally the exponential and logarithmic maps are described as:

expκ x(v) = x κ

|κ| 1 2 λκ x||v||

logκ x(y) = 2|κ| 1

2 λκx tan 1 κ || x κ y|| x κ y || x k y|| (7)

a We write x y for ( x) y and not (x y).

Constant Curvature Graph Convolutional Networks

Figure 3. Heatmap of the distance function dκ(x, ) in st2 κ for κ = 0.254 (left) and κ = 0.248 (right).

Proof sketch: The case κ 0 was already taken care of by Ganea et al. (2018a). For κ > 0, we provide a detailed proof in Appendix C.2.2. The exponential map and unit-speed geodesics are obtained using the Egregium theorem and the known formulas in the standard spherical model. The distance then follows from the formula dκ(x, y) = logκ x(y) x which holds in any Riemannian manifold.

Around κ = 0. One notably observes that choosing κ = 0 yields all corresponding Euclidean quantities, which guarantees a continuous interpolation between κstereographic models of different curvatures, via Euler s formula tan(x) = i tanh(ix) where i := 1. But is this interpolation differentiable with respect to κ? It is, as shown by the following theorem, proved in Appendix C.2.3.

Theorem 3 (Differentiability of std κ w.r.t. κ around 0). Let v = 0 and x, y Rd, such that x = y (and x = y/(κ y 2) if κ > 0). Quantities in Eqs. (4,5,6, 7) are well-deﬁned for |κ| < 1/ min( x 2, y 2), i.e. for κ small enough. Their ﬁrst order derivatives at 0 and 0+ exist and are equal. Moreover, for the distance we have up to quadratic terms in κ:

dκ(x, y) 2 x y

2κ x y 3/3 + (x T y) x y 2 (8)

Note that for x T y 0, this tells us that an inﬁnitesimal change of curvature from zero to small negative, i.e. towards 0 , while keeping x, y ﬁxed, has the effect of increasing their distance.

As a consequence, we have a uniﬁed formalism that allows for a differentiable interpolation between all three geometries of constant curvature.

We start by introducing the methods upon which we build. We present our models for spaces of constant sectional curvature, in the κ-stereographic model. However, the generalization to cartesian products of such spaces (Gu et al., 2019) follows naturally from these tools.

3.1. Graph Convolutional Networks

The problem of node classiﬁcation on a graph has long been tackled with explicit regularization using the graph Laplacian (Weston et al., 2012). Namely, for a directed graph with adjacency matrix A, by adding the following term to the loss: P i,j Aij f(xi) f(xj) 2 = f(X)T Lf(X), where L = D A is the (unnormalized) graph Laplacian, Dii := P

k Aik deﬁnes the (diagonal) degree matrix, f contains the trainable parameters of the model and X = (xj i)ij the node features of the model. Such a regularization is expected to improve generalization if connected nodes in the graph tend to share labels; node i with feature vector xi is represented as f(xi) in a Euclidean space.

With the aim to obtain more scalable models, Defferrard et al. (2016); Kipf & Welling (2017) propose to make this regularization implicit by incorporating it into what they call graph convolutional networks (GCN), which they motivate as a ﬁrst order approximation of spectral graph convolutions, yielding the following scalable layer architecture (detailed in Appendix A):

H(t+1) = σ D 1

2 H(t)W(t) (9)

where A = A + I has added self-connections, Dii = P

k Aik deﬁnes its diagonal degree matrix, σ is a nonlinearity such as sigmoid, tanh or Re LU = max(0, ), and W(t) and H(t) are the parameter and activation matrices of layer t respectively, with H(0) = X the input feature matrix.

Constant Curvature Graph Convolutional Networks

Figure 4. Left: Spherical gyromidpoint of four points. Right: M obius gyromidpoint in the Poincar e model deﬁned by (Ungar, 2008) and alternatively, here in eq. (12).

3.2. Tools for a κ-GCN

Learning a parametrized function fθ that respects hyperbolic geometry has been studied in Ganea et al. (2018a): neural layers and hyperbolic softmax. We generalize their deﬁnitions into the κ-stereographic model, unifying operations in positive and negative curvature. We explain how curvature introduces a fundamental difference between left and right matrix multiplications, depicting the M obius matrix multiplication of Ganea et al. (2018a) as a right multiplication, independent for each embedding. We then introduce a left multiplication by extension of gyromidpoints which ties the embeddings, which is essential for graph neural networks.

3.3. κ-Right-Matrix-Multiplication

Let X Rn d denote a matrix whose n rows are ddimensional embeddings in std κ, and let W Rd e denote a weight matrix. Let us ﬁrst understand what a right matrix multiplication is in Euclidean space: the Euclidean right multiplication can be written row-wise as (XW)i = Xi W. Hence each d-dimensional Euclidean embedding is modiﬁed independently by a right matrix multiplication. A natural adaptation of this operation to the κ-stereographic model yields the following deﬁnition.

Deﬁnition 1. Given a matrix X Rn d holding κstereographic embeddings in its rows and weights W Rd e, the κ-right-matrix-multiplication is deﬁned rowwise as

(X κ W)i = expκ 0 ((logκ 0(X)W)i )

= tanκ αi tan 1 κ (||X i||) (XW)i ||(XW)i || (10)

where αi = ||(XW)i ||

||Xi || and expκ 0 and logκ 0 denote the exponential and logarithmic map in the κ-stereographic model.

This deﬁnition is in perfect agreement with the hyperbolic

Figure 5. Weighted Euclidean midpoint αx + βy

scalar multiplication for κ < 0, which can also be written as r κ x = expκ 0(r logκ 0(x)). This operation is known to have desirable properties such as associativity (Ganea et al., 2018a).

3.4. κ-Left-Matrix-Multiplication as a Midpoint Extension

For graph neural networks we also need the notion of message passing among neighboring nodes, i.e. an operation that combines / aggregates the respective embeddings together. In Euclidean space such an operation is given by the left multiplication of the embeddings matrix with the (preprocessed) adjacency ˆA: H(l+1) = σ( ˆAZ(l)) where Z(l) = H(l)W(l). Let us consider this left multiplication. For A Rn n, the matrix product is given row-wise by:

(AX)i = Ai1X1 + + Ain Xn

This means that the new representation of node i is obtained by calculating the linear combination of all the other node embeddings, weighted by the i-th row of A. An adaptation to the κ-stereographic model hence requires a notion of weighted linear combination.

We propose such an operation in std κ by performing a κscaling of a gyromidpoint whose deﬁnition is reminded below. Indeed, in Euclidean space, the weighted linear combination αx+βy can be re-written as (α+β)m E(x, y; α, β) with Euclidean midpoint m E(x, y; α, β) := α α+β x+ β α+β y. See ﬁg. 5 for a geometric illustration. This motivates generalizing the above operation to std κ as follows.

Deﬁnition 2. Given a matrix X Rn d holding κstereographic embeddings in its rows and weights A Rn n, the κ-left-matrix-multiplication is deﬁned rowwise as

(A κ X)i := ( X

j Aij) κ mκ(X1 , , Xn ; Ai ).

The κ-scaling is motivated by the fact that dκ(0, r κ x) =

Constant Curvature Graph Convolutional Networks

Figure 6. Heatmap of the distance from a st2 κ-hyperplane to x st2 κ for κ = 0.254 (left) and κ = 0.248 (right)

|r|dκ(0, x) for all r R, x std κ. We remind that the gyromidpoint is deﬁned when κ 0 in the κ-stereographic model as (Ungar, 2010):

mκ(x1, , xn; α) = 1

αiλκ xi Pn j=1 αj(λκxj 1)xi

(12) with λκ x = 2/(1 + κ x 2). Whenever κ > 0, we have to further require the following condition:

j αj(λκ xj 1) = 0. (13)

For two points, one can calculate that (λκ x 1)+(λκ y 1) = 0 is equivalent to κ x y = 1, which holds in particular whenever x = y/(κ y 2). See ﬁg. 4 for illustrations of gyromidpoints.

Our operation κ satisﬁes interesting properties, proved in Appendix C.2.4:

Theorem 4 (Neuter element & κ-scalar-associativity). We have In κ X = X, and for r R,

r κ (A κ X) = (r A) κ X.

The matrix A. In most GNNs, the matrix A is intended to be a preprocessed adjacency matrix, i.e. renormalized by the diagonal degree matrix Dii = P

k Aik. This normalization is often taken either (i) to the left: D 1A, (ii) symmetric: D 1

2 or (iii) to the right: AD 1. Note that the latter case makes the matrix right-stochastic1, which is a property that is preserved by matrix product and exponentiation. For this case, we prove the following result in Appendix C.2.5:

1M is right-stochastic if for all i, P

Theorem 5 (κ-left-multiplication by right-stochastic matrices is intrinsic). If A, B are right-stochastic, φ is a isometry of std κ and X,Y are two matrices holding κstereographic embeddings:

i, dφ = dκ ((A κ φ(X))i , (B κ φ(Y))i )

= dκ((A κ X)i , (B κ Y)i ). (14)

The above result means that A can easily be preprocessed as to make its κ-left-multiplication intrinsic to the metric space (std κ, dκ). At this point, one could wonder: does there exist other ways to take weighted centroids on a Riemannian manifold? We comment on two plausible alternatives.

Fr echet/Karcher means. They are obtained as arg minx P

i αidκ(x, xi)2; note that although they are also intrinsic, they usually require solving an optimization problem which can be prohibitively expensive, especially when one requires gradients to ﬂow through the solution moreover, for the space std κ, it is known that the minimizer is unique if and only if κ 0.

Tangential aggregations. The linear combination is here lifted to the tangent space by means of the exponential and logarithmic map and were in particular used in the recent works of Chami et al. (2019) and Liu et al. (2019).

Deﬁnition 3. The tangential aggregation of x1, . . . , xn std κ w.r.t. weights {αi}1 i n, at point x std κ (for xi = x/(κ x 2) if κ > 0) is deﬁned by:

tgκ x(x1, ..., xn; α1, ..., αn) := expκ x

i=1 αi logκ x(xi)

The below theorem describes that for the κ-stereographic model, this operation is also intrinsic. We prove it in Appendix C.2.6.

Constant Curvature Graph Convolutional Networks

Theorem 6 (Tangential aggregation is intrinsic). For any isometry φ of std κ, we have

tgφ(x)({φ(xi)}; {αi}) = φ(tgx({xi}; {αi})). (16)

3.5. Logits

Finally, we need the logit and softmax layer, a neccessity for any classiﬁcation task. We here use the model of Ganea et al. (2018a), which was obtained in a principled manner for the case of negative curvature. Their derivation rests upon the closed-form formula for distance to a hyperbolic hyperplane. We naturally extend this formula to std κ, hence also allowing for κ > 0 but leave for future work the adaptation of their theoretical analysis.

p(y = k|x) = S

|κ| sin 1 κ

|κ| zk, ak (1 + κ||zk||2)||ak||

(17) where ak T0std κ = Rd and pk std κ are trainable parameters, x std κ , is the input, zk = pk x and S( ) is the softmax function. We reference the reader to Appendix D for further details and to ﬁg. 6 for an illustration of eq. 17.

We are now ready to introduce our κ-stereographic GCN (Kipf & Welling, 2017), denoted by κ-GCN2. Assume we are given a graph with node level features G = (V, A, X) where X Rn d with each row Xi std κ and adjacency A Rn n. We ﬁrst perform a preprocessing step by mapping the Euclidean features to std κ via the projection X 7 X/(2 p

|κ|||X||max), where ||X||max denotes the maximal Euclidean norm among all stereographic embeddings in X. For l {0, . . . , L 2}, the (l + 1)-th layer of κ-GCN is given by:

H(l+1) = σ κ ˆA κ H(l) κ W(l) , (18)

where H(0) = X, σ κ(x) := expκ 0(σ(logκ 0(x))) is the M obius version (Ganea et al., 2018a) of a pointwise nonlinearity σ and ˆA = D 1

2 . The ﬁnal layer is a κ-logit layer (Appendix D):

H(L) = softmax ˆA logitκ H(L 1), W(L 1) , (19)

where W(L 1) contains the parameters ak and pk of the κ-logits layer. A very important property of κ-GCN is that its architecture recovers the Euclidean GCN when we let curvature go to zero:

2To be pronounced kappa GCN; the greek letter κ being commonly used to denote sectional curvature

Table 1. Minimum achieved average distortion of the different models. H and S denote hyperbolic and spherical models respectively.

MODEL TREE TOROIDAL SPHERICAL

E10 (LINEAR) 0.045 0.0607 0.0415 E10 (RELU) 0.0502 0.0603 0.0409 H10 (κ-GCN) 0.0029 0.272 0.267 S10 (κ-GCN) 0.473 0.0485 0.0337 H5 H5 (κ-GCN) 0.0048 0.112 0.152 S5 S5 (κ-GCN) 0.51 0.0464 0.0359 H2 4 (κ-GCN) 0.025 0.084 0.062 S2 4 (κ-GCN) 0.312 0.0481 0.0378

κ-GCN κ 0 GCN.

4. Experiments

We evaluate the architectures introduced in the previous sections on the tasks of node classiﬁcation and minimizing embedding distortion for several synthetic as well as real datasets. We detail the training setup and model architecture choices to Appendix F.

Minimizing Distortion Our ﬁrst goal is to evaluate the graph embeddings learned by our GCN models on the representation task of ﬁtting the graph metric in the embedding space. We desire to minimize the average distortion, i.e. deﬁned similarly as in Gu et al. (2019):

d G(i,j) 2 1 2 , where d(xi, xj) is the dis-

tance between the embeddings of nodes i and j, while d G(i, j) is their graph distance (shortest path length).

We create three synthetic datasets that best reﬂect the different geometries of interest: i) Tree : a balanced tree of depth 5 and branching factor 4 consisting of 1365 nodes and 1364 edges. ii) Torus : We sample points (nodes) from the (planar) torus, i.e. from the unit connected square; two nodes are connected by an edge if their toroidal distance (the warped distance) is smaller than a ﬁxed R = 0.01; this gives 1000 nodes and 30626 edges. iii) Spherical Graph : we sample points (nodes) from S2, connecting nodes if their distance is smaller than 0.2, leading to 1000 nodes and 17640 edges.

For the GCN models, we use 1-hot initial node features. We use two GCN layers with dimensions 16 and 10. The non-Euclidean models do not use additional non-linearities between layers. All Euclidean parameters are updated using the ADAM optimizer with learning rate 0.01. Curvatures are learned using gradient descent and learning rate of 0.0001. All models are trained for 10000 epochs and we report the

Constant Curvature Graph Convolutional Networks

Table 2. Node classiﬁcation: Average accuracy across 5 splits with estimated uncertainties at 95 percent conﬁdence level via bootstrapping on our datasplits. H and S denote hyperbolic and spherical models respectively.

MODEL CITESEER CORA PUBMED AIRPORT

E16 (KIPF & WELLING, 2017) 0.729 0.0054 0.814 0.004 0.792 0.0039 0.814 0.0029 H16 (CHAMI ET AL., 2019) 0.71 0.0049 0.803 0.0046 0.798 0.0043 0.844 0.0041 H16 (κ-GCN) 0.732 0.0051 0.812 0.005 0.785 0.0036 0.819 0.0033 S16 (κ-GCN) 0.721 0.0045 0.819 0.0045 0.788 0.0049 0.809 0.0058 PROD-GCN (κ-GCN) 0.711 0.0059 0.808 0.0041 0.781 0.006 0.817 0.0044

minimal achieved distortion.

Distortion results. The obtained distortion scores shown in table 1 reveal the beneﬁt of our models. The best performing architecture is the one that matches the underlying geometry of the graph.

Figure 7. Histogram of Curvatures from Deviation of Parallogram Law

4.1. Node Classiﬁcation

We consider the popular node classiﬁcation datasets Citeseer (Sen et al., 2008), Cora-ML (Mc Callum et al., 2000) and Pubmed (Namata et al., 2012). Node labels correspond to the particular subﬁeld the published document is associated with. Dataset statistics and splitting details are deferred to the Appendix F due to the lack of space. We compare against the Euclidean model (Kipf & Welling, 2017) and the recently proposed hyperbolic variant (Chami et al., 2019).

Curvature Estimations of Datasets To understand how far are the real graphs of the above datasets from the Euclidean geometry, we ﬁrst estimate the graph curvature of the four studied datasets using the deviation from the Parallelogram Law (Gu et al., 2019) as detailed in Appendix G. Curvature histograms are shown in ﬁg. 7. It can be noticed that the datasets are mostly non-Euclidean, thus offering

a good motivation to apply our constant-curvature GCN architectures.

Training Details We trained the baseline models in the same setting as done in Chami et al. (2019). Namely, for GCN we use one hidden layer of size 16, dropout on the embeddings and the adjacency of rate 0.5 as well as L2regularization for the weights of the ﬁrst layer. We used Re LU as the non-linear activation function.

For the non-Euclidean architectures, we used a combination of dropout and dropconnect for the non-Euclidean models as reported in Chami et al. (2019), as well as L2-regularization for the ﬁrst layer. All models have the same number of parameters and for fairness are compared in the same setting, without attention. We use one hidden layer of dimension 16. For the product models we consider two-component spaces (e.g H8 S8) and we split the embedding space into equal dimensions of size 8. We also distribute the input features equally among the components. Non-Euclidean models use the M obius version of Re LU. Euclidean parameters use a learning rate of 0.01 for all models using ADAM. The curvatures are learned using gradient descent with a learning rate of 0.01. We show the learnt values in Appendix F. We use early stopping: we ﬁrst train for 2000 epochs, then we check every 200 epochs for improvement in the validation cross entropy loss; if that is not observed, we stop.

Node classiﬁcation results. These are shown in table 2. It can be seen that our models are competitive with the Euclidean GCN considered and outperforms Chami et al. (2019) on Citeseer and Cora, showcasing the beneﬁt of our proposed architecture.

5. Conclusion

In this paper, we introduced a natural extension of GCNs to the stereographic models of both positive and negative curvatures in a uniﬁed manner. We show how this choice of models permits a differentiable interpolation between positive and negative curvature, allowing the curvature to be trained independent of an initial sign choice. We hope that our models will open new exciting directions into non Euclidean graph neural networks.

Constant Curvature Graph Convolutional Networks

6. Acknowledgements

We thank prof. Thomas Hofmann, Andreas Bloch, Calin Cruceru and Ondrej Skopek for useful discussions and anonymous reviewers for suggestions. Gary B ecigneul is funded by the Max Planck ETH Center for Learning Systems.

Abu-El-Haija, S., Kapoor, A., Perozzi, B., and Lee, J. N-GCN: Multi-scale Graph Convolution for Semisupervised Node Classiﬁcation. International Workshop on Mining and Learning with Graphs (MLG), 2018.

Amari, S.-i. and Nagaoka, H. Methods of information geometry, volume 191. American Mathematical Soc., 2007.

Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. Proceedings of the International Conference on Learning Representations, 2015.

Bronstein, M. M., Bruna, J., Le Cun, Y., Szlam, A., and Vandergheynst, P. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4): 18 42, 2017.

Bruna, J., Zaremba, W., Szlam, A., and Lecun, Y. Spectral networks and locally connected networks on graphs. In International Conference on Learning Representations (ICLR2014), CBLS, April 2014, pp. http openreview, 2014.

Chami, I., Ying, R., R e, C., and Leskovec, J. Hyperbolic graph convolutional neural networks. Advances in Neural Information processing systems, 2019.

Chen, J., Ma, T., and Xiao, C. Fastgcn: fast learning with graph convolutional networks via importance sampling. ICLR, 2018.

Cho, H., De Meo, B., Peng, J., and Berger, B. Large-margin classiﬁcation in hyperbolic space. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pp. 1832 1840, 2019.

Davidson, T. R., Falorsi, L., De Cao, N., Kipf, T., and Tomczak, J. M. Hyperspherical Variational Auto-Encoders. Uncertainty in Artiﬁcial Intelligence (UAI), 856865, 2018.

Defferrard, M., Bresson, X., and Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral ﬁltering. In Advances in Neural Information Processing Systems, pp. 3844 3852, 2016.

Defferrard, M., Perraudin, N., Kacprzak, T., and Sgier, R. Deepsphere: towards an equivariant graph-based spherical cnn. In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019. URL https: //arxiv.org/abs/1904.05146.

Deza, M. and Laurent, M. Geometry of Cuts and Metrics. Springer, Vol. 15, 1996.

Ganea, O., B ecigneul, G., and Hofmann, T. Hyperbolic neural networks. In Advances in Neural Information Processing Systems, pp. 5345 5355, 2018a.

Ganea, O.-E., Becigneul, G., and Hofmann, T. Hyperbolic entailment cones for learning hierarchical embeddings. In International Conference on Machine Learning, pp. 1632 1641, 2018b.

Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and Dahl, G. E. Neural Message Passing for Quantum Chemistry. Proceedings of the International Conference on Machine Learning, 2017.

Grattarola, D., Zambon, D., Alippi, C., and Livi, L. Learning graph embeddings on constant-curvature manifolds for change detection in graph streams. stat, 1050:16, 2018.

Graves, A., Mohamed, A.-r., and Hinton, G. Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 6645 6649. IEEE, 2013.

Gromov, M. Hyperbolic groups. In Essays in group theory, pp. 75 263. Springer, 1987.

Gu, A., Sala, F., Gunel, B., and R e, C. Learning mixedcurvature representations in product spaces. Proceedings of the International Conference on Learning Representations, 2019.

Gulcehre, C., Denil, M., Malinowski, M., Razavi, A., Pascanu, R., Hermann, K. M., Battaglia, P., Bapst, V., Raposo, D., Santoro, A., et al. Hyperbolic attention networks. Proceedings of the International Conference on Learning Representations, 2018.

Hamann, M. On the tree-likeness of hyperbolic spaces. Mathematical Proceedings of the Cambridge Philosophical Society, pp. 1 17, 2017. doi: 10.1017/ S0305004117000238.

Hamann, M. On the tree-likeness of hyperbolic spaces. Mathematical Proceedings of the Cambridge Philosophical Society, pp. 117, 2017.

Hamilton, W. L., Ying, R., and Leskovec, J. Inductive Representation Learning on Large Graphs. In Advances in Neural Information Processing Systems, 2017.

Constant Curvature Graph Convolutional Networks

Hammond, D. K., Vandergheynst, P., and Gribonval, R. Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis, 30(2):129 150, 2011.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

He, X., Liao, L., Zhang, H., Nie, L., Hu, X., and Chua, T.-S. Neural collaborative ﬁltering. In Proceedings of the 26th international conference on world wide web, pp. 173 182. International World Wide Web Conferences Steering Committee, 2017.

Henaff, M., Bruna, J., and Le Cun, Y. Deep convolutional networks on graph-structured data. ar Xiv preprint ar Xiv:1506.05163, 2015.

Kingma, D. P. and Ba, J. ADAM: A method for stochastic optimization. ICLR, 2015.

Kipf, T. N. and Welling, M. Semi-supervised classiﬁcation with graph convolutional networks. International Conference on Learning Representations, 2017.

Klicpera, J., Bojchevski, A., and G unnemann, S. Predict then propagate: graph neural networks meet personalized pagerank. International Conference on Learning Representations, 2019.

Krioukov, D., Papadopoulos, F., Kitsak, M., Vahdat, A., and Bogun a, M. Hyperbolic geometry of complex networks. Physical Review E, 82(3):036106, 2010.

Liu, Q., Nickel, M., and Kiela, D. Hyperbolic graph neural networks. Advances in Neural Information processing systems, 2019.

Mathieu, E., Lan, C. L., Maddison, C. J., Tomioka, R., and Teh, Y. W. Continuous hierarchical representations with Poincar e variational auto-encoders. Advances in Neural Information Processing Systems, 2019.

Matousek, J. Lecture notes on metric embeddings. 2013.

Mc Callum, A., Nigam, K., Rennie, J., and Seymore, K. Automating the construction of internet portals with machine learning. Information Retrieval, 3(2):127 163, 2000.

Namata, G., London, B., Getoor, L., and Huang, B. Querydriven Active Surveying for Collective Classiﬁcation. International Workshop on Mining and Learning with Graphs (MLG), 2012.

Nickel, M. and Kiela, D. Poincar e embeddings for learning hierarchical representations. In Advances in Neural Information Processing Systems, pp. 6341 6350, 2017.

Nickel, M. and Kiela, D. Learning continuous hierarchies in the lorentz model of hyperbolic geometry. In International Conference on Machine Learning, 2018.

Ovinnikov, I. Poincar e Wasserstein autoencoder. ar Xiv preprint ar Xiv:1901.01427, 2019.

Roweis, S. T. and Saul, L. K. Nonlinear dimensionality reduction by locally linear embedding. science, 290(5500): 2323 2326, 2000.

Sala, F., De Sa, C., Gu, A., and Re, C. Representation tradeoffs for hyperbolic embeddings. In International Conference on Machine Learning, pp. 4457 4466, 2018.

Sarkar, R. Low distortion delaunay embedding of trees in hyperbolic plane. International Symposium on Graph Drawing, pp. 355 366. Springer,, 2011.

Sen, P., Namata, G., Bilgic, M., Getoor, L., Gallagher, B., and Eliassi-Rad, T. Collective Classiﬁcation in Network Data. AI Magazine, 29(3):93 106, 2008.

Spivak, M. A comprehensive introduction to differential geometry. volume four. 1979.

Tenenbaum, J. B., De Silva, V., and Langford, J. C. A global geometric framework for nonlinear dimensionality reduction. science, 290(5500):2319 2323, 2000.

Tifrea, A., B ecigneul, G., and Ganea, O.-E. Poincar e glove: Hyperbolic word embeddings. Proceedings of the International Conference on Learning Representations, 2019.

Ungar, A. Barycentric Calculus in Euclidean and Hyperbolic Geometry. World Scientiﬁc, ISBN 9789814304931, 2010.

Ungar, A. A. The hyperbolic pythagorean theorem in the Poincar e disc model of hyperbolic geometry. The American mathematical monthly, 106(8):759 763, 1999.

Ungar, A. A. Analytic hyperbolic geometry: Mathematical foundations and applications. World Scientiﬁc, 2005.

Ungar, A. A. A gyrovector space approach to hyperbolic geometry. Synthesis Lectures on Mathematics and Statistics, 1(1):1 194, 2008.

Ungar, A. A. Analytic Hyperbolic Geometry in N Dimensions: An Introduction. CRC Press, 2014.

Ungar, A. A. Novel tools to determine hyperbolic triangle centers. In Essays in Mathematics and its Applications, pp. 563 663. Springer, 2016.

Constant Curvature Graph Convolutional Networks

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998 6008, 2017.

Veliˇckovi c, P., Cucurull, G., Casanova, A., Romero, A., Li o, P., and Bengio, Y. Graph attention networks. International Conference on Learning Representations, 2018.

Weston, J., Ratle, F., Mobahi, H., and Collobert, R. Deep learning via semi-supervised embedding. In Neural Networks: Tricks of the Trade, pp. 639 655. Springer, 2012.

Wilson, R. C., Hancock, E. R., Pekalska, E., and Duin, R. P. Spherical and hyperbolic embeddings of data. IEEE transactions on pattern analysis and machine intelligence, 36(11):2255 2269, 2014.

Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., and Yu, P. S. A comprehensive survey on graph neural networks. ar Xiv preprint ar Xiv:1901.00596, 2019.

Xu, J. and Durrett, G. Spherical latent spaces for stable variational autoencoders. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4503 4513, 2018.

Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerful are graph neural networks? International Conference on Learning Representations, 2018.

Zhang, M. and Chen, Y. Link prediction based on graph neural networks. In Advances in Neural Information Processing Systems, 2018.