# residual_hyperbolic_graph_convolution_networks__72791ada.pdf Residual Hyperbolic Graph Convolution Networks Yangkai Xue1, Jindou Dai1, Zhipeng Lu2*, Yuwei Wu1,2*, Yunde Jia2,1 1Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology, Beijing Institute of Technology, China 2Guangdong Laboratory of Machine Perception and Intelligent Computing, Shenzhen MSU-BIT University, China xue.yangkai@bit.edu.cn, daijindou@foxmail.com, zhipeng.lu@hotmail.com, wuyuwei@bit.edu.cn, jiayunde@smbu.edu.cn Hyperbolic graph convolutional networks (HGCNs) have demonstrated representational capabilities of modeling hierarchical-structured graphs. However, as in general GCNs, over-smoothing may occur as the number of model layers increases, limiting the representation capabilities of most current HGCN models. In this paper, we propose residual hyperbolic graph convolutional networks (R-HGCNs) to address the oversmoothing problem. We introduce a hyperbolic residual connection function to overcome the over-smoothing problem, and also theoretically prove the effectiveness of the hyperbolic residual function. Moreover, we use product manifolds and Hyper Drop to facilitate the R-HGCNs. The distinctive features of the R-HGCNs are as follows: (1) The hyperbolic residual connection preserves the initial node information in each layer and adds a hyperbolic identity mapping to prevent node features from being indistinguishable. (2) Product manifolds in R-HGCNs have been set up with different origin points in different components to facilitate the extraction of feature information from a wider range of perspectives, which enhances the representing capability of R-HGCNs. (3) Hyper Drop adds multiplicative Gaussian noise into hyperbolic representations, such that perturbations can be added to alleviate the over-fitting problem without deconstructing the hyperbolic geometry. Experiment results demonstrate the effectiveness of R-HGCNs under various graph convolution layers and different structures of product manifolds. Introduction Hyperbolic graph convolutional networks (HGCNs) have been an emerging research topic due to its superior representation capabilities of modeling hierarchical graphs [Chami et al. 2019, Dai et al. 2021]. Different from Euclidean spaces with a polynomial expanding space volume, hyperbolic spaces increase exponentially growth of space volume with radius, which is well-suited to the geometry of hierarchical data. Benefiting from this property, great progress has been made by generalizing Euclidean methods to hyperbolic spaces such as hyperbolic graph convolutional networks [Chami et al. 2019, Dai et al. 2021], hyperbolic image embeddings [Khrulkov et al. 2020], and hyperbolic word embeddings [Nickel and *Corresponding authors. Copyright 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Kiela 2017, 2018]. However, the over-smoothing issue impeding the development of deep HGCNs, over-smoothing means node features becomes indistinguishable after passing through a large number of graph convolution layers. It proved that graph convolution is a special form of Laplacian smoothing [Li, Han, and Wu 2018] . Smoothing on nodes can reduce intra-class differences, while over-smoothing makes model less discriminative with indistinguishable node features. In this paper, we propose residual hyperbolic graph convolutional networks (R-HGCNs) to address the over-smoothing problem. Specifically, we introduce a hyperbolic residual connection function and use product manifolds and Hyper Drop into HGCNs. The details are as follows: (1) The hyperbolic residual connection transmits the initial node information to each layer to prevent node features from being indistinguishable, and the hyperbolic identity mapping prevents performance degradation caused by deepening the models. (2) Product manifolds pick different origin points on different components, which makes the same input have different embedding results in different components, giving them the ability to view the graph structure from different perspectives. This enhances the representational ability of R-HGCNs. (3) Hyper Drop adds multiplicative Gaussian noise on hyperbolic neurons to alleviate the over-fitting issue, inheriting the insight of training with noise and preserving the hyperbolic geometry, Extensive experiments demonstrate the effectiveness of R-HGCNs under various graph convolution layers and different structures of product manifolds. The contributions of this paper are summarized as follows: We propose R-HGCN, a product manifold based deep hyperbolic graph convolutional network that makes up for the deficiency of existing HGCNs for capturing longrange relationships in graphs. We design hyperbolic residual connection that addresses the over-smoothing issue while deepening HGCNs and theoretically prove the effectiveness of the hyperbolic residual connection. We propose to use product manifolds with different origin points in different components, which enables R-HGCNs to extract more comprehensive features from the data. We develop Hyper Drop, a regularization method tailored for hyperbolic representations. It can improve R-HGCNs generalization ability, alleviating the over-fitting issue. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Related Work Graph convolutional networks typically embed graphs in Euclidean spaces since Euclidean spaces are the most commonly used and easy to calculate. Many researchers have noticed that Euclidean spaces have limitations while modeling data with hierarchical structure. Sa et al.[De Sa et al. 2018] claimed that it is not possible to embed trees into Euclidean spaces with arbitrarily low distortion, even in an infinite-dimensional Euclidean space. Meanwhile, trees can be embedded into a two-dimensional hyperbolic space with arbitrarily low distortion. Such a surprising fact benefits from the pretty property of hyperbolic spaces: unlike the volume of a ball in Euclidean space, which expands polynomially with radius, the volume of space in hyperbolic space grows exponentially with radius. [Liu, Nickel, and Kiela 2019]. Thus, hyperbolic spaces are commonly viewed as a smooth version of tree and more suitable to model hierarchical data. Several works discovered that graphs, e.g., biological networks and social networks, exhibit a highly hierarchical structure [Krioukov et al. 2010, Papadopoulos et al. 2012]. Krioukov et al.[Krioukov et al. 2010] proved that the typical properties such as power-law degree distribution and strong clustering in such graphs are closely related to the curvature of hyperbolic spaces. Based on the above observations, generalizing GCNs from Euclidean spaces to hyperbolic spaces have been an emerging research topic. Liu et al.[Liu, Nickel, and Kiela 2019] and Chami et al.[Chami et al. 2019] first bridged the research gap and concurrently proposed HGCNs. Following the above works, many advanced techniques are proposed to improve HGCNs. Dai et al.[Dai et al. 2021] discovered that performimg graph convolution in tangent space will distort the global structure of hyperbolic spaces because tangent space is only a local approximation of hyperbolic manifolds. Yao et al.[Yao, Pi, and Chen 2022] designs a Hyperbolic Skipped Knowledge Graph Convolutional Network to capture the network structure characteristics in hyperbolic knowledge embeddings. Liu et al.[Liu and Lang 2023] propose a Multi-curvature Hyperbolic Heterogeneous Graph Convolutional Network (Mc H-HGCN) based on type triplets for heterogeneous graphs. Preliminaries Hyperbolic Manifold. A Riemannian manifold or Riemannian space (M, g) is a real and smooth manifold M equipped with a positive-definite metric tensor g. It is a topological space that is locally homeomorphic to an Euclidean space at each point x M, and the local Euclidean space is termed the tangent space T x M. Lorentz Model. A d-dimensional Lorentz model (Ld, g) is defined by the manifold Ld = { x = [x0, x1, , xd] Rd+1 : x, x L = 1, x0 > 0} where the Lorentz inner product is defined as x, y L = x g y = x0y0 + i=1 xiyi, (1) and the metric tensor g = diag([ 1, 1, , 1]) where diag( ) denotes a diagonal matrix. Exponential and logarithmic maps. Mappings between Riemannian manifold and their tangent spaces are termed exponential and logarithmic maps. Let x be a point on the Lorentz manifold L, T x L be the tangent space at x, and v be a vector on the tangent space T x L. The exponential map exp x( v) that projects v onto the manifold L is defined as exp x( v) = cosh( v L) x + sinh( v L) v v L , (2) where v L = p v, v L is the norm of v. The logarithmic map, inverse to the exponential map at x, is given by log x( y) = arcosh( x, y L) p x, y 2 L 1 ( y + x, y L x). (3) Parallel Transport. The generalization of parallel translation to non-Euclidean geometry is termed parallel transport. For two points x, y L on the Lorentz model, the parallel transport of a tangent vector v T x L on the tangent space at x to the tangent space T y L at y, along a smooth curve on the Lorentz model, is defined as P x y( v) = v log x( y), v L d L( x, y)2 log x( y) + log y( x) . (4) Method We propose residual hyperbolic graph convolutional networks (R-HGCNs) to address the over-smoothing problem and enhance the representational ability of HGCNs. Let G = (V, E) denote a graph with a vertex set V and an edge set E. X Rn d denotes the node features that typically lie in Euclidean spaces. n is the number of nodes, and d is the dimension of the node features. Residual Hyperbolic Graph Convolution Lorentz Operations. Due to the strict manifold constraint of the Lorentz model, basic operations (matrix multiplication, vector addition, etc) are non-trivial to be generalized to Lorentz representations. Based on the maps mentioned above, we define the following operations. Definition 1 (Lorentz matrix-vector multiplication). Let W be a (d + 1) (d + 1) real matrix and x L be an input in the Lorentz model. Then we define the Lorentz matrixmultiplication as W x := exp o W log o( x) , (5) where exp( ) and log( ) are defined as Eqs (2) and (3). Definition 2 (Lorentz scalar multiplication). The Lorentz scalar multiplication of a scale ξ and x L on the Lorentz model is defined as ξ x := exp o ξlog o( x) . (6) Definition 3 (Lorentz vector addition). The Lorentz vector addition of x, y L on the Lorentz model is defined as x y := exp x(P o x(log o( y))), (7) where P o x( ) is the parallel transport operator defined in Eq (4). The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Definition 4 (Lorentz activation function). For x L, the Lorentz activation function on the Lorentz model is defined as σL( x) := exp o σ(log o( x)) , (8) where σ( ) can be any activation function such as Re LU( ). Residual Hyperbolic Graph Convolution. Since the input of a hyperbolic graph convolutional network is required to be hyperbolic, we construct the initial Lorentz node features H(0) Rn (d+1) whose i-th row H(0) i being the Lorentz feature of the i-th node is generated by H(0) i = exp o [0, Xi] = h cosh Xi 2 , sinh Xi 2 Xi Xi 2 where Xi denotes the i-row of X. Such a construction is based on the fact that [0, Xi] Rd+1 can be viewed as a tangent vector on the tangent space at the origin point, satisfying o, [0, Xi] L = 0 where o = [1, 0, , 0] is the origin point of the Lorentz model. The performance of a hyperbolic graph convolutional network declines as the number of graph convolution layers increases, that is called over-smoothing issue [Li, Han, and Wu 2018]. This is because the graph convolution is proved to be a special form of Laplacian smoothing, making node features tend to be indistinguishable after extensive graph convolutions[Li, Han, and Wu 2018]. Inspired by [Chen et al. 2020], we design hyperbolic residual connection and hyperbolic identity mapping to tackle this issue. The residual hyperbolic graph convolution operator hgc( ) is defined as hgc(H) = σL (1 β)I + βW H , H = (1 α) ( A H) α H0 , (10) where , , and are the Lorentz matrix-vector multiplication (Definition 1), the Lorentz scalar multiplication (Definition 2), and the Lorentz vector addition (Definition 3). Here σL( ) is the Lorentz activation function (Definition 4). α and β are hyper-parameters to control the weight of hyperbolic residual connection and hyperbolic identity mapping. Formally, at the ℓ-th layer, residual hyperbolic graph convolution performs like H(ℓ) = hgc(H(ℓ 1)) = σL (1 βℓ)I + βℓW (ℓ) H , H = (1 αℓ) ( A H(ℓ 1)) αℓ H(0) , (11) where hgc( ) takes the node features H(ℓ 1) from the previous layer, and outputs the node features H(ℓ) at the ℓ-th layer. Compared to the convolution operator in vanilla GCNs, i.e. H(ℓ) = σ( AH(ℓ 1)W ), hgc( ) relieves over-smoothing issue through two modifications: (1) hyperbolic residual connection adds information paths from initial node features to each graph convolution layer, such that no matter how deep a hyperbolic graph convolutional network is, the node features at the top layer still combine initial node features avoid becoming indistinguishable; (2) hyperbolic identity mapping ensures that a deep hyperbolic graph convolutional network is not worse than a shallow model. In the extreme case where the values of β are set to be zero after the i-th layer, the model degenerates to an i-layer GCN no matter how deep it is. Effectiveness of Hyperbolic Residual Connection This section is meant to theoretically explain the efficiency of our network architecture. Inspired by [Cai and Wang 2020], we first define a hyperbolic version of Dirichlet energy" for tracking node embeddings as follows. The Dirichlet energy of a function measures the "smoothness" of a unit norm function. The indistinguishable parameters leading to the oversmoothing issue result in small Dirichlet energy. For details of formulas and proofs in this section, see the supplementary material. Definition 5 Dirichlet energy E(f) of a scalar function f Ld Rd+1 on the graph G is defined as E(f) = logo(f)T logo(f), where = Id+1 D 1 2 , A = A + Id+1, D = D + Id+1, while A and D are the adjacency and degree matrices of G. For a vector field F(d+1) c = (f1, . . . , fc), its Dirichlet energy is E(F) = tr(logo(F)T logo(F)). Note that the defined Dirichlet energy always pulls back the node embedding in Lorentz space to its tangent space at the origin point. If the hyperbolic graph convolution operator hgc( ) as in (10) has no original input, i.e. H = P H therein, then E(H(l)) (1 λ)2 (1 βl)I + βl W (l) 2 2E(H(l 1)), (12) where 0 < λ < 2 is the smallest non-zero eigenvalue of and X 2 denotes the maximal singular value of X. Note that X 2 = max u 2=1 Xu 2 for any matrix X. Hence by the triangle inequality, ((1 βl)I + βl W (l))u 2 1 βl + βl W (l)u 2. (13) Actually Xu 2 for any weight matrix X may be estimated by Lemma 1 If X = (Xij) is a n n weight matrix, i.e. Pn j=1 Xij = 1, Xij 0, then for any u Rn with u 2 = 1, Xu 2 n. Hence combined with (12) and (13), we easily prove that in HGCNs without initial input, we have E(H(l)) d(1 λ)2E(H(l 1)). Thus if the graph G does not have enough expansion, say (1 λ) < 1 d, then HGCNs would be exponentially over-smoothing as the number of layers increases. The above result suggests us add an initial input as in hgc( ), H = (1 αl) ( P H(l 1)) αl H(0) , The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) seeking to interfere the decreasing of Dirichlet energy, which is the motivation of R-HGCNs. We investigate in details the interference of initial input. For simplicity we assume that the features in process all have positive entries so that Re LU does not affect the evaluation of Dirichlet energy. Thus utilizing the same argument as in the proof of (12) in the case with initial input, we have E(H(l)) = (yl logo(z))T (yl logo(z)), (14) with yl = (1 βl)I + βl W (l) and z = expz1(αl Po z1(logo(H(0)))) the last equality of which is due to linearity of parallel transport, and z1 = expo((1 αl) logo( P H(l 1))) = expo(z2). By definition of parallel transport, we have Po z1(logo(H(0))) = logo(H(0)) logo(z1), logo(H(0)) L d L(o, z1) (z2 + logz1(o)) Further by definition, z1 = cosh((1 αl) P logo(H(l 1))) L)o + sinh((1 αl) P logo(H(l 1)) L) P logo(H(l 1)) L P logo(H(l 1)) Also by definition, expz1(x) = cosh( x L)z1+ sinh( x L) x L x. Then again by the property of parallel transport, we have z = θl logo(H(0)) + φl PH(l 1) + ψlo, (17) where θl, φl, ψl are coefficients depending on αl, βl and the Lorentzian norm of the above 3 vectors. Noting that θl only depends on αl and H(0), the effect of logo(H(0)) is always not negligible. Also (z = (z0, . . . )) logo(z) = arcosh(z0) p z2 0 1 (z z0o), which removes z0 from z and re-scale it by a large factor. Then altogether we have E(H(l)) = ( θl logo(H(0)) + φl PH(l 1) + ψlo)T ( ) = θ2 l E(H(0)) + , (18) where θl is negligible similarly. Thus we prove that in RHGCNs with initial input, the Dirichlet energy E(H(l)) will be bounded away from zero even if the Dirichlet energy in the corresponding HGCNs without initial input decrease to zero. Product Manifold We use the product manifold of the Lorentz models as embedding space. The Lorentz components of the product manifold are independent of each other. The product manifold is the Cartesian product of a sequence of Riemannian manifolds, each of which is called a component. Given a sequence of the Lorentz models Ld1 1 , , Ldj j , , Ldk k where dj denotes the dimension of the j-th component, the product manifold is defined as L = Ld1 1 Ldj j Ldk k . The coordinate of a point x on L is written as x = [ x1, , xj, xk] where xj Ldj j . Similarly, the coordinate of a tangent vector v T x L is written as v = [ v1, , vj, , vk] where vj T xj Ldj j . For x, y L and v T x L, the exponential and logarithmic maps on L are defined as exp x( v) = [exp x1( v1), , exp xj( vj), , exp xk( vk)], (19) log x( y) = [log x1( y1), , log xj( yj), , log xk( yk)]. (20) Different from ordinary product manifolds, we use L = (Ld)o1 (Ld)o2 (Ld)ok, where (Ld)oi are copies of d-dimensional Lorentz spaces with randomly prescribed origin points oi Ld. This gives R-HGCNs the ability to extract node features from a wider range of different perspectives. Mathematically, such construction of product manifolds is inspired by the general construction of manifolds using Euclidean strata with different coordinates. Hyperbolic Dropout Hyperbolic Dropout(Hyper Drop) adds multiplicative Gaussian noise on Lorentz components to regularize the HGCNs and alleviate the over-fitting issue. Concretely, let L = Ld1 1 Ldj j Ldk k denote a product manifold of k Lorentz models where dj denotes the dimension of the j-th Lorentz component. Given an input l = [ l1, , lj, lk] L on the product manifold where lj Ldj j , Hyper Drop is formulated as y = [ y1, , yj, , yk], yj = ξj fθj( lj), ξj N(1, σ2), where ξj is the multiplicative Gaussian noise drawn from the Gaussian distribution N(1, σ2). Following [Srivastava et al. 2014], we set σ2 = η/(1 η) where η denotes drop rate. denotes the Lorentz scalar multiplication that is the generalization of scalar multiplication to the Lorentz representations, defined in Definition 2. fθj( ) could be any realization of a desirable function, such as a neural network with parameters θj. It is noted that we sample ξj from the Gaussian distribution instead of the Bernoulli distribution used in the standard dropout for the following reason. If ξj is drawn from the Bernoulli distribution and happens to be 0 (with a probability of η) at the ℓ-th neural network layer, the information flow of the j-th Lorentz component will be interrupted, leading to the deactivation of the j-th Lorentz component after the ℓ-th neural network layer. In contrast, ξj drawn from a Gaussian distribution with mean value 1 is exactly equal to 0 is a small probability event. Thus, the j-th Lorentz component always works. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) We may interpret Hyper Drop from a Bayesian perspective. For convenience, we take a single Lorentz model and the Lorentz linear transformation as an example, i.e. y = ξ fθ( l) and fθ( l) = l θ. We have y = ξ ( l θ), ξ N(1, σ2) (22) with mr,c = ξθr,c, and ξ N(1, σ2), (23) where denotes the Lorentz matrix-vector multiplication as defined in Definition 1, M is the matrix with mr,c as entries, and θ is the matrix with θr,c as entries. Eq (23) can be interpreted as a Bayesian treatment that the posterior distribution of the weight is given by a Gaussian distribution i.e. qφ(mr,c) = N(θr,c, σ2θ2 r,c). The Hyper Drop sampling procedure Eq (22) can be interpreted as rising from a reparameterization of the posterior on the parameter M as shown in Eq (23). Residual Hyperbolic Graph Convolutional Network We then investigate our architecture of R-HGCN and its effect of preventing over-smoothing. In fact, we always combine R-HGCN with initial input. Note that our model is of the form L = (Ld)o1 (Ld)o2 (Ld)ok, where (Ld)oi are copies of d-dimensional Lorentz spaces with random prescribed origin points oi Ld. Then for any x = (x1, . . . , xk) L, its Dirichlet energy is defined as E(x) = max 1 i k(logoi(xi)T logoi(xi)) := max 1 i k Ei(xi). (24) Then by the similar argument with the proof of (12), we can estimate Ei(H(l) i ) separately and then opt for the maximal among them, which may be better behaved than any single component due to possible fluctuation. Network Architecture. Let L = (Ld)o1 (Ld)o2 (Ld)ok denote the product manifold of k Lorentz models where dj is the dimension of the j-th Lorentz model. The initial node features H(0) on the product manifold of Lorentz models are given by ( o = [ o1, , oj, , ok]) H(0) = [H1,(0), , Hj,(0), , Hk,(0)], Hj,(0) i = exp o [0, Xi] , (25) where Hj,(0) i R(dj+1) denotes the i-th row of the node features Hj,(0) Rn (dj+1) on the j-th Lorentz component. The graph convolution on the product manifold of the Lorentz models combined with Hyper Drop is realized by instantiating f( ) in Eq (21) as the hyperbolic graph convolution operator hgc( ) , i.e. Eq (11) becomes H(ℓ) = [H1,(ℓ), , Hj,(ℓ), , Hk,(ℓ)], Hj,(ℓ) i = ξj i hgc(Hj,(ℓ 1)), ξj i N(1, σ2), Datasets PUBMED CITESEER CORA AIRPORT Classes 3 6 7 4 Nodes 19, 717 3, 327 2, 708 3, 188 Edges 44, 338 4, 732 5, 429 18, 631 Features 500 3, 703 1, 433 4 Table 1: Dataset statistics. where Hj,(ℓ) Rn (dj+1) is the node features on the j-th Lorentz component at the ℓ-th layer. Hj,(ℓ) i R(dj+1) is the i-th row (i.e., the i-th node) of Hj,(ℓ). denotes the Lorentz scalar multiplication defined in Definition 2. ξj i is the random multiplicative noise drawn from the Gaussian distribution N(1, σ2). We set σ = η/(1 η) where η denotes the drop rate. The node features at the last layer can be used for downstream tasks. Taking the node classification task as an example, we map node features to the tangent spaces of the product manifolds, and send tangent representations to a fullyconnected layer followed by a softmax for classification. Experiments Experiments are performed on the semi-supervised node classification task. We first evaluate the performance of R-HGCN under different configurations of models, including various graph convolution layers and different structures of product manifolds. Then, we compare with several state-of-theart Euclidean GCNs and HGCNs, showing that R-HGCN achieves competitive results. Further, we compare with Drop Connect[Wan et al. 2013], a related regularization mathod for deep GCNs. Datasets and Baselines We use four standard commonly-used citation network graph datasets : PUBMED, CITESEER, CORA and AIRPORT [Sen et al. 2008]. Dataset statistics are summarized in Table 1. Experiment details see in the supplementary material. Validation Experiments Here we demonstrate the effectiveness of the R-HGCN and our regularization method under different model configurations. For R-HGCN, increasing the number of hyperbolic graph convolution layers almost always brings improvements on three datasets. The criterion for judging the effectiveness of Hyper Drop is to test whether the performance of R-HGCN is improved with the help of Hyper Drop. In Table ??, we report the performance of Hyper Drop with various graph convolution layers and structures of product manifolds on PUBMED, CITESEER CORA and AIRPORT. We observe that in most experiment, Hyper Drop improves the performance of R-HGCN. For example, on CORA, R-HGCN[2 8] obtains 1.7%, 0.7%, 0.3% and 0.2% gains with 4, 8, 16 and 32 layers. The stable improvements demonstrate that Hyper Drop can effectively improve the generalization ability of R-HGCN. We also compare R-HGCN with a deep Euclidean method, GCNII. The hyperbolic residual connection and hyperbolic The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Datasets Methods 4 layers 8 layers 16 layers 32 layers Original Hyper Drop Original Hyper Drop Original Hyper Drop Original Hyper Drop GCNII 79.3 0.3 79.9 0.3 79.9 1.7 80.0 1.9 P-HGCN[2 8] 79.1 0.2 79.8 0.3 80.0 0.1 80.3 0.1 79.1 0.2 79.2 0.3 79.2 0.3 79.5 0.4 P-HGCN[4 4] 79.0 0.2 79.5 0.2 79.8 0.1 80.1 0.3 79.9 0.3 80.0 0.2 79.8 0.4 80.3 0.3 P-HGCN[8 2] 79.0 0.2 79.4 0.2 79.7 0.2 79.9 0.2 80.0 0.3 80.1 0.3 80.1 0.2 80.3 0.4 P-HGCN[16 1] 78.7 0.3 79.2 0.3 79.4 0.1 79.9 0.2 79.3 0.3 79.5 0.4 79.2 0.2 80.1 0.4 GCNII 69.3 2 70.6 1.4 70.5 1.3 70.8 1.6 P-HGCN[2 8] 71.4 0.2 72.0 0.4 72.1 0.2 72.3 0.3 71.9 0.5 71.9 0.7 72.1 0.6 72.3 0.7 P-HGCN[4 4] 71.4 0.2 72.0 0.4 71.9 0.3 72.2 0.3 71.5 0.5 71.7 1.0 71.4 0.2 71.9 0.7 P-HGCN[8 2] 71.2 0.8 70.4 0.9 68.6 0.8 71.1 0.7 72.0 0.2 71.9 0.9 72.0 0.1 72.0 0.5 P-HGCN[16 1] 71.3 0.3 72.4 0.3 71.9 0.3 72.5 0.5 72.2 0.4 72.3 0.4 72.3 0.6 72.5 0.9 GCNII 76.6 2.4 79.4 1.4 81.3 1.0 81.5 1.4 P-HGCN[2 8] 80.3 0.7 82.0 0.5 81.5 0.2 82.2 0.4 81.8 0.2 82.1 0.2 81.9 0.3 82.1 0.1 P-HGCN[4 4] 80.3 0.8 81.7 0.5 81.3 0.2 81.9 0.4 81.7 0.2 81.9 0.2 81.6 0.5 82.1 0.7 P-HGCN[8 2] 77.9 0.8 80.8 0.8 80.2 0.9 81.5 0.3 81.7 0.2 81.8 0.2 81.9 0.5 82.3 0.8 P-HGCN[16 1] 79.4 0.8 81.9 0.7 80.4 0.4 82.5 0.4 81.6 0.6 81.5 0.5 81.6 0.7 81.4 0.6 GCNII 88.9 0.5 89.1 0.3 89.2 0.4 89.6 0.7 P-HGCN[2 8] 88.9 0.4 89.1 0.2 89.2 1.0 89.4 0.9 89.1 0.3 89.3 0.5 89.7 0.2 90.0 0.4 P-HGCN[4 4] 88.6 0.6 89.0 0.2 89.3 0.4 89.4 0.3 89.6 0.1 89.6 0.2 89.7 0.1 90.2 0.7 P-HGCN[8 2] 88.7 0.2 88.7 0.4 89.3 0.5 89.6 0.7 89.4 0.2 89.8 0.5 89.6 0.4 89.7 0.7 P-HGCN[16 1] 88.5 0.7 88.6 0.4 88.6 0.3 88.5 0.5 88.7 0.2 88.9 0.5 89.0 0.3 89.2 0.5 Table 2: Comparisons on various graph convolution layers and different structures of 16-dimensional product manifolds w and w/o Hyper Drop. We also compare with a related work GCNII using 16-dimensional embedding space. Mean accuracy (%) and standard deviation are reported. P-HGCN[d m] denotes the P-HGCN with a product manifold of m d-dimensional Lorentz models. Methods PUBMED CITESEER CORA GCN[Kipf and Welling 2017] 79.1 0.3 71.2 0.6 81.3 0.5 GAT[Veliˇckovi c et al. 2017] 77.7 0.2 70.9 0.4 82.4 0.6 Graph Sage[Hamilton, Ying, and Leskovec 2017] 77.3 0.3 67.8 1.1 77.3 0.8 SGC[Wu et al. 2019] 69.3 0.0 78.9 0.0 80.9 0.0 APPNP[Klicpera, Bojchevski, and Günnemann 2019] 80.1 0.2 71.6 0.4 83.7 0.5 GCNII(8)[Chen et al. 2020] 79.9 0.3 72.4 0.9 83.5 0.7 HGCN [Chami et al. 2019] 80.3 0.3 - 79.9 0.2 H2HGCN[Dai et al. 2021] 79.9 0.5 - 82.8 0.4 κ GCN [Bachmann, Bécigneul, and Ganea 2020] 78.3 0.6 70.7 0.5 80.0 0.6 LGCN [Zhang et al. 2021] 78.6 0.7 71.9 0.7 83.3 0.7 P-HGCN[16 1](8) 79.4 0.1 71.9 0.3 80.4 0.4 P-HGCN[16 1](8)+Hyper Drop 79.9 0.2 72.5 0.5 82.5 0.4 P-HGCN[2 8](8) 80.0 0.1 72.1 0.2 81.5 0.2 P-HGCN[2 8](8)+Hyper Drop 80.3 0.1 72.3 0.3 82.2 0.4 Table 3: Mean accuracy (%) and standard deviation on PUBMED, CITESEER, and CORA. We set the dimensions of embedding spaces to 16 for all methods and the number of graph convolution layers to 8 (number in parentheses) for deep models, i.e., GCNII and P-HGCN. P-HGCN[d m] denotes the P-HGCN with a product manifold of m d-dimensional Lorentz models. identity mapping in R-HGCN are inspired by GCNII. The main difference between R-HGCN and GCNII is, R-HGCN performs graph representation learning in hyperbolic spaces while GCNII is in Euclidean spaces. As shown in Table ??, R-HGCN shows superiority compared to GCNII. Actually, R-HGCN is only baseline model we developed for evaluating the effectiveness of Hyper Drop, and we give up extra training tricks for clear evaluations. For example, in Section , R-HGCN obtains the same mean accuracy 83.5% as GCNII with 8 layers on CORA while using Hyper Drop and Drop Connect[Wan et al. 2013] together. Drop Connect is used in other hyperbolic graph convolutional networks, such HGCN [Chami et al. 2019] and LGCN [Zhang et al. 2021]. We claim that the superior performance of R-HGCN compared to GCNII benefits from the representing capabilities of hyperbolic spaces while dealing with hierarchical-structure data. It confirms the significance of hyperbolic representation learning. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Layers P-HGCN[2 8] P-HGCN[4 4] P-HGCN[8 2] P-HGCN[16 1] with IRC w/o IRC with IRC w/o IRC with IRC w/o IRC with IRC w/o IRC 2 78.9 0.4 78.9 0.2 79.0 0.3 78.8 0.3 78.8 0.3 78.9 0.4 78.7 0.2 78.6 0.4 4 79.1 0.4 79.3 0.2 79.0 0.2 79.0 0.6 79.2 0.3 78.7 0.4 79.2 0.3 78.6 0.4 8 80.3 0.0 78.6 0.2 80.2 0.2 79.1 0.2 80.1 0.1 78.5 0.6 79.8 0.2 76.1 3.5 16 79.2 0.3 29.8 11.1 80.0 0.2 60.1 7.8 80.1 0.3 60.1 7.8 79.5 0.4 49.9 2.1 Table 4: Testing accuracy (%) comparisons on different layers and model structures w and w/o hyperbolic residual connection(HRC). P-HGCN[d m] denotes the P-HGCN with a product manifold of m d-dimensional Lorentz models. Methods PUBMED CITESEER CORA w/o dropout 79.4 0.1 71.9 0.3 80.3 0.4 Drop Connect 79.7 0.3 71.9 0.3 83.0 0.6 Hyper Drop 79.9 0.2 72.5 0.5 82.5 0.4 Both 79.9 0.3 72.7 0.5 83.5 0.9 Table 5: Comparisons of Hyper Drop and Drop Connect. Ablation Experiments We conducted ablation experiments on the PUBMED dataset to observe the effect of our proposed hyperbolic residual connection on R-HGCN performance in different dimension selection approaches, respectively. As can be seen from Tabel 4, without the hyperbolic residual connection, the fortunate performance of the model shows a different degree of decrease respectively. Moreover, as the number of model layers increases, the decline in model performance is more pronounced. The experimental results prove that hyperbolic residual connection has a great helpful effect on the model performance. Performance Comparisons The comparisons with several state-of-the-art Euclidean GCNs and HGCNs are shown in Table 3. We have three observations. First and most importantly, compared with other HGCNs that are typically shallow models, R-HGCN shows better results on PUBMED and CITESEER. Through, on PUBMED, HGCN also achieves the best accuracy, 80.3%. Note that HGCN uses extra link prediction task as pretraining model while R-HGCN does not use this training trick for a clear evaluation of Hyper Drop; and the performance of HGCN decreases when the link prediction pre-training is not used. On CORA, LGCN achieves the highest mean accuracy 83.3% among HGCNs. Note that both HGCN and LGCN utilize Drop Connect [Wan et al. 2013] technique for training. As shown in Section , R-HGCN obtains 83.5% mean accuracy on CORA while also using Drop Connect, that is 0.2% higher than that of LGCN. Second, both R-HGCN[16 1] and R-HGCN[2 8] benefit from Hyper Drop on three datasets. It proves Hyper Drop alleviates over-fitting issue in hyperbolic graph convolutional network and improves the generalization ability of R-HGCN on the test set. Third, compared with Euclidean GCNs, R-HGCN combined with Hyper Drop achieves the best results on PUBMED and CITESEER. It confirms the superiority of hyperbolic representation learning while modeling graph data. Comparisons with Drop Connect Table 5 shows the performance of Hyper Drop and Drop Connect[Wan et al. 2013] on PUBMED, CITESEER, and CORA using a 16-dimensional R-HGCN. Since there is no dropout method tailed for hyperbolic representations before Hyper Drop, some works [Chami et al. 2019, Zhang et al. 2021] use Drop Connect as a regularization. Drop Connect is one of variants of dropout methods that randomly zeros out elements of the Euclidean parameters in model, and it can be used in hyperbolic graph convolutional network as the parameters in hyperbolic graph convolutional network are Euclidean. For Drop Connect, we search the drop rate from 0.1 to 0.9 and report the best results. Drop Connect obtains improvements on PUBMED and CORA but not on CITESEER. In contrast, Hyper Drop achieves stable improvements on three datasets, and higher mean accuracy on PUBMED and CITESEER compared to Drop Connect. Hyper Drop and Drop Connect are two dropout methods that the former works on hyperbolic representations and the latter works on Euclidean parameters. They can work together effectively for a better generalization of R-HGCN. As the results on CITESEER and CORA show, using Hyper Drop and Drop Connect together has better performance than using only Hyper Drop or Drop Connect individually. In this paper, we have proposed R-HGCN, a product manifold based residual hyperbolic graph convolutional network for overcoming the over-smoothing problem. The residual connections can prevent node representations from being indistinguishable by hyperbolic residual connection and hyperbolic identity mapping. The product manifold with different origin points also provides a wider range of perspectives of data. A novel hyperbolic dropout method, Hyper Drop, is proposed to alleviate the over-fitting issue while deepening models. Experiments have demonstrated the effectiveness of R-HGCNs under various graph convolution layers and different structures of product manifolds. Acknowledgements This work was supported by the Natural Science Foundation of China (NSFC) under Grants No. 62172041 and No. 62176021, Natural Science Foundation of Shenzhen under Grant No. JCYJ20230807142703006, and 2023 Shenzhen National Science Foundation(No. 20231128220938001). The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Bachmann, G.; Bécigneul, G.; and Ganea, O. 2020. Constant curvature graph convolutional networks. In International Conference on Machine Learning (ICML), 486 496. Cai, C.; and Wang, Y. 2020. A note on over-smoothing for graph neural networks. ar Xiv preprint ar Xiv:2006.13318. Chami, I.; Ying, Z.; Ré, C.; and Leskovec, J. 2019. Hyperbolic graph convolutional neural networks. In Advances in Neural Information Processing Systems (Neur IPS), 4868 4879. Chen, M.; Wei, Z.; Huang, Z.; Ding, B.; and Li, Y. 2020. Simple and deep graph convolutional networks. In International Conference on Machine Learning (ICML), 1725 1735. Dai, J.; Wu, Y.; Gao, Z.; and Jia, Y. 2021. A hyperbolicto-hyperbolic graph convolutional network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 154 163. De Sa, C.; Gu, A.; Ré, C.; and Sala, F. 2018. Representation tradeoffs for hyperbolic embeddings. Proceedings of Machine Learning Research, 80: 4460. Hamilton, W.; Ying, Z.; and Leskovec, J. 2017. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems (Neur IPS), 1024 1034. Khrulkov, V.; Mirvakhabova, L.; Ustinova, E.; Oseledets, I.; and Lempitsky, V. 2020. Hyperbolic image embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 6418 6428. Kipf, T. N.; and Welling, M. 2017. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR). Klicpera, J.; Bojchevski, A.; and Günnemann, S. 2019. Predict then propagate: Graph neural networks meet personalized pagerank. In International Conference on Learning Representations (ICLR). Krioukov, D.; Papadopoulos, F.; Kitsak, M.; Vahdat, A.; and Boguná, M. 2010. Hyperbolic geometry of complex networks. Physical Review E, 82(3): 036106. Li, Q.; Han, Z.; and Wu, X.-M. 2018. Deeper insights into graph convolutional networks for semi-supervised learning. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 3538 3545. Liu, Q.; Nickel, M.; and Kiela, D. 2019. Hyperbolic graph neural networks. In Advances in Neural Information Processing Systems (Neur IPS), 8230 8241. Liu, Y.; and Lang, B. 2023. Mc H-HGCN: multi-curvature hyperbolic heterogeneous graph convolutional network with type triplets. Neural Computing and Applications, 35(20): 15033 15049. Nickel, M.; and Kiela, D. 2017. Poincaré embeddings for learning hierarchical representations. In Advances in Neural Information Processing Systems (Neur IPS), 6338 6347. Nickel, M.; and Kiela, D. 2018. Learning continuous hierarchies in the lorentz model of hyperbolic geometry. In International Conference on Machine Learning (ICML), 3776 3785. Papadopoulos, F.; Kitsak, M.; Serrano, M. Á.; Boguná, M.; and Krioukov, D. 2012. Popularity versus similarity in growing networks. Nature, 489(7417): 537 540. Sen, P.; Namata, G.; Bilgic, M.; Getoor, L.; Galligher, B.; and Eliassi-Rad, T. 2008. Collective classification in network data. AI magazine, 29(3): 93 93. Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research (JMLR), 15(1): 1929 1958. Veliˇckovi c, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; and Bengio, Y. 2017. Graph attention networks. In International Conference on Learning Representations (ICLR). Wan, L.; Zeiler, M.; Zhang, S.; Le Cun, Y.; and Fergus, R. 2013. Regularization of neural networks using dropconnect. In International Conference on Machine Learning (ICML), 1058 1066. Wu, F.; Zhang, T.; Souza Jr, A. H. d.; Fifty, C.; Yu, T.; and Weinberger, K. Q. 2019. Simplifying graph convolutional networks. In International Conference on Machine Learning (ICML), 6861 6871. Yao, S.; Pi, D.; and Chen, J. 2022. Knowledge embedding via hyperbolic skipped graph convolutional networks. Neurocomputing, 480: 119 130. Zhang, Y.; Wang, X.; Shi, C.; Liu, N.; and Song, G. 2021. Lorentzian graph convolutional networks. In Proceedings of the Web Conference (WWW), 1249 1261. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)