# orthogonal_graph_neural_networks__b104d199.pdf Orthogonal Graph Neural Networks Kai Guo1, Kaixiong Zhou2, Xia Hu2, Yu Li3, Yi Chang1,4, Xin Wang1* 1School of Artificial Intelligence, Jilin University, Changchun, China 2Department of Computer Science, Rice University, USA 3College of Computer Science and Technology, Jilin University, Changchun, China 4International Center of Future Science, Jilin University guokai20@mails.jlu.edu.cn, Kaixiong.Zhou@rice.edu, xia.hu@rice.edu, liyu18@mails.jlu.edu.cn, yichang@jlu.edu.cn, xinwang@jlu.edu.cn Graph neural networks (GNNs) have received tremendous attention due to their superiority in learning node representations. These models rely on message passing and feature transformation functions to encode the structural and feature information from neighbors. However, stacking more convolutional layers significantly decreases the performance of GNNs. Most recent studies attribute this limitation to the over-smoothing issue, where node embeddings converge to indistinguishable vectors. Through a number of experimental observations, we argue that the main factor degrading the performance is the unstable forward normalization and backward gradient resulted from the improper design of the feature transformation, especially for shallow GNNs where the over-smoothing has not happened. Therefore, we propose a novel orthogonal feature transformation, named Ortho GConv, which could generally augment the existing GNN backbones to stabilize the model training and improve the model s generalization performance. Specifically, we maintain the orthogonality of the feature transformation comprehensively from three perspectives, namely hybrid weight initialization, orthogonal transformation, and orthogonal regularization. By equipping the existing GNNs (e.g. GCN, JKNet, GCNII) with Ortho-GConv, we demonstrate the generality of the orthogonal feature transformation to enable stable training, and show its effectiveness for node and graph classification tasks. Introduction Graph neural networks (GNNs) (Kipf and Welling 2017) and their variants have been widely applied to analyze graphstructured data, such as social networks (Tian et al. 2019), molecular networks (Zhou et al. 2021b) and citation networks (Zhou et al. 2021a, 2019). Based upon the input node features and graph topology, GNNs apply neighbor aggregation and feature transformation schemes to update the representation of each node recursively. While the neighbor aggregation passes neighborhood messages along the adjacency edges, the feature transformation aims to project node embedding to improve models learning ability. Despite their superior effectiveness, a key limitation of GNNs is that their performance would decrease significantly *Corresponding author. Copyright 2022, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. with layer stacking. Most of the previous studies (Chen et al. 2020a; Li, Han, and Wu 2018; Oono and Suzuki 2019; Zhou et al. 2021a) attribute this limitation to the over-smoothing issue, which indicates that the node representations become indistinguishable due to the recursive neighbor aggregation upon the graph structure. A number of models have been recently proposed to alleviate the over-smoothing issue, including skip connection (Chen et al. 2020c,b; Li et al. 2019) and graph augmentation (Rong et al. 2020). Their main ideas are to avoid the overwhelming amount of neighborhood information, and strengthen the specific node features of themselves at each graph convolutional layer. In contrast to the previous extensive studies of oversmoothing for the extremely deep GNNs, we transfer the research attendance to explore the primary factor compromising the performance of shallow GNNs. Our research is motivated by an inconsistent observation in Figure 1.(d): the node classification accuracy of GNNs drops rapidly once the model depth is slightly enhanced (e.g., up to 8 layers), where the over-smoothing status should be far from being reached. By simply removing the feature transformation module from GNNs, it is further observed that GNNs surprisingly perform steadily even with tens of graph convolutional layers. This motivates us to pose the following research question: does the stable feature transformation play the dominating role in affecting the model performance of shallow GNNs? Forward and backward signaling analysis. To answer this question, we first systematically analyze the steadiness of feature transformation from both the forward inference and backward gradient directions. We apply two corresponding steadiness metrics: one measuring the signal magnifications of forward node embeddings and the other evaluating the norms of backward gradients. As shown in Figure 1.(a) and (b), it is empirically demonstrated that vanilla GNNs suffer from the forward embedding explosion and backward gradient vanishing. While the forward explosion greatly shifts the internal embedding distributions among layers to make the model training inefficient (Ioffe and Szegedy 2015), the gradient vanishing hampers the tuning of feature transformation module to adapt the downstream tasks. Therefore, we conclude and argue that the vanilla feature transformation damages the steady model signaling at both the forward and backward directions, which in turn degrades the performance especially for shallow GNNs. The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22) Orthogonal graph convolutions. To overcome the unstable training, we propose orthogonal graph convolutions to ensure the orthogonality of feature transformation. The orthogonal weight (Trockman and Kolter 2021; Wang et al. 2020; Vorontsov et al. 2017) has been explored in convolutional and recurrent neural networks (CNNs and RNNs) to maintain the forward activation norm and to avoid the gradient vanishing, which can accelerate the training and improve adversarial robustness. To tailor to the graph data, we optimize the orthogonality of feature transformation from three perspectives: (i) a hybrid weight initialization to possess the trade-off between graph representation learning ability and the model s orthogonality; (ii) an orthogonal weight transformation to ensure the orthogonality in the forward inference; and (iii) an orthogonal regularization to constrain orthogonality during the backward update. The contributions are listed in the following: We propose two metrics to measure the steadiness of forward inference as well as backward gradient, and provide the systematic analysis to theoretically and empirically study the influences of unstable feature transformation on shallow GNNs. We propose the orthogonal graph convolutions, named Ortho-GConv, to achieve the orthogonality of feature transformation and stabilize the forward and backward signaling in GNNs. We conduct the comprehensive experiments on both the node and graph classification tasks to demonstrate the general effectiveness of our Ortho-GConv on various GNN backbones, including GCN, JKNet and GCNII. Related Work GNNs. (Bruna et al. 2014) is the first remarkable research on GNNs, which develops graph convolution based on spectral graph theory for graph application tasks. Later, a series of GNN variants are proposed by (Kipf and Welling 2017; Defferrard, Bresson, and Vandergheynst 2016; Henaff, Bruna, and Le Cun 2015; Li et al. 2018; Zhou et al. 2020b). Although these models achieve the better performance with two layers, when stacking more layers hampers their performance. Recently, several studies argue that stacking more layers causes the over-smoothing issue. These methods, such as APPNP (Klicpera, Bojchevski, and G unnemann 2019), JKNet (Xu et al. 2018), Drop Edge (Rong et al. 2020) and GCNII (Chen et al. 2020c), are proposed to solve the over-smoothing issue. However, (Liu, Gao, and Ji 2020) claim that the over-smoothing issue only happens when node representations propagate repeatedly for a large number of iterations, especially for sparse graphs. Therefore, a few propagation iterations in the GNN models are insufficient for over-smoothing to occur. On the contrary, we argue that unstable forward and backward signaling results in the poor performance of shallow GNN models. Orthogonal initialization and transformation. The advantages of orthogonal weight initialization, i.e., ensuring the signals propagate through deep networks and preserving gradient norms, are explained by (Saxe, Mc Clelland, and Ganguli 2014). Recently, there have been several studies exploring orthogonal initialization in CNNs. (Xiao et al. 2018) propose a method for orthogonal convolutions and demonstrated, which allows the CNN model to effectively explore large receptive fields without batch normalization or residual connection. (Trockman and Kolter 2021) propose the Calay transformation to constrain the parameters in convolutional layers to make them orthogonal. (Huang et al. 2020) propose an efficient and stable orthogonalization method to learn a layer-wise orthogonal weight matrix in CNNs. Furthermore, the gradient norm preserving properties can also benefit from remembering long-term dependencies in RNNs. (Vorontsov et al. 2017) propose the constrained transformation matrices to make orthogonal and address the gradient vanishing and exploding in RNNs. Forward and Backward Signaling Analysis In this section, we first introduce the notations and GNN model. We then analyze the forward and backward signaling process, and validate our argument with empirical studies. Notations and GNNs We denote matrices with boldface capital letters (e.g. X), vectors with boldface lowercase letters (e.g., x) and scalars with lowercase alphabets (e.g., x). An undirected graph is represented by G = (V, E), where V = {vi} and E = {(vi, vj)} denote the node and edge sets, respectively. Let X Rn d denote the node feature matrix, where the ith row is the corresponding d-dimensional feature vector of node vi. The adjacency matrix is defined as A Rn n, which associates each edge (vi, vj) with its element Aij; and D is the degree matrix. Let A := A+In and D := D+In be the adjacency and degree matrices of the graph augmented with self-loops. The normalized adjacency matrix is given by ˆ A := D 1 2 , which is widely used for spatial neighborhood aggregation in GNN models. We use the graph convolutional networks (GCNs) (Kipf and Welling 2017) as a typical example, to illustrate how the traditional GNNs conduct node representation learning and to explain the model stability problem in the following subsection. The forward inference at the l-th layer of GCNs is formally defined as: H(l) = σ( ˆ AH(l 1)W (l)), (1) where H(l) denotes the node embedding matrix at the l-th layer; H(0) is given by X; σ( ) is the nonlinear activation function, such as Re LU; W (l) Rd d is the linear transformation matrix. It is observed that graph convolutions consist of two key steps: the spatial neighbor aggregation based upon matrix ˆ A and the feature transformation with matrix W (l). Let L denote the model depth. The output embedding of node vi, i.e., h(L) i at the i-th row of H(L), could be used to conduct the node or graph classification task. Forward and Backward Steadiness Analysis For the spatial GNN models based on Eq. (1), it is commonly known that stacking more layers tends to degrade the 2 4 6 8 Depth Ortho-GCN GCN 1 20 40 60 80 100 120 140 160 180 200 1 2 3 4 5 6 7 8 1 20 40 60 80 100 120 140 160 180 200 1 6 15 40 65 90 Depth acc smooth w/o W - acc w/o W - smooth Figure 1: (a)The signal magnifications Msig for GCNs with and without the orthogonal graph convolutions. (b) The gradient norms of vanilla GCN. (c) The gradient norms of GCN augmented with orthogonal graph convolutions. (d) Test accuracy and node embedding smoothness under different GCN model depths. downstream task performance (e.g., node classification accuracy) significantly. Such phenomenon is usually attributed to the over-smoothing issue (Li, Han, and Wu 2018), which states that the node embeddings become similar due to the recursive neighbor aggregation. By simplifying the feature transformation and non-linear activation, many theoretical studies have been delivered to explain the recursive neighbor aggregation with low-passing filtering or Markov chains, which proves the node embeddings will converge to a unique equilibrium when the model depth keeps increasing (Liu, Gao, and Ji 2020; Nt and Maehara 2019). Following this analysis, a variety of heuristic models are proposed to improve the neighbor aggregation and relieve the oversmoothing issue. For example, the skip connection combines with node embeddings from the previous layer to preserve the initial node features (Chen et al. 2020c), and the edge dropping randomly removes edges to avoid the overwhelming amount of neighborhood information (Rong et al. 2020). We rethink the performance degrading problem of GNNs with an empirical observation vanilla GNNs even with a few layers suffer from the similar issue with the deep GNNs. Specifically, we investigate GCNs with the different model depths in terms of node classification accuracy and graph smoothness. The graph smoothness is defined as the average distance of node pairs (Liu, Gao, and Ji 2020), i.e, D = 1 |V|2 P vi,vj V ||h(L) i h(L) j ||. As shown in Figure 1.(d), on dataset Cora (Sen et al. 2008), the accuracy and graph smoothness drop quickly when model depth L is slightly enhanced to 8. Even worse, metric D approximates to zero once L > 20, where GCNs fall into the random prediction. Such observations are in contrast to the previous theoretical analysis on the over-smoothing issue node embeddings become indistinguishable only if L . By simply removing the feature transformation module in GCNs, it is surprising to observe that the accuracy and graph smoothness are maintained steadily until L = 100. This makes us question the application of over-smoothing theory to explain the performance deterioration for shallow GNNs, since the node embeddings averaged by the standalone neighbor aggregation are well separated when L is small. Thus, we shift the research attendance to the feature transformation has been ignored before, and argue that the unstable feature transformation is the primary factor to compromise GNNs. The stability of feature transformation is defined by both the forward inference steadiness and the backward gradient steadiness, which are introduced in the following. Forward inference steadiness. Keeping the steady forward inference is a ubiquitous technique to constrain the magnitudes of propagated signals and train deep neural networks stably (Trockman and Kolter 2021). Recalling the vanilla graph convolutions in Eq. (1), the feature transformation may amplify the magnitudes of node embeddings without the proper constraint on matrix W (l). Such magnitude amplification accumulates exponentially with layers (Xie, Xiong, and Pu 2017), and gives rise to the indefiniteness and randomness of forward node embeddings. The dramatical shift of internal signal distributions among the different layers could prevent the underlying model from training efficiently (Ioffe and Szegedy 2015). To quantify the magnitude amplification of node embeddings across the whole GNN model, we define the signal magnification as: ||h(L) i ||2 ||h(0) i ||2 . (2) Specifically, metric Msig averages the ratios of node embedding norms from the last layer over those at the initial layer. A larger value of Msig indicates that the node embedding magnitudes are amplified excessively during the forward inference. Based on the common assumption that the initial data are whitened and decorrelated, the ideal Msig should be 1 to ensure the identical embedding magnitudes and signal distributions among layers. Backward gradient steadiness. Besides the forward inference, an alternative direction to stabilize the training process is to maintain the backward gradient steadiness. GNNs are trained by gradient descent with back propagation to update weight W (l) involved in the feature transformation. While the existing studies focus on the forward message passing in GNNs, understanding backward gradient trajectory to optimize the feature transformation has remained limited. Therefore, we make the initial step to analyze the gradients in terms of parameter W (l). To facilitate the gradient analysis, we simplify the non-linear activation function in Eq. (1) and obtain H(l) = ˆ AH(l 1)W (l). It has been widely recognized that GNNs with and without non-linear have comparable node classification performance and learning curves (Wu et al. 2019). The model simplification helps us understand the gradient dynamics intuitively. Theorem 1 Given the linear GNN model with L layers and the specific training loss L, the gradient with respect to parameter W (l) at the l-th layer is given by: L W (l) = (H(l 1)) ( ˆ A )L l+1 L H(L) (W (l+1) W (L)) Rd d. (3) We give the detailed proof in Appendix.1. L could be represented by the cross-entropy loss in the node or graph classification task. According to Eq. (3), it is observed that the backward gradients are backpropagated through the neighborhood aggregation and feature transformation, which are similar to the forward inference process. To update parameter W (l) at layer l, the initial gradient L H(L) is smoothed through the posterior L l layers, and transformed with ( ˆ AH(l 1)) . Since parameters {W (l+1), , W (L)} are penalized during training, such smoothing and transformation will make most of the gradient entries approximate to zeros. In other words, the backward gradients may be vanishing at the initial few layers, which hampers the effective training of GNNs. To study the influence of gradient vanishing, we propose to apply gradient norms, i.e., || L W (l) ||F , l = 1, , L, to quantify the gradient steadiness. The appropriate strengths of gradient norms would be preferred to stabilize the model training. Illustration of model steadiness metrics. To validate our argument that the unstable feature transformation is the primary factor compromising the performance of shallow models, we empirically analyze the two proposed steadiness metrics, i.e, the forward signal magnification Msig and the backward gradient norm || L W (l) ||F . Specifically, we develop a series of GCNs with different depths L, and show their signal magnifications in Figure 1.(a). It is observed that metric Msig rises quickly with depth L, which indicates the magnitudes of forward node embeddings are amplified exponentially across the model. The resulting butterfly effect of internal distribution shift prevents the effective training, which help explain the performance degrading of shallow GNNs. To illustrate the gradient dynamics from the final to the initial layer, we plot their gradient norms for an 8-layer GNN in Figure 1.(b). Being consistent with our theoretical analysis, || L W (l) ||F shows decreasing tendency during the backpropagating process at the initial training phase. After hundreds of epochs, all gradients at the different layers were vanishing, which stops GNNs moving to the global minimum of the loss landscape. It should be noted that the unstable forward and backward signaling appear in shallow GCN with 8 layers, while GCNs without the feature transformation modules deliver steady performance until L = 100 as illustrated in Figure 1.(d). Therefore, we could conclude that the unstable feature transformation is responsible for the performance deterioration of shallow GNNs. Orthogonal Graph Convolutions To overcome the issues of unstable forward inference and backward gradients, we explore the use of orthogonality on the feature transformation. Although many other methods have been applied to constrain the forward direction in GNNs, such as pair or group normalization (Zhao and Akoglu 2019; Zhou et al. 2020a), they fail to guarantee the backward steadiness due to the complexity variations of signals (Saxe, Mc Clelland, and Ganguli 2013). In this section, we first review the orthogonal matrix and its theoretical properties in stabilizing forward and backward signaling. We then discuss the challenges of applying orthogonal constraint in GNNs, and optimize the orthogonal graph convolutions to tailor to the graph data. Orthogonality A matrix W Rd d is orthogonal if W W = W W = I. Encouraging the orthogonality in deep neural networks has proven to yield several benefits, such as stable and quick training (Xiao et al. 2018), better generalization (Bansal, Chen, and Wang 2018), and improved robustness against the adversarial attack (Tsuzuku, Sato, and Sugiyama 2018). In GNNs, we concern the preferred properties of orthogonal feature transformation to stabilize the signaling processes at both the forward and backward directions. We thus ignore the neighbor aggregation and non-linear activation at each layer to simplify the theoretical analysis, and consider their impacts empirically in the following model design. Mathematically, for the l-th layer, the graph convolutions could be simplified as H(l) = ˆ H(l)W (l), where ˆ H(l) = ˆ AH(l 1) denotes the node embeddings after neighbor aggregation. The following theorem, proved in (Huang et al. 2018), shows that orthogonal weight W (l) could preserve the norms of forward embeddings and backward gradients for signals passing through the feature transformation module. Theorem 2 Let W (l) Rd d denote the orthogonal matrix adopted by the feature transformation at the lth layer. Let ˆh(l) and h(l) denote the node embeddings, which are given by specific rows in matrices ˆ H(l) and H(l), respectively. (1) Assume the mean of ˆh(l) is Eˆh(l)[ˆh(l)] = 0, and the covariance matrix of ˆh(l) is cov(ˆh(l)) = σ2I. Then Eh(l)[h(l)] = 0, and cov(h(l)) = σ2I. (2) We have H(l) F = ˆ H(l) F . (3) Given the back-propagated gradient L H(l) , we have L H(l) F = L ˆ H(l) F . We list the detailed proof in Appendix.2, which is inspired by (Huang et al. 2018). Theorem 2 shows the benefits of orthogonal feature transformation to stabilize simplified GNNs: (1) The Frobenius norms of node embeddings H(l) and ˆ H(l) are kept to be identical, which helps constrain the embedding magnitudes across model and approximate the desired signal magnification Msig with value 1. (2) The norms of backward gradients are maintained when passing through the feature transformation layer. This relieves the gradient vanishing issue as studied in Theorem 1. Orthogonal Graph Convolutions To ensure orthogonality, most of the prior efforts either insert an extra orthogonal layer to transform matrix W (l) (Trockman and Kolter 2021), or exploit orthogonal weight initialization to provide model a good start (Xiao et al. 2018). However, it is non-trivial to directly apply the existing orthogonal methods due to two challenges in GNNs. First, since the node features usually contain the key information for the downstream task, the intuitive orthogonal initialization will accelerate the training process toward local minimum and damage the model s learning ability. Second, even with the strict orthogonality of W (l), the impacts brought by the neighbor aggregation and non-linear activation make it failed to preserve the embedding norms of the successive layers. According to Theorem 2, the orthogonality only obtains H(l) F = ˆ H(l) F within the same layer in simplified GNNs, instead of strictly ensuring H(l) F = H(l 1) F at the successive layers in nonlinear GNNs. To bridge the gaps, we propose the orthogonal graph convolutions, named Ortho-GConv, by optimizing the orthogonality designs comprehensively from three architectural perspectives, including hybrid weight initialization, orthogonal transformation, and orthogonal regularization. The details are introduced in the following. Hybrid weight initialization. It is widely demonstrated that GNNs tend to overfit on the large and attributed graph data (Kong et al. 2020). Although the orthogonal initialization allows training the deep vanilla neural networks efficiently, the quick convergence may iterate to the local optimum and intensifies the overfitting problem. To obtain the trade-off between orthogonality and model s learning ability, we propose the hybrid weight initialization to set weight at the l-th layer as follows: Q(l) = βP (l) + (1 β)I Rd d. (4) P (l) is initialized by the traditional random approaches (e.g., Glorot initialization (Glorot and Bengio 2010)), while we adopt the identity initialization (Le, Jaitly, and Hinton 2015), the simplest orthogonal method, to obtain the initial orthogonality I. β is a hyperparameter to control the trade-off. Orthogonal transformation. Given initialized weight Q(l), we adopt an extra orthogonal transformation layer to transform it and improve the orthogonality before its appliance for the feature transformation. We use Newton s iteration to illustrate our method due to its simplicity (Huang et al. 2020). Specifically, the orthogonal transformation based on Newton s iteration is divided into two steps: spectral bounding and orthogonal projection. First, the spectral bounding normalizes weight Q(l) as: ˆQ(l) = Q(l) The orthogonal projection then maps matrix ˆQ(l) to obtain the orthogonal weight W (l). Mathematically, the orthogonal projection is given by: W (l) = M 1 2 ˆQ(l), where M = ˆQ(l) ˆQ(l)T is the covariance matrix. Due to the exponential complexity to compute the square root of covariance matrix, M 1 2 is instead computed by the Newton s iteration with the following iterative formulas: B0 = I Bt = 1 2 3Bt 1 B3 t 1M , t = 1, 2, . . . , T (5) where T is the number of iterations. Under the condition of I M 2 < 1, it has been proved that BT will converge to 2 . Therefore, we instead obtain the orthogonal weight via W (l) = BT ˆQ(l). Weight W (l) is applied for the feature transformation as shown in Eq. (1). Orthogonal regularization. Even accompanied with orthogonal matrix W (l) in the feature transformation, the norms of forward node embeddings still fails to be preserved due to the neighbor aggregation and non-linear activation in GNNs. To be specific, recalling the graph convolutitons in Eq. (1), we have: H(l) F = σ( ˆ AH(l 1)W (l)) F ˆ AH(l 1)W (l) F = ˆ AH(l 1) F ˆ A F H(l 1) F H(l 1) F . The first inequality holds since the non-linear activation of Re LU maps the negative entries into zeros. The following equality is obtained by the norm-preserving property of orthogonal weight W (l). Since the entries in adjacency matrix ˆ A are normalized within range [0, 1], we have ˆ A F 1 to get the final inequality. Compared to the forward embedding explosion in vanilla GNNs, such norm vanishing will also shift the internal embedding distributions and result in the inefficient training. To maintain the node embedding norms during the forward inference, we propose a simple orthogonal regularization to constrain weight W (l) as: Lauxiliary = λ X W (l)W (l) c(l) I F , (7) where λ is a hyperparameter. c(l) is the trainable scalar to control the norm of weight W (l). We initialize c(l) with value 1, and let model automatically learn how to preserve the embedding norms in the forward inference. A larger c(l) indicates to compensate the norm vanishing brought by the neighbor aggregation and non-linear activation. Our Ortho-GConv is a general module capable of augmenting the existing GNNs. Without loss of generality, we adopt the simple identity weight and Newton s iteration to provide the orthogonal initialization and transformation, respectively. More the other orthogonal methods could be applied to further improve model s performance in the future. Steadiness Study of Our Model To validate the effectiveness of our Ortho-GConv in stabilizing the forward and backward signaling, we implement it upon the vanilla GCNs. While the signal magnifications Msig are constrained around value 1 for models of different depths in Figure 1.(a), the gradient norms || L W (l) ||F are comparable for the different layers within an 8-layer model in Figure 1.(c). In other word, our Ortho-GConv could constrain the magnitudes of node embedding to stabilize the forward inference, meanwhile maintaining the gradients at the backward direction to update model effectively. Experiments In this section, we conduct extensive experiments to evaluate our model aimed at answering the following questions: Q1: How effective is the proposed Ortho-GConv applied to current popular GNNs on the full-supervised node classification and graph classification tasks? Q2: How much do the hybrid weight initialization, orthogonal transformation and orthogonal regularization influence the Ortho-GConv? We provide the results on semi-supervised node classification in Appendix.5. Our code is publicly available 1. Benchmark Datasets. For full-supervised node classification task, we use Cora (Sen et al. 2008), Cite Seer (Sen et al. 2008), Pub Med (Sen et al. 2008), and three sub-sets of Web KB (Pei et al. 2020): Cornell, Texas and Wisconsin to evaluate the performance. For graph classification task, we use the protein datasets including D&D (Dobson and Doig 2003) and PROTEINS (Borgwardt et al. 2005). In addition, we conduct experiments on ogbn-arxiv (Hu et al. 2020a) to evaluate the scalability and performance of Ortho-GConv on large graph. The statistics of datasets and parameter settings are provided in Appendix.3 and 4, respectively. Full-supervised Node Classification We construct two experiments to evaluate the performance of our proposed Ortho-GConv, namely the comparison with different layers and the comparison with SOTA. We apply the same full-supervised experimental settings and training settings with GCNII (Chen et al. 2020c), and use six datasets, including Cora, Citeseer, Pubmed, Cornell, Texas and Wisconsin to evaluate the performance. For each dataset, we randomly split the nodes of each class into 60%, 20% and 20% for training, validation and testing, respectively. In addition, we conduct experiments on ogbn-arxiv to further evaluate the performance of our proposed Ortho GConv on large graph. For this dataset, we split the nodes of each class into 54%, 18% and 28% for training, validation and testing by following the previous effort (Hu et al. 2020b). (1) Comparison with Different Layers. We take GCN, JKNet and GCNII as the three backbones and compare our proposed Ortho-GConv with their original models at 2/4/8 layers, respectively. The experimental results are shown in Table 2. We make the following conclusions: i) Ortho-GConv generally improves the performance of all the backbones on each dataset under different numbers of layers. For example, Ortho-GConv delivers average improvements of 1.9% and 2.1% over the backbones with 2 layers on Cora and Citeseer, respectively; while achieving remarkable gains of 2.8% and 12.4% in the case of 8 layers. ii) With the increase of the number of layers, the performance of GNN models decreases significantly. However, our proposed Ortho-GConv achieves the comparable performance for 2 and 8 layers, which relieves the performance degrading of shallow GNNs as analyzed before. This phenomenon is attributed to the advantage of our Ortho-GConv that can solve the problem of gradient vanishing and make the model inference stable. 1https://github.com/Kai Guo20/Ortho-GConv Method Cora Cite. Pumb. Corn. Texa. Wisc. GCN 85.77 73.68 87.91 57.84 55.68 49.02 GAT 86.37 74.32 87.62 54.32 58.38 49.41 Geom-GCN 85.19 77.99 90.05 56.76 57.58 58.24 APPNP 87.87 76.53 89.40 73.51 65.41 69.02 JKNet 86.20 75.89 89.55 62.16 62.70 64.31 Incep(Drop Egde) 86.86 76.83 89.18 61.62 57.84 50.20 GCNII* 88.01 77.13 90.30 76.49 77.84 81.57 Ortho-GCNII* 88.69 77.21 90.42 77.84 80.54 83.53 Table 1: Accuracy (%) comparisons with SOTA on fullsupervised tasks. The best performance in each dataset is highlighted in bold. iii) Ortho-GConv also generally achieves better performance than the backbones on ogbn-arxiv. The results demonstrate that Ortho-GConv is applicable to large graph. Therefore, our proposed Ortho-GConv could tackle the problems of forward inference explosion and backward gradient vanishing in the vanilla shallow GNNs, thus enabling the model to be trained stably. (2) Comparison with SOTA. In order to verify the overall performance of Ortho-GConv, we select the best performance from each backbone with Ortho-GConv and compare it with the current popular SOTA methods. We repeat the experiments 5 times following the same experimental settings as GCNII and report the average results in the Table 1. The experimental results reveal that our proposed method outperforms all the baselines. Ortho-GCNII* obtains 1.2% average improvement on all datasets. Especially, Ortho-GConv delivers an improvement of 3.5% over GCNII* on Texas. In addition, the performance of few-layer GCNII* model equipped with Ortho-GConv is better than deep GNN models, which demonstrates the superiority of the Ortho-GConv. More details of the results with layer number information are reported in Appendix.6. Graph Classification For the graph classification task, we use Graph-U-Nets (Gao and Ji 2019) as the backbone, PSCN (Niepert, Ahmed, and Kutzkov 2016), DGCNN(Zhang et al. 2018), and Diff Pool(Ying et al. 2018) as the comparison baselines, and conduct experiments on the D&D and PROTEINS datasets to evaluate our model. We follow the same experimental settings as Graph-U-Nets for a fair comparison, and fix the parameters T = 4, β = 0.4. The experimental results are reported in Table 3. We can see that applying Ortho-GConv on the Graph-U-Nets (g-U-Nets) achieves new state-of-theart performance on datasets D&D and PROTEINS, which demonstrates the effectiveness of Ortho-GConv on the graph classification task. In summary, our proposed Ortho-GConv gains a significant performance improvement over representative baseline methods on the node classification, and graph classification tasks, which answers the first question asked at the beginning of this section. Dataset Backbone 2 layers 4 layers 8 layers original Ortho-GConv original Ortho-GConv original Ortho-GConv Cora GCN 85.77 1.73 87.28 1.68 82.37 2.47 86.20 1.75 81.13 2.78 85.27 1.64 JKNet 85.96 1.54 87.36 1.74 86.20 1.45 87.12 2.25 85.84 1.64 87.24 2.09 GCNII 86.28 0.79 88.49 1.59 85.70 2.10 88.81 1.69 86.80 2.10 88.41 1.43 Citeseer GCN 73.68 1.69 75.59 1.78 68.03 5.90 74.80 1.11 53.10 6.34 71.61 2.46 JKNet 75.89 1.54 76.89 1.64 74.97 1.76 76.11 1.76 74.85 1.69 75.60 1.95 GCNII 75.31 2.36 77.26 1.83 75.60 1.70 76.94 2.10 76.10 2.10 77.11 2.20 Pubmed GCN 87.91 0.44 86.04 0.61 77.00 7.55 84.68 0.55 69.49 0.98 83.75 0.50 JKNet 89.40 0.30 89.46 0.28 89.47 0.44 89.54 0.38 89.55 0.47 89.57 0.21 GCNII 89.51 0.69 90.30 0.30 89.50 0.40 90.04 0.32 89.78 0.33 89.80 0.43 Corn. GCN 52.70 5.05 58.38 3.62 57.84 3.00 57.84 3.08 57.84 3.00 57.84 3.08 JKNet 62.16 5.05 63.24 4.90 52.97 11.6 58.92 4.44 56.22 7.97 58.92 5.53 GCNII 58.92 4.44 74.05 4.09 66.00 6.20 75.14 6.72 74.10 5.60 75.14 7.13 Texa. GCN 55.14 7.78 61.08 9.08 55.68 5.60 58.92 7.49 54.59 7.00 58.38 6.22 JKNet 58.38 6.22 61.08 8.01 62.70 4.00 62.70 5.85 62.16 5.05 60.54 3.52 GCNII 69.73 13.30 77.84 6.72 71.40 8.00 77.30 5.60 70.80 5.20 75.14 3.52 Wisc. GCN 49.02 3.66 50.59 8.40 46.67 7.76 48.24 9.46 40.00 10.6 46.67 9.44 JKNet 64.31 6.41 69.41 5.64 59.61 4.06 68.63 4.80 56.86 3.10 65.88 5.11 GCNII 72.94 9.23 77.25 3.54 74.50 7.80 77.25 5.11 75.30 8.10 75.29 6.29 ogbn-arxiv GCN 71.28 0.28 71.33 0.26 72.30 0.17 72.30 0.13 71.84 0.27 71.87 0.12 GCNII 71.24 0.19 71.35 0.21 71.21 0.19 71.38 0.14 72.51 0.28 72.44 0.46 Table 2: Mean classification accuracy (%) of full-supervised node classification comparisons on different backbones with/without Ortho-GConv on full-supervised node classification tasks Dataset PSCN DGCNN Diff Pool g-U-Nets Ortho-g-U-Nets. D&D 76.27 79.37 80.64 83.00 83.87 PROTEINS 75.00 76.26 76.25 77.68 78.78 Table 3: Accuracy (%) comparisons with SOTA on graph classification tasks. Ablation Study To study the importance of Ortho-GConv s three perspectives, we construct several variants of our model on the hybrid weight initialization, orthogonal transformation and orthogonal regularization, and define them as follows: 1) Ortho-GCN w/o I represents disregarding the hybrid weight initialization module from Ortho-GCN; 2) Ortho-GCN w/o T represents omitting the orthogonal transformation from Ortho-GCN; 3) Ortho-GCN w/o R indicates that orthogonal regularization from Ortho-GCN is ignored. To shed light on the contributions of the three perspectives, we report the semi-supervised node classification results of Ortho-GCN and their variants on Cora, as shown in Figure 2.(a). We have the following observations: i) Ortho-GCN achieves better performance than the three ablated variants and the GCN model. It further demonstrates the importance of three orthogonal techniques for stabilizing the model. ii) Removing the orthogonal initialization from the proposed model has a considerable impact on the performance, which suggests that this component plays an important role in node classification task. The experimental results further demonstrate the importance of the three key components of our model, which answers the second question accordingly. Besides, we provide an ablation study to investigate how the iteration number T affects the performance and training time of Ortho-GConv. We conduct experiments on Cora 2 4 8 Depth GCN Ortho-GCN Ortho-GCN w/o I Ortho-GCN w/o T Ortho-GCN w/o R 1 2 3 4 5 6 T Accuracy Time 1 2 3 4 5 6 T Accuracy Time 5.0 Figure 2: (a) Depth vs Accuracy for GCN, our model and three ablated models on Cora. (b) Effects of the iteration number T of the 2-layer Ortho-GCN on Cora dataset. (c) Effects of the iteration number T of the 8-layer Ortho-GCN on Cora dataset. dataset using 2-layer and 8-layer Ortho-GCN models. The results are reported in Figure 2.(b) and (c), respectively. We find that the time consumption becomes larger and larger as T increases. From the Figure 2.(b) and (c), we can also observe that both larger and smaller numbers of iterations degrade the performance of our proposed model. We get the best performance when the number of iterations is 4. In conclusion, choosing the appropriate T can achieve high accuracy and acceptable time complexity. In this paper, we first conduct a series of analytical experiments to probe the reason for degrading performance of GNNs when stacked with more convolutional layers. We argue that the primary factor is the unstable forward and backward signaling in GNNs. Then, we propose an orthogonal graph convolutions to augment the GNN backbones to stabilize the model training and improve the generalization performance of the model. The experiments show that our Ortho-GConv achieves the better performance than SOTA methods on node and graph classification tasks. Acknowledgments This work is supported by the Foundation of the Major Project of Science and Technology Innovation 2030 - New Generation of Artificial Intelligence (No.2021ZD0112500), the National Natural Science Foundation of China (No.61872161, No.61976102, No.U19A2065), Jilin Educational Committee (No.JJKH20191257KJ), the Nature Science Foundation of Jilin Province (20200201297JC). Yu Li is also supported in part by China Scholarship Council (No.202006170168). Bansal, N.; Chen, X.; and Wang, Z. 2018. Can we gain more from orthogonality regularizations in training deep CNNs? ar Xiv preprint ar Xiv:1810.09102. Borgwardt, K. M.; Ong, C. S.; Sch onauer, S.; Vishwanathan, S. V. N.; Smola, A. J.; and Kriegel, H. 2005. Protein function prediction via graph kernels. In Proceedings Thirteenth International Conference on Intelligent Systems for Molecular Biology, 47 56. Bruna, J.; Zaremba, W.; Szlam, A.; and Le Cun, Y. 2014. Spectral Networks and Locally Connected Networks on Graphs. In 2nd International Conference on Learning Representations. Chen, D.; Lin, Y.; Li, W.; Li, P.; Zhou, J.; and Sun, X. 2020a. Measuring and relieving the over-smoothing problem for graph neural networks from the topological view. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 3438 3445. Chen, L.; Wu, L.; Hong, R.; Zhang, K.; and Wang, M. 2020b. Revisiting graph based collaborative filtering: A linear residual graph convolutional network approach. In Proceedings of the AAAI conference on artificial intelligence, 27 34. Chen, M.; Wei, Z.; Huang, Z.; Ding, B.; and Li, Y. 2020c. Simple and deep graph convolutional networks. In International Conference on Machine Learning, 1725 1735. PMLR. Defferrard, M.; Bresson, X.; and Vandergheynst, P. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. Advances in neural information processing systems, 29: 3844 3852. Dobson, P. D.; and Doig, A. J. 2003. Distinguishing enzyme structures from non-enzymes without alignments. Journal of molecular biology, 330(4): 771 783. Gao, H.; and Ji, S. 2019. Graph u-nets. In international conference on machine learning, 2083 2092. PMLR. Glorot, X.; and Bengio, Y. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, 249 256. JMLR Workshop and Conference Proceedings. Henaff, M.; Bruna, J.; and Le Cun, Y. 2015. Deep convolutional networks on graph-structured data. ar Xiv preprint ar Xiv:1506.05163. Hu, W.; Fey, M.; Zitnik, M.; Dong, Y.; Ren, H.; Liu, B.; Catasta, M.; and Leskovec, J. 2020a. Open graph benchmark: Datasets for machine learning on graphs. ar Xiv preprint ar Xiv:2005.00687. Hu, W.; Fey, M.; Zitnik, M.; Dong, Y.; Ren, H.; Liu, B.; Catasta, M.; and Leskovec, J. 2020b. Open Graph Benchmark: Datasets for Machine Learning on Graphs. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems. Huang, L.; Liu, L.; Zhu, F.; Wan, D.; Yuan, Z.; Li, B.; and Shao, L. 2020. Controllable Orthogonalization in Training DNNs. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6428 6437. IEEE. Huang, L.; Liu, X.; Lang, B.; Yu, A. W.; Wang, Y.; and Li, B. 2018. Orthogonal weight normalization: Solution to optimization over multiple dependent stiefel manifolds in deep neural networks. In Thirty-Second AAAI Conference on Artificial Intelligence. Ioffe, S.; and Szegedy, C. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, 448 456. PMLR. Kipf, T. N.; and Welling, M. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In 5th International Conference on Learning Representations. Open Review.net. Klicpera, J.; Bojchevski, A.; and G unnemann, S. 2019. Predict then Propagate: Graph Neural Networks meet Personalized Page Rank. In 7th International Conference on Learning Representations. Open Review.net. Kong, K.; Li, G.; Ding, M.; Wu, Z.; Zhu, C.; Ghanem, B.; Taylor, G.; and Goldstein, T. 2020. Flag: Adversarial data augmentation for graph neural networks. ar Xiv preprint ar Xiv:2010.09891. Le, Q. V.; Jaitly, N.; and Hinton, G. E. 2015. A simple way to initialize recurrent networks of rectified linear units. ar Xiv preprint ar Xiv:1504.00941. Li, G.; Muller, M.; Thabet, A.; and Ghanem, B. 2019. Deep GCNs: Can GCNs Go As Deep As CNNs? In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Li, Q.; Han, Z.; and Wu, X.-M. 2018. Deeper insights into graph convolutional networks for semi-supervised learning. In Thirty-Second AAAI conference on artificial intelligence. Li, R.; Wang, S.; Zhu, F.; and Huang, J. 2018. Adaptive graph convolutional neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence. Liu, M.; Gao, H.; and Ji, S. 2020. Towards deeper graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 338 348. Niepert, M.; Ahmed, M.; and Kutzkov, K. 2016. Learning convolutional neural networks for graphs. In International conference on machine learning, 2014 2023. PMLR. Nt, H.; and Maehara, T. 2019. Revisiting graph neural networks: All we have is low-pass filters. ar Xiv preprint ar Xiv:1905.09550. Oono, K.; and Suzuki, T. 2019. Graph neural networks exponentially lose expressive power for node classification. ar Xiv preprint ar Xiv:1905.10947. Pei, H.; Wei, B.; Chang, K. C.; Lei, Y.; and Yang, B. 2020. Geom-GCN: Geometric Graph Convolutional Networks. In 8th International Conference on Learning Representations. Open Review.net. Rong, Y.; Huang, W.; Xu, T.; and Huang, J. 2020. Drop Edge: Towards Deep Graph Convolutional Networks on Node Classification. In 8th International Conference on Learning Representations. Open Review.net. Saxe, A. M.; Mc Clelland, J. L.; and Ganguli, S. 2013. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. ar Xiv preprint ar Xiv:1312.6120. Saxe, A. M.; Mc Clelland, J. L.; and Ganguli, S. 2014. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In 2nd International Conference on Learning Representations. Sen, P.; Namata, G.; Bilgic, M.; Getoor, L.; Galligher, B.; and Eliassi-Rad, T. 2008. Collective classification in network data. AI magazine, 29(3): 93 93. Tian, Y.; Niu, Y.; Yan, J.; and Tian, F. 2019. Inferring private attributes based on graph convolutional neural network in social networks. In 2019 International Conference on Networking and Network Applications (Na NA), 186 190. IEEE. Trockman, A.; and Kolter, J. Z. 2021. Orthogonalizing Convolutional Layers with the Cayley Transform. In 9th International Conference on Learning Representations. Open Review.net. Tsuzuku, Y.; Sato, I.; and Sugiyama, M. 2018. Lipschitzmargin training: Scalable certification of perturbation invariance for deep neural networks. ar Xiv preprint ar Xiv:1802.04034. Vorontsov, E.; Trabelsi, C.; Kadoury, S.; and Pal, C. 2017. On orthogonality and learning recurrent networks with long term dependencies. In International Conference on Machine Learning, 3570 3578. PMLR. Wang, J.; Chen, Y.; Chakraborty, R.; and Yu, S. X. 2020. Orthogonal convolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 11505 11515. Wu, F.; Souza, A.; Zhang, T.; Fifty, C.; Yu, T.; and Weinberger, K. 2019. Simplifying graph convolutional networks. In International conference on machine learning, 6861 6871. Xiao, L.; Bahri, Y.; Sohl-Dickstein, J.; Schoenholz, S.; and Pennington, J. 2018. Dynamical isometry and a mean field theory of cnns: How to train 10,000-layer vanilla convolutional neural networks. In International Conference on Machine Learning, 5393 5402. PMLR. Xie, D.; Xiong, J.; and Pu, S. 2017. All you need is beyond a good init: Exploring better solution for training extremely deep convolutional neural networks with orthonormality and modulation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6176 6185. Xu, K.; Li, C.; Tian, Y.; Sonobe, T.; Kawarabayashi, K.-i.; and Jegelka, S. 2018. Representation learning on graphs with jumping knowledge networks. In International Conference on Machine Learning, 5453 5462. PMLR. Ying, Z.; You, J.; Morris, C.; Ren, X.; Hamilton, W. L.; and Leskovec, J. 2018. Hierarchical Graph Representation Learning with Differentiable Pooling. In Annual Conference on Neural Information Processing Systems 2018, 4805 4815. Zhang, M.; Cui, Z.; Neumann, M.; and Chen, Y. 2018. An end-to-end deep learning architecture for graph classification. In Thirty-Second AAAI Conference on Artificial Intelligence. Zhao, L.; and Akoglu, L. 2019. Pairnorm: Tackling oversmoothing in gnns. ar Xiv preprint ar Xiv:1909.12223. Zhou, K.; Huang, X.; Li, Y.; Zha, D.; Chen, R.; and Hu, X. 2020a. Towards deeper graph neural networks with differentiable group normalization. ar Xiv preprint ar Xiv:2006.06972. Zhou, K.; Huang, X.; Zha, D.; Chen, R.; Li, L.; Choi, S.-H.; and Hu, X. 2021a. Dirichlet Energy Constrained Learning for Deep Graph Neural Networks. Advances in Neural Information Processing Systems, 34. Zhou, K.; Liu, N.; Yang, F.; Liu, Z.; Chen, R.; Li, L.; Choi, S.-H.; and Hu, X. 2021b. Adaptive Label Smoothing To Regularize Large-Scale Graph Training. ar Xiv preprint ar Xiv:2108.13555. Zhou, K.; Song, Q.; Huang, X.; and Hu, X. 2019. Auto-gnn: Neural architecture search of graph neural networks. ar Xiv preprint ar Xiv:1909.03184. Zhou, K.; Song, Q.; Huang, X.; Zha, D.; Zou, N.; Hu, X.; et al. 2020b. Multi-channel graph neural networks. International Joint Conferences on Artificial Intelligence.