# digraph_inception_convolutional_networks__33c988ca.pdf Digraph Inception Convolutional Networks Zekun Tong1 Yuxuan Liang2 Changsheng Sun2 Xinke Li1 David S. Rosenblum2,3 Andrew Lim1, 1Department of Industrial Systems Engineering and Management, National University of Singapore, Singapore 2Department of Computer Science, National University of Singapore, Singapore 3Department of Computer Science, George Mason University, VA, USA {zekuntong,liangyuxuan,cssun,xinke.li}@u.nus.edu dsr@gmu.edu, isealim@nus.edu.sg Graph Convolutional Networks (GCNs) have shown promising results in modeling graph-structured data. However, they have difficulty with processing digraphs because of two reasons: 1) transforming directed to undirected graph to guarantee the symmetry of graph Laplacian is not reasonable since it not only misleads message passing scheme to aggregate incorrect weights but also deprives the unique characteristics of digraph structure; 2) due to the fixed receptive field in each layer, GCNs fail to obtain multi-scale features that can boost their performance. In this paper, we theoretically extend spectral-based graph convolution to digraphs and derive a simplified form using personalized Page Rank. Specifically, we present the Digraph Inception Convolutional Networks (Di GCN) which utilizes digraph convolution and kth-order proximity to achieve larger receptive fields and learn multi-scale features in digraphs. We empirically show that Di GCN can encode more structural information from digraphs than GCNs and help achieve better performance when generalized to other models. Moreover, experiments on various benchmarks demonstrate its superiority against the state-of-the-art methods. 1 Introduction Learning from digraph (directed graph) data to solve practical problems, such as traffic prediction [25, 36], knowledge discovery [12] and time-series problems [4, 9], has attracted increasing attention. There are two general categories of GCNs: spatial-based [16, 40] and spectral-based [19, 23, 11]. The spatial-based approaches achieve digraph convolution by using self-defined neighbour traversal methods to aggregate features, which usually adds significant computational overhead [44, 45, 48]. Correspondingly, spectral-based GCNs [19, 44] use adjacency matrices based on spectrum analysis theory to explore neighborhood instead of traversal search, which greatly reduces the training time. However, they are limited to use undirected graphs as input by definition, the graph Laplacian needs to be symmetric [45]. How to extend spectral-based GCNs to the digraphs needs to be explored. A majority of spectral-based GCNs transform digraphs to undirected by relaxing its direction structure [19, 44], i.e., trivially adding edges to symmetrize the adjacency matrices. It will not only mislead message passing scheme to aggregate the features with incorrect weights but also discard distinctive direction structure [43], such as irreversible time-series relationships. Besides, there are several works that learn specific structure by defining motifs [28], inheritance relationship [18] and second-order proximity[39]. However, these methods have to stipulate learning templates or rules in advance, and is not capable to deal with complex structures beyond their definitions. Corresponding author 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. Besides, most of the existing spectral-based GCNs enhance their capabilities of feature extraction by stacking a number of graph convolutional layers [22, 13]. However, it often leads to feature dilution as well as overfitting problem when models become deep [18, 22]. Inspired by Inception Network for image classification [37], some works [32, 46, 1] widen their layers to obtain larger receptive fields and increase learning abilities. However, they use the fixed adjacency matrix in one layer, which increases the difficulty to capture multi-scale features. A scalable neighborhood would be desirable to provide more scale information, especially for nodes belonging to communities with different sizes. Moreover, choosing a proper receptive field scheme to fuse multi-scale features together can help handle complex structures in digraphs. To address these issues, we first extend the spectral-based graph convolution to digraphs by leveraging the inherent connections between graph Laplacian and stationary distributions of Page Rank [29]. Since original digraph is not necessarily irreducible and aperiodic, the corresponding Markov chain does not have unique stationary distribution. To solve this problem, we add a chance of teleporting back to every node based on Page Rank. However, the derived digraph Laplacian is too dense, and it is extremely time-consuming to perform convolution. Thus, referring to personalized Page Rank [3], we introduce an extra auxiliary node as the teleport connected with every node to simplify fullyconnected links in Page Rank. The simplified digraph Laplacian can dramatically reduce the number of edges without changing the properties (irreducible and aperiodic). In addition, we theoretically analyze its properties and find that our Laplacian is the intermediate form between the undirected and directed graph, and the degree of conversion is determined by the teleport probability α. Moreover, inspired by the Inception Network [37], we exploit kth-order proximity between two nodes in a digraph, which is determined through the shared kth-order neighborhood structures of these two nodes. This does not require direct kth-hop paths between them. By using this method, we design scalable receptive fields, which not only allows us to learn features of different sizes within one convolutional layer but also get larger receptive fields. This notion of proximity also appears in network analysis (HITS[20, 51]), psychology [30] and daily life: people who have a lot of common friends are more likely to be friends. In this way, we avoid yielding unbalanced receptive fields caused by the asymmetric paths in digraphs. Besides that, to obtain r-range receptive field, our model only requires stacking logk r layers instead of r GCN layers in conventional approaches. Through experiments, we empirically show that Digraph Incpetion Convolutional Networks (Di GCN) outperforms against competitive baselines. Additionally, our digraph convolution is superior to GCN s convolution in the mainstream directed graph benchmarks, especially over 20% accuracy on CORAML dataset. Our implement can be obtained at https://github.com/flyingtango/Di GCN. 2 Digraph Convolution In this section, we first give the definition of digraph Laplacian based on Page Rank, which is too dense to perform convolution well. We then simplify it by personalized Page Rank and analyze its properties. Finally, we give the definition of digraph convolution based on the above operations. 2.1 Digraph Laplacian based on Page Rank Formally, given a digraph (directed graph) G = (V,E), its adjacency matrix can be denoted as A = {0,1}n n, where n = V . The nodes are described by the feature matrix X Rn c, with the number of features c per node. GCN [19] proposes the spectral graph convolution as Zu = ˆAu XΘ, where Zu Rn d is the convolved result with output dimension d, Θ Rc d is trainable weight and ˆAu is the normalized self-looped version of undirected adjacency matrix Au (see [19]). GCN and its variants need the undirected symmetric adjacency matrix Au as input, therefore, they transform asymmetric A to symmetric form by relaxing direction structure of digraphs, e.g., let Au = (A + AT )/2 in their experiments1. Noticing the inherent connections between graph Laplacian and stationary distributions of Page Rank [29], we can use the properties of Markov chain to help us solve the problem in digraphs. Given a digraph G = (V,E), a random walk on G is a Markov process with transition matrix Prw = D 1A, 1There are various ways to construct an undirected graph from a digraph. In this paper, we consider one of the most commonly used methods that averaging edge weights when combination. where the diagonal degree matrix D(i,i) = j A(i,j). The G may contain isolated nodes in the periphery or could be formed into bipartite graph. Thus, Prw is not necessarily irreducible and aperiodic, we can not guarantee this random walk has unique stationary distribution. In order to relax this constraint, we slightly modify the random walk to Page Rank which adds a small chance of teleporting back to every node. The Page Rank transition matrix is defined as Ppr = (1 α)Prw + α n1n n, where α (0,1) is the teleport probability [8] and be controlled to keep the probability α n in a small range. It is easy to prove Ppr is irreducible and aperiodic, thus, it has a unique left eigenvector πpr (also called Perron vector) which is strictly positive with eigenvalue 1 according to Perron-Frobenius Theory [5]. The row-vector πpr corresponds to the stationary distribution of Ppr and we have πpr(i) = i,i j πpr(i)Ppr(i,j). That is, the probability of finding the walk at vertex i is the sum of all the incoming probabilities from vertices j that have a directed edges to i. Thus, πpr has analogy property with nodes degree matrix Du in undirected graph that reflecting the connectivity between nodes [15]. Using this property, we define the digraph Laplacian using Page Rank Lpr in symmetric normalized format [10] as follows: 1 2pr PprΠ 1 2 pr PT prΠ 1 2pr), (1) where we use Πpr = 1 πpr 1 Diag(πpr) to replace Du in undirected graph Laplacian [19]. In contrast to Ppr, this matrix is symmetric. Likewise, another work [26] also employs this idea to solve digraph problem. However, it is defined on the strongly connected digraphs, which is not universally applicable to any digraphs. Our method can easily generalize to it by α 0. Adding a chance of teleporting back to every node guarantees πpr exists and makes Lpr a Rn n dense matrix at the same time. Using this Laplacian matrix leads to greatly increase computational overhead of convolution operation and memory requirement of O (n2) for training (see time usage in Section 6). To deal with it, we propose a simplified sparse Laplacian using personlized Page Rank. 2.2 Approximate Digraph Laplacian based on Personalized Page Rank To solve this issue, reconsider the equation Ppr = (1 α)Prw + α n1n n of Page Rank. Instead of viewing it as a combination of the random walk Prw with a fully-connected teleport transition matrix, we can also view it as a variant of personalized Page Rank matrix using all the nodes as teleports. To retain properties while sparse the Laplacian, we design an auxiliary node scheme. Using Auxiliary Node as Teleport. More precisely, we introduce an auxiliary node ξ V as the personalized Page Rank teleport set T = {ξ}. Based on it, we define the transition matrix of personalized Page Rank Pppr in the graph Gppr as follows: (1 α) P α1n 1 , Pppr R(n+1) (n+1), (2) where P = D 1 A, A = A + In n denotes the adjacency matrix with added self-loops and D(i,i) = j A(i,j). Adding self-loops makes Pppr aperiodic due to the greatest common divisor of the lengths of its cycles is one. Meanwhile, each node in V has a α possibility of linking to ξ and ξ has a 1/n possibility of teleporting back to every node in V, which guarantees Pppr to be irreducible. Thus, Pppr has a unique left eigenvector πppr Rn+1 which is strictly positive with eigenvalue 1. Approximate Digraph Laplacian. Our target is finding the Laplacian of P for spectral analysis, however, P is not necessarily irreducible, which means the eigenvector π Rn with the largest eigenvalue of P is not unique. Thus, we use the stationary distribution of Pppr to approximate the stationary distribution of P. We can split πppr into two parts: πppr = (πappr,πξ), where πappr Rn is the unique stationary distribution of the first n points and πξ R1 is the unique stationary distribution of the auxiliary node ξ. πappr can converge to stationary distribution of P according to THEOREM 1. THEOREM 1 Based on the definitions, when teleport probability α 0, πappr P πappr 0. We can control α in a small range then simplify Equation 1 to: Lappr = I 1 2 ( Π 1 2 P Π 1 2 PT Π 1 2 ) I 1 1 2appr PΠ 1 2 appr + Π 1 2 appr PT Π 1 2appr), (3) where Π = 1 π 1 Diag( π) and Πappr = 1 πappr 1 Diag(πappr). Note that Lappr retains the graph s sparsity of O ( E ). Next, we give the conditions to converge to other forms in THEOREM 2. THEOREM 2 Based on the above definitions, given an input graph G, when teleport probability α 1, Πappr 1 n In n and the approximate digraph Laplacian converges as Lappr I 1 2( P + PT ), which is a trivial-symmetric Laplacian matrix. Specially, if G is undirected, then the approximate digraph Laplacian converges as Lappr I D 1 A, which is a random-walk normalized Laplacian. Generalization. We show in THEOREM 2 that two common used undirected Laplacian matrices are special cases of our method under certain conditions: the trivial-symmetric one mentioned in Section 2.1 and random-walk normalized one. When α tends to 1, the form of our method is closer to the form of undirected graph Laplacian. That is to say α can control the degree of conversion from a directed form to an undirected form. The smaller α retains the more directed properties, and vice versa. Proofs of THEOREM 1 and 2 are attached in the Supplementary Material. 2.3 Digraph Convolution As we have defined the digraph Laplacian in Equation 3 and it is symmetric, we can follow the spectral analysis in GCN [19] to derive the definition of the digraph convolution as: 1 2appr PΠ 1 2 appr + Π 1 2 appr PT Π 1 2appr)XΘ, (4) where Z Rn d is the convolved result with d output dimension, X Rn c is node feature matrix and Θ Rc d is trainable weight. Note that we carry out row normalization to the input weighted adjacency matrix. This propagation scheme has complexity O( E cd) which is same with GCN [19], as digraph Laplacian is sparse and can be calculated during preprocessing. 3 Digraph Inception Network In this section, first, we introduce kth-order Proximity as scalable receptive field and then we present Di GCN, a multi-scale inception network, to learn from features of different size in digraphs. 3.1 Scalable Receptive Field for Digraph based on kth-order Proximity We start by explaining the feature spreading ways in GCNs. Xu et al. [47] have shown that the information of node i spreads to node j in an analogous random walk manner, which means path is the way of feature transmission and the size of receptive field is determined by the length of the path in a graph. However, in digraph, long paths only exist between a few points and are often not bidirectional, which is not conducive to obtaining global features. Meanwhile, different communities have various node degrees of in and out, which may cause unbalanced receptive fields (paths) in digraphs. To solve this problem, we propose kth-order Proximity in digraphs which not only obtains the node s features from its directly adjacent nodes, but also extract the hidden information from kth-order neighbor nodes. That is, if two nodes share common neighhors, they tend to be similar. DEFINITION 1 kth-order Proximity. Given a graph G = (V,E), for k 2, if e V and a path between node i and j ( i,j V) in this form: vi ¹¹¹¹ ¹¹¹¹ k 1 edges ve ¹¹¹¹ ¹¹¹¹ k 1 edges vj, we define this path as kth-order meeting path M(k) i,j . Similarity, the kth-order diffusion path Di,j (k) is vi ¹¹¹¹ ¹¹¹¹ k 1 edges ve ¹¹¹¹ ¹¹¹¹ k 1 edges vj. If there exist both Mi,j (k) and Di,j (k) between node i and j, we think they are kth-order proximity and e is their kth-order common neighbor. Note that one node is 0th-order proximity with itself and 1st-order proximity with its directly connected neighbors. Figure 1: Illustration of Di GCN model. For a digraph G, we use kth-order proximity to generate k receptive fields based on the input adjacency matrix A shown in the Subfigure (a). Then we convolute them with input feature matrix Z(l 1) I and gain the output Z(l) I after fusion. We encapsulate this process as an Inception block shown in the Subfigure (b), where l Z+ is the number of layers and the initial input Z(0) I = X. Multi-layer networks can be implemented by stacking Inception blocks. The schematic of kth -order proximity shows in Figure 1(a). Based on Definition 1, we build a kth-order proximity matrix to connect similar nodes together. DEFINITION 2 kth-order Proximity Matrix. In order to model the kth-order proximity, we define the kth-order proximity matrix P(k)(k Z) in the graph G: I k = 0 D 1 A k = 1 Intersect((P(1)) k 1(P(1)T ) k 1 ,(P(1)T ) k 1 (P(1)) k 1)/2 k 2 . (5) A is the adjacency matrix with self-loops of G and D is corresponding diagonalized degree matrix. Intersect( ) denotes the element-wise intersection of matrices that only when the corresponding positions have both meeting and diffusion paths, the sum operation is performed, otherwise, it is 0. The kth-order proximity matrix P(k) is symmetric if k 2 because of the intersection operation. k represents distance between two similar nodes, that is, the size of the receptive fields. We can get scalable receptive fields by setting different k. 3.2 Multi-scale Inception Network Structure Based on the proposed kth-order proximity matrix, we define the multi-scale digraph convolution as: XΘ(0) k = 0 1 2 (Π(1) 1 2 P(1)Π(1) 1 2 P(1)T Π(1) 1 2 )XΘ(1) k = 1 2 P(k)W(k) 1 2 XΘ(k) k 2 Z(k) Rn d are convolved results with d output dimension, X Rn c is node feature matrix, W(k) is diagonalized weight matrix of P(k) and Θ(k) Rc d is trainable weight. Note that when k = 1, Z(1) is calculated by digraph convolution with P(1) in Section 2.3 and Π(1) is the corresponding approximate diagonalized eigenvector. Inspired by the Inception module proposed in [37], we build the multi-scale digraph Inception network. We can compare P(k=0) to 1 1 convolution kernel and treat Z(k=0) as a skip connection term carrying non-smoothed features. Moreover, Z(k 1) is designed to encode multi-scale directed structure features. Finally, we use fusion operation Γ to fusion multi-scale features together as an Inception block ZI: ZI = σ(Γ(Z(0),Z(1),...,Z(k))), (7) where σ is activation function. Fusion function Γ can be various, such as normalization, summation and concatenation. In practice, we use Γ to keep the feature dimensions unchanged, that is keeping ZI Rn d for stacking the same block. The schematic of Inception block shows in Figure 1(b). We notice that a recent work SIGN uses a similar Inception structure to handle large scale graph learning [32]. Differently, they use SGC [44] as basic block which is not applicable to digraphs and concatenate these block of different size together into a FC layer. However, concatenation is a kind of features fusion method, and in some cases, the effect of concatenation is not as good as summation, we illustrate this in Section 6. Generalization to other Models. Our method using kth-order proximity to improve the convolution receptive field has strong generalization ability. In most spectral-based models, we can use our Inception block to replace the original layer (see Table 1). Table 1: Our Inception block can generalizate to other models only need to modify some parameters. Here, Λ is Laplacian eigenvalue matrix defined in Cheb Net [11] and Au is the symmetric form of A. Undigraph Digraph Adj Scale Range Weights Fusion Method Cheb Net [11] Λ [0, 1, .., k] Θ0, ...,Θk Sum GCN [19] Au 1 Θ None SGC [44] Au k Θ None SIGN [32] Au [0, 1, .., k] Θ0, ...,Θk Concate Ours (Di GCN) A [0, 1, .., k] Θ0, ...,Θk Any Γ Taking SGC [44] as an example, we can generalize our method to the SGC by replacing the origin kth -power of adjacency matrix by Inception block. Experimental results in Section 6 show that integrating our method can help improve accuracy. Time and Space Complexity. For digraph convolution defined in the Equation 4, we can use a sparse matrix to store Laplacians. And as we use full batch training, the memory space cost for one adjacency matrix is O( E ). Figure 2: # of edges per Inception block For Inception block defined in the Equation 7, due to the asymmetry of the digraphs mentioned in Section 3.1, long paths normally exist between a few points and are not bidirectional. Thus, using kth-order proximity will get unbalanced receptive field and introduce computational complexity. Intersection and union of meeting and diffusion paths both can handle unbalancing problem. We compare the number of edges per Inception block with different k on CORA-ML [6] and CITESEER [33] shown in Figure 2 and find that the edges in kth-order matrix will not increase exponentially and intersection does help to reduce memory consume. Thus, the memory space cost for one Inception block in practical is O(k E ). However, the worst case does exist when the input graph is undirected and strongly connected. Though it is unsuitable to our model, which mainly treats reducible digraphs, the worst space complexity is O(k V 2). We can calculate eigenvalue decomposition in the Equation 3 during the preprocessing and store the results, therefore the computational complexity is O( V 3). At the same time, we use the sparse matrix multiplication. Thus, we can obtain the complexity of convolution as O(k E cd). 4 Related Work Digraph Convolution. Several works have tried to make GCNs adaptive to digraphs by looking for structural patterns and reformulate the graph [28, 39, 18]. However, these methods have their limitations that rely on pre-defined structure examples and can not handle complex structure which do not appear in the patterns. An alternative method [26] redefines the propagation scheme from Markov process view, which only apply for strongly connected digraph according to their definition. Our approach is universally applicable to any digraphs, which is the biggest difference from them. Page Rank in GCNs. Klicpera et al. [21] propose PPNP model which utilize a propagation based on personalized Page Rank to handle feature oversmoothing problem. Besides, the PPRGo [7] increases the efficiency of PPNP by incorporating multi-hop neighborhood information in a single step. Note that no matter PPNP or PPRGo, the basic form of their propagation matrices is Appnp = α (I (1 α)Au) 1, which is quite different from our digraph Laplacian in Equation 3. Furthermore, they use symmetric matrix Au in the propagation, which means they are not adaptive to digraphs. kth-order Proximity and Multi-scale Receptive Field. Previous works have found the powerful information extraction capabilities of kth-order proximity [38, 38, 51, 39]. However, they only consider firstand second-order proximity to find hidden information, and our method can adjust the proximity range as needed. There are several methods to achieve multi-scale receptive field, such as k-hop method [2, 11, 1, 44] and graph inception [32, 46, 31]. In our model, we use kth-order proximity instead of the k-hop method because they may create unbalanced receptive filed in digraphs (see Section 3.1). And the existing graph inception models are not suitable to digraphs (see Table 1). 5 Experimental Settings We conduct extensive experiments to evaluate the effectiveness of our model. The Supplementary Material reports further details on the experiments and reproducibility. Experimental Task. Node classification is a common task used to measure graph models. In this paper, we adopt the task of Semi-supervised Node Classification in Digraphs to verify the learning ability of models. Compared with the common experiments for undirected graphs [19, 44, 21, 42], the challenge is that the given adjacency matrix A is asymmetric, which means message passing has its direction. Based on the method proposed above, we build several simple models to deal with this problem. Task definition and specific model structures, including schematics, loss function and configuration, are included in the Supplementary Material. Baselines. We compare our model to eight state-of-the-art models that can be divided into four main categories: 1) spectral-based GNNs including Cheb Net [11], GCN [19], SGC [44] , APPNP [21] and Info Max [41]; 2) spatial-based GNNs containing Graph Sage [16] and GAT [40]; 3) Digraph GNNs including DGCN [39] (we do not use [26] because it only apply for strongly connected graph which needs cropping the original dataset to meet its settings); 4) Graph Inception having SIGN [32]. The descriptions and settings of them are introduced in the Supplementary Material. Datasets and Splitting. We use several digraph datasets including citation networks: CORA-ML [6] and CITESEER [33], and Amazon Co-purchase Networks: AM-PHOTO and AM-COMPUTER [34]. The split of the datasets will greatly affect the performance of the models [34, 21]. Especially for a single split, not only will it cause overfitting problems during training, but it is also easy to get misleading results. Thus, in our experiments, we randomly split the datasets and perform multiple experiments to obtain stable and reliable results. For train/validation/test split, following the rules in GCN [19], we choose 20 labels per class for training set, 500 labels for validation set and rest for test set. The detailed descriptions are summarized in the Supplementary Material. 6 Experimental Results Overall accuracy. The performance comparisons between our model and baselines on four datasets are reported in Table 2. We train all models for a maximum of 1000 epochs and early stop if the validation accuracy does not increase for 200 consecutive epochs, then calculate mean test accuracy with STD in percent (%) averaged over 20 random dataset splits with random weight initialization. It can be seen easily that our methods achieves the state-of-the-art results on all datasets. Notice that spectral-based models including Cheb Net, GCN, SGC and Info Max, do not perform well on digraph datasets compared to their good performance in undirected graphs. This is mainly because these models have limited ability to obtain features from the surroundings using asymmetric adjacency matrices. However, APPNP is an exception. It allows features to randomly propagate with a certain teleport probability, which breaks through the path limitation and achieves good results in digraphs. The spatial-based method and ours have similar results, which shows that both methods have good suitability for digraphs. Moreover, DGCN performs well on the most datasets, however, it uses both in- & out-degree proximity matrix to obtain structural features in digraphs, which leads to out of memory on AM-COMPUTER. Meanwhile, SIGN uses SGC as the basic module, thus, even if Inception method is used, SIGN does not perform well in digraphs (see analysis in Section 3.2). Training time. With the same training settings, we measure the convergence speed of models by average training time per run in second (s) in Table 2. Apparently, our models have similar results. Lpr can only be applied to moderately sized graphs, while Lappr scales to large graphs. Compared with the spectral-based methods, the overall speed of our model without Inception block is similar to Table 2: Overall accuracy and training time. "w/ pr" means using Lpr; "w/ appr" means using Lappr; "w/o IB" means using digraph convolution only; "w/ IB" means using Inception block. The best results are highlighted with bold and the second are highlighted with underline. Model CORA-ML CITESEER AM-PHOTO AM-COMPUTER acc time acc time acc time acc time Cheb Net [11] 64.02 1.5 7.23 56.46 1.4 7.45 80.91 1.0 10.52 73.25 0.8 16.96 GCN [19] 53.11 0.8 4.48 54.36 0.5 4.80 53.20 0.4 4.86 60.50 1.6 5.04 SGC [44] 51.14 0.6 1.92 44.07 3.5 3.58 71.25 1.3 2.31 76.17 0.1 3.68 APPNP [21] 70.07 1.1 6.84 65.39 0.9 6.94 79.37 0.9 6.72 63.16 1.4 6.47 Info Max [41] 58.00 2.4 4.11 60.51 1.7 4.85 74.40 1.2 31.80 47.32 0.7 41.96 Graph Sage [16] 72.06 0.9 6.22 63.19 0.7 6.21 87.57 0.9 8.52 79.29 1.3 14.49 GAT [40] 71.91 0.9 6.02 63.03 0.6 6.12 89.10 0.7 8.83 79.45 1.5 14.66 DGCN [39] 75.02 0.5 6.53 66.00 0.4 6.84 83.66 0.8 36.29 OOM - SIGN [32] 66.47 0.9 2.81 60.69 0.4 2.96 74.13 1.0 5.33 69.40 4.8 4.97 Ours w/ pr w/o IB 77.11 0.5 39.13 64.77 0.6 47.19 OOM - OOM - w/ appr w/o IB 77.01 0.4 2.71 64.92 0.3 2.69 88.72 0.3 2.95 85.55 0.4 4.23 w/ appr w/ IB 80.28 0.5 6.38 66.11 0.7 6.42 90.02 0.5 11.77 85.94 0.5 26.63 * OOM stands for out of memory (see efficiency analysis in Section 2.1) SGC since the Laplacian is precomputed. On AM-PHOTO and AM-COMPUTER with large scales, our model is 30% faster on average than GCN. The accuracy of our model improves significantly while the speed decreases after using Inception, which is consistent with complexity analysis in Section 3.2. Ablation study. We validate the effectiveness of the components and the resulting ACC are shown in Table 2. Comparing model with Lappr and model with Lpr, we find that the approximate method can not only achieve the similar accuracy but also save training time and memory. Meanwhile, we find that the combination of Lappr and Inception block brings significant improvement in accuracy. This substantially validate that scalable receptive fields do help to learn features from neighborhood. Effect of teleport probability α. Figure 3(a) shows the effect of the hyperparameter α on the test accuracy and structural retention. Referring to the assumption in Section 2.2, we define structural retention as S = KL(πappr,πappr P) 1, where KL means KL divergence. We use S to measure how much directed structural information retained after approximation. The smaller S, the less information remain. According to Theorem 1 that α needs to be close to 0 to retain digraph structural information in Laplacian, however, we find that a higher α improves accuracy slightly. In view of this, we choose α [0.05,0.2] to balance structural retention and accuracy. α should be adjusted for the different datasets [21], but in order to maintain consistency, we take α = 0.1 in all experiments. Link prediction in digraphs. We use link prediction task in digraphs to show that our model is able to obtain more structural information. In this task, we split the edges of a digraph into positive and negative train/val/test edges and compare the results with GCN over 20 runs for a maximum of 500 epochs. Figure 3(b) shows that that our model (w/ appr and w/o IB) outperforms GCN for link prediction on all datasets. This is mainly because we take the direction of the edges into account when calculating Lappr, which allows us to obtain more accurate structural information in digraphs. Training time at different graph scales. We report results for the mean training time in millisecond (ms) per epoch for 200 epochs on simulated random graphs using digraph convolution. We construct a simple random graph with N nodes and assign 2N edges uniformly at random. We take the node index matrix as input feature matrix and give the same label for every node. Figure 3(c) summarizes the results and shows that our model can handle about 10 million nodes in one GPU (11GB). (a) (b) (c) 5.63 5.74 6.38 6.61 16.78 32.05 5K 10K 50K 100K 500K 1M 5M 10M 50M Training time per epoch(ms) number of nodes Figure 3: (a) effect of α to ACC and structural retention S; (b) link prediction results on different datasets; (c) training time per epoch on random digraphs with difference size. (a) (b) (c) Figure 4: (a) effects of model width k ; (b) fusion and activation functions on AM-PHOTO in different layers (* stands for unable to converge); (c) generalization to SGC using different k. Depth and width of Inception block. How to balance the model depth and width becomes a vital issue for Inception block. Figure 4(a) shows that for a single layer model, the improvement in val accuracy is not significant for k > 2, which means larger receptive field cannot obtain more effective information on small-scale dataset CITESEER. Then, we keep k = 2 and carry out grid search on model depth and choose layer=3. To intuitively compare the impact of depth under the same receptive range, we choose GAT as baseline, which can obtain various range by adjusting the head size without stacking layers. From Table 3, we can observe that larger receptive fields help our model to perform better. It is consistent that using a moderate number of layers is enough to effectively learn features. Table 3: Results under various depths. Our model sets k = 2, and uses Lappr as Laplacian and Sum as fusion operation. "Range" means range of receptive field. The best results are highlighted with bold. Methods Range Settings CORA-ML CITESEER AM-PHOTO AM-COMPUTER GAT [40] 2 head=2 71.33 1.4 63.33 0.7 81.12 1.5 75.12 3.2 Ours layer=1 72.14 1.0 62.89 0.4 75.43 0.5 64.17 0.5 GAT [40] 4 head=4 71.65 1.0 63.30 0.7 86.03 1.0 77.57 1.5 Ours layer=2 76.62 0.5 63.98 0.5 87.71 0.9 82.36 0.7 GAT [40] 8 head=8 71.62 0.8 63.17 0.6 87.04 1.0 78.22 1.7 Ours layer=3 80.28 0.5 66.11 0.7 90.02 0.5 85.94 0.5 GAT [40] 16 head=16 71.91 0.9 63.03 0.6 89.10 0.7 79.45 1.5 Ours layer=4 79.95 0.8 64.00 1.0 89.81 0.9 83.36 0.7 Fusion operation and activation function. We show results in Figure 4(b) and use Sum and Concate to represent Summation and Concatenation respectively. We find that the choices of fusion and activation functions need to match the complexity of the model. When the model is shallow, Concate performs better due to more parameters. Sum achieves better results in deep model because it requires fewer parameters, which helps prevent the model from overfitting. Since we use kth-order proximate IB, linear combinations can achieve stable results on smaller datasets. Using a non-linear activation function (Re LU) may result in models that are too complex to learn features effectively. Generalization to other model (SGC). To test the generalization ability of our Inception block, we use it to replace the origin layer in SGC [44]. The generalized SGC model is denoted by SGC+Di GCN. In addition, we test the case of k = 1 and k = 2 to the generalized model and the results are shown in Figure 4(c). Obviously, whether k = 1 or k = 2 our generalized model outperforms the original model on all datasets, which shows that the multi-scale receptive field helps the model obtain more surrounding information. Meanwhile, our method has good generalization ability because of its simple structure that can be plugged into existing models easily. 7 Conclusion and Future Work In this paper, we present a novel Digraph Inception Convolutional Networks (Di GCN), which can effectively learn digraph representation. We theoretically extend spectral-based graph convolution to digraph and further simplify it. Besides, we define kth-order proximity and design the digraph Inception networks to learn multi-scale features. This simple and scalable model can not only learn digraph structure, but also get hidden information through kth-order proximity relationship. Finally, we use several tasks on various real-world datasets to validate the effectiveness and generalization capability of our model. The results show that our model outperforms several state-of-the-art methods. Due to the full batch training, our model can not be applied to large-scale graphs and we will consider adapting it to mini-batch training in the future. In addition, we will study how to combine our model with existing models to solve more complex tasks, e.g., computer vision and NLP. Broader Impact GCNs could be applied to a wide range of applications, including image segmentation [27], speech recognition [14], recommender system [17], point cloud [50, 24], traffic prediction [25] and many more [45]. Our method can help to expand the graph types from undirected to directed in the above application scenarios and obtain multi-scale features from the high-order hidden directed structure. For traffic prediction, our method can be used in map applications to obtain more fine-grained and accurate predictions. This requires users to provide location information, which has a risk of privacy leakage. The same concerns also arise in social network analysis [38], person re-ID [35] and NLP [49], which use graph convolutional networks as their feature extraction methods. Another potential risk is that our model may be adversarial attacked by adding new nodes or deleting existing edges. For example, in a graph-based recommender system, our model may produce completely different recommendation results due to being attacked. We see opportunities for research applying Di GCN to beneficial purposes, such as investigating the ability of Di GCN to discover hidden complex directed structure, the limitation of approximate method based on personalized Page Rank and the feature oversmoothing problem in digraphs. We also encourage follow-up research to design derivative methods for different tasks based on our method. Acknowledgments and Disclosure of Funding The authors would like to thank Jinyang Wang, Miqi Wu and Yongxing Dai for the great help in discussion and proofreading. This research is supported in part by the following grants: MOE2019T3-1-010, MOE2017-T2-2-153, NRF-RSS2016-004. [1] S. Abu-El-Haija, A. Kapoor, B. Perozzi, and J. Lee, N-gcn: Multi-scale graph convolution for semi-supervised node classification, ar Xiv preprint ar Xiv:1802.08888, 2018. [2] S. Abu-El-Haija, B. Perozzi, A. Kapoor, N. Alipourfard, K. Lerman, H. Harutyunyan, G. V. Steeg, and A. Galstyan, Mixhop: Higher-order graph convolutional architectures via sparsified neighborhood mixing, ar Xiv preprint ar Xiv:1905.00067, 2019. [3] B. Bahmani, A. Chowdhury, and A. Goel, Fast incremental and personalized pagerank, Proceedings of the VLDB Endowment, vol. 4, no. 3, pp. 173 184, 2010. [4] M. Barigozzi and C. Brownlees, Nets: Network estimation for time series, Journal of Applied Econometrics, vol. 34, no. 3, pp. 347 364, 2019. [5] G. Barker and H. Schneider, Algebraic perron-frobenius theory, Linear Algebra and its Applications, vol. 11, no. 3, pp. 219 233, 1975. [6] A. Bojchevski and S. Günnemann, Deep gaussian embedding of attributed graphs: Unsupervised inductive learning via ranking, ar Xiv preprint ar Xiv:1707.03815, 2017. [7] A. Bojchevski, J. Klicpera, B. Perozzi, M. Blais, A. Kapoor, M. Lukasik, and S. Günnemann, Is pagerank all you need for scalable graph neural networks? in KDD, MLG Workshop, 2019. [8] S. Brin and L. Page, Reprint of: The anatomy of a large-scale hypertextual web search engine, Computer networks, vol. 56, no. 18, pp. 3825 3833, 2012. [9] Y. Cai, L. Ge, J. Liu, J. Cai, T.-J. Cham, J. Yuan, and N. M. Thalmann, Exploiting spatialtemporal relationships for 3d pose estimation via graph convolutional networks, in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 2272 2281. [10] F. Chung, Laplacians and the cheeger inequality for directed graphs, Annals of Combinatorics, vol. 9, no. 1, pp. 1 19, 2005. [11] M. Defferrard, X. Bresson, and P. Vandergheynst, Convolutional neural networks on graphs with fast localized spectral filtering, in NIPS, 2016, pp. 3844 3852. [12] T. Dettmers, P. Minervini, P. Stenetorp, and S. Riedel, Convolutional 2d knowledge graph embeddings, in Thirty-Second AAAI Conference on Artificial Intelligence, 2018. [13] M. Fey, Just jump: Dynamic neighborhood aggregation in graph neural networks, ar Xiv preprint ar Xiv:1904.04849, 2019. [14] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin, Convolutional sequence to sequence learning, in ICML. JMLR. org, 2017, pp. 1243 1252. [15] D. Gleich, Hierarchical directed spectral graph partitioning, Information Networks, 2006. [16] W. Hamilton, Z. Ying, and J. Leskovec, Inductive representation learning on large graphs, in NIPS, 2017, pp. 1024 1034. [17] X. He, K. Deng, X. Wang, Y. Li, Y. Zhang, and M. Wang, Lightgcn: Simplifying and powering graph convolution network for recommendation, ar Xiv preprint ar Xiv:2002.02126, 2020. [18] M. Kampffmeyer, Y. Chen, X. Liang, H. Wang, Y. Zhang, and E. P. Xing, Rethinking knowledge graph propagation for zero-shot learning, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 11 487 11 496. [19] T. N. Kipf and M. Welling, Semi-supervised classification with graph convolutional networks, ar Xiv preprint ar Xiv:1609.02907, 2016. [20] J. M. Kleinberg, Authoritative sources in a hyperlinked environment, Journal of the ACM (JACM), vol. 46, no. 5, pp. 604 632, 1999. [21] J. Klicpera, A. Bojchevski, and S. Günnemann, Predict then propagate: Graph neural networks meet personalized pagerank, ar Xiv preprint ar Xiv:1810.05997, 2018. [22] G. Li, M. Müller, A. Thabet, and B. Ghanem, Deepgcns: Can gcns go as deep as cnns? ar Xiv preprint ar Xiv:1904.03751, 2019. [23] R. Li, S. Wang, F. Zhu, and J. Huang, Adaptive graph convolutional neural networks, in AAAI, 2018. [24] X. Li, C. Li, Z. Tong, A. Lim, J. Yuan, Y. Wu, J. Tang, and R. Huang, Campus3d: A photogrammetry point cloud benchmark for hierarchical understanding of outdoor scene, in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 238 246. [25] Y. Liang, S. Ke, J. Zhang, X. Yi, and Y. Zheng, Geoman: Multi-level attention networks for geo-sensory time series prediction. in IJCAI, 2018, pp. 3428 3434. [26] Y. Ma, J. Hao, Y. Yang, H. Li, J. Jin, and G. Chen, Spectral-based graph convolutional network for directed graphs, ar Xiv preprint ar Xiv:1907.08990, 2019. [27] F. Milletari, N. Navab, and S.-A. Ahmadi, V-net: Fully convolutional neural networks for volumetric medical image segmentation, in 2016 Fourth International Conference on 3D Vision (3DV). IEEE, 2016, pp. 565 571. [28] F. Monti, K. Otness, and M. M. Bronstein, Motifnet: a motif-based graph convolutional network for directed graphs, in 2018 IEEE Data Science Workshop (DSW). IEEE, 2018, pp. 225 228. [29] L. Page, S. Brin, R. Motwani, and T. Winograd, The pagerank citation ranking: Bringing order to the web. Stanford Info Lab, Tech. Rep., 1999. [30] S. M. Reich, K. Subrahmanyam, and G. Espinoza, Friending, iming, and hanging out face-toface: overlap in adolescents online and offline social networks. Developmental psychology, vol. 48, no. 2, p. 356, 2012. [31] Y. Rong, W. Huang, T. Xu, and J. Huang, Dropedge: Towards deep graph convolutional networks on node classification, in ICLR, 2019. [32] E. Rossi, F. Frasca, B. Chamberlain, D. Eynard, M. Bronstein, and F. Monti, Sign: Scalable inception graph neural networks, ar Xiv preprint ar Xiv:2004.11198, 2020. [33] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad, Collective classification in network data, AI magazine, vol. 29, no. 3, pp. 93 93, 2008. [34] O. Shchur, M. Mumme, A. Bojchevski, and S. Günnemann, Pitfalls of graph neural network evaluation, ar Xiv preprint ar Xiv:1811.05868, 2018. [35] Y. Shen, H. Li, S. Yi, D. Chen, and X. Wang, Person re-identification with deep similarityguided graph neural network, in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 486 504. [36] J. Sun, J. Zhang, Q. Li, X. Yi, Y. Liang, and Y. Zheng, Predicting citywide crowd flows in irregular regions using multi-view graph convolutional networks, IEEE Transactions on Knowledge and Data Engineering, 2020. [37] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, Rethinking the inception architecture for computer vision, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818 2826. [38] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei, Line: Large-scale information network embedding, in Proceedings of the 24th international conference on world wide web, 2015, pp. 1067 1077. [39] Z. Tong, Y. Liang, C. Sun, D. S. Rosenblum, and A. Lim, Directed graph convolutional network, ar Xiv preprint ar Xiv:2004.13970, 2020. [40] P. Veliˇckovi c, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, Graph attention networks, ar Xiv preprint ar Xiv:1710.10903, 2017. [41] P. Veliˇckovi c, W. Fedus, W. L. Hamilton, P. Liò, Y. Bengio, and R. D. Hjelm, Deep graph infomax, ar Xiv preprint ar Xiv:1809.10341, 2018. [42] Y. Wang, W. Wang, Y. Liang, Y. Cai, and B. Hooi, Graphcrop: Subgraph cropping for graph classification, ar Xiv preprint ar Xiv:2009.10564, 2020. [43] Y. Wang, W. Wang, Y. Liang, Y. Cai, J. Liu, and B. Hooi, Nodeaug: Semi-supervised node classification with data augmentation, in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 207 217. [44] F. Wu, T. Zhang, A. H. d. Souza Jr, C. Fifty, T. Yu, and K. Q. Weinberger, Simplifying graph convolutional networks, ar Xiv preprint ar Xiv:1902.07153, 2019. [45] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu, A comprehensive survey on graph neural networks, ar Xiv preprint ar Xiv:1901.00596, 2019. [46] Y. Xiong, Y. Zhang, X. Kong, H. Chen, and Y. Zhu, Graphinception: Convolutional neural networks for collective classification in heterogeneous information networks, IEEE Transactions on Knowledge and Data Engineering, 2019. [47] K. Xu, C. Li, Y. Tian, T. Sonobe, K.-i. Kawarabayashi, and S. Jegelka, Representation learning on graphs with jumping knowledge networks, ar Xiv preprint ar Xiv:1806.03536, 2018. [48] Z. Ying, D. Bourgeois, J. You, M. Zitnik, and J. Leskovec, Gnnexplainer: Generating explanations for graph neural networks, in NIPS, 2019, pp. 9240 9251. [49] N. Zhang, S. Deng, Z. Sun, G. Wang, X. Chen, W. Zhang, and H. Chen, Long-tail relation extraction via knowledge graph embeddings and graph convolution networks, ar Xiv preprint ar Xiv:1903.01306, 2019. [50] Y. Zhao, Y. Wu, C. Chen, and A. Lim, On isometry robustness of deep 3d point cloud models under adversarial attacks, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 1201 1210. [51] D. Zhou, T. Hofmann, and B. Schölkopf, Semi-supervised learning on directed graphs, in NIPS, 2005, pp. 1633 1640.