# gaussianinduced_convolution_for_graphs__aca21d2d.pdf The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19) Gaussian-Induced Convolution for Graphs Jiatao Jiang, Zhen Cui, Chunyan Xu, Jian Yang PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, and Jiangsu Key Lab of Image and Video Understanding for Social Security, School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China {jiatao, zhen.cui, cyx, csjyang}@njust.edu.cn Learning representation on graph plays a crucial role in numerous tasks of pattern recognition. Different from gridshaped images/videos, on which local convolution kernels can be lattices, however, graphs are fully coordinate-free on vertices and edges. In this work, we propose a Gaussianinduced convolution (GIC) framework to conduct local convolution filtering on irregular graphs. Specifically, an edgeinduced Gaussian mixture model is designed to encode variations of subgraph region by integrating edge information into weighted Gaussian models, each of which implicitly characterizes one component of subgraph variations. In order to coarsen a graph, we derive a vertex-induced Gaussian mixture model to cluster vertices dynamically according to the connection of edges, which is approximately equivalent to the weighted graph cut. We conduct our multi-layer graph convolution network on several public datasets of graph classification. The extensive experiments demonstrate that our GIC is effective and can achieve the state-of-the-art results. Introduction As witnessed by the widespread applications, graph is one of the most successful models to conduct structured and semistructured data, ranging from text (Defferrard, Bresson, and Vandergheynst 2016), bioinformatics (Yanardag and Vishwanathan 2015; Niepert, Ahmed, and Kutzkov 2016; Song et al. 2018) and social network (Gomez, Chiem, and Delvenne 2017; Orsini, Baracchi, and Frasconi 2017) to images/videos (Marino, Salakhutdinov, and Gupta 2016; Cui et al. 2018; Cui, Yang, and others 2017). Among these applications, learning robust representations from structured graphs becomes the main topic. To this end, various methods have come forth in recent years. Graph kernels (Yanardag and Vishwanathan 2015) and recurrent neural networks (RNNs) (Scarselli et al. 2009) are the most representative ones. Graph kernels usually take the classic Rconvolution strategy (Haussler 1999) to recursively decompose graphs into atomic sub-structures and then define local similarities between them. RNNs based methods sequentially traverse neighbors with tied parameters in depth. With the increase of graph size, graph kernels would suffer diagonal dominance of kernels (Sch olkopf et al. 2002) while Corresponding author Copyright c 2019, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. RNNs would have the explosive number of combinatorial paths in the recursive stage. Recently convolutional neural networks (CNNs) (Le Cun, Bengio, and Hinton 2015) have achieved breakthrough progresses on representing grid-shaped image/video data. In contrast, graphs are with irregular structures and fully coordinate-free on vertices and edges. The vertices/edges are not strictly ordered, and can not be explicitly matched between two graphs. To generalize the idea of CNNs onto graphs, we need to solve this problem therein that the same responses should be produced for those homomorphic graphs/subgraphs when performing convolutional filtering. To this end, recent graph convolution methods (Defferrard, Bresson, and Vandergheynst 2016; Atwood and Towsley 2016; Hamilton, Ying, and Leskovec 2017) attempted to aggregate neighbor vertices as shown in Fig. 1e. This kind of methods actually employ a fuzzy filtering (i.e., a tied/shared filter) on neighbor vertices because only firstorder statistics (mean) is used. Two examples are shown in Fig. 1a and Fig. 1b. Although they have different structures, the responses on them are fully equal. Oppositely, Niepert et.al (Niepert, Ahmed, and Kutzkov 2016) ranked neighbor vertices according to weights of edges, and then used different filters on these sorted vertices, as shown in Fig. 1f. However, this rigid ranking method will suffer some limitations: i) probably consistent responses to different structures (e.g., Fig. 1b and Fig. 1c) because weights of edges are out of consideration after ranking; ii) information loss of node pruning for a fixed-size receptive field as shown in Fig. 1b; and iii) ranking ambiguity for equal connections as shown in Fig. 1d; and iv) ranking sensitivity to (slightly) changes of edge weights/connections. In this paper we propose a Gaussian-induced graph convolution framework to learn graph representation. For a coordinate-free subgraph region, we design an edge-induced Gaussian mixture model (EI-GMM) to implicitly coordinate the vertices therein. Specifically, the edges are used to regularize Gaussian models such that variations of subgraph can be well-encoded. In analogy to the standard convolutional kernel as shown in Fig. 1h, EI-GMM can be viewed as a coordinate normalization by projecting variations of subgraph into several Gaussian components. For example, the four subgraphs w.r.t. Fig. 1a 1d will have different repre- 𝑓𝑓Σ𝑖𝑖=1 5 ෦ 𝑤𝑤0𝑖𝑖𝑥𝑥𝑖𝑖= Σ𝑖𝑖=1 5 ෦ 𝑤𝑤0𝑖𝑖𝑓𝑓(𝑥𝑥𝑖𝑖) weighted summation 𝑣𝑣2 𝑣𝑣1 𝑣𝑣3 𝑣𝑣4 𝑣𝑣5 (e) Tied filtering 𝑓𝑓1 𝑥𝑥2 + 𝑓𝑓2(𝑥𝑥1) + 𝑓𝑓3(𝑥𝑥4) + 𝑓𝑓4(𝑥𝑥5) rank and select vertices 𝑣𝑣1 𝑣𝑣2 𝑣𝑣4 𝑣𝑣5 𝑣𝑣3 (f) Ranking filtering 𝑓𝑓1 ℊ1 𝑥𝑥+ 𝑓𝑓2(ℊ2 𝑥𝑥) + 𝑓𝑓3(ℊ3 𝑥𝑥) encode Gaussian vectors 𝒩𝒩(𝜇𝜇𝑐𝑐, 1 𝑤𝑤0𝑖𝑖Σ𝑐𝑐) edge-induced 𝑣𝑣5 𝑣𝑣4 𝑣𝑣3 𝑣𝑣2 𝑣𝑣1 (g) Gaussian-induced filtering 𝑥𝑥1 𝑥𝑥2 𝑥𝑥3 𝑥𝑥8 𝑥𝑥0 𝑥𝑥4 𝑥𝑥7 𝑥𝑥6 𝑥𝑥5 𝑓𝑓0 𝑥𝑥0 + + 𝑓𝑓8(𝑥𝑥8) (h) 3 3 regular filtering Figure 1: Different filters operations on graph vertices. Examples of one-hop subgraphs are given in (a)-(d), where v0 is the reference vertex and each vertex is assigned to a signal. The tied filtering (e) summarizes all neighbor vertices, and generates the same responses to (a) and (b) under the filter f, i.e., f(P g w0ixi) = f(1.9), although the two graphs are completely different in structures. The ranking filtering (f) sorts/prunes neighbor vertices and then performs different filtering on them. It might result into the same responses f1(1) + f2(3) + f3(1) + f4(4) to different graphs such as (b) and (c), where the digits in red boxes denote the ranked indices and the vertex of dashed box in (b) is pruned. Moreover, the vertex ranking is uncertain/non-unique for equal connections in (d).To address these problems, we derive edge-induced GMM to coordinate subgraphs as shown in (g). Each of Gaussian model can be viewed as one variation component (or direction) of subgraph. Like the standard convolution (h), the Gaussian encoding is sensitive to different subgraphs, e.g., (a)-(d) will have different responses. Note f, fi are linear filters, and the non-linear activation functions are put on their responses. sentations1 through our Gaussian encoding in Fig. 1g. To make the network inference forward, we transform Gaussian components of each subgraph into the gradient space of multivariate Gaussian parameters, instead of employing the sophisticated EM algorithm. Then the filters (or transform functions) are performed on different Gaussian components like latticed kernels on different directions in Fig. 1h. Further, we derive a vertex-induced Gaussian mixture model (VI-GMM) to favor dynamic coarsening of graph. We also theoretically analyze the approximate equivalency of VIGMM to weighted graph cut (Dhillon, Guan, and Kulis 2007). Finally, EI-GMM and VI-GMM can be alternately stacked into an end-to-end optimization network. In summary, our main contributions are four folds: i) propose an end-to-end Gaussian-induced convolutional neural network for graph representation; ii) propose edge-induced GMM to encode variations of different subgraphs; iii) derive vertex-induced GMM to perform dynamic coarsening of graphs, which is an approximation to the weighted graph cut; iv) verify the effectiveness of our method and report state-of-the-art results on several graph datasets. 1Suppose three Gaussian models are N(0, 1), N(0, 2) and N(0, 3), then we can compute the responses on (a)-(d) respectively as f1([0.49, 0.93]) + f2([0.17, 0.65]) + f3([0.07, 0.44]), f1([0.35, 0.73]) + f2([0.15, 0.58]) + f3([0.10, 0.64], f1([0.35, 0.71]) + f2([0.15, 0.39]) + f3([0.10, 0.43]), f1([0.46, 0.99]) + f2([0.18, 0.62]) + f3([0.08, 0.42]). Please refer to incoming section. Related Work Graph CNNs mainly fall into two categories: spectral and spatial methods. Spectral methods (Bruna et al. 2014; Scarselli et al. 2009; Henaff, Bruna, and Le Cun 2015; Such et al. 2017; Li et al. 2018a; 2018b) construct a series of spectral filters by decomposing graph Laplacian, which often suffers high-computational burden. To address this problem, the fast local spectral filtering method (Defferrard, Bresson, and Vandergheynst 2016) parameterizes the frequency responses as a Chebyshev polynomial approximation. However, as shown in Fig. 1b, after summarizing all nodes, this method will discard topology structures of a local receptive field. This kind of methods usually require equal sizes of graphs like the same sizes of images for CNNs (Kipf and Welling 2017). Spatial methods attempt to define spatial structures of adjacent vertices and then perform filtering on structured graphs. Diffusion CNNs (Atwood and Towsley 2016) scans a diffusion process across each node. PATCHYSAN (Niepert, Ahmed, and Kutzkov 2016) linearizes neighbors by sorting weights of edges and deriving convolutional filtering on graphs, as shown in Fig. 1c. As an alternative, random walks based approach is also used to define the neighborhoods (Perozzi, Al-Rfou, and Skiena 2014). For the linearized neighbors, RNNs (Li et al. 2016) could be used to model the structured sequences. Similarly, Ngram CNN (Luo et al. 2017) serializes each graph by introducing the concept of n-gram block. GAT (Velickovic et al. 2018) attempts to weight edges through the attention mechanism. WSC (Jiang edge-induced GMM vertex-induced GMM FC + Softmax edge-induced GMM vertex-induced GMM convolution convolution coarsening coarsening Figure 2: The GIC network architecture. The GIC main contains two module: convolution layer (EI-GMM) and coarsening layer (VI-GMM). The GIC stacks several convolution and coarsening layers alternatively and iteratively. More details can be found in incoming section. et al. 2018) attempts to aggregate walk fields defined by random walks into Gaussian mixture models. Zhao (Zhao et al. 2018) attempts to define a standard network with different graph convolutions. Besides, some variants (Hamilton, Ying, and Leskovec 2017; Duran and Niepert 2017; Zhang et al. 2018) employ the aggregation or propagation of local neighbor nodes. Different from these tied filtering or ranking filtering methods, we use Gaussian models to encode local variations of graph. Also different from the recent mixture models (Monti et al. 2017), which uses GMM to only learn the importance of adjacent nodes, our method uses weighted GMM to encode the distributions of local graph structures. The GIC Network Attribute Graph Here we consider an undirected attribute graph G = (V, A, X) of m vertices (or nodes), where V = {vi}m i=1 is the set of vertices, A is a (weighted) adjacency matrix, and X is a matrix of graph attributes (or signals). The adjacency matrix A Rm m records the connections between vertices. If vi, vj are not connected, then A(vi, vj) = 0, otherwise A(vi, vj) = 0. We sometimes abbreviate A(vi, vj) as Aij. The attribute matrix X Rm d is associated with the vertex set V, whose i-th row Xi (or Xvi) denotes a ddimension attribute of the i-th node (i.e., vi). The graph Laplacian matrix L is defined as L = D A, where D Rm m is the diagonal degree matrix with Dii = P j Aij. The normalized version is written as Lnorm = D 1/2LD 1/2 = I D 1/2AD 1/2. where I is the identity matrix. Unless otherwise specified, we use the latter. We give the definition of subgraph used in the following. Definition 1. Given an attribute graph G = (V, A, X), the attribute graph G = (V , A , X ) is a subgraph of G, denoted G G, if (i) V V, (ii) A is the submatrix of A w.r.t. the subset V , and (iii) X = XV . Overview The GIC network architecture is shown in Fig. 2. Given an attribute graph G(0) = (V(0), A(0), X(0)), where the superscript denotes the layer number, we construct multi-scale re- ceptive fields for each vertex based on the adjacency matrix A(0). Each receptive field records k-hop neighborhood relationships around the reference vertex, and forms a local centralized subgraph. To encode the centralized subgraph, we project it into edge-induced Gaussian models, each of which defines one variation direction of the subgraph. We perform different filtering operations on different Gaussian components and aggregate all responses as the convolutional output. After the convolutional filtering, the input graph G(0) is transformed into a new graph G(1) = (V(1), A(1), X(1)), where V(1) = V(0) and A(1) = A(0). To further abstract graphs, we next stack a coarsening layer on the graph G(1). The proposed vertex-induced GMM is used to downsample the graph G(1) into the low-resolution graph G(2) = (V(2), A(2), X(2)). Taking the convolution and coarsening modules, we may alternately stack them into a multi-layer GIC network, With the increase of layers, the receptive field size of filters will become larger, so the higher layer can extract more global graph information. In the supervised case of graph classification, we finally concatenate with a fully connected layer followed by a softmax loss layer. Multi-Scale Receptive Fields In the standard CNN, receptive fields may be conveniently defined as latticed spatial regions. Thus convolution kernels on grid-shaped structures are accessible. However, the construction of convolutional kernels on graphs are intractable due to coordinate-free graphs, e.g., unordered vertices, unfixed number of adjacent edges/vertices. To address this problem, we resort to the adjacent matrix A, which expresses connections between vertices. Since Ak exactly records the k-step reachable vertices, we may construct a k-neighbor receptive field by using the k-order polynomial of A, denoted as ψk(A). Taking the simplest case, ψk(A) = Ak reflects the k-hop neighborhood relationships. In order to remove the scale effect, we may normalize ψk(A) as ψk(A)diag(ψk(A)1) 1, which describes the reachable possibility in a k-hop walking. Formally, we define the k-th scale receptive field as a subgraph. Definition 2. The k-th scale receptive field around a reference vertex vi is a subgraph Gk vi = (V , A , X ) of the k-order graph (V, e A = ψk(A), X), where V = {vj| e Aij = 0} {vi}, A is the submatrix of e A w.r.t. V , and X = XV . Convolution: Edge-Induced GMM Given a reference vertex vi, we can construct the centralized subgraph Gk vi of the k-th scale. To coordinate the subgraph, we introduce Gaussian mixture models (GMMs), each of which may be understood as one principal direction of its variations. To encode the variations accurately, we jointly formulate attributes of vertices and connections of edges into Gaussian models. The edge weight A (vi, vj) indicates the relevance of vj to the central vertex vi. The higher weight is, the stronger impact on vi is. So the weights can be incorporated into a Gaussian model by observing A (vi, vj) times. As the likelihood function, it is equivalent to raise the power A (vi, vj) on Gaussian function, which is proportional to N(X vj, µ, 1 A (vi,vj)Σ). Formally, we estimate the probability density of the subgraph Gk vi from the C1component GMM, pvi(X vj; Θ1, A ij) = c=1 πc N(X vj; µc, 1 s.t. πc > 0, c=1 πc = 1, (1) where Θ1 = {π1, , πC1, µ1, , µC1, Σ1, , ΣC1} are the mixture parameters, {πc} are the mixture coefficients, {µc, Σc} are the parameters of the k-th component, and A ij > 0 2. Intuitively, edge weight A ij is, the stronger impact of the node vj w.r.t. the reference vertex vi is. We will refer to the model in Eqn. (1) as the edge-induced Gaussian mixture model (EI-GMM). In what follows, we assume all attributes of nodes are independent on each other, which is often used in signal processing. That means, the covariance matrix Σc is diagonal, so we denote it as diag(σ2 c). To avoid the explicit constraints for πc in Eqn. (1), we adopt the soft-max normalization with the re-parameterization variable αc, i.e., πc = exp(αc)/PC1 k=1 exp(αk). Thus, the entire subgraph log-likelihood can be written as j=1 ln pvi(X vj; Θ1, A ) c=1 πc N(X vj; µc, 1 A ij Σc), (2) To infer forward, instead of the expectation-maximization (EM) algorithm, we use the gradients of the subgraph with regard to the parameters of the EI-GMM model Θ1, motivated by the recent Fisher vector work (Sanchez et al. 2013), which has been proven to be effective in representation. For a convenient calculation, we simplify the notations, Njc = N(X vj, µc, 1 A ij σ2 c) and Qjc = πc Njc PC1 k=1 πk Njk , then we 2In practice, we normalize A into a non-negative matrix. can derive the gradients of model parameters from Eqn. (2) as follows ζ(Gk vi) µc = A ij Qjc(X vj µc) ζ(Gk vi) σc = Qjc(A ij(X vj µc)2 σ2 c) where the division of vectors means a term-by-term operation. Note we do not use ζ(Gk vi)/ αc due to no improvement in our experience. The gradients describe the contribution of the corresponding parameters to the generative process. The subgraph variations are adaptively allocated to C1 Gaussian models. Finally, we ensemble all gradients w.r.t. Gaussian model (i.e., directions of graph) to analogize the collection of local square receptive field on image. Formally, for the k-scale receptive field Gk vi around the vertex vi, the attributes produced from Gaussian models are filtered respectively and then concatenated, F(Gk vi, Θ1, f) = Re LU( c=1 fi(Cat[ ζ(Gk vi) µc , ζ(Gk vi) σc ]), where Cat[ , ] is a concatenation operator, fi is a linear filtering function (i.e., a convolution function) and Re LU is the rectified linear unit. Therefore we can produce the feature vectors that have same dimensionality depending on the number of Gaussian models for different subgraphs. If the soft assignment distribution Qjc is sharply peaked on a single value of one certain Gaussian for the vertex vj, the vertex will be only projected onto one Gaussian direction. Coarsening: Vertex-Induced GMM Like the standard pooling in CNNs, we need to downsample graphs so as to abstract them as well as reduce the computational cost. However, the pooling on images are tailored for latticed structures, and cannot be used for irregular graphs. One solution is to use some clustering algorithms to partition vertices to several clusters, and then produce a new vertex from each cluster. However, we expect that two vertices should not fall into the same cluster with a larger possibility if there is a high transfer difficulty between them. To this end, we derive vertex-induced Gaussian mixture models (VI-GMM) to weight each vertex. To utilize the edge information, we construct a latent observation φ(vi) w.r.t. each vertex vi from the graph Laplacian (or adjacent matrix if semi-positive definite), i.e., the kernel calculation φ(vi), φ(vj) = Lij. Moreover, for each vertex vi, we define an influence factor wi for Gaussian models. Formally, given C2 Gaussian models, VI-GMM is written as p(φ(vi); Θ2, wi) = c=1 πc N(φ(vi); µc, 1 s.t. wi = h(Xvi) > 0, (5) where h is a mapping function to be learnt. To reduce the computation cost of matrix inverse on Σ, we specify it as an identity matrix. Then we have p(φ(vi); Θ2, wi) = wi )d/2 exp wi 2 φ(vi) µc 2, (6) Given a graph with m vertices, the objective is to maximize the following log-likelihood: arg max Θ2 ζ(Θ2) = c=1 πc N(φ(vi); µc, 1 wi I)). (7) To solve above model in Eqn. (7), we use the iterative expectation maximization algorithm, which has closed-form solution at each step. Meanwhile, the algorithm may automatically conduct the required constraints. The graphical clustering process is summarized as follows: (1) E-Step: the posteriors, i.e., the i-th vertex for the cth cluster, are updated with pic = πcp(φ(vi);θc,wi) PC k=1 πkp(φ(vi);θk,wi), where θc is the c-th Gaussian parameters, and Θ2 = {θ1, , θC2}. (2) M-Step: we optimize Gaussian parameters π, µ. The parameter estimatation is given by πc = 1 m Pm i=1 ric, µc = P vi Gc wiφ(vi) P vi Gc wi . πc indicates the energy summation of all vertices assigned to the cluster c, and µc may be understood as a doubly weighted (wi, ric) average on the cluster c. After several iterations of the two steps, we perform hard quantification. The i-th vertex is assigned as the class with the maximum possibility, formally, ric = 1 if c = arg maxk pik, otherwise 0. Thus we can obtain the cluster matrix P {0, 1}m C2, where Pic = 1 if the i-th vertex falls into the cluster c. During coarsening, we take maximal responses of each cluster as the attributes of new vertex, and derive a new adjacency matrix by using P AP. It is worth noting that we need not compute the concrete φ during the clustering process. The main calculation φ(vi) µc 2 in EM can be reduced to the kernel version: vj Gc wj Kij P vj ,vk Gc wjwk Kjk vj Gc wj)2 , where Kij = φ(vi), φ(vj) . In practice, we can use the graph Laplacian L as the kernel. In this case, we can easily reach the following proposition, which is relevant to graph cut (Dhillon, Guan, and Kulis 2007). Proposition 1. In EM, if the kernel matrix takes the weightregularized graph Laplacian, i.e., K = diag(w)Ldiag(w), then VI-GMM is equal to an approximate optimization of graph cut, i.e., min PC c=1 links(Vc,V\Vc) w(Vc) , where links(A, B) = P vi A,vj B Aij, and w(Vc) = P Experiments Graph Classification For graph classification, each graph is annotated with one label. We use two types of datasets: Bioinformatics and Network datasets. The former contains MUTAG (Debnath et al. 1991), PTC (Toivonen et al. 2003), NCI1 and NCI109 (Wale, Watson, and Karypis 2008), ENZYMES (Borgwardt et al. 2005) and PROTEINS (Borgwardt et al. 2005). The latter has COLLAB (Leskovec, Kleinberg, and Faloutsos 2005), REDDIT-BINARY, REDDIT-MULTI-5K, REDDITMULTI-12K, IMDB-BINARY and IMDB-MULTI. Experiment Settings We verify our GIC on the above bioinformatics and social network datasets. In default, GIC mainly consists of three graph convolution layers, each of which is followed by a graph coarsening layer, and one fully connected layer with a final softmax layer as shown in Fig 2. Its configuration can simply be set as C(64)-P(0.25)-C(128)- P(0.25)-C(256)-P-FC(256), where C, P and FC denote the convolution, coarsening and fully connected layers respectively. The choices of hyperparameters are mainly inspired from the classic VGG net. For example, the coarsening factor is 0.25 (w.r.t. 0.5 0.5 in VGG), the attribute dimensions at three conv. layers are 64-128-256 (w.r.t. the channel numbers of conv1-3 in VGG). The scale of respective field and the number of Gaussian components are both set to 7. We train GIC network with stochastic gradient descent for roughly 300 epochs with a batch size of 100, where the learning rate is 0.1 and the momentum is 0.95. In the bioinformatics datasets, we exploit labels and degrees of the vertices to generate initial attributes of each vertex. In the social network datasets, we use degrees of vertices. We closely follow the experimental setup in PSCN (Niepert, Ahmed, and Kutzkov 2016). We perform 10-fold cross-validation, 9-fold for training and 1-fold for testing. The experiments are repeated 10 times and the average accuracies are reported. Comparisons with the State-of-the-arts We compare our GIC with several state-of-the-arts, which contain graph convolution networks (PSCN (Niepert, Ahmed, and Kutzkov 2016), DCNN (Atwood and Towsley 2016), Ngram CNN (Luo et al. 2017)), neural networks (SAEN (Orsini, Baracchi, and Frasconi 2017)), feature based algorithms (Dy F (Gomez, Chiem, and Delvenne 2017), FB (Bruna et al. 2014)), random walks based methods (RW (G artner, Flach, and Wrobel 2003)), graph kernel approaches (GK (Shervashidze et al. 2009), DGK (Yanardag and Vishwanathan 2015), WL (Morris, Kersting, and Mutzel 2017)). We present the comparisons with the state-of-the-arts, as shown in Table 1. All results come from the related literatures. We have the following observations. Deep learning based methods on graphs (including DCNN, PSCN, Ngram CNN, SAEN and ours) are superior to those conventional methods in most cases. The conventional kernel methods usually require the calculation on graph kernels with high-computational complexity. In contrast, these graph neural networks attempt to learn more abstract highlevel features by performing inference-forward, which need relatively low computation cost. Compared with recent graph convolution methods, ours can achieve better performance on most datasets, such as PTC, NCI1, NCI109, ENZYMES and PROTEINS. The main reason should be that local variations of subgraphs are accurately described with Gaussian component analysis. The proposed GIC achieves state-of-the-art results on most datasets. The best performance is gained in some bioin- Table 1: Comparisons with state-of-the-art methods. DATASET PSCN DCNN NGRAMCNN FB DYF WL GK DGK RW SAEN GIC MUTAG 92.63 66.98 94.99 84.66 88.00 78.3 81.66 82.66 83.72 84.99 94.44 4.21 5.63 2.01 2.37 1.9 2.11 1.45 1.50 1.82 4.30 PTC 60.00 56.60 68.57 55.58 57.15 57.26 57.32 57.85 57.04 77.64 4.82 1.72 2.30 1.47 1.41 1.13 1.30 1.30 6.98 NCI1 78.59 62.61 62.90 68.27 83.1 62.28 62.48 48.15 77.80 84.08 1.89 0.96 0.34 0.2 0.29 0.25 0.50 0.42 1.77 NCI109 62.86 62.43 66.72 85.2 62.60 62.69 49.75 82.86 1.13 0.20 0.2 0.19 0.23 0.60 2.37 ENZYMES 18.10 29.00 33.21 53.4 26.61 27.08 24.16 62.50 1.16 1.20 1.4 0.99 0.79 1.64 5.12 PROTEINS 75.89 75.96 69.97 75.04 73.7 71.67 71.68 74.22 75.31 77.65 2.76 2.98 1.34 0.65 0.5 0.55 0.50 0.42 0.70 3.21 COLLAB 72.60 76.35 80.61 72.84 73.09 69.01 75.63 81.24 2.15 1.64 1.60 0.28 0.25 0.09 0.31 1.44 REDDIT-B 86.30 88.98 89.51 75.3 77.34 78.04 67.63 86.08 88.45 1.58 2.26 1.96 0.3 0.18 0.39 1.01 0.53 1.60 REDDIT-5K 49.10 50.83 50.31 41.01 41.27 52.24 51.58 0.70 1.83 1.92 0.17 0.18 0.38 1.68 REDDIT-12K 41.32 42.37 40.30 31.82 32.22 46.72 42.98 0.42 1.27 1.41 0.08 0.10 0.23 0.87 IMDB-B 71.00 71.66 72.02 72.87 72.4 65.87 66.96 64.54 71.26 76.70 2.29 2.71 4.71 4.05 0.5 0.98 0.56 1.22 0.74 3.25 IMDB-M 45.23 50.66 47.34 48.12 43.89 44.55 34.54 49.11 51.66 2.84 4.10 3.56 3.56 0.38 0.52 0.76 0.64 3.40 Table 2: Node label prediction on Reddit and PPI data (micro-averaged F1 score). DATASET REDDIT PPI RANDOM 0.042 0.396 RAW FEATURES 0.585 0.422 DEEP WALK 0.324 DEEP WALK + FEATURES 0.691 NODE2VEC + REGRESSION 0.934 GRAPHSAGE-GCN 0.930 0.500 GRAPHSAGE-MEAN 0.950 0.598 GRAPHSAGE-LSTM 0.954 0.612 GIC 0.952 0.661 formatics datasets and some social network datasets including PTC, NCI1, ENZYMES, PROTEINS, COLLAB, IMDB-BINARY and IMDB-MULTI. Although Ngram CNN, Dy F, WL and SEAN approaches have obtained the best performance on MUTAG, REDDIT-BINARY, NCI109, REDDIT-MULTI-5K and REDDIT-MULTI-12K respectively, our method is fully comparable to them. Node Classification For node classification, one node is assigned one/multiple labels. It is challenging if the label set is large. During training, we only use a fraction of nodes and their labels. The task is to predict the labels for the remaining nodes. Following the setting in (Hamilton, Ying, and Leskovec 2017), we conduct the experiments on Reddit data and PPI data. For a fair comparison to graph SAGE (Hamilton, Ying, and Leskovec 2017), we use the same initial graph data, minibatch iterators, supervised loss function and neighborhood sample. The other network parameters are similar to graph classification except removing the coarsening layer. Tabel 2 summarizes the comparison results. Our GIC can obtain the best performance 0.661 on PPI data and a comparable result 0.952 on Reddit data. The raw features provide an important initial information for node multi-label classification. Based on the raw features, deep walk (Perozzi, Al-Rfou, and Skiena 2014) improves about 0.36 (micro-F1 scores) on Reddit data. Meanwhile, we conduct an experiment of node2vec and use regression model to classification. Our method gains better performance than node2vec (Grover and Leskovec 2016). Comparing different aggregation methods like GCN (Kipf and Welling 2017), mean and LSTM, our GIC has a significant improvement about 0.16 on PPI data and gains a competitive performance on Reddit data. The results demonstrate our approach is robust to infer unknown labels of partial graphs. Model Analysis EI-GMM and VI-GMM: To directly analyze convolution filtering with EI-GMM, we compare our method with Cheb Net (Defferrard, Bresson, and Vandergheynst 2016) and GCN (Kipf and Welling 2017) approaches by using the same coarsening mechanism VI-GMM. As shown in Table 3, under the same coarsening operation, our GIC is superior to Table 3: The verification of our convolution and coarsening. DATASET CHEBNET GCN GIC GIC W/ VI-GMM W/ VI-GMM W/O VI-GMM MUTAG 89.44 6.30 92.22 5.66 93.33 4.84 94.44 4.30 PTC 68.23 6.28 71.47 4.75 68.23 4.11 77.64 6.98 NCI1 73.96 1.87 76.39 1.08 79.17 1.63 84.08 1.77 NCI109 72.88 1.85 74.92 1.70 77.81 1.88 82.86 2.37 ENZYMES 52.83 7.34 51.50 5.50 52.00 4.76 62.50 5.12 PROTEINS 78.10 3.37 80.09 3.20 78.19 2.04 77.65 3.21 Table 4: Comparisons on K and C1. DATASET K, C1 = 1 K, C1 = 3 K, C1 = 5 K, C1 = 7 MUTAG 67.77 11.05 83.88 5.80 90.55 6.11 94.44 4.30 PTC 72.05 8.02 77.05 4.11 76.47 5.58 77.64 6.98 NCI1 71.21 1.94 83.26 1.17 84.47 1.64 84.08 1.77 NCI109 70.02 1.57 81.74 1.56 83.39 1.65 82.86 2.37 ENZYMES 33.83 4.21 64.00 4.42 63.66 3.85 62.50 5.12 PROTEINS 75.49 4.00 77.47 3.37 78.10 2.96 77.65 3.21 Table 5: Comparisons on the layer number. DATASET N = 2 N = 4 N = 6 N = 8 MUTAG 86.66 8.31 91.11 5.09 93.88 5.80 94.44 4.30 PTC 64.11 6.55 74.41 6.45 75.29 6.05 77.64 6.98 NCI1 71.82 1.85 81.36 1.07 83.01 1.54 84.08 1.77 NCI109 71.09 2.41 80.02 1.67 81.60 1.83 82.86 2.37 ENZYMES 42.33 4.22 61.83 5.55 64.83 6.43 62.50 5.12 PROTEINS 77.38 2.97 79.81 3.84 78.37 4.00 77.65 3.21 Cheb Net+VI-GMM and GCN+VI-GMM. It indicates EIGMM can indeed encode the variations of subgraphs more effectively. On the other hand, we remove the coarsening layer from our GIC. For different size graphs, we pad new zero vertices into a fixed size and then concatenate attributes of all vertices for classification. As shown in this table, the performance of GIC still outperforms GIC without VIGMM coarsening, which verifies the effectiveness of the coarsening layer VI-GMM. K and C1: The kernel size K and the number of Gaussian components C1 are the most crucial parameters. Generally, the C1 is proportional to the K. The reason is that the larger receptive field usually contains more vertices (i.e., a relative large subgraph). Thus we simply take the equal values for them, K = C1 = {1, 3, 5, 7}. The experimental results are shown in Table 4. With the increase of K, C1, the performance improves at most cases. The reasons are two folds: i) with increasing receptive field size, the convolution will cover the farther hopping neighbors; ii) with the increase of C1, the variations of subgraphs are encoded more accurately. But for the larger values of K and C1 will increase the computational burden. Moreover, the overfitting phenomenon might occur with the increase of model complexity. Take the example of NCI109, in the first convolution layer, the encoded attributes (in Eqn. (4)) will be 2 39 7 = 546 for each scale of receptive field, where 39 is the dimension of attributes (w.r.t the number of node labels) and 7 is the number of Gaussian components. Thus, for 7 scales of receptive field, the final encoded attributes will be 546 7 = 3822 dimensions, which will be mapping to 64 dimensions by the function f = [f1, , f C1] in Eqn. (4). Thus the model parameter is 3822 64 = 244608 in the first layer. Similarly, if the number of node label is 2, the model parameter will sharply decrease into 18816. Besides, the parameter complexity is related to the number of classes and nodes. The comparison results in Table 4 demonstrate the trend of the parameters K and C1 in our GIC framework. Number of stacked layers: Here we test on the number of stacked network layers with N = 2, 4, 6, 8. When N = 2, only one fully connected layer and one softmax layer are preserved. When N = 4, we add two layers: the convolution layer and the coarsening layer. When continuing to stack both, the depth of network will be 6 and 8. The results are shown in Table 5. Deeper networks can gain better performance in most cases, because the larger receptive field is observed and more abstract structures will be extracted in the topper layer. Of course, there is an extra risk of overfitting due to the increase of model complexity. An analysis of computation complexity: In the convolution layer, the computational costs of receptive fields and Gaussian encoding are about O(Km2) and O(C1md2) respectively, where m, d are number of nodes and the feature dimensionality. Generally, K = C1 d < m. In the coarsening layer, the time complexity is about O(pm2 + md), where p is iteration number of the EM algorithm. In all, suppose the whole GIC alternatively stacks n convolution and coarsening layers, the entire time complexity is O(n(K + p)m2 + n C1md2 + nmd). In this paper, we proposed a novel Gaussian-induced convolution network to handle with general irregular graph data. Considering the previous spectral and spatial methods do not well characterize local variations of graph, we derived edgeinduced GMM to adaptively encode subgraph structures by projecting them into several Gaussian components and then performing different filtering operations on each Gaussian direction like the standard CNN filters on images. Meanwhile, we formulated graph coarsening into vertex-induced GMM to dynamically partition a graph, which was also proven to be equal to graph cut. Extensive experiments in two graphic tasks (i.e. graph and node classification) demonstrated the effectiveness and superiority of our GIC compared with those baselines and state-of-the-art methods. In the future, we would like to extend our method into more applications to irregular data. Acknowledgments The authors would like to thank the Chairs and the anonymous reviewers for their critical and constructive comments and suggestions. This work was supported by the National Science Fund of China under Grant Nos. 61602244, 61772276, U1713208 and 61472187 and Program for Changjiang Scholars. Atwood, J., and Towsley, D. 2016. Diffusion-convolutional neural networks. In NIPS, 1993 2001. Borgwardt, K. M.; Ong, C. S.; Sch onauer, S.; Vishwanathan, S.; Smola, A. J.; and Kriegel, H.-P. 2005. Protein function prediction via graph kernels. Bioinformatics 21(suppl 1):i47 i56. Bruna, J.; Zaremba, W.; Szlam, A.; and Le Cun, Y. 2014. Spectral networks and locally connected networks on graphs. ICLR. Cui, Z.; Xu, C.; Zheng, W.; and Yang, J. 2018. Context-dependent diffusion network for visual relationship detection. In MM, 1475 1482. ACM. Cui, Z.; Yang, J.; et al. 2017. Spectral filter tracking. ar Xiv preprint ar Xiv:1707.05553. Debnath, A. K.; Lopez de Compadre, R. L.; Debnath, G.; Shusterman, A. J.; and Hansch, C. 1991. Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. correlation with molecular orbital energies and hydrophobicity. Journal of medicinal chemistry 34(2):786 797. Defferrard, M.; Bresson, X.; and Vandergheynst, P. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In NIPS, 3844 3852. Dhillon, I. S.; Guan, Y.; and Kulis, B. 2007. Weighted graph cuts without eigenvectors a multilevel approach. TPAMI 29(11). Duran, A. G., and Niepert, M. 2017. Learning graph representations with embedding propagation. In NIPS, 5119 5130. G artner, T.; Flach, P.; and Wrobel, S. 2003. On graph kernels: Hardness results and efficient alternatives. Learning Theory and Kernel Machines 129 143. Gomez, L. G.; Chiem, B.; and Delvenne, J.-C. 2017. Dynamics based features for graph classification. ar Xiv preprint ar Xiv:1705.10817. Grover, A., and Leskovec, J. 2016. node2vec: Scalable feature learning for networks. In SIGKDD, 855 864. ACM. Hamilton, W.; Ying, Z.; and Leskovec, J. 2017. Inductive representation learning on large graphs. In NIPS, 1025 1035. Haussler, D. 1999. Convolution kernels on discrete structures. Technical report, Technical report, Department of Computer Science, University of California at Santa Cruz. Henaff, M.; Bruna, J.; and Le Cun, Y. 2015. Deep convolutional networks on graph-structured data. ar Xiv preprint ar Xiv:1506.05163. Jiang, J.; Cui, Z.; Xu, C.; Li, C.; and Yang, J. 2018. Walksteered convolution for graph classification. ar Xiv preprint ar Xiv:1804.05837. Kipf, T. N., and Welling, M. 2017. Semi-supervised classification with graph convolutional networks. ICLR. Le Cun, Y.; Bengio, Y.; and Hinton, G. 2015. Deep learning. Nature 521(7553):436 444. Leskovec, J.; Kleinberg, J.; and Faloutsos, C. 2005. Graphs over time: densification laws, shrinking diameters and possible explanations. In SIGKDD, 177 187. ACM. Li, Y.; Tarlow, D.; Brockschmidt, M.; and Zemel, R. 2016. Gated graph sequence neural networks. ICLR. Li, C.; Cui, Z.; Zheng, W.; Xu, C.; Ji, R.; and Yang, J. 2018a. Action-attending graphic neural network. TIP 27(7):3657 3670. Li, C.; Cui, Z.; Zheng, W.; Xu, C.; and Yang, J. 2018b. Spatiotemporal graph convolution for skeleton based action recognition. AAAI. Luo, Z.; Liu, L.; Yin, J.; Li, Y.; and Wu, Z. 2017. Deep learning of graphs with ngram convolutional neural networks. TKDE 29(10):2125 2139. Marino, K.; Salakhutdinov, R.; and Gupta, A. 2016. The more you know: Using knowledge graphs for image classification. ar Xiv preprint ar Xiv:1612.04844. Monti, F.; Boscaini, D.; Masci, J.; Rodola, E.; Svoboda, J.; and Bronstein, M. M. 2017. Geometric deep learning on graphs and manifolds using mixture model cnns. In Proc. CVPR, volume 1, 3. Morris, C.; Kersting, K.; and Mutzel, P. 2017. Glocalized weisfeiler-lehman graph kernels: Global-local feature maps of graphs. In ICDM, 327 336. IEEE. Niepert, M.; Ahmed, M.; and Kutzkov, K. 2016. Learning convolutional neural networks for graphs. In ICML, 2014 2023. Orsini, F.; Baracchi, D.; and Frasconi, P. 2017. Shift aggregate extract networks. ar Xiv preprint ar Xiv:1703.05537. Perozzi, B.; Al-Rfou, R.; and Skiena, S. 2014. Deepwalk: Online learning of social representations. In SIGKDD, 701 710. ACM. Sanchez, J.; Perronnin, F.; Mensink, T.; and Verbeek, J. 2013. Image classification with the fisher vector: Theory and practice. IJCV 105(3):222 245. Scarselli, F.; Gori, M.; Tsoi, A. C.; Hagenbuchner, M.; and Monfardini, G. 2009. The graph neural network model. TNN 20(1):61 80. Sch olkopf, B.; Weston, J.; Eskin, E.; Leslie, C.; and Noble, W. S. 2002. A kernel approach for learning from almost orthogonal patterns. In ECML, 511 528. Springer. Shervashidze, N.; Vishwanathan, S.; Petri, T.; Mehlhorn, K.; and Borgwardt, K. 2009. Efficient graphlet kernels for large graph comparison. In Artificial Intelligence and Statistics, 488 495. Song, T.; Zheng, W.; Song, P.; and Cui, Z. 2018. Eeg emotion recognition using dynamical graph convolutional neural networks. IEEE Transactions on Affective Computing. Such, F. P.; Sah, S.; Dominguez, M. A.; Pillai, S.; Zhang, C.; Michael, A.; Cahill, N. D.; and Ptucha, R. 2017. Robust spatial filtering with graph convolutional neural networks. IEEE Journal of Selected Topics in Signal Processing 11(6):884 896. Toivonen, H.; Srinivasan, A.; King, R. D.; Kramer, S.; and Helma, C. 2003. Statistical evaluation of the predictive toxicology challenge 2000 2001. Bioinformatics 19(10):1183 1193. Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; and Bengio, Y. 2018. Graph attention networks. ICLR. Wale, N.; Watson, I. A.; and Karypis, G. 2008. Comparison of descriptor spaces for chemical compound retrieval and classification. Knowledge and Information Systems 14(3):347 375. Yanardag, P., and Vishwanathan, S. 2015. Deep graph kernels. In SIGKDD, 1365 1374. ACM. Zhang, T.; Zheng, W.; Cui, Z.; and Li, Y. 2018. Tensor graph convolutional neural network. ar Xiv preprint ar Xiv:1803.10071. Zhao, W.; Xu, C.; Cui, Z.; Zhang, T.; Jiang, J.; Zhang, Z.; and Yang, J. 2018. When work matters: Transforming classical network structures to graph cnn. ar Xiv preprint ar Xiv:1807.02653.