# deformable_graph_convolutional_networks__971c83c0.pdf Deformable Graph Convolutional Networks Jinyoung Park, Sungdong Yoo, Jihwan Park, Hyunwoo J. Kim* Department of Computer Science and Engineering, Korea University {lpmn678, ysd424, jseven7071, hyunwoojkim}@korea.ac.kr Graph neural networks (GNNs) have significantly improved the representation power for graph-structured data. Despite of the recent success of GNNs, the graph convolution in most GNNs have two limitations. Since the graph convolution is performed in a small local neighborhood on the input graph, it is inherently incapable to capture long-range dependencies between distance nodes. In addition, when a node has neighbors that belong to different classes, i.e., heterophily, the aggregated messages from them often negatively affect representation learning. To address the two common problems of graph convolution, in this paper, we propose Deformable Graph Convolutional Networks (Deformable GCNs) that adaptively perform convolution in multiple latent spaces and capture short/long-range dependencies between nodes. Separated from node representations (features), our framework simultaneously learns the node positional embeddings (coordinates) to determine the relations between nodes in an end-to-end fashion. Depending on node position, the convolution kernels are deformed by deformation vectors and apply different transformations to its neighbor nodes. Our extensive experiments demonstrate that Deformable GCNs flexibly handles the heterophily and achieve the best performance in node classification tasks on six heterophilic graph datasets. Our code is publicly available at https://github.com/mlvlab/Deformable GCN. Introduction Graphs are flexible representations for modeling relations in data analysis problems and are widely used in various domains such as social network analysis (Wang, Cui, and Zhu 2016), recommender system (Berg, Kipf, and Welling 2017), chemistry (Gilmer et al. 2017), natural language processing (Erkan and Radev 2004), and computer vision (Johnson et al. 2015). In recent years, Graph Neural Networks (GNNs) have achieved great success in many graph-related applications such as node classification (Kipf and Welling 2017; Hamilton, Ying, and Leskovec 2018), link prediction (Zhang and Chen 2018; Schlichtkrull et al. 2018), and graph classification (Errica et al. 2019; Ying et al. 2018). Most existing GNNs learn node representations via message passing schemes, which iteratively learn *is the corresponding author. Copyright 2022, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. the hidden representation of each node by aggregating messages from its local neighborhoods. For example, Graph Convolution Networks (GCNs) (Kipf and Welling 2017) operate convolutions on input graphs inspired by first-order approximation of spectral graph convolutions (Hammond, Vandergheynst, and Gribonval 2011). However, most graph convolution that aggregates messages from local neighborhoods implicitly assumes that input graphs are homophilic graphs, where connected nodes have similar features or belong to the same class. So the smoothing over the input graphs effectively removes noise in the input features and significantly improve the representational power when the assumption holds. However, on heterophilic graphs where connected nodes have dissimilar features and different labels, the conventional graph convolutional neural networks often underperform simple methods such as a multi-layer perceptron (MLP) that completely ignores the graph structure. In addition, since the conventional graph convolution receives messages from local neighbors, it has the limited capability to capture long-range dependencies between distant yet relevant nodes for the target tasks. To address these limitations, we propose a Deformable Graph Convolutional Network (Deformable GCN) that softly changes the receptive field of each node by adaptively aggregating the outputs of deformable graph convolution in multiple latent spaces. Started from a general definition of the discrete convolution with finite support, we extend the deformable 2D convolution (Dai et al. 2017) to a latent space for graph-structured data. Similar to the convolution defined on a grid space for images, our convolution kernel generates different transformations for various relations. Our framework models useful relations between nodes represented by the difference of learned node positional embeddings. Our contributions are as follows: We propose a Deformable Graph Convolution (Deformable GConv) that performs convolution in a latent space and adaptively deforms the convolution kernels to handle heterophily and variable-range dependencies between nodes. We propose novel architecture Deformable Graph Convolution Networks (Deforamble GCN) that simultaneously learn node representations (features) and node positional embeddings (coordiantes) and efficiently perform Deformable GConv in multiple latent spaces using The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22) latent neighborhood graphs. Our experiments demonstrate the effectiveness of Deformable GCN in the node classification task on homophily and heterophily graphs. Also, the interpretable attention scores in our framework provide insights which relation (or latent space) is beneficial to the target task. Related Works Graph Neural Networks. Graph neural networks have been studied for representing graph-structured data in recent years. Based on spectral graph theory, Cheby Net (Defferrard, Bresson, and Vandergheynst 2016) designed fast localized convolutional filters and reduced computational cost. Motivated by it, (Kipf and Welling 2017) proposed GCN, which is simplified convolutional networks based on the first-order approximation of the spectral convolutions. There are several studies to improve the performance by message passing processes (Hamilton, Ying, and Leskovec 2018; Xu et al. 2018) and attention-based models (Veliˇckovi c et al. 2017; Yun et al. 2019). However, the studies introduced above have been developed based on the homophily assumption that connected nodes have similar characteristics. This assumption often leads to performance degradation on heterophilic graphs. Moreover, most existing GNNs cannot capture the longrange dependencies between distant nodes because they aggregate messages only in a local neighborhood. Recently, there are several studies have attempted to address the problems (Bo et al. 2021; Liu, Wang, and Ji 2020; Zhu et al. 2020, 2021). Geom-GCN (Pei et al. 2020) proposed a novel geometric aggregation scheme in a latent space that captures long-range dependencies through structural information. However, Geom-GCN has some limitations that they define convolution on a grid that is a manually divided latent space and require pre-trained embedding methods in the geometric aggregation step. Unlike Geom-GCN, our models apply deformable convolution kernel on the latent space with the relation vectors defined in a continuous latent space and utilize learnable multiple embeddings to softly grant relations in an end-to-end fashion. Deformable Convolution. Convolutional neural networks (CNNs) have achieved great success in various fields (He et al. 2016, 2017). However, the convolution kernels are limited to model large and unknown transformations since they are defined in a fixed structure. To address these limitations, (Dai et al. 2017; Zhu et al. 2019) proposed deformable convolution networks that adaptively change the shape of convolution kernels by learning offsets for deformation Because of the feasibility and effectiveness, deformable convolution networks are used in various fields, such as point cloud (Thomas et al. 2019), image generation (Siarohin et al. 2018), and video tasks (Wang et al. 2019). Inspired by this line of work, we study a deformable graph convolution to adaptively handle diverse relations between an ego-node and its neighbors. Method The goal of our framework is to address the limitations of existing graph convolutions that learn node representations in a small neighborhood with the homophily assumption. In other words, existing GNNs often poorly perform when neighbors belong to different classes and have dissimilar features. Also, most GNNs with a small number of layers cannot model the long-range dependency between distant nodes. To address the limitations, we propose Deformable Graph Convolutional Networks that perform deformable convolution on latent graphs. Our framework softly changes the shape (or size) of the receptive field for each node. This allows our framework to adaptively handle homophilic/heterophilic graphs as well as short/longrange dependency. Before introducing our frameworks, we first briefly summarize notations and the basic concepts of graph neural networks. Preliminaries Notations. Let G = (V, E) denote an input graph with a set of nodes V and a set of edges E V V . Each node v V has a feature vector xv Rdx and the edge between node u and node v is represented by (u, v) E. The neighborhoods of node v on input graph is denoted by N (v) = {u V | (u, v) E}. Graph Neural Networks. To learn representation for graph-structured data, most existing GNNs perform message-passing frameworks as the following equation (Gilmer et al. 2017; Xu et al. 2018): h(l) v = σ W(l) AGGREGATE h(l 1) u : u e N (v) , where h(0) v = xv, h(l) v Rdh(l) is a hidden representation of node v in a l-th layer, e N (v) = {v} {u V |(u, v) E} indicates neighbors of node v with a self-loop, W(l) is a learnable weight matrix at the l-th layer, AGGREGATE is an aggregation function characterized by the particular model, and σ is a non-linear activation function. e N (v) determines the receptive field of the graph convolution and it is usually a one-hop ego-graph. For example, GCN (Kipf and Welling 2017) is a specific instance of (1). GCN can be written as u e N(v) (deg(v) deg(u)) 1/2 W(l)h(l 1) u (2) where deg(v) is the degree of node v. Even GNNs such as GCN (Kipf and Welling 2017) and GAT (Veliˇckovi c et al. 2017) have been successfully applied to various graph-based tasks, their success is limited to homophilic graphs (Pei et al. 2020; Zhu et al. 2020), which are graphs that linked nodes often have similar properties (Mc Pherson, Smith-Lovin, and Cook 2001). Recent works (Li, Han, and Wu 2018; Wu et al. 2019) showed that GCN makes node embeddings smoother within their peripherals since GCN can be a specific form of Laplacian smoothing (Li, Han, and Wu 2018). For this reason, graph neural Figure 1: Overall structure of Deformable GCN and Deformable GConv. (a) In Deformable GCN, at the beginning of training, latent neighborhood graphs, {G(l)}L+1 l=0 , are constructed to define neighborhoods for the Deformable Graph Convolution (Deformable GConv). Then, Deformable GConv is applied on each latent neighborhood graph and the outputs of the convolution {y(l) v }l are adaptively aggregated for representing ehv using an attention mechanism. (b) Our Deformable GConv performs graph convolution in a latent (position) space. For more flexible graph convolution, Deformable GConv adaptively deforms convolution kernels gdeform( , ) for each center node v by kernel vector deformation k(ev). networks have difficulty adapting to graphs that linked nodes often have different properties, called heterophilic graphs. In our work, we consider both heterophilic and homophilic graphs, unlike standard graph neural networks that have mainly focused on homophilic graphs. To have enough representation power on heterophilic graphs, we generate latent graphs for linking distant nodes with similar property according to their latent embeddings. Deformable Graph Convolution We here introduce a Deformable Graph Convolution (Deformable GConv), which softly changes the receptive fields and adaptively aggregates messages from neighbors on the multiple latent graphs. The overall structure of Deformable GConv is illustrated in Figure 1. Our Deformable Graph Convolution can be derived from a general definition of the discrete convolution by a kernel with finite support. The convolution of a feature map H by a kernel g at a node v is given as: yv = (H g) (v) = X u e N (v) g (ru,v) hu, (3) where yv Rdy is the output of the convolution, hu Rd is a feature at u, and relation vector ru,v represent the relation between u and v, e N(v) is the neighborhood of v that coincides with the finite support of g centered at v. g(ru,v) is a linear function to transform hu and it varies depending on the relation vector. For example, in a 2D convolution with a 3 3 kernel, ru,v = ϕu ϕv, where ϕu, ϕv are the coordinates of u and v. For each relative position, a linear function g(ru,v) Rdy dh is applied. In the graph domain, a GCN layer defined in (2) (without the activation function) can be viewed as a specific instantiation of (3) with g(ru,v) = (deg(v) deg(u)) 1/2W. So, the GCN relations are determined by the degree of node u and v. Also, except for the normalization, GCN performs the same linear transformation for all the relations different from standard 2D convolution. Our framework extends the graph convolution to more flexible and deformable graph convolutions defining a relation vector ru,v, a kernel function g( ), and its support (or a neighborhood). To extend the relation of nodes beyond the adjacency on the input graph G, we first embed nodes in a position space, which is a latent space to determine the coordinates of nodes using a node positional embedding. Then, we compute the relation vector of the nodes by a function of the node positional embeddings. Node positional embedding. In our framework, each node is embedded in a latent space called position space using a node embedding method. Since the node embeddings are used only for the coordinates of nodes and their relations, we name this node positional embedding. For each node, our framework learns both node positional embeddings (coordinates) and node representations (features). Node positional embedding is computed by a simple procedure. Given a node v, its node positional embedding ϕ(l) v is the projected input features after smoothing l times on the original input graph G. It can be written as ϕ(l) v = W(l) ϕ e(l) v , where e(0) v = xv, e(l) v = 1 f deg(v) u e N(v) e(l 1) u , (4) where W(l) ϕ is a learnable project matrix, f deg (v) indicates a degree of node v with self-loop and el v is the l-time smoothed input features of node v. Each node has L + 1 node positional embeddings. Alternatively, for node positional embedding ϕv, other node embedding methods can be used such as LINE (Tang et al. 2015), Node2Vec (Grover and Leskovec 2016), distance encoding (Li et al. 2020), Poincare embedding in hyperbolic geometry (Nickel and Kiela 2017), and other various node embedding methods. But our preliminary experiment showed that our simple node positional embedding was sufficient for our model. So in this paper, we did not use any external node embedding method. Relation vector. The relation between nodes is represented by a relation vector ru,v. One natural choice is the relative position of neighbor node u from node v in the position space. In our framework, we use normalized relation vectors with an extra dimension to encode the relation of nodes with identical positional embeddings. The relation vector for a neighbor node u of node v with node positional vectors ϕu, ϕv Rdϕ is given as ru,v = r u,v||0 Rdϕ+1, if ϕu = ϕv [0, 0, , 1] Rdϕ+1, otherwise (5) where r u,v = ϕu ϕv ϕu ϕv 2 and [||] is concatenation operator. Kernel function. As discussed with (3), a kernel function g yields linear functions to transform hidden representations of neighbor nodes. Our kernel function g on ru,v is defined as: k=1 au,v,k Wk, where au,v,k = exp r u,v eϕk /Z. eϕk Rdϕ+1 is a kernel vector and Wk Rdy dh is a corresponding transformation matrix. Both are learnable parameters. Z R is a normalization factor. The function value of g, which is a linear transformation, varies depending on the relation vector ru,v but for the same relation, the same linear transformation is returned when Z is a constant. This is the same as a standard 2D convolution that applies the identical linear transformation at the same relative position from the center of the kernel. One interesting difference is that since a standard 2D convolution kernel has a linear transformation matrix for each relative position, the number of its transformation matrices increases as the kernel size increases whereas our kernel g differentially combines a fixed number of {Wk}K k=1 matrices depending on the relation vector ru,v. Thereby the number of parameters of our convolution does not depends on the kernel size anymore. Deformable Graph Convolution. The kernel function can be further extended for a more flexible and deformable convolution. We first normalize the weight au,v,k for a trans- formation matrix Wk by Z = P u exp r u ,v eϕk . Then the kernel becomes adaptive. This is similar to the dot-product attention in (Vaswani et al. 2017). We found that the dotproduct attention can be viewed as a variant of convolution. For more details, see the supplement. Now the kernel yields a transformation matrix for each neighbor considering not only the relation between a center node v and a neighbor node u but also the relations between neighbor nodes. In addition, the kernel vector eϕk is dynamically translated by deformation vector k(ev) Rdϕ+1 depending on the smoothed input features ev Rdx of the center node v. Putting these pieces together, we define our Deformable Graph Convolution as: u e N (v) gdeform (ru,v, k (ev)) hu, (7) where gdeform (ru,v, k (ev)) = PK k=1 bau,v,k Wk and bau,v,k = exp(r u,v( e ϕk+ k(ev))) P u exp r u ,v( e ϕk+ k(ev)) . In our experiments, the deformation vector is generated by a simple MLP network with one hidden layer. Latent Neighborhood Graph. The Deformable Graph Convolution is computed within a neighborhood e N(v). To efficiently compute the neighborhoods, k-nearest neighbors for each node are computed with respect to ℓ2 distance of smoothed input features, i.e., eu ev 2. Since in our framework the input feature smoothing does not have any learnable parameters, the smoothed features {eu} do not change during training. Thereby, the k NN graph generation is performed once at the beginning of training. Due to the k NN graphs, our Deformable Graph Convolution can be viewed as a graph convolution on the k NN graphs. Seemingly, the neighborhood computed by node positional embedding, ϕv, is more plausible following (Dai et al. 2017). But, it requires huge computational costs for pairwise distance computation of all nodes at every iteration. Also, in practice, the drastic changes of neighborhoods caused by positional embedding learning often lead to unstable numerical optimization. So, in this work, we use the smoothed input features {ev}v for the neighborhood computation. Deformable Graph Convolutional Networks With our Deformable Graph Convolution, we design our Deformable Graph Convolution Network (Deformable GCN). The overall structure is depicted in Figure 1. Deformable GCN utilizes multiple node positional embeddings. In other words, it generates multiple neighborhood graphs. Let e(l) v denote the input features at node v that are smoothed l times on the original graph G. A Latent Neighborhood Graph, i.e., k NN graph, constructed based on {e(l) v }v V is denoted by G(l). Deformable GCN generates L+1 k NN graphs {G(l)}l=L l=0 . Combining with the original input graph G(L+1) = G, Deformable GCN performs the Derformable GConv on each neighborhood graph. The outputs of convolution on G(l), denoted by y(l) v , are adaptively aggregated on each node by a simple attention mechanism as: l=0 s(l) v ey(l) v , s(l) v = exp z ey(l) v PL+1 l =0 exp z ey(l ) v , (8) where ey(l) v = y(l) v y(l) v 2 and z Rdy is a learning parameter. The score s(l) v indicates which neighborhood (with its Deformable GConv) is suitable for node v. For each node, Deformable GCN softly chooses a suitable neighborhood with node positional embeddings, and Deformable GConv performs convolution with a deformed convolution kernel. Loss functions. To learn more effective deformable graph convolution in a latent space without collapsed kernel vectors, we impose two regularizers: a separating regularizer and a focusing regularizer. The separating regularization loss maximizes the distance between kernel vectors { eϕk}k=K k=1 so that our kernel differentially yields transformation matrices against diverse relations. It is formulated as eϕk2 eϕk1 2 In addition, to avoid extreme changes of the deformable kernel gdeform and the collapse of kernel vectors after deformations, we use a focusing regularizer that penalizes the ℓ2norm of deformation vectors k (ev) as Lfocus = 1 K |V | k=1 k(ev) 2 2 . (10) With these two regularizer losses, our Graph Deformable Convolutional Network properly generates the kernel vectors and deformation vectors. Since we mainly conduct experiments on node classification tasks, we use the standard cross-entropy loss function Lcls with the two regularizers. The total loss function is as follows: L = Lcls + α Lsep. + β Lfocus, (11) where α and β are hyperparameters for the strength of regularizations, Lsep. and Lfocus. Experiments In this section, we validate the effectiveness of our framework, Deformable GCN, using heterophily and homophily graph datasets. Dataset For validating our model, we use six heterophilic graph datasets and three homophilic graph datasets, which can be distinguished by the homophily ratio (Zhu et al. 2020) h = |{(u,v):(u,v) E yu=yv}| |E| , where yv is the label of node v. Statistics of each dataset are in Table 1. More details on datasets are in the supplementary material. Baselines and Implementation Details Baselines. For baseline models, we include widely-used GNN-based methods: GCN (Kipf and Welling 2017), GAT (Veliˇckovi c et al. 2017), and Cheby Net (Defferrard, Bresson, and Vandergheynst 2016). We also include a 2layer MLP as a baseline since MLP models show comparable performance under heterophily when existing methods for graph-structured data do not use the graph topology effectively. To compare our models with state-of-the-art models on heterophilic graphs, we include JKNet (Xu et al. 2018), Mix Hop (Abu-El-Haija et al. 2019), H2GCN (Zhu et al. 2020), and Geom-GCN (Pei et al. 2020). We use the best models among the three variants of Geom-GCN from the paper (Pei et al. 2020). Implementation details. We use the Adam optimizer (Kingma and Ba 2014) with ℓ2-regularization and 500 epochs for training our model and the baselines. The performance is reported with the best model on the validation datasets. For all datasets, we apply the splits (48%/ 32%/ 20%)1 of nodes per class for (train/ validation/ test) provided by (Pei et al. 2020) for a fair comparison as (Zhu et al. 2020). All experiments are repeated 10 times as (Pei et al. 2020; Zhu et al. 2020) and accuracy is used as an evaluation metric. More implementation details are in the supplementary materials. Results on Node Classification Table 2 shows the results of Deformable GCN and other baselines on node classification tasks. The best model for each dataset is highlighted with boldface. Overall, Deformable GCN achieves the highest performance on all heterophilic graphs compared to all baselines including H2GCN, which is specifically proposed for heterophilic graphs. Note that on some heterophilic graph datasets such as Texas, Wisconsin, Actor, and Cornell, MLP outperforms various GNNs such as GCN, GAT, and Geom-GCN by significant margins without utilizing any graph structure information. It might seem that graph structure information is harmful to representation learning on heterophilic graphs for most existing GNN models, but on other heterophilic graph datasets such as Squirrel and Chameleon, MLP underperforms most GNN models. In contrast to the existing GNN 1https://github.com/graphdml-uiuc-jlu/geom-gcn. Heterophilic Graphs Homophilic Graphs Dataset Texas Wisconsin Actor Squirrel Chameleon Cornell Citeseer Pubmed Cora # Classes 5 5 5 5 5 5 6 3 7 # Nodes 183 251 7600 5201 2277 183 3327 19717 2708 # Edges 279 450 26659 198353 31371 277 4552 44324 5278 # Features 1703 1703 932 2089 2325 1703 3703 500 1433 Avg deg. 3.05 3.59 7.02 76.28 27.55 3.03 3.03 4.50 3.90 Hom. ratio h 0.11 0.21 0.22 0.22 0.23 0.30 0.74 0.80 0.81 Table 1: Dataset statistics. Dataset Texas Wisconsin Actor Squirrel Chameleon Cornell Citeseer Pubmed Cora Hom. ratio h 0.11 0.21 0.22 0.22 0.23 0.30 0.74 0.80 0.81 MLP 82.16 2.64 84.90 1.82 36.78 0.56 30.77 1.68 47.87 1.31 81.08 3.62 72.86 1.42 87.62 0.19 75.09 1.45 GCN 64.32 2.60 62.94 4.19 30.47 0.64 46.65 0.99 64.98 0.69 60.27 2.37 76.66 1.12 87.59 0.36 87.44 0.83 GAT 60.81 4.62 64.31 3.48 29.92 0.43 45.47 1.38 66.56 0.99 59.46 1.58 76.54 1.00 86.55 0.34 87.46 0.77 Cheby Net 76.49 3.45 77.84 3.29 35.03 0.73 45.42 0.06 61.80 1.19 72.97 4.61 76.20 1.12 89.16 0.30 83.44 2.04 JKNet 62.97 5.18 60.78 3.29 30.78 0.60 53.78 1.25 68.53 1.59 59.19 2.16 76.05 1.00 88.64 0.34 87.12 0.81 Mix Hop 83.78 3.35 85.10 2.57 33.80 0.89 39.32 2.16 63.09 1.07 81.08 4.61 75.86 1.20 89.02 0.26 87.20 0.85 Geom-GCN 68.11 3.04 64.51 2.53 31.48 0.64 38.00 0.60 60.99 1.74 60.54 3.55 77.48 0.87 89.51 0.33 85.51 0.92 H2GCN 82.16 4.12 86.67 2.18 36.96 0.55 54.51 0.94 65.42 1.58 80.54 3.77 77.05 0.9 89.38 0.26 87.48 0.93 Deformable GCN 84.32 3.42 87.06 2.16 37.07 0.79 62.56 1.31 70.90 1.12 85.95 2.71 76.83 1.15 89.49 0.29 87.48 0.82 Table 2: Evaluation results on node classification task (Mean accuracy (%) 95% confidence interval). The best-performing models are highlighted with boldface. models, our Deformable Graph Convolution in various position spaces with node positional embeddings allows Deformable GCN to consistently achieve the best performance on all the heterophilic graph datasets. Similar to our method, Geom-GCN also performs convolution in a latent space. But it significantly underperforms our method by 17.4% on average on heterophilic graph datasets and especially on Cornell the gap is 25.4%. We believe that our node positional embeddings, which are simultaneously learned with node representations for the target task in an end-to-end fashion, are more effective than the node embeddings in Geom-GCN that are obtained from pretrained external embedding methods. On homophilic graphs, Deformable GCN shows comparable performance. Since most existing GNNs have been proposed based on the homophily assumption, the performance gap between GNNbased models is small. Ablation Study and Analysis We conduct additional experiments to verify the contributions of our node positional embedding, deformable graph convolution layer, deformation, and regularization for Deformable GCN and analyze the attention score s(l) Node positional embedding. To verify the efficacy of our node positional embedding method, we compare it with other node embedding methods such as node2vec (Grover and Leskovec 2016) and Poincare embedding (Nickel and Kiela 2017) by using them for ϕv in (4) on four datasets Positional Dataset embedding Wisconsin Actor Squirrel Pubmed node2vec 67.57 35.04 45.27 88.35 Poin Care 68.1 35.15 46.31 87.26 Ours 87.06 37.07 62.56 89.49 Table 3: Comparison of our node positional embeddings with other node embedding methods on four datasets. Dataset Layer Wisconsin Actor Squirrel Pubmed GAT Layer 70.54 36.26 61.62 88.04 Deformable GConv 87.06 37.07 62.56 89.49 Table 4: Comparison of our Deformable GConv with GAT Layer on four datasets. in Table 3. The result shows that our proposed embedding method outperforms two other embedding methods for each dataset. Specifically, our proposed embedding method improves over node2vec and Poincare embedding by 9.9% and 9.8% on average, respectively. Unlike node2vec and Poincare embeddings that require separate pre-training, our positional embedding can be simultaneously trained with GNNs in an end-to-end fashion. We believe that our embedding scheme is much easier to train and allows more optimized positional embeddings for target tasks. (a) Wisconsin Figure 2: Ablations for deformation of deformable graph convolution. We compare Deformable GCN and the model without deformation (w/o Deformation) on (a) Wisconsin and (b) Pubmed according to number of the kernel vectors. Regularizer Dataset Lsep. Lfocus Wisconsin Actor Squirrel Pubmed 84.12 36.67 60.31 89.02 86.08 36.70 60.43 88.86 86.08 36.67 61.83 88.79 87.06 37.07 62.56 89.49 Table 5: Ablations for two regularizer losses (Lsep., Lfocus) on four datasets. Deformable graph convolution. In Table 4, we conduct an experiment to examine the contribution of deformable graph convolution (Deformable GConv) by substituting it with the GAT Layer in Graph Attention Networks (Veliˇckovi c et al. 2017) on four datasets. From the table, Deformable GConv consistently shows better performance compared to GAT Layer. In particular, on Wisconsin dataset, Deformable GConv improves 16.52% over GAT Layer. It means that Deformable GConv contributes to the performance improvements of Deformable GCN. Deformation. To reveal the effectiveness of deformation, we compare the results of Deformable GCN and the model without deformation (w/o Deformation) on node classification task with Wisconsin and Pubmed datasets according to the number of kernel vectors. Figure 2 shows that Deformable GCNs consistently achieve superior performance (2 6% on average) than the models without deformation (w/o Deformation) on both a heterophilic graph Wisconsin and a homophilic graph Pubmed in the various settings with different numbers of kernel vectors. Regularizers. To verify contribution of regularizers, we conduct an ablation study on regularizers such as Lsep. and Lfocus on four datasets. Table 5 summarizes the results of the ablation study of our regularizers. From the table, each regularizer contributes to improving performance of Deformable GCN. Moreover, we observe that training with both regularizers is the most effective. On average, Deformable GCNs trained with both regularizers outperform ones without regularizers by 1.5%. Attention score s(l) v . The attention score s(l) v in (8) can be used to understand datasets. An averaged attention score Figure 3: The attention score s(l) of each latent neighborhood graph G(l). Datasets above the dash line are heterophilic graphs and the others homophilic graph datasets. s(l) = 1 |V | P v V s(l) v indicates the overall importance of a latent neighborhood graph G(l). Figure 3 shows the attention score of a Deformable GCN with L = 5 that utilizes 6 latent neighborhood graphs {G(l)}l=5 l=0 and a original input graph G(Input). On most heterophilic graph datasets except for Chameleon and Squirrel, Deformable GCN has the large attention score for latent graph G(0) and relatively small attention scores for G(Input). Recall that G(0) is the k NN graph constructed based on the similarity between input features. This indicates that rather than focusing on the neighborhood on the input graph, smoothing over the nodes with similar input features is more helpful for predictions on heterophilic graphs. Indeed, this coincides with the results in Table 2. On the heterophilic graphs such as Texas, Wisconsin, Actor, and Cornell, an MLP, which does not use any graph structure information, outperforms most existing GNNs. On the other hand on homophilic graphs, G(Input) plays a more important role on homophilic graphs compared to heterophilic graphs. We proposed Deformable Graph Convolutional Networks for learning representations on both heterophilic and homophilic graphs. Our approach learns node positional embeddings for mapping nodes into latent spaces and applies deformable graph convolution on each node. The deformable graph convolution deforms in latent neighborhoods with weights according to kernel vectors shifted by deformation vectors. Our experiments show that the Deformable GCNs are effective for learning representations on both heterophilic and homophilic graphs. Acknowledgments This work was partly supported by MSIT (Ministry of Science and ICT), Korea, under the ICT Creative Consilience program (IITP-2022-2020-0-01819) supervised by the IITP and Samsung Research Funding&Incubation Center of Samsung Electronics under Project Number SRFC-IT1701-51. References Abu-El-Haija, S.; Perozzi, B.; Kapoor, A.; Alipourfard, N.; Lerman, K.; Harutyunyan, H.; Ver Steeg, G.; and Galstyan, A. 2019. Mixhop: Higher-order graph convolutional architectures via sparsified neighborhood mixing. In ICML. Berg, R. v. d.; Kipf, T. N.; and Welling, M. 2017. Graph convolutional matrix completion. ar Xiv:1706.02263. Bo, D.; Wang, X.; Shi, C.; and Shen, H. 2021. Beyond Lowfrequency Information in Graph Convolutional Networks. In AAAI. Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; and Wei, Y. 2017. Deformable Convolutional Networks. In ICCV. Defferrard, M.; Bresson, X.; and Vandergheynst, P. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In Neur IPS. Erkan, G.; and Radev, D. R. 2004. Lexrank: Graph-based lexical centrality as salience in text summarization. JAIR, 22: 457 479. Errica, F.; Podda, M.; Bacciu, D.; and Micheli, A. 2019. A fair comparison of graph neural networks for graph classification. In ICLR. Gilmer, J.; Schoenholz, S. S.; Riley, P. F.; Vinyals, O.; and Dahl, G. E. 2017. Neural message passing for quantum chemistry. In ICML. Grover, A.; and Leskovec, J. 2016. node2vec: Scalable feature learning for networks. In KDD. Hamilton, W. L.; Ying, R.; and Leskovec, J. 2018. Inductive Representation Learning on Large Graphs. In Neur IPS. Hammond, D. K.; Vandergheynst, P.; and Gribonval, R. 2011. Wavelets on graphs via spectral graph theory. ACHA, 30(2): 129 150. He, K.; Gkioxari, G.; Doll ar, P.; and Girshick, R. 2017. Mask r-cnn. In ICCV. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR. Johnson, J.; Krishna, R.; Stark, M.; Li, L.-J.; Shamma, D.; Bernstein, M.; and Fei-Fei, L. 2015. Image retrieval using scene graphs. In CVPR. Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochastic optimization. In ICLR. Kipf, T. N.; and Welling, M. 2017. Semi-supervised classification with graph convolutional networks. In ICLR. Li, P.; Wang, Y.; Wang, H.; and Leskovec, J. 2020. Distance Encoding Design Provably More Powerful GNNs for Structural Representation Learning. In Neur IPS. Li, Q.; Han, Z.; and Wu, X.-M. 2018. Deeper insights into graph convolutional networks for semi-supervised learning. In AAAI. Liu, M.; Wang, Z.; and Ji, S. 2020. Non-local graph neural networks. ar Xiv:2005.14612. Mc Pherson, M.; Smith-Lovin, L.; and Cook, J. M. 2001. Birds of a feather: Homophily in social networks. Annu. Rev. Sociol., 27(1): 415 444. Nickel, M.; and Kiela, D. 2017. Poincar e embeddings for learning hierarchical representations. In Neur IPS. Pei, H.; Wei, B.; Chang, K. C.-C.; Lei, Y.; and Yang, B. 2020. Geom-gcn: Geometric graph convolutional networks. In ICLR. Schlichtkrull, M.; Kipf, T. N.; Bloem, P.; Van Den Berg, R.; Titov, I.; and Welling, M. 2018. Modeling relational data with graph convolutional networks. In ESWC, 593 607. Siarohin, A.; Sangineto, E.; Lathuili ere, S.; and Sebe, N. 2018. Deformable GANs for Pose-Based Human Image Generation. In CVPR. Tang, J.; Qu, M.; Wang, M.; Zhang, M.; Yan, J.; and Mei, Q. 2015. Line: Large-scale information network embedding. In WWW. Thomas, H.; Qi, C. R.; Deschaud, J.-E.; Marcotegui, B.; Goulette, F.; and Guibas, L. J. 2019. Kpconv: Flexible and deformable convolution for point clouds. In ICCV. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. Attention is all you need. In Neur IPS. Veliˇckovi c, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; and Bengio, Y. 2017. Graph attention networks. In ICLR. Wang, D.; Cui, P.; and Zhu, W. 2016. Structural deep network embedding. In KDD. Wang, X.; Chan, K. C. K.; Yu, K.; Dong, C.; and Loy, C. C. 2019. EDVR: Video Restoration With Enhanced Deformable Convolutional Networks. In CVPR W. Wu, F.; Souza, A.; Zhang, T.; Fifty, C.; Yu, T.; and Weinberger, K. 2019. Simplifying graph convolutional networks. In ICML. Xu, K.; Li, C.; Tian, Y.; Sonobe, T.; Kawarabayashi, K.-i.; and Jegelka, S. 2018. Representation learning on graphs with jumping knowledge networks. In ICML. Ying, R.; You, J.; Morris, C.; Ren, X.; Hamilton, W. L.; and Leskovec, J. 2018. Hierarchical graph representation learning with differentiable pooling. In Neur IPS. Yun, S.; Jeong, M.; Kim, R.; Kang, J.; and Kim, H. J. 2019. Graph Transformer Networks. In Neur IPS. Zhang, M.; and Chen, Y. 2018. Link prediction based on graph neural networks. In Neur IPS. Zhu, J.; Rossi, R. A.; Rao, A.; Mai, T.; Lipka, N.; Ahmed, N. K.; and Koutra, D. 2021. Graph Neural Networks with Heterophily. In AAAI. Zhu, J.; Yan, Y.; Zhao, L.; Heimann, M.; Akoglu, L.; and Koutra, D. 2020. Beyond Homophily in Graph Neural Networks: Current Limitations and Effective Designs. In Neur IPS. Zhu, X.; Hu, H.; Lin, S.; and Dai, J. 2019. Deformable convnets v2: More deformable, better results. In CVPR.