# deep_and_flexible_graph_neural_architecture_search__855bdb2c.pdf DFG-NAS: Deep and Flexible Graph Neural Architecture Search Wentao Zhang 1 Zheyu Lin 1 Yu Shen 1 Yang Li 1 Zhi Yang 1 Bin Cui 1 2 Graph neural networks (GNNs) have been intensively applied to various graph-based applications. Despite their success, manually designing the well-behaved GNNs requires immense human expertise. And thus it is inefficient to discover the potentially optimal data-specific GNN architecture. This paper proposes DFG-NAS, a new neural architecture search (NAS) method that enables the automatic search of very deep and flexible GNN architectures. Unlike most existing methods that focus on micro-architectures, DFGNAS highlights another level of design: the search for macro-architectures on how atomic propagation (P) and transformation (T) operations are integrated and organized into a GNN. To this end, DFG-NAS proposes a novel search space for P-T permutations and combinations based on message-passing dis-aggregation, defines four custom-designed macro-architecture mutations, and employs the evolutionary algorithm to conduct an efficient and effective search. Empirical studies on four node classification tasks demonstrate that DFG-NAS outperforms state-of-the-art manual designs and NAS methods of GNNs. 1. Introduction Graph Neural Networks (GNNs) are a set of message passing algorithms, whose intuition is to smooth the node embedding across the edges of a graph. By staking multiple GNN layers, each node can enhance its node embedding with distant neighborhood nodes. Recently, GNNs have been applied in various domains such as social network analysis (Zhang et al., 2020; Huang et al., 2021), chemistry and biology (Dai et al., 2019; Bradshaw et al., 2019), rec- *Equal contribution 1School of CS & Key Laboratory of High Confidence Software Technologies, Peking University 2Institute of Computational Social Science, Peking University (Qingdao), China. Correspondence to: Zhi Yang , Bin Cui . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). ommendation (Jiang et al., 2022; Wu et al., 2020), natural language processing (Wu et al., 2021; Vashishth et al., 2020), and computer vision (Shi et al., 2019; Sarlin et al., 2020). Despite a broad spectrum of applications, designing GNN architectures manually is a knowledge-intensive and laborintensive process. Existing literature (Huan et al., 2021) suggests that handcrafted architectures (e.g., GCN (Kipf & Welling, 2017), Graph SAGE (Hamilton et al., 2017), and GAT (Velickovic et al., 2018)) can not behave well in all scenarios. Therefore, there is an ever-increasing demand for automated architecture exploration to obtain the optimal data-specific GNN architectures. Motivated by the success of neural architecture search (NAS) in other established areas, e.g., computer vision, several recent graph neural architecture search (G-NAS) methods are proposed to effectively tackle the architecture challenge in GNNs, including Graph NAS (Gao et al., 2020a), Auto-GNN (Zhou et al., 2019), and Graph Gym (You et al., 2020). These G-NAS methods assume GNNs consist of several repetitive message passing layers, and focus more on the intra-layer design such as aggregation function and nonlinear activation function. Despite their effectiveness, they suffer from two limitations: Fixed Pipeline Pattern. To generate GNNs, existing methods adopt a fixed message-passing pipeline to organize two types of atomic operations: propagating (P) representations of its neighbors and applying transformation (T) on the representations. In particular, most G-NAS methods adopt the tight entanglement of applying transformation after propagation in each layer (e.g., P-T-P-T). Several handcrafted architectures include a certain degree of entanglement by only retaining the first transformation or propagation, e.g., T-P-P-P (Klicpera et al., 2019; Liu et al., 2020) or P-P-P-T (Wu et al., 2019; Rossi et al., 2020; Zhang et al., 2021c). However, these specific P-T permutations and combinations are still fixed pipeline designs, limiting the expressive power of macro-architecture search space. It remains to be seen whether more general and flexible pipelines can further improve the performance. Restricted Pipeline Depth. A common practice to increase the expressive power is to directly stack multiple GNN layers (Kipf & Welling, 2017). However, when the layers become deeper, the performance decreases as the node DFG-NAS: Deep and Flexible Graph Neural Architecture Search (with appendix) embedding becomes indistinguishable with too many P operations, which we refer to as the over-smoothing issue (Li et al., 2018; Miao et al., 2021b). Therefore, the existing GNAS methods fix the number of layers to a small constant. For example, both Auto GNN and Graph NAS pre-define a very restricted GNN layer number (e.g., 3). How to increase the pipeline depth is another critical problem to conduct effective G-NAS. This paper proposes DFG-NAS, a new method for automatic search of deep and flexible GNN architectures. Our key insight is that propagation P and transformation T operation correspond to enforcing and mitigating the effect of smoothing. Inspired by this insight, we propose to search for the best P-T permutations and combinations (i.e., pipeline) to tune suitable smoothness level and thus obtain well-behaved data-specific GNN architectures. Instead of the microarchitecture in intra-layer design, we explore another level of design pipeline search in the GNN macro-architecture. Specifically, DFG-NAS greatly increases the capabilities of GNN macro-architecture search in two dimensions: the pipeline pattern and depth of GNN generation. We accomplish this by dis-aggregate the P and T operations in our search space and define four effective mutation designs to explore P-T permutations and combinations. To solve the over-smoothing problem, we propose a gating mechanism between different P operations so that node-adaptive propagation can be achieved. Furthermore, the skip-connection mechanism is used to T operations to avoid model degradation (He et al., 2016; Zhang et al., 2021a). Besides the pipeline and mutation designs, an evolutionary algorithm is applied to search for well-behaved GNN architectures. The contribution of the paper is summarized as follows: (1) By decoupling the P and T operations, DFG-NAS suggests a transition from studying specific fixed GNN pipelines to studying the GNN pipeline design space. (2) By further adding gating and skip-connection mechanisms, DFG-NAS could support both deep propagation and transformation, which has the ability to explore the best architecture design to push forward the GNN performance boundary. (3) We search for the flexible pipeline using a custom-designed genetic algorithm so that the final searched GNN architecture represents the result of joint optimization over the pattern and depth of pipelines. (4) Empirical results demonstrate that DFG-NAS achieves an accuracy improvement of up to 0.9% over state-of-the-art manual designs and brings up to 15.96x speedups over existing G-NAS methods. 2. Preliminary 2.1. Problem Formulation Given a graph G = (V, E) with |V| = N nodes and |E| = M edges, feature matrix X = {x1, x2..., x N} in which xi Rd is the feature vector of node vi, the node set V is partitioned into training set Vtrain (including both the labeled set Vl and unlabeled set Vu), validation set Vval and test set Vtest. Suppose c is the number of label classes, the one-hot vector yi Rc is the ground-truth label for node vi, M is the performance evaluation metric of a design in any given graph analysis task, e.g., F1 score or accuracy in the node classification task. Specifically, let F be the search space (finite or infinite) of graph neural architecture. Graph neural architecture search (G-NAS) aims to find the optimal design f F, so that the model can be trained to achieve the best performance in terms of the evaluation metric M on the validation set. Formally, it can be defined as the following bi-level optimization problem: arg max f F Evi Vval [M (yi, P(ˆyi|f(θ )))] , s.t. : θ = arg min θ ℓ(f(θ ), Vtrain), (1) where P(ˆyi|f(θ )) is the predicted label distribution of node vi, ℓis the loss function and θ is the optimized weights of model design f. For each design f, G-NAS first trains the corresponding model weight θ on the training set Vtrain. Then, it evaluates the trained model f(θ ) on the validation set Vval to obtain the final evaluation result. 2.2. Graph Neural Networks Based on the intuitive assumption that locally connected nodes are likely to have the same label, most GNNs iteratively propagate the information of each node to its adjacent nodes, and then transform the information with non-linear transformation. We refer to the propagation and transformation operations as P and T, respectively. At timestep t, a message vector mt v for node v V is computed with the representations of its neighbors Nv using the P operation, which is mt v P ht 1 u |u Nv . Then, mt v is then updated according to ht v T(mt v) via the T operation, where T is usually a dense layer. Take the vanilla GCN (Kipf & Welling, 2017) as an example, a message passing layer can be formulated as: P : Mt = ˆAHt 1, T : Ht = δ(Mt Wt), (2) where ˆA is the normalized adjacent matrix, Wt is the training parameters of the t-th layer, Mt and Ht are matrices formed by mt v and ht v, respectively. By stacking k layers, each node in GCN can utilize the information from its khop neighborhood. Therefore, the model performance is expected to be improved when more distant neighborhood nodes get involved in the training process. DFG-NAS: Deep and Flexible Graph Neural Architecture Search (with appendix) 2.3. Graph Neural Architecture Search As the basic operations of GNNs, the search of P and T operations has been widely discussed. Specifically, the existing G-NAS methods can be classified into micro-architecture search and macro-architecture search. Micro-architecture search. Micro-architecture corresponds to the design of graph convolution layers, especially the details for P and T operations. Specifically, the P operation determines how to exchange messages among different nodes in each graph convolution layer. Many GNNs (Kipf & Welling, 2017; Wu et al., 2019; Klicpera et al., 2019) adopt the normalized adjacency matrix for neighbor propagation, and several attention mechanisms (Velickovic et al., 2018; Wang et al., 2021) are also proposed for more effective propagation. In addition, how to combine the propagated node embeddings is also important in P (e.g., MEAN, MAX, etc). Unlike P, which is closely related to the graph structure, T focuses on the non-linear transformation in the neural network, including the choices of hidden size and activation function. Several G-NAS frameworks (Gao et al., 2020b; Zhou et al., 2019; Cai et al., 2021; Li et al., 2021b) have been proposed to search for the best P and T operations from a pool of various implementations. Different from these works, we focus on macro-architecture. Macro-architecture search. Different from the microarchitecture that highlights the details in each GNN layer, the macro-architecture search focuses on the interaction between different layers. Concretely, macro-architecture search involves the design of network topology, including the choices of layer depth and the inter-layer skip connections. Similar to residual connections and dense connections in CNNs (He et al., 2016; Huang et al., 2017), node representations in one GNN layer do not necessarily solely depend on the immediate previous layer (Xu et al., 2018; Li et al., 2019; Miao et al., 2021a). Take the representative Graph Gym (You et al., 2020) as an example, the search space for macro-architecture includes the choice of graph convolutional layer depth, the pre-processing and post-processing layer depth, and the skip-connections. The main difference between our method and previous work is that we open up new design opportunities for macro-architecture search in GNNs. Specifically, we disentangle the P and T operations in GNN layers, and thus allow exploring P-T permutations and combinations in designing GNN architectures. 3. How do P and T influence GNNs? 3.1. Entanglement of GNNs Entangled GNNs. The pattern of Entangled Propagation and Transformation is widely adopted by mainstream GNNs, e.g., GCN (Kipf & Welling, 2017), Graph SAGE (Hamilton et al., 2017), GAT (Velickovic et al., 2018), Graph- Propagation Transformation P T P T P Data T SGC APPNP T P T P Figure 1. Entanglement of GNNs. SAINT (Zeng et al., 2020). Similar to GCN shown in Figure 1, entangled GNNs pass the input signals through a set of filters to propagate the information, which is further followed by a non-linear transformation. The propagation operation P and transformation operation T are intertwined and executed alternately in entangled GNNs. Therefore, the entangled GNNs share a strict restriction that Dp = Dt, where Dp and Dt are the number of propagation and transformation operations, respectively. Disentangled GNNs. Recently, some researches show that the entanglement of P and T could compromise performance on a range of benchmark tasks (Wu et al., 2019; He et al., 2020; Liu et al., 2020; Frasca et al., 2020; Klicpera et al., 2019). They also argue that the true effectiveness of GNNs lies in the propagation operation P rather than the T operation inside the graph convolution. In this way, some disentangled GNNs are proposed to separate P and T. Following SGC (Wu et al., 2019) as shown in Figure 1, several methods (He et al., 2020; Frasca et al., 2020) execute P operations in advance, and then feed the propagated features into multiple T operations. On the contrary, several methods first transform the node features and then propagate the node information to distant neighbors, e.g., APPNP (Klicpera et al., 2019), DAGNN (Liu et al., 2020), etc. 3.2. Pipeline Pattern and Depth The number of P and T operations plays a very important role in GNN learning.. As introduced in previous studies (Wu et al., 2019; Liu et al., 2020; Miao et al., 2021a), more P operations are required to enhance the information propagation when the edges, labels or features are sparse. In addition, it is widely recognized that more training parameters and T operations should be used to increase the expressive power of neural networks when the size of dataset (i.e., the number of nodes in a graph) is large. Besides the number of P and T operations, the pipeline pattern (i.e., permutations and combinations) of P and T operations also matters. To verify this, we fix the depths of P and T operations to 2 and only change the pipeline pattern. We evaluate the three architectures in Figure 1, and report their test accuracy on three citation networks. The DFG-NAS: Deep and Flexible Graph Neural Architecture Search (with appendix) Table 1. Test accuracy of GNNs with different PT orders. Methods Cora Citeseer Pub Med PPTT 83.4 0.3 72.2 0.4 78.5 0.5 TTPP 82.8 0.2 71.8 0.3 79.8 0.3 PTPT 81.2 0.6 71.2 0.4 79.1 0.2 experimental results in Table 1 show that the test accuracy varies a lot in the pipeline pattern. Concretely, the TTPP architecture achieves the best performance on Pub Med, while it is less competitive than the PPTT architecture on Cora and Citeseer. Since the pipeline pattern highly influences the architecture performance and the optimal pipeline pattern varies across different graph datasets, it is necessary to consider the pipeline pattern given a specific dataset/task. 3.3. Influence on smoothness It is widely recognized that the propagation operation in GNN is a Laplacian smoothing, which may lead to indistinguishable node representations (Li et al., 2018). In other words, the representations for all nodes will converge to the same value when GNNs go deeper, which is also called over-smoothing. In fact, most entangled GNNs (e.g., GCN and GAT) face the over-smoothing problem and suffer from performance degradation when stacking too many GNN layers. To better measure the influence of P and T operations on smoothness, we analyze the change of smoothness when adding a single P or T operation each time. We measure the smoothness by calculating the average similarity (Liu et al., 2020) between two different node embeddings after a P or T operation. Concretely, the node smoothness is defined as follows, St i = 1 1 2N 2 et i ||et i|| et j ||et j|| where et i is the i-th node embedding of the t-th GNN layer, and St i measures the similarity of et i to the entire graph. Larger et i means node i faces higher risk of over-smoothing issue, and St i ranges in [0, 1] since we adopt the normalized node embedding to compute their Euclidean distance. Based on the node smoothness St i, we further measure the average similarities between all the node pairs and define the smoothness of the whole graph G as: i V St i, (4) where St is the graph smoothness of the t-th GNN layer. Unlike previous G-NAS methods that treat the combination of P T as one GNN layer, we treat each P or T here as an individual layer. We change different permutations and combinations of P and T, and report the corresponding output 1 2 3 4 Layers Smoothness Level Figure 2. Smoothness of different PT orders. smoothness of each layer in SGC and GCN on Cora. As shown in Figure 2, the results show that 1) the smoothness increases, i.e., the node embedding becomes similar, by applying the P operation; 2) the smoothness decreases by applying the T operation, which implies that the T operation has the ability to alleviate the over-smoothing issue. While over-smoothing leads to indistinguishable node embedding, under-smoothing cannot unleash the full potential of the graph structure. Therefore, how to carefully control the smoothness with P or T in GNNs is also an open question. 4. The Proposed Method Based on the message passing dis-aggregation, we propose a general design space for GNN pipeline search, which includes the crucial aspects of the number, and the permutations and combinations of P and T operations. Then we provides a custom-designed genetic algorithm to search the space of pipeline pattern and depth. 4.1. Pipeline Search Space Our general search space of GNNs pipeline includes P-T permutations and combinations, and the number of P-T operations. Besides, we further add connections among each type of operation to enable deep propagation and transformation in DFG-NAS. Suppose a single P or T operation is one GNN layer in DFG-NAS, o(l) v is the output of node v in the l-th layer, LP and LT are two sets that include the layer index of all P operations and T operations, respectively. The layer connections between different layers is as follows. Propagation Connection. As introduced in Section 3.3, GNNs may suffer from the over-smoothing or undersmoothing issue if we execute too many or too few propagation operations. In addition, a previous study (Zhang et al., 2021b) shows that the nodes with different local structures have different smoothing speeds. Therefore, how to control the smoothness of different nodes in a node-adaptive way is essential in GNN architecture design. To allow deep propagation and provide suitable smoothness DFG-NAS: Deep and Flexible Graph Neural Architecture Search (with appendix) ߙଵ ߙଶ T P P T Gating for deep Skip-connection for deep Figure 3. GNN pipeline example in the search space of DFG-NAS. for different nodes, we adopt a gating mechanism upon P operations. The output of the l-th P operation is the propagated node embedding of o(l 1) if its next operation is P. As shown in Figure 3, if the next operation is T, we assign a node-adaptive combination weight for the node embeddings propagated by all previous P operations. The above process can be formulated as, z(l) v = P(o(l 1) v ), z(l) v , Followed by P i LP ,i l Softmax(αi)z(i) v , Followed by T , (5) where αi = σ(s oi v) is the weight for i-th layer output of node v. s is the trainable vector shared by all nodes , and σ denotes the Sigmoid function We adopt the Softmax function to scale the sum of gating scores to 1. Transformation Connection. GNNs require more training parameters and non-linear transformation operations to enhance their expressive power on larger graphs. However, it is widely acknowledged that too many transformation operations will lead to the model degradation issue (He et al., 2016), i.e., both the training and test accuracies decrease along with the increased transformation steps. To allow deeper transformation and meanwhile alleviate the model degradation issue, we introduce the skip-connection mechanism into T operations. As shown in Figure 3, the input of each T operation is the sum of the output of the last layer and the outputs of all previous T operations before the last layer. Concretely, the input and output of the l-th T operation can be defined as follows: z(l) v = o(l 1) v + i LT ,i