# graph_differentiable_architecture_search_with_structure_learning__791e342e.pdf Graph Differentiable Architecture Search with Structure Learning Yijian Qin, Xin Wang , Zeyang Zhang, Wenwu Zhu Tsinghua University qinyj19@mails.tsinghua.edu.cn, xin_wang@tsinghua.edu.cn zy-zhang20@mails.tsinghua.edu.cn, wwzhu@tsinghua.edu.cn Discovering ideal Graph Neural Networks (GNNs) architectures for different tasks is labor intensive and time consuming. To save human efforts, Neural Architecture Search (NAS) recently has been used to automatically discover adequate GNN architectures for certain tasks in order to achieve competitive or even better performance compared with manually designed architectures. However, existing works utilizing NAS to search GNN structures fail to answer the question How NAS is able to select the desired GNN architectures. In this paper, we investigate this question to solve the problem, for the first time. We conduct theoretical analysis and measurement study with experiments to discover that gradient based NAS methods tend to select proper architectures based on the usefulness of different types of information with respect to the target task. Our explorations further show that gradient based NAS also suffers from noises hidden in the graph, resulting in searching suboptimal GNN architectures. Based on our findings, we propose a Graph differentiable Architecture Search model with Structure Optimization (GASSO), which allows differentiable search of the architecture with gradient descent and is able to discover graph neural architectures with better performance through employing graph structure learning as a denoising process in the search procedure. Extensive experiments on real-world graph datasets demonstrate that our proposed GASSO model is able to achieve the state-of-the-art performance compared with existing baselines. 1 Introduction In real-world applications including social networks, e-commerce, traffics, and biochemistry, a variety of relational data can be represented as graphs. This motivates the advent of graph neural networks (GNNs) based models such as GCN [1], GAT [2] and GIN [3], which are designed to learn and extract knowledge from the relational graph-structured data. To utilize information in graph structure and node features, GNNs follow a recursive message passing scheme where nodes aggregate information from their neighbors in each layer, making great success in various graph related tasks. Given that discovering ideal GNN architectures for different tasks is labor intensive and time consuming, graph architecture search [4, 5] employs the ideas of neural architecture search (NAS) [6 8] to facilitate the automatic design of optimal GNN architectures so that a large number of human efforts can be saved. These automatically generated GNNs can achieve competitive or even better performance compared with manually designed GNNs on graph related tasks. 2 Nevertheless, existing works on graph architecture search mainly focus on designing search space and search strategy, ignoring the fundamental mechanism in adopting NAS for automatic GNN Corresponding Authors. 2Our code will be released at https://github.com/THUMNLab/Auto GL 35th Conference on Neural Information Processing Systems (Neur IPS 2021), virtual. joint optimization of neural architecture and graph structure original graph training loss node classification denoised graph final architecture super network hidden feature smoothness loss Figure 1: Overview framework of the proposed GASSO model, where the original graph and super network architecture are jointly optimized, hidden feature smoothness loss and training loss are proposed to optimize the neural architecture and graph structure. Then the learned architecture is adopted on the denoised graph, giving predictions of target nodes. structure search and failing to answer the following questions. (i) How does graph architecture search select its desired architectures? and (ii) How optimal are the architectures selected by graph neural architecture search? Answers to these questions can help us to understand the NAS and GNN mechanism in gathering messages, leading us to design better GNN structures for certain tasks. However, the failure of existing works in answering the questions significantly limits their capabilities of designing powerful GNN architectures. In this paper, we answer the above question through exploring how graph neural architecture search is able to select the desired GNN architectures. Given that neural architecture search approaches in literature can be mainly categorized into several groups, i.e., gradient based (DARTS [8]), reinforcement learning based, evolution algorithm based, and Bayesian optimization based methods, we only focus on gradient based architecture search approach in this work and leave investigations into the other groups as future works. We theoretically analyze DARTS behavior and find that DARTS prefers operations who can help to correct the predictions on hard data. Further analysis on graph data shows that different operations fit graphs with different amount of information in the node features and graph structures. Measurement study via designing synthetic experiments corroborates our theory. We find that DARTS on GNN is able to evaluate the usefulness of information hidden in node features and graph structure with respect to the target task, then automatically select appropriate operations based on the evaluated usefulness for GNN architecture. On the other hand, we also discover that the performance of gradient based graph architecture search (e.g., DARTS) can also be deteriorated by noises inside the graphs, leading to suboptimal architectures. To solve the problem, we propose Graph differentiable Architecture Search model with Structure Optimization (GASSO), which allows differentiable search of the architecture with gradient descent and is capable of searching the optimal architecture as well as adjusting graph structure adaptively through a joint optimization scheme. The graph structure adjustment in our proposed GASSO model serves as an adaptive noise reduction to enhance the architecture search performance. The framework of GASSO is shown in Figure 1. We employ differentiable graph structure to allow us to optimize graph structure by gradient based methods during the training process, and we use hidden feature smoothness to evaluate the edge importance and constrain edge weights. Overall, we jointly optimize the parameters of neural architecture and graph structure in an iterative updating manner. We evaluate our model on several widely used graph benchmark datasets including Cite Seer, Cora and Pub Med. The experimental results show that our proposed GASSO model can outperform state-of-the-art methods and demonstrate a better denoising ability than existing graph architecture search model using DARTS and other GNNs models. To summarize, this paper makes the following contributions: (1) We theoretically explore how graph neural architecture search is able to select the desired GNN architectures, to the best of knowledge, for the first time, showing that i) gradient based graph architecture search prefers operations who can help to correct the predictions on hard data, and ii) different operations fit graphs with different amount of information in the node features and graph structures. (2) We design measurement study with experiments to show that i) gradient based graph architecture search is able to select operations based on the usefulness of the information in graphs, and ii) noises in graph features and structures can deteriorate the architecture search performance. (3)We propose Graph differentiable Architecture Search model with Structure Optimization (GASSO), which searches the optimal architecture as well as adjusts graph structure adaptively through a joint optimization scheme. Experimental results on several graph benchmark datasets demonstrate the superiority of our GASSO model against state-of-the-art methods. 2 Related Work 2.1 Graph Neural Network As an effective framework for graph representation learning, GNN [1 3, 9 13] follows a neighborhood aggregation scheme. At each iteration, representations of nodes are generated through aggregating their neighbors representations. For instance, graph convolutional neural network (GCN) [1] takes the average of all neighbor nodes features as aggregation. Different from GCN, Graph Attention Network (GAT) [2] treats neighbors unequally. It calculates the attention on each neighbor. Therefore, the aggregation can emphasize representations of more important neighbor nodes with larger attentions. The vectorized representation of a given node in the graph after k iterations can capture both structural (topological) and semantic (attributed) information within the region of the target node s k-hop neighborhood. As such, GNN characterizes the whole graph by aggregating node representations [1, 14], making themselves achieve the state-of-the-art results for tasks including node and graph classifications. Like other types of deep neural networks, GNNs are vulnerable to noise and adversarial attacks. Graph attackers usually adopt data perturbation by changing the node features or graph structure [15, 16]. How to denoise and defend graph adversarial attacks is also a popular question [17 19]. Graph structure learning is presented in recent GNN methods [20 22]. Since graph structure plays an important role in GNN scheme, certain noise in the original graph is extremely harmful to GNNs. Graph structure learning aims to generate a clean graph during the training procedure. To achieve this, prior constraints such as low rank and smoothness are utilized. Learning a new graph structure improves the model s ability of denoising, as well as its robustness to adversarial attacks of GNNs. 2.2 (Graph) Neural Architecture Search Recent years have witnessed a significant surge in research on automated machine learning, including hyper-parameter optimization [23 26] and Neural Architecture Search (NAS) methods [6 8, 27 30] aim at designing an neural architecture automatically for certain tasks. Since the architecture search space is discrete, reinforce learning [6, 7] and evolution algorithm [31, 32] are often used in NAS methods. Besides, transferring the discrete architecture search space into a differentiable space is another strategy for solving the NAS problems. DARTS [8] and SNAS [33] construct a super network where each operation is mixed by all candidate operations, making it possible to update the architecture as well as the weights simultaneously through the classical gradient descent method. NAS has also been applied to graph tasks, achieving competitive performance with the state-of-the-art performances. Existing works [4, 5, 34 36] mainly focus on defining a proper search space and search strategy for GNNs. However, the mechanism of NAS methods to select architectures and the relationship between NAS methods and GNN denoising ability have not been noticed. 3 Exploring Architecture Search for Graph 3.1 Preliminaries: Differentiable Architecture Search We use DARTS [8], a representative NAS method with both conciseness and effectiveness as the exploration object in this paper. In DARTS, neural architecture is represented as a directed acyclic graph (DAG) where the input data is the source node, the output is the sink node, and the edges represent operations (e.g. a GCN layer, or a linear layer) adopted on the data. Following a micro search space setting, DARTS only searches which operation should be selected and how the nodes in the DAG should be connected. The calculation inside each operation is predefined. There are two phases in DARTS procedure, the searching phase and the evaluation phase. In searching phase, a super network (shown in Figure 1) is constructed, where edges exist between each two nodes, i.e., ei,j exists, 0 i < j N 1. Each edge is a mixed operation that can be calculated by ei,j(xi) = P o O exp{αo i,j} P o O exp{αo i,j} o(xi), where xi is the output of node i, O is candidate operation set, α is learnable architecture parameters. After calculating the mixed weighted sum, each node aggregates all input edges by xj = P i