# neural_architecture_optimization__40ea4a5a.pdf Neural Architecture Optimization 1Renqian Luo , 2Fei Tian , 2Tao Qin, 1Enhong Chen, 2Tie-Yan Liu 1University of Science and Technology of China, Hefei, China 2Microsoft Research, Beijing, China 1lrq@mail.ustc.edu.cn, cheneh@ustc.edu.cn 2{fetia, taoqin, tie-yan.liu}@microsoft.com Automatic neural architecture design has shown its potential in discovering powerful neural network architectures. Existing methods, no matter based on reinforcement learning or evolutionary algorithms (EA), conduct architecture search in a discrete space, which is highly inefficient. In this paper, we propose a simple and efficient method to automatic neural architecture design based on continuous optimization. We call this new approach neural architecture optimization (NAO). There are three key components in our proposed approach: (1) An encoder embeds/maps neural network architectures into a continuous space. (2) A predictor takes the continuous representation of a network as input and predicts its accuracy. (3) A decoder maps a continuous representation of a network back to its architecture. The performance predictor and the encoder enable us to perform gradient based optimization in the continuous space to find the embedding of a new architecture with potentially better accuracy. Such a better embedding is then decoded to a network by the decoder. Experiments show that the architecture discovered by our method is very competitive for image classification task on CIFAR-10 and language modeling task on PTB, outperforming or on par with the best results of previous architecture search methods with a significantly reduction of computational resources. Specifically we obtain 2.11% test set error rate for CIFAR-10 image classification task and 56.0 test set perplexity of PTB language modeling task. The best discovered architectures on both tasks are successfully transferred to other tasks such as CIFAR-100 and Wiki Text-2. Furthermore, combined with the recent proposed weight sharing mechanism, we discover powerful architecture on CIFAR-10 (with error rate 3.53%) and on PTB (with test set perplexity 56.6), with very limited computational resources (less than 10 GPU hours) for both tasks. 1 Introduction Automatic design of neural network architecture without human intervention has been the interests of the community from decades ago [12, 22] to very recent [45, 46, 27, 36, 7]. The latest algorithms for automatic architecture design usually fall into two categories: reinforcement learning (RL) [45, 46, 34, 3] based methods and evolutionary algorithm (EA) based methods [40, 32, 36, 27, 35]. In RL based methods, the choice of a component of the architecture is regarded as an action. A sequence of actions defines an architecture of a neural network, whose dev set accuracy is used as the reward. In EA based method, search is performed through mutations and re-combinations of architectural components, where those architectures with better performances will be picked to continue evolution. It can be easily observed that both RL and EA based methods essentially perform search within the discrete architecture space. This is natural since the choices of neural network architectures are The work was done when the first author was an intern at Microsoft Research Asia. The first two authors contribute equally to this work. 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. typically discrete, such as the filter size in CNN and connection topology in RNN cell. However, directly searching the best architecture within discrete space is inefficient given the exponentially growing search space with the number of choices increasing. In this work, we instead propose to optimize network architecture by mapping architectures into a continuous vector space (i.e., network embeddings) and conducting optimization in this continuous space via gradient based method. On one hand, similar to the distributed representation of natural language [33, 25], the continuous representation of an architecture is more compact and efficient in representing its topological information; On the other hand, optimizing in a continuous space is much easier than directly searching within discrete space due to better smoothness. We call this optimization based approach Neural Architecture Optimization (NAO), which is briefly shown in Fig. 1. The core of NAO is an encoder model responsible to map a neural network architecture into a continuous representation (the blue arrow in the left part of Fig. 1). On top of the continuous representation we build a regression model to approximate the final performance (e.g., classification accuracy on the dev set) of an architecture (the yellow surface in the middle part of Fig. 1). It is noteworthy here that the regression model is similar to the performance predictor in previous works [4, 26, 10]. What distinguishes our method is how to leverage the performance predictor: different with previous work [26] that uses the performance predictor as a heuristic to select the already generated architectures to speed up searching process, we directly optimize the module to obtain the continuous representation of a better network (the green arrow in the middle and bottom part of Fig. 1) by gradient descent. The optimized representation is then leveraged to produce a new neural network architecture that is predicted to perform better. To achieve that, another key module for NAO is designed to act as the decoder recovering the discrete architecture from the continuous representation (the red arrow in the right part of Fig. 1). The decoder is an LSTM model equipped with an attention mechanism that makes the exact recovery easy. The three components (i.e., encoder, performance predictor and decoder) are jointly trained in a multi task setting which is beneficial to continuous representation: the decoder objective of recovering the architecture further improves the quality of the architecture embedding, making it more effective in predicting the performance. We conduct thorough experiments to verify the effectiveness of NAO, on both image classification and language modeling tasks. Using the same architecture space commonly used in previous works [45, 46, 34, 26], the architecture found via NAO achieves 2.11% test set error rate (with cutout [11]) on CIFAR-10. Furthermore, on PTB dataset we achieve 56.0 perplexity, also surpassing best performance found via previous methods on neural architecture search. Furthermore, we show that equipped with the recent proposed weight sharing mechanism in ENAS [34] to reduce the large complexity in the parameter space of child models, we can achieve improved efficiency in discovering powerful convolutional and recurrent architectures, e.g., both take less than 10 hours on 1 GPU. Our codes and model checkpoints are available at https://github.com/renqianluo/NAO. 2 Related Work Recently the design of neural network architectures has largely shifted from leveraging human knowledge to automatic methods, sometimes referred to as Neural Architecture Search (NAS) [40, 45, 46, 26, 34, 6, 36, 35, 27, 7, 6, 21]. As mentioned above, most of these methods are built upon one of the two basic algorithms: reinforcement learning (RL) [45, 46, 7, 3, 34, 8] and evolutionary algorithm (EA) [40, 36, 32, 35, 27]. For example, [45, 46, 34] use policy networks to guide the next-step architecture component. The evolution processes in [36, 27] guide the mutation and recombination process of candidate architectures. Some recent works [17, 18, 26] try to improve the efficiency in architecture search by exploring the search space incrementally and sequentially, typically from shallow to hard. Among them, [26] additionally utilizes a performance predictor to select the promising candidates. Similar performance predictor has been specifically studied in parallel works such as [10, 4]. Although different in terms of searching algorithms, all these works target at improving the quality of discrete decision in the process of searching architectures. The most recent work parallel to ours is DARTS [28], which relaxes the discrete architecture space to continuous one by mixture model and utilizes gradient based optimization to derive the best architecture. One one hand, both NAO and DARTS conducts continuous optimization via gradient based method; on the other hand, the continuous space in the two works are different: in DARTS it is the mixture weights and in NAO it is the embedding of neural architectures. The difference in Encoder Architecture 𝐱 Optimized Architecture 𝐱 output surface of performance prediction continuous space of architectures 𝓔 Figure 1: The general framework of NAO. Better viewed in color mode. The original architecture x is mapped to continuous representation ex via encoder network. Then ex is optimized into ex via maximizing the output of performance predictor f using gradient ascent (the green arrow). Afterwards ex is transformed into a new architecture x using the decoder network. optimization space leads to the difference in how to derive the best architecture from continuous space: DARTS simply assumes the best decision (among different choices of architectures) is the argmax of mixture weights while NAO uses a decoder to exactly recover the discrete architecture. Another line of work with similar motivation to our research is using bayesian optimization (BO) to perform automatic architecture design [37, 21]. Using BO, an architecture s performance is typically modeled as sample from a Gaussian process (GP). The induced posterior of GP, a.k.a. the acquisition function denoted as a : X R+ where X represents the architecture space, is tractable to minimize. However, the effectiveness of GP heavily relies on the choice of covariance functions K(x, x ) which essentially models the similarity between two architectures x and x . One need to pay more efforts in setting good K(x, x ) in the context of architecture design, bringing additional manual efforts whereas the performance might still be unsatisfactory [21]. As a comparison, we do not build our method on the complicated GP setup and empirically find that our model which is simpler and more intuitive works much better in practice. We introduce the details of neural architecture optimization (NAO) in this section. 3.1 Architecture Space Firstly we introduce the design space for neural network architectures, denoted as X. For fair comparison with previous NAS algorithms, we adopt the same architecture space commonly used in previous works [45, 46, 34, 26, 36, 35]. For searching CNN architecture, we assume that the CNN architecture is hierarchical in that a cell is stacked for a certain number of times (denoted as N) to form the final CNN architecture. The goal is to design the topology of the cell. A cell is a convolutional neural network containing B nodes. Each of the nodes contains two branches, with each branch taking the output of one of the former nodes as input and applying an operation to it. The operation set includes 11 operators listed in Appendix. The node adds the outputs of its two branches as its output. The inputs of the cell are the outputs of two previous cells, respectively denoted as node 2 and node 1. Finally, the outputs of all the nodes that are not used by any other nodes are concatenated to form the final output of the cell. Therefore, for each of the B nodes we need to: 1) decide which two previous nodes are used as the inputs to its two branches; 2) decide the operation to apply to its two branches. We set B = 5 in our experiments as in [46, 34, 26, 35]. For searching RNN architecture, we use the same architecture space as in [34]. The architecture space is imposed on the topology of an RNN cell, which computes the hidden state ht using input it and previous hidden state ht 1. The cell contains B nodes and we have to make two decisions for each node, similar to that in CNN cell: 1) a previous node as its input; 2) the activation function to apply. For example, if we sample node index 2 and Re LU for node 3, the output of the node will be o3 = Re LU(o2 W h 3 ). An exception here is for the first node, where we only decide its activation function a1 and its output is o1 = a1(it W i + ht 1 W h 1 ). Note that all W matrices are the weights related with each node. The available activation functions are: tanh, Re LU, identity and sigmoid. Finally, the output of the cell is the average of the output of all the nodes. In our experiments we set B = 12 as in [34]. We use a sequence consisting of discrete string tokens to describe a CNN or RNN architecture. Taking the description of CNN cell as an example, each branch of the node is represented via three tokens, including the node index it selected as input, the operation type and operation size. For example, the sequence node-2 conv 3x3 node1 max-pooling 3x3 means the two branches of one node respectively takes the output of node 2 and node 1 as inputs, and respectively apply 3 3 convolution and 3 3 max pooling. For the ease of introduction, we use the same notation x = {x1, , x T } to denote such string sequence of an architecture x, where xt is the token at t-th position and all architectures x X share the same sequence length denoted as T. T is determined via the number of nodes B in each cell in our experiments. 3.2 Components of Neural Architecture Optimization The overall framework of NAO is shown in Fig. 1. To be concrete, there are three major parts that constitute NAO: the encoder, the performance predictor and the decoder. Encoder. The encoder of NAO takes the string sequence describing an architecture as input, and maps it into a continuous space E. Specifically the encoder is denoted as E : X E. For an architecture x, we have its continuous representation (a.k.a. embedding) ex = E(x). We use a single layer LSTM as the basic model of encoder and the hidden states of the LSTM are used as the continuous representation of x. Therefore we have ex = {h1, h2, , h T } RT d where ht Rd is LSTM hidden state at t-th timestep with dimension d3. Performance predictor. The performance predictor f : E R+ is another important module accompanied with the encoder. It maps the continuous representation ex of an architecture x into its performance sx measured by dev set accuracy. Specifically, f first conducts mean pooling on ex = {h1, , h T } to obtain ex = 1 T PT t ht, and then maps ex to a scalar value using a feed-forward network as the predicted performance. For an architecture x and its performance sx as training data, the optimization of f aims at minimizing the least-square regression loss (sx f(E(x)))2 . Considering the objective of performance prediction, an important requirement for the encoder is to guarantee the permutation invariance of architecture embedding: for two architectures x1 and x2, if they are symmetric (e.g., x2 is formed via swapping two branches within a node in x1), their embeddings should be close to produce the same performance prediction scores. To achieve that, we adopt a simple data augmentation approach inspired from the data augmentation method in computer vision (e.g., image rotation and flipping): for each (x1, sx), we add an additional pair (x2, sx) where x2 is symmetrical to x1, and use both pairs (i.e., (x1, sx) and (x2, sx)) to train the encoder and performance predictor. Empirically we found that acting in this way brings non-trivial gain: on CIFAR-10 about 2% improvement when we measure the quality of performance predictor via the pairwise accuracy among all the architectures (and their performances). Decoder. Similar to the decoder in the neural sequence-to-sequence model [38, 9], the decoder in NAO is responsible to decode out the string tokens in x, taking ex as input and in an autoregressive manner. Mathematically the decoder is denoted as D : E x which decodes the string tokens x from its continuous representation: x = D(ex). We set D as an LSTM model with the initial hidden state s0 = h T (x). Furthermore, attention mechanism [2] is leveraged to make decoding easier, which will output a context vector ctxr combining all encoder outputs {ht}T t=1 at each timestep r. The decoder D then induces a factorized distribution PD(x|ex) = QT r=1 PD(xr|ex, x