# snas_stochastic_neural_architecture_search__9e4f9713.pdf Published as a conference paper at ICLR 2019 SNAS: STOCHASTIC NEURAL ARCHITECTURE SEARCH Sirui Xie, Hehui Zheng, Chunxiao Liu, Liang Lin Sense Time {xiesirui, zhenghehui, liuchunxiao}@sensetime.com linliang@ieee.org We propose Stochastic Neural Architecture Search (SNAS), an economical endto-end solution to Neural Architecture Search (NAS) that trains neural operation parameters and architecture distribution parameters in same round of backpropagation, while maintaining the completeness and differentiability of the NAS pipeline. In this work, NAS is reformulated as an optimization problem on parameters of a joint distribution for the search space in a cell. To leverage the gradient information in generic differentiable loss for architecture search, a novel search gradient is proposed. We prove that this search gradient optimizes the same objective as reinforcement-learning-based NAS, but assigns credits to structural decisions more efficiently. This credit assignment is further augmented with locally decomposable reward to enforce a resource-efficient constraint. In experiments on CIFAR-10, SNAS takes fewer epochs to find a cell architecture with state-of-theart accuracy than non-differentiable evolution-based and reinforcement-learningbased NAS, which is also transferable to Image Net. It is also shown that child networks of SNAS can maintain the validation accuracy in searching, with which attention-based NAS requires parameter retraining to compete, exhibiting potentials to stride towards efficient NAS on big datasets. 1 INTRODUCTION The trend to seek for state-of-the-art neural network architecture automatically has been growing since Zoph & Le (2016), given the enormous effort needed in scientific research. Normally, a Neural Architecture Search (NAS) pipeline comprises architecture sampling, parameter learning, architecture validation, credit assignment and search direction update. There are basically three existing frameworks for neural architecture search. Evolution-based NAS like NEAT (Stanley & Miikkulainen, 2002) employs evolution algorithm to simultaneously optimize topology alongside with parameters. However, it takes enormous computational power and could not leverage the efficient gradient back-propagation in deep learning. To achieve the state-of-the-art performance as human-designed architectures, Real et al. (2018) takes 3150 GPU days for the whole evolution. Reinforcement-learning-based NAS is end-to-end for gradient back-propagation, among which the most efficient one, ENAS (Pham et al., 2018) learns optimal parameters and architectures together just like NEAT. However, as NAS is modeled as a Markov Decision Process, credits are assigned to structural decisions with temporal-difference (TD) learning (Sutton et al., 1998), whose efficiency and interpretability suffer from delayed rewards (Arjona-Medina et al., 2018). To get rid of the architecture sampling process, DARTS (Liu et al., 2019) proposes deterministic attention on operations to analytically calculate expectation at each layer. After the convergence of the parent network, it removes operations with relatively weak attention. Due to the pervasive non-linearity in neural operations, it introduces untractable bias to the loss function. This bias causes inconsistency between the performance of derived child networks and converged parent networks, thus parameter retraining comes up as necessary. A more efficient, more interpretable and less biased framework is in desire, especially for future full-fledged NAS solutions on large datasets. In this work, we propose a novel, efficient and highly automated framework, Stochastic Neural Architecture Search (SNAS), that trains neural operation parameters and architecture distribution parameters in same round of back propagation, while maintaining the completeness and differentiability of the NAS pipeline. One of the key motivations of SNAS is to replace the feedback Published as a conference paper at ICLR 2019 mechanism triggered by constant rewards in reinforcement-learning-based NAS with more efficient gradient feedback from generic loss. We reformulate NAS with a new stochastic modeling to bypass the MDP assumption in reinforcement learning. To combine architecture sampling with computational graph of arbitrary differentiable loss, the search space is represented with a set of one-hot random variables from a fully factorizable joint distribution, multiplied as a mask to select operations in the graph. Sampling from this search space is made differentiable by relaxing the architecture distribution with concrete distribution (Maddison et al., 2016). We name gradients w.r.t their parameters search gradient. From a global view, we prove that SNAS optimizes the same objective as reinforcement-learning-based NAS, except the training loss is used as reward. Zooming in, we provide a policy gradient equivalent of this search gradient, showing how gradients from the loss of each sample are used to assign credits to structural decisions. By interpreting this credit assignment as Taylor Decomposition (Montavon et al., 2017a), we prove SNAS s efficiency over reinforcementlearning-based NAS. Additionally, seeing that existing methods (Liu et al., 2019) manually design topology in child networks to avoid complex architecture, we propose a global resource constraint to automate it, augmenting the objective with feasiblity concerns. This global constraint could be linearly decomposed for structural decisions, hence the proof of SNAS s efficiency still applies. In our experiments, SNAS shows strong performance compared with DARTS and all other existing NAS methods in terms of test error, model complexity and searching resources. Specifically, SNAS discovers novel convolutional cells achieving 2.85 0.02% test error on CIFAR-10 with only 2.8M parameters, which is better than 3.00 0.14%-3.3M from 1st-order DARTS and 2.89%-4.6M from ENAS. It is also on par with 2.76 0.09%-3.3M from 2nd-order DARTS with fewer parameters. With a more aggressive resource constraint, SNAS discovers even smaller model achieving 3.10 0.04% test error on CIFAR-10 with 2.3M parameters. During the architecture search process, SNAS obtains a validation accuracy of 88% compared to around 70% of ENAS in fewer epochs. When validating the derived child network on CIFAR-10 without finetuning, SNAS maintains the search validation accuracy, significantly outperforming 54.66% by DARTS. These results validate our theory that SNAS is less biased than DARTS. The discovered cell achieves 27.3% top-1 error when transferred to Image Net (mobile setting), which is comparable to 26.9% by 2nd-order DARTS. %& 0 %' %( %& 0 %' %( )*+(-(')) -(&) -(') Figure 1: A conceptual visualization for a forward pass within SNAS. Sampled from p(Z), Z is a matrix whose rows Zi,j are one-hot random variable vectors indicating masks multiplied to edges (i, j) in the DAG. Columns of this matrix correspond to operations Ok. In this example, there are 4 operation candidates, among which the last one is zero, i.e. removing that edge. The objective is the expectation of generic loss L of all child graphs. 2 METHODOLOGY The main initiative of SNAS is to build an efficient and economical end-to-end learning system with as little compromise of the NAS pipeline as possible. In this section, we first describe how to sample from the search space for NAS in a cell, and how it motivates a stochastic reformuation for SNAS (Section 2.1). A new optimization objective is provided and the attention-based NAS s inconsistency is discussed. Then in Section 2.2, we introduce how this discrete search space is relaxed to be Published as a conference paper at ICLR 2019 continuous to let gradients back-propagate through. In Section 2.3, the search gradient of SNAS is connected to the policy gradient in reinforcement-learning-based NAS (Zoph & Le, 2016; Pham et al., 2018), interpreting SNAS s credit assignment with contribution analysis. At last, we introduce in Section 2.4 how SNAS automates the topology search to reduce the complexity of child netowrk, as well as how it decomposes this global constraint in the context of credit assignment. 2.1 SEARCH SPACE AND ARCHITECTURE SAMPLING Searching for structure of a cell that is later stacked as building blocks for a deep architecture is an ad hoc solution to trade-off search efficiency and result optimality (Zoph et al., 2017; Liu et al., 2017a; Real et al., 2018; Pham et al., 2018; Liu et al., 2019). As shown in the left of Figure 1, the search space, i.e. a cell, is represented using a directed acyclic graph (DAG), which is called parent graph. Nodes xi in this DAG represent latent representation, whose dimensions are simply ignored to avoid abuse of notations. In convolutional networks, they are feature maps. Edges (i, j) represent information flows and possible operations Oi,j to be selected between two nodes xi and xj. To make the skip operation included, nodes are enforced to be ordered, while edges only point from lower indexed nodes to higher ones. Thus we have intermediate nodes ij ZT j,m Oj,m(ZT i,j Oi,j(xi))]. (18) DARTS relaxed this objective to m>j Epαj,m [ZT j,m Oj,m(Epαi,j [ZT i,j Oi,j(xi)])]). (19) Considering that O(x) are Re LU-Conv-BN stacks as in ENAS (Pham et al., 2018), which are nonlinear, this transformation introduces unbounded bias. Though it will not be perceivable in training, where the complete graph is used for accuracy validation, consistent this loss, the derived graph is never validated during training. Hence the training is inconsistent with the true objective maximizing the expected performance of derived architectures. After an architecture derivation introduced in DARTS, the performance falls enormously and the parameters need to be retrained. Published as a conference paper at ICLR 2019 Figure 6: A comparison for gradients in DARTS and SNAS. (a) Deterministic gradients in DARTS; (b) Stochastic gradients in SNAS. Solid lines denote deterministic nodes, while dashed lines denote stochastic nodes. Black dotted lines denote compounded gradients, purple lines for parameter gradients in SNAS, red for search gradients. C GRADIENTS IN SNAS Figure 6(b) gives an illustration of a base three-intermediate-node unit in SNAS, where each edge has three operations (indexed by k) to choose from. In the search space of SNAS, intermediate nodes take input from all previous nodes. We have h