# data_differentiable_architecture_approximation__83cb452b.pdf DATA: Differentiable Archi Tecture Approximation Jianlong Chang1,2,3 Xinbang Zhang1,2 Yiwen Guo4,5 Gaofeng Meng1 Shiming Xiang1,2 Chunhong Pan1 1NLPR, Institute of Automation, Chinese Academy of Sciences 2School of Artificial Intelligence, University of Chinese Academy of Sciences 3Samsung Research China - Beijing, 4Intel Labs China, 5Bytedance AI Lab {jianlong.chang, xinbang.zhang, gfmeng, smxiang, chpan}@nlpr.ia.ac.cn guoyiwen.ai@bytedance.com Neural architecture search (NAS) is inherently subject to the gap of architectures during searching and validating. To bridge this gap, we develop Differentiable Archi Tecture Approximation (DATA) with an Ensemble Gumbel-Softmax (EGS) estimator to automatically approximate architectures during searching and validating in a differentiable manner. Technically, the EGS estimator consists of a group of Gumbel-Softmax estimators, which is capable of converting probability vectors to binary codes and passing gradients from binary codes to probability vectors. Benefiting from such modeling, in searching, architecture parameters and network weights in the NAS model can be jointly optimized with the standard back-propagation, yielding an end-to-end learning mechanism for searching deep models in a large enough search space. Conclusively, during validating, a high-performance architecture that approaches to the learned one during searching is readily built. Extensive experiments on a variety of popular datasets strongly evidence that our method is capable of discovering high-performance architectures for image classification, language modeling and semantic segmentation, while guaranteeing the requisite efficiency during searching. 1 Introduction In the era of deep learning, how to design proper network architectures for specific problems is a crucial but challenging task. However, designing architecture with state-of-the-art performance typically requires substantial efforts from human experts. In order to eliminate such exhausting engineering, many neural architecture search (NAS) methods have been devoted to accomplishing the task automatically [14, 27, 55], i.e., evolution-based NAS [13, 18, 26, 41, 43, 44, 45, 47], reinforcement learning-based NAS [2, 3, 21, 42, 56, 59, 60], and gradient-based NAS [11, 34, 35, 46, 53], which has achieved significant successes in a multitude of fields, including image classification [4, 12, 21, 30, 31, 34, 44, 53, 60], semantic segmentation [8, 32] and object detection [9, 15, 50, 52, 60]. Although the achievements in the literature are brilliant, these methods are still hard to effectively bridge the gap between architectures during searching and validating. That is, feasible paths in a learned architecture are dependent on each other and become deeply coupled during searching. In validating, however, the inherited architectures from searching always decouple the dependent paths rudely, such as DARTS [34] and SNAS [53] that choose only one path in validating. As a result, the effectiveness of the searched architectures are unclear although they could surpass the random ones. In order to eliminate the limitation, Differentiable Archi Tecture Approximation (DATA) is proposed to elegantly minimize the gap of architectures during searching and validating. For this purpose, we develop the Ensemble Gumbel-Softmax (EGS) estimator, an ensemble of a group of Gumbel-Softmax 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. estimators, which is in a position to sample an architecture that approaches the one during searching as close as possible, while maintaining the differentiability of a promising NAS pipeline for requisite efficiency. That is, our EGS estimator suffices to not only decouple the relationship between different paths in learned architectures but also pass gradient seamlessly, yielding an end-to-end mechanism of searching deep models in a large enough search space. To sum up, the main contributions of this work are: By generalizing the Gumbel-Softmax estimator, we develop the EGS estimator, which provides a successful attempt to effectively and efficiently perform structural decisions like policy gradient in the reinforcement-learning, with higher efficiency. With the EGS estimator, the DATA model can seamlessly bridge the gap of architectures between searching and validating, and be learned with the standard back-propagation, yielding an end-to-end mechanism of searching deep models in a large enough search space. Extensive experiments strongly demonstrate that our DATA model consistently outperforms current NAS models in searching high-performance convolutional and recurrent architectures for image classification, semantic segmentation, and language modeling. 2 Differentiable architecture search Before introducing our approach, we first briefly review NAS. Without loss of generality, the architecture search space A can be naturally represented by directed acyclic graphs (DAG) each consisting of an ordered sequence of nodes. For a specific architecture, it always corresponds to a graph α A, represented as N(α, w) with network weights w. Intrinsically, the goal in NAS is to find a graph α A that minimizes the validation loss Lval(N(α , w )), where the network weights w associated with the architecture α are obtained by minimizing the training loss w = arg minw Ltrain(N(α , w)), i.e., min α A Lval(N(α, w )), s.t. w = arg min w Ltrain(N(α , w)). (1) min α A Lval(N(α, w )), s.t. w = arg minw Ltrain(N(α , w)) This implies that the essence of NAS is to solve a bi-level optimization problem, which is hard to optimize because of the nested relationship between architecture parameters α and network weights w. To handle this issue, we parameterize architectures with binary codes, and devote to jointly learning architectures and network weights in a differentiable way. 2.1 Parameterizing architectures with binary codes For simplicity, we denote all DAGs with n ordered nodes as A = {e(i,j)|1 i < j n}, where e(i,j) indicates a directed edge from the i-th node to the j-th node. Corresponding to each directed edge e(i,j), there are a set of candidate primitive operations O = {o1, , o K}, such as convolution, pooling, identity, and zero. With these operations, the output at the j-th node can be formulated as i