# xnas_neural_architecture_search_with_expert_advice__6c319ee6.pdf XNAS: Neural Architecture Search with Expert Advice Niv Nayman , Asaf Noy , Tal Ridnik , Itamar Friedman, Rong Jin, Lihi Zelnik-Manor Machine Intelligence Technology, Alibaba Group {niv.nayman,asaf.noy,tal.ridnik,itamar.friedman,jinrong.jr,lihi.zelnik} @alibaba-inc.com This paper introduces a novel optimization method for differential neural architecture search, based on the theory of prediction with expert advice. Its optimization criterion is well fitted for an architecture-selection, i.e., it minimizes the regret incurred by a sub-optimal selection of operations. Unlike previous search relaxations, that require hard pruning of architectures, our method is designed to dynamically wipe out inferior architectures and enhance superior ones. It achieves an optimal worst-case regret bound and suggests the use of multiple learning-rates, based on the amount of information carried by the backward gradients. Experiments show that our algorithm achieves a strong performance over several image classification datasets. Specifically, it obtains an error rate of 1.6% for CIFAR-10, 23.9% for Image Net under mobile settings, and achieves state-of-the-art results on three additional datasets. 1 Introduction In recent years tremendous efforts have been put into a manual design of high performance neural networks [22, 16, 40, 39]. An emerging alternative approach is replacing the manual design with automated Neural Architecture Search (NAS). NAS excels in finding architectures which yield state-of-the-art results. Earlier NAS works were based on reinforcement learning [55, 56], sequential optimization [24], and evolutionary algorithms [33], and required immense computational resources, sometimes demanding years of GPU compute time in order to output an architecture. More recent NAS methods reduce the search time significantly, e.g. via weight-sharing [30] or by a continuous relaxation of the space [25], making the search affordable and applicable to real problems. While current NAS methods provide encouraging results, they still suffer from several shortcomings. For example, a large number of hyper-parameters that are not easy to tune, hard pruning decisions that are performed sub-optimally at once at the end of the search, and a weak theoretical understanding. This cultivates skepticism and criticism of the utility of NAS in general. Some recent works even suggest that current search methods are only slightly better than random search and further imply that some selection methods are not well principled and are basically random [23, 35]. To provide more principled methods, we view NAS as an online selection task, and rely on Prediction with Experts Advice (PEA) theory [4] for the selection. Our key contribution is the introduction of XNAS (e Xperts Neural Architecture Search), an optimization method (section 2.2) that is well suited for optimizing inner architecture weights over a differentiable architecture search space (section 2.1). We propose a setup in which the experts represent inner neural operations and connections, whose dominance is specified by architecture weights. These authors contributed equally. 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. Our proposed method addresses the mentioned shortcomings of current NAS methods. For the mitigation of the hard pruning, we leverage the Exponentiated-Gradient (EG) algorithm [21], which favors sparse weight vectors to begin with, enhanced by a wipeout mechanism for dynamically pruning inferior experts during the search process. Additionally, the algorithm requires less hyperparameters to be tuned (section 3.2.2), and the theory behind it further provides guidance for the choice of learning rates. Specifically, the algorithm avoids the decay of architecture weights [12], which is shown to promote selection of arbitrary architectures. Additionally, XNAS features several desirable properties, such as achieving an optimal worst-case regret bound (section 3.1) and suggesting to assign different learning rates for different groups of experts. Considering an appropriate reward term, the algorithm is more robust to the initialization of the architecture weights and inherently enables the recovery of late bloomers , i.e., experts which may become effective only after a warm-up period (section 3.2.1). The wipeout mechanism allows the recovery of experts with a chance of being selected at the end of the process. We compare XNAS to previous methods and demonstrate its properties and effectiveness over statistical and deterministic setups, as well as over 7 public datasets (section 4). It achieves stateof-the-art performance over 3 datasets, and top-NAS over rest, with significant improvements. For example, XNAS reaches 1.60% error over CIFAR-10, more than 20% improvement over existing NAS methods. 2 Proposed Approach To lay out our approach we first reformulate the differentiable architecture search space of DARTS [25] in a way that enables direct optimization over the architecture weights. We then propose a novel optimizer that views NAS as an online selection task, and relies on PEA theory for the selection. 2.1 Neural Architecture Space We start with a brief review of the PEA settings and then describe our view of the search space as separable PEA sub-spaces. This enables us to leverage PEA theory for NAS. PEA Settings. PEA [4] refers to a sequential decision making framework, dealing with a decision maker, i.e. a forecaster, whose goal is to predict an unknown outcome sequence {yt}T t=1 Y while having access to a set of N experts advises, i.e. predictions. Denote the experts predictions at time t by ft,1, . . . , ft,N D, where D is the decision space, which we assume to be a convex subset of a vector space. Denote the forecaster s prediction {pt}T t=1 D, and a non-negative loss function ℓ: D Y R. At each time step t = 1, . . . , T, the forecaster observes ft,1, . . . , ft,N and predicts pt. The forecaster and the experts suffer losses of ℓt(pt) := ℓ(pt, yt) and ℓt(ft,i) := ℓ(ft,i, yt) respectively. The Search Space Viewed as Separable PEA Sub-spaces. We view the search space suggested by DARTS [25] as multiple separable sub-spaces of experts, as illustrated in Figure 1, described next. An architecture is built from replications of normal and reduction cells represented as a directed acyclic graph. Every node x(j) in this super-graph represents a feature map and each directed edge (j, k) is associated with a forecaster, that predicts a feature map p(j,k) := p(j,k)(x(j)) given the input x(j). Intermediate nodes are computed based on all of their predecessors: x(k) = Σj