# stronger_nas_with_weaker_predictors__23b0ef8c.pdf

Stronger NAS with Weaker Predictors

Junru Wu1, Xiyang Dai2, Dongdong Chen2, Yinpeng Chen2, Mengchen Liu2, Ye Yu2, Zhangyang Wang3, Zicheng Liu2, Mei Chen2, Lu Yuan2

1 Texas A&M University, 2Microsoft Corporation, 3University of Texas at Austin sandboxmaster@tamu.edu, {xidai,dochen,yilche,mengcliu}@microsoft.com, atlaswang@utexas.edu, {yeyu1,zliu,mei.chen,luyuan}@microsoft.com

Neural Architecture Search (NAS) often trains and evaluates a large number of architectures. Recent predictor-based NAS approaches attempt to alleviate such heavy computation costs with two key steps: sampling some architecture-performance pairs and ﬁtting a proxy accuracy predictor. Given limited samples, these predictors, however, are far from accurate to locate top architectures due to the difﬁculty of ﬁtting the huge search space. This paper reﬂects on a simple yet crucial question: if our ﬁnal goal is to ﬁnd the best architecture, do we really need to model the whole space well?. We propose a paradigm shift from ﬁtting the whole architecture space using one strong predictor, to progressively ﬁtting a search path towards the high-performance sub-space through a set of weaker predictors. As a key property of the weak predictors, their probabilities of sampling better architectures keep increasing. Hence we only sample a few well-performed architectures guided by the previously learned predictor and estimate a new better weak predictor. This embarrassingly easy framework, dubbed Weak NAS, produces coarse-to-ﬁne iteration to gradually reﬁne the ranking of sampling space. Extensive experiments demonstrate that Weak NAS costs fewer samples to ﬁnd top-performance architectures on NAS-Bench-101 and NAS-Bench-201. Compared to state-of-the-art (SOTA) predictor-based NAS methods, Weak NAS outperforms all with notable margins, e.g., requiring at least 7.5x less samples to ﬁnd global optimal on NAS-Bench-101. Weak NAS can also absorb their ideas to boost performance more. Further, Weak NAS strikes the new SOTA result of 81.3% in the Image Net Mobile Net Search Space. The code is available at: https://github.com/VITA-Group/Weak NAS.

1 Introduction

Figure 1: Comparison between our method using a set of weak predictors (iterative sampling), and a single strong predictor (random sampling) on NAS-Bench201. For fair comparison, the NAS predictor in both methods adtops the same type of MLP described in 2.4. Solid lines and shadows denote the mean and standard deviation (std), respectively.

Neural Architecture Search (NAS) [1 12] methods aim to ﬁnd the best network architecture by exploring the architecture-to-performance manifold, using reinforcedlearning-based [13], evolution-based [14, 15] or gradientbased [1, 16] approaches. In order to cover the entire search space, they often train and evaluate a large number of architectures, leading to tremendous computation cost.

Recently, predictor-based NAS methods alleviate this problem with two key steps: one sampling step to sample some architecture-performance pairs, and another performance modeling step to ﬁt the performance distribution by training a proxy accuracy predictor. An in-depth analysis of existing methods [2] found that most of those methods [5, 6, 17, 7 9, 18] consider these two steps independently and attempt to model the performance distribution over the whole architec-

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

Figure 2: An illustration of Weak NAS s progressive approximation. Previous predictor-based NAS uniformly sample in the whole search space to ﬁt a strong predictor. Instead, our method progressively shrinks the sample space based on predictions from previous weak predictors, and update new weak predictors towards subspace of better architectures, hence focusing on ﬁtting the search path.

ture space using a strong1 predictor. However, since the architecture space is often exponentially large and highly non-convex, even a very strong predictor model has difﬁculty ﬁtting the whole space given limited samples. Meanwhile, different types of predictors often demand handcraft design of the architecture representations to improve their performance.

This paper reﬂects on a fundamental question for predictor-based NAS: if our ﬁnal goal is to ﬁnd the best architecture, do we really need to model the whole space well? . We investigate the alternative of utilizing a few weak1 predictors to ﬁt small local spaces, and to progressively move the search space towards the subspace where good architecture resides. Intuitively, we assume the whole space could be divided into different sub-spaces, some of which are relatively good while some are relatively bad. We tend to choose the good ones while discarding the bad ones, which makes sure more samples will be focused on modeling only the good subspaces and then ﬁnd the best architecture. It greatly simpliﬁes the learning task of each predictor. Eventually, a line of progressively evolving weak predictors can connect a path to the best architecture.

We present a novel, general framework that requires only to estimate a series of weak predictors progressively along the search path, we denoted it as Weak NAS in the rest of the paper. To ensure moving towards the best architecture along the path, at each iteration, the sampling probability of better architectures keep increasing through the guidance of the previous weak predictor. Then, the consecutive weak predictors with better samples will be trained in the next iteration. We iterate until we arrive at an embedding subspace where the best architectures reside and can be accurately assessed by the ﬁnal weak predictor.

Compared to the existing predictor-based NAS, our proposal represents a new line of attack and has several merits. First, since only weak predictors are required, it yields better sample efﬁciency. As shown in Figure 1, it costs signiﬁcantly fewer samples to ﬁnd the top-performance architecture than using one strong predictor, and yields much lower variance in performance over multiple runs. Second, it is ﬂexible to the choices of architecture representation (e.g., different architecture embeddings) and predictor formulation (e.g., multilayer perceptron (MLP), gradient boosting regression tree, or random forest). Experiments show our framework performs well in all their combinations. Third, it is highly generalizable to other open search spaces, e.g. given a limited sample budget, we achieve the state-of-the-art Image Net performance on the NASNet and Mobile Net search spaces. Detailed comparison with state-of-the-art predictor-based NAS [19 21, 8] is presented in Section 4.

1 Strong vs Weak predictors: we name a weak predictor if it only predicts a local subspace of the search space thus can be associated with our iterative sampling scheme; such predictors therefore usually do not demand very heavily parameterized models. On the contrary, strong predictors predict the global search space and are often associated with uniform sampling. The terminology of strong versus weak predictors does not represent their number of parameters or the type of NAS predictor used. An overparameterized NAS predictor with our iterative sampling scheme may still be considered as a weak predictor.

2 Our Framework

2.1 Reformulating Predictor-based NAS as Bi-Level Optimization Given a search space of network architectures X and an architecture-to-performance mapping function f : X P from the architecture set X to the performance set P, the objective is to ﬁnd the best neural architecture x with the highest performance f(x) in the search space X: x = arg max x X f(x) (1)

A naïve solution is to estimate the performance mapping f(x) through the full search space. However, this is prohibitively expensive since all architectures have to be exhaustively trained from scratch. To address this problem, predictor-based NAS learns a proxy predictor f(x) to approximate f(x) by using some architecture-performance pairs, which signiﬁcantly reduces the training cost. In general, predictor-based NAS can be re-cast as a bi-level optimization problem:

x = arg max x X f(x|S), s.t. f = arg min S, f F

s S L( f(s), f(s)) (2)

where L is the loss function for the predictor f, F is a set of all possible approximation to f, S := {S X | |S| C} all architectures satisfying the sampling budget C. C is directly related to the total training cost, e.g., the total number of queries. Our objective is to minimize the loss L based on some sampled architectures S.

Previous predictor-based NAS methods attempt to solve Equation 2 with two sequential steps: (1) sampling some architecture-performance pairs and (2) learning a proxy accuracy predictor. For the ﬁrst step, a common practice is to sample training pairs S uniformly from the search space X to ﬁt the predictor. Such sampling is however inefﬁcient considering that the goal of NAS is only to ﬁnd well-performed architectures without caring for the bad ones.

2.2 Progressive Weak Predictors Emerge Naturally as A Solution to the Optimization

Optimization Insight: Instead of ﬁrst (uniformly) sampling the whole space and then ﬁtting the predictor, we propose to jointly evolve the sampling S and ﬁt the predictor f, which helps achieve better sample efﬁciency by focusing on only relevant sample subspaces. That could be mathematically formulated as solving Equation 2 in a new coordinate descent way, that iterates between optimizing the architecture sampling and predictor ﬁtting subproblems:

(Sampling) P k = { fk(s)|s X \ Sk}, SM Top N( P k), Sk+1 = SM Sk,

where Top N( P k) denote the set of top N architectures in P k (3)

(Predictor Fitting) x = arg max x X f(x|Sk+1), s.t. fk+1 = arg min fk F

s Sk+1 L( f(s), f(s)) (4)

In comparison, existing predictor-based NAS methods could be viewed as running the above coordinate descent for just one iteration a special case of our general framework.

As well known in optimization, many iterative algorithms only need to solve (subsets of) their subproblems inexactly [22 24] for properly ensuring convergence either theoretically or empirically. Here, using a strong predictor to ﬁt the whole space could be treated as solving the predictor ﬁtting subproblem relatively precisely, while adopting a weak predictor just imprecisely solves that. Previous methods solving Equation 2 truncate their solutions to one shot" and hinge on solving subproblems with higher precision. Since we now take a joint optimization view and allow for multiple iterations, we can afford to only use weaker predictors for the ﬁtting subproblem per iteration.

Implementation Outline: The above coordinate descent solution has clear interpretations and is straightforward to implement. Suppose our iterative methods has K iterations. We initialize S1

by randomly sampling a few samples from X, and train an initial predictor f1. Then at iterations k = 2, . . . K, we jointly optimize the sampling set Sk and predictor fk in an alternative manner.

Subproblem 1: Architecture Sampling. At iteration k + 1, we ﬁrst sort all architectures2 in the search space X (excluding all the samples already in Sk) according to its predicted performance P k at

2One only exception is the Section 3.2 open-domain experiments: we will sub-sample all architectures in the search space before sorting. More details can be found in Appendix Section H

(c) Figure 3: Visualization of the search dynamics in NAS-Bench-201 Search Space. (best viewed in color) (a) The trajectory of the predicted best architecture and global optimal through out 5 iterations; (b) Error empirical distribution function (EDF) of the predicted top-200 architectures throughout 5 iterations (c) Triangle marker: probability of sampling top-50 architectures throughout 5 iterations; Star marker: Kendall s Tau ranking of NAS predictor in Top 50 architectures through out 5 iterations.

Figure 4: Visualization of search dynamics in NAS-Bench-201 Search Space via t-SNE. At i-th iteration, we randomly sample M = 40 new architectures from the top N = 400 ranked architectures in P k. The top row from (a)-(d) show the sampling space Top N( P k), and the bottom row from (e)-(h) show the sampled architectures Sk. The performance ranking of architectures is encoded by color, and those not-sampled architectures are colored in grey.

every iteration k. We then randomly sample M new architectures from the top N ranked architectures in P k. Note this step both reduces the sample budget, and controls the exploitation-exploration trade-off (see Section 3.1). The newly sampled architectures together with Sk become Sk+1.

Subproblem 2: (Weak) Predictor Fitting. We learn a predictor f k+1, by minimizing the loss L of the predictor f k+1 based on sampled architectures Sk+1. We then evaluate architectures using the learned predictor f k+1 to get the predicted performance P k+1.

As illustrated in Figure 2, through alternating iterations, we progressively evolve weak predictors to focus on sampling along the search path, thus simplifying the learning workload of each predictor. With these coarse-to-ﬁne iterations, the predictor f k would guide the sampling process to gradually zoom into the promising architecture samples. In addition, the promising samples Sk+1 would in turn improve the performance of the updated predictor f k+1 among the well-performed architectures, hence the ranking of sampling space is also reﬁned gradually. In other words, the solution quality to the subproblem 2 will gradually increase as a natural consequence of the guided zoom-in. For

derivation, we simply choose the best architecture predicted by the ﬁnal weak predictor. This idea is related to the classical ensembling [25], yet a new regime to NAS. Proof-of-Concept Experiment. Figure 3 (a) shows the progressive procedure of ﬁnding the optimal architecture x and learning the predicted best architecture x k over 5 iterations. As we can see from Figure 3 (a), the optimal architecture and the predicted best one are moving towards each other closer and closer, which indicates that the performance of predictor over the optimal architecture(s) is growing better. In Figure 3 (b), we use the error empirical distribution function (EDF) [26] to visualize the performance distribution of architectures in the subspace. We plot the EDF of the top-200 models based on the predicted performance over 5 iterations. As is shown, the subspace of top-performed architectures is consistently evolving towards more promising architecture samples over 5 iterations. Then in Figure 3 (c), we validate that the probabilities of sampling better architectures within the top N predictions keep increasing. Based on this property, we can just sample a few well-performing architectures guided by the predictive model to estimate another better weak predictor. The same plot also suggests that the NAS predictor s ranking among the top-performed models is gradually reﬁned, since more and more architectures in the top region are sampled.

In Figure 4, we also show the t-SNE visualization of the search dynamic in NAS-Bench-201 search space. We can observe that: (1) NAS-Bench-201 search space is highly structured; (2) the sampling space Top N( P k) and sampled architectures Sk are both consistently evolving towards more promising regions, as can be noticed by the increasingly warmer color trend.

2.3 Relationship to Bayesian Optimization: A Simpliﬁcation and Why It Works Our method can be alternatively regarded as a vastly simpliﬁed variant of Bayesian Optimization (BO). It does not refer to any explicit uncertainty-based modeling such as Gaussian Process (which are often difﬁcult to scale up); instead it adopts a very simple step function as our acquisition function. For a sample x in the search space X, our special acquisition function" can be written as:

acq(x) = u(x θ) ϵ (5)

where the step function u(x) is 1 if x θ, and 0 otherwise; ϵ is a random variable from the uniform distribution U(0, 1); and θ is the threshold to split Top N from the rest, according to their predicted performance P k(x). We then choose the samples with the M largest acquisition values:

SM = arg max Top M acq(x) (6)

Why such oversimpliﬁed BO" can be effectively for our framework? We consider the reason to be the inherently structured NAS search space. Speciﬁcally, existing NAS spaces are created either by varying operators from a pre-deﬁned operator set (DARTS/NAS-Bench-101/201 Search Space) or by varying kernel size, width or depth (Mobile Net Search Space). Therefore, as shown in Figure 4, the search spaces are often highly-structured, and the best performers gather close to each other.

Here comes our underlying prior assumption: we can progressively connect a piecewise search path from the initialization, to the ﬁnest subspace where the best architecture resides. At the beginning, since the weak predictor only roughly ﬁts the whole space, the sampling operation will be noisier", inducing more exploration. When it comes to the later stage, the weak predictors ﬁt better on the current well-performing clusters, thus performing more exploitation locally. Therefore our progressive weak predictor framework provides a natural evolution between exploration and exploitation, without explicit uncertainty modeling, thanks to the prior of special NAS space structure.

Another exploration-exploitation trade-off is implicitly built in the adaptive sampling step of our subproblem 1 solution. To recall, at each iteration, instead of choosing all Top N models by the latest predictor, we randomly sample M models from Top N models to explore new architectures in a stochastic manner. By varying the ratio ϵ = M/N and the sampling strategy (e.g., uniform, linear-decay or exponential-decay), we can balance the sampling exploitation and exploration per step, in a similar ﬂavor to the ϵ-greedy [27] approach in reinforcement learning.

2.4 Our Framework is General to Predictor Models and Architecture Representations

Our framework is designed to be generalizable to various predictors and features. In predictor-based NAS, the objective of ﬁtting the predictor f is often cast as a regression [7] or ranking [5] problem. The choice of predictors is diverse, and usually critical to ﬁnal performance [5, 6, 2, 7 9]. To illustrate our framework is generalizable and robust to the speciﬁc choice of predictors, we compare the following predictor variants.

(a) CIFAR10

(b) CIFAR100

(c) Image Net16-120

Figure 5: Evaluations of robustness across different predictors on NAS-Bench-201. Solid lines and shadow regions denote the mean and std, respectively.

Multilayer perceptron (MLP): MLP is the common baseline in predictor-based NAS [5] due to its simplicity. For our weak predictor, we use a 4-layer MLP with hidden layer dimension of (1000, 1000, 1000, 1000). Regression Tree: tree-based methods are also popular [9, 28] since they are suitable for categorical architecture representations. As our weak predictor, we use the Gradient Boosting Regression Tree (GBRT) based on XGBoost [29], consisting of 1000 Trees. Random Forest: random forests differ from GBRT in that they combines decisions only at the end rather than along the hierarchy, and are often more robust to noise. For each weak predictor, we use a random forest consisting of 1000 Forests.

The features representations to encode the architectures are also instrumental. Previous methods hand-craft various features for the best performance, e.g., raw architecture encoding [6], supernet statistics [30], and graph convolutional network encoding [7, 5, 8, 19] Our framework is also agnostic to various architecture representations, and we compare the following:

One-hot vector: In NAS-Bench-201 [31], its DARTS-style search space has ﬁxed graph connectivity, hence the one-hot vector is commonly used to encode the choice of operator. Adjacency matrix: In NAS-Bench-101, we used the same encoding scheme as in [32, 6], where a 7 7 adjacency matrix represents the graph connectivity and a 7-dimensional vector represents the choice of operator on every node.

As shown in Figure 5, all predictor models perform similarly across different datasets. Comparing performance on NAS-Bench-101 and NAS-Bench-201, although they use different architecture encoding methods, our method still performs similarly well among different predictors. This demonstrates that our framework is robust to various predictor and feature choices.

3 Experiments

Setup: For all experiments, we use an Intel Xeon E5-2650v4 CPU and a single Tesla P100 GPU, and use the Multilayer perceptron (MLP) as our default NAS predictor, unless otherwise speciﬁed.

NAS-Bench-101 [32] provides a Directed Acyclic Graph (DAG) based cell structure. The connectivity of DAG can be arbitrary with a maximum number of 7 nodes and 9 edges. Each nodes on the DAG can choose from operator of 1 1 convolution, 3 3 convolution or 3 3 max-pooling. After removing duplicates, the dataset consists of 423,624 diverse architectures trained on CIFAR10[33].

NAS-Bench-201 [31] is a more recent benchmark with a reduced DARTS-like search space. The DAG of each cell is ﬁxed, and one can choose from 5 different operations (1 1 convolution, 3 3 convolution, 3 3 avg-pooling, skip, no connection), on each of the 6 edges, totaling 15,625 architectures. It is trained on 3 different datasets: CIFAR10, CIFAR100 and Image Net16-120 [34]. For experiments on both NAS-Benches, we followed the same setting as [8].

Open Domain Search: We follow the same NASNet search space used in [35] and Mobile Net Search Space used in [36] to directly search for the best architectures on Image Net[37]. Due to the huge computational cost to evaluate sampled architectures on Image Net, we leverage a weightsharing supernet approach. On NASNet search space, we use Single-Path One-shot [38] approach to train our Super Net, while on Mobile Net Search Space we reused the pre-trained supernet from OFA[36]. We then use the supernet accuracy as the performance proxy to train weak predictors. We clarify that despite using supernet, our method is more accurate than existing differentiable weight-sharing methods, meanwhile requiring less samples than evolution based weight-sharing

methods, as manifested in Table 6 and 7. We adopt Py Torch and image models library (timm) [39] to implement our models and conduct all Image Net experiments using 8 Tesla V100 GPUs. For derived architecture, we follow a similar training from scratch strategies used in La NAS[21].

3.1 Ablation Studies We conduct a series of ablation studies on the effectiveness of proposed method on NAS-Bench101. To validate the effectiveness of our iterative scheme, In Table 1, we initialize the initial Weak Predictor f1 with 100 random samples, and set M = 10, after progressively adding more weak predictors (from 1 to 191), we ﬁnd the performance keeps growing. This demonstrates the key property of our method that probability of sampling better architectures keeps increasing as more iteration goes. It s worth noting that the quality of random initial samples M0 may also impact on the performance of Weak NAS, but if |M0| is sufﬁciently large, the chance of hitting good samples (or its neighborhood) is high, and empirically we found |M0|=100 to already ensure highly stable performance at NAS-Bench-101: a more detailed ablation can be found in the Appendix.

Sampling #Predictor #Queries Test Acc.(%) SD(%) Test Regret(%) Avg. Rank

Uniform 1 Strong Predictor 2000 93.92 0.08 0.40 135.0

1 Weak Predictor 100 93.42 0.37 0.90 6652.1 11 Weak Predictors 200 94.18 0.14 0.14 5.6 91 Weak Predictors 1000 94.25 0.04 0.07 1.7 191 Weak Predictors 2000 94.26 0.04 0.06 1.6

Optimal - - 94.32 - 0.00 1

Table 1: Ablation on the effectiveness of our iterative scheme on NAS-Bench-101

Sampling (M from Top N) M Top N #Queries Test Acc.(%) SD(%) Test Regret(%) Avg. Rank

Exponential-decay 10 100 1000 93.96 0.10 0.36 85.0 Linear-decay 10 100 1000 94.06 0.08 0.26 26.1 Uniform 10 100 1000 94.25 0.04 0.07 1.7

Uniform 10 1000 1000 94.10 0.19 0.22 14.1 Uniform 10 100 1000 94.25 0.04 0.07 1.7 Uniform 10 10 1000 94.24 0.04 0.08 1.9

Table 2: Ablation on exploitation-exploration trade-off on NAS-Bench-101

Method #Queries Test Acc.(%) SD(%) Test Regret(%) Avg. Rank

Weak NAS 1000 94.25 0.04 0.07 1.7

Weak NAS (BO Variant) 1000 94.12 0.15 0.20 8.7

Optimal - 94.32 - 0.00 1.0

Table 3: Comparing to the BO variant of Weak NAS on NAS-Bench-101.

Figure 6: Comparison with So TA methods on NAS-Bench-101. Solid lines and shadow regions denote the mean and std, respectively.

We then study the exploitation-exploration tradeoff in Table 2 in NAS-Bench-101 (a similar ablation in Mobilenet Search space on Image Net is also included in Appendix Table 6) by investigating two settings: (a) We gradually increase N to allow for more exploration, similar to controlling ϵ in the epsilon-greedy [27] approach in the RL context; (b) We vary the sampling strategy from Uniform, Linear-decay to Exponentialdecay (top models get higher probabilities by following either linear-decay or exponential-decay distribution). We empirically observed that: (a) The performance drops more (Test Regret 0.22% vs 0.08%) when more exploration (Top N=1000 vs Top N=10) is used. This indicates that extensive exploration is not optimal for NAS-Bench101; (b) Uniform sampling method yields better performance than sampling method that biased

towards top performing model (e.g. linear-decay, exponential-decay). This indicates good architectures are evenly distributed within the Top 100 predictions of Weak NAS, therefore a simple uniform sampling strategy for exploration is more optimal in NAS-Bench-101. To conclude, our Weak NAS Predictor strikes a good balance between exploration and exploration.

Apart from the above exploitation-exploration trade-off of Weak NAS, we also explore the possibility of integrating other meta-sampling methods. We found that the local search algorithm could achieve comparable performance, while using Semi-NAS [20] as a meta sampling method could further boost the performance of Weak NAS: more details are in Appendix Section G.

3.2 Comparison to State-of-the-art (SOTA) Methods

NAS-Bench-101: On NAS-Bench-101 benchmark, we compare our method with several popular methods [14, 40, 21, 2, 7, 20, 19, 41 44]. Table 5 shows that our method signiﬁcantly outperforms baselines in terms of sample efﬁciency. Speciﬁcally, our method costs 964 , 447 , 378 , 245 , 58 , and 7.5 less samples to reach the optimal architecture, compared to Random Search, Regularized Evolution [14], MCTS [40], Semi-NAS[20], La NAS[21], BONAS[19], respectively. We then plot the best accuracy against number of samples in Table 4 and Figure 6 to show the sample efﬁciency on the NAS-Bench-101, from which we can see that our method consistently costs fewer sample to reach higher accuracy.

Method #Queries Test Acc.(%) SD(%) Test Regret(%) Avg. Rank

Random Search 2000 93.64 0.25 0.68 1750.0 NAO [2] 2000 93.90 0.03 0.42 168.1 Reg Evolution [14] 2000 93.96 0.05 0.36 85.0 Semi-NAS [20] 2000 94.02 0.05 0.30 42.1 Neural Predictor [7] 2000 94.04 0.05 0.28 33.5 Weak NAS 2000 94.26 0.04 0.06 1.6

Semi-Assessor [42] 1000 94.01 - 0.31 47.1 La NAS [21] 1000 94.10 - 0.22 14.1 BONAS [19] 1000 94.22 - 0.10 3.0 Weak NAS 1000 94.25 0.04 0.07 1.7

Arch2vec [41] 400 94.10 - 0.22 14.1 Weak NAS 400 94.24 0.04 0.08 1.9

La NAS [21] 200 93.90 - 0.42 168.1 BONAS [19] 200 94.09 - 0.23 18.0 Weak NAS 200 94.18 0.14 0.14 5.6

NASBOWLr [45] 150 94.09 - 0.23 18.0 CATE (cate-DNGO-LS) [43] 150 94.10 - 0.22 12.3 Weak NAS 150 94.10 0.19 0.22 12.3

Optimal - 94.32 - 0.00 1.0

Table 4: Comparing searching efﬁciency by limiting the total query amounts on NAS-Bench-101.

Method NAS-Bench-101 NAS-Bench-201

Dataset CIFAR10 CIFAR10 CIFAR100 Image Net16-120

Random Search 188139.8 7782.1 7621.2 7726.1 Reg Evolution [14] 87402.7 563.2 438.2 715.1 MCTS [40] 73977.2 528.3 405.4 578.2 Semi-NAS [20] 47932.3 - - - La NAS [21] 11390.7 247.1 187.5 292.4 BONAS [19] 1465.4 - - -

Weak NAS 195.2 182.1 78.4 268.4

Table 5: Comparison on the number of samples required to ﬁnd the global optimal on NAS-Bench-101 and NAS-Bench-201. denote reproduced results using adapted code.

NAS-Bench-201: We further evaluate on NAS-Bench-201, and compare with random search, Regularized Evolution [14], Semi-NAS[20], La NAS[21], BONAS[19]. . As shown in Table 5, we conduct searches on all three subsets (CIFAR10, CIFAR100, Image Net16-120) and report the average number

of samples needed to reach global optimal on the testing set over 100 runs. It shows that our method has the smallest sample cost among all settings.

Open Domain Search: we further apply our method to open domain search without ground-truth, and compare with several popular methods [35, 14, 46, 2, 47, 48, 21]. As shown in Tables 6 and 7, using the fewest samples (and only a fraction of GPU hours) among all, our method can achieve state-of-the-art Image Net top-1 accuracies with comparable parameters and FLOPs. Our searched architecture is also competitive to expert-design networks. On the NASNet Search Space, compared with the So TA predictor-based NAS method La NAS (Oneshot) [21], our method reduces 0.6% top-1 error while using less GPU hours. On the Mobile Net Search Space, we improve the previous So TA La NAS [21] to 81.3% top-1 accuracy on Image Net while costing less FLOPs.

3.3 Discussion: Further Comparison with SOTA Predictor-based NAS Methods

BO-based NAS methods [19, 45]: BO-based methods in general treat NAS as a black-box optimization problem, for example, BONAS [19] customizes the classical BO framework in NAS with GCN embedding extractor and Bayesian Sigmoid Regression to acquire and select candidate architectures. The latest BO-based NAS approach, NASBOWL [45], combines the Weisfeiler-Lehman graph kernel in BO to capture the topological structures of the candidate architectures.

Compare with those BO-based method, our Weak NAS is an oversimpliﬁed" version of BO as explained in Section 2.3. Interestingly, results in Table 4 suggests that Weak NAS is able to outperform BONAS [19], and is comparable to NASBOWLr [45] on NAS-Bench-101, showcasing that the simpliﬁcation does not compromise NAS performance. We hypothesize that the following factors might be relevant: (1) the posterior modeling and uncertainty estimation in BO might be noisy; (2) the inherently structured NAS search space (shown in Figure 4) could enable a shortcut" simpliﬁcation to explore and exploit. In addition, the conventional uncertainty modeling in BO, such as the Gaussian Process used by [45], is not as scalable when the number of queries is large. In comparison, the complexity of Weak NAS scales almost linearly, as can be veriﬁed in Appendix Table 1. In our experiments, we observe Weak NAS to perform empirically more competitively than current BO-based NAS methods at larger query numbers, besides being way more efﬁcient.

To further convince that Weak NAS is indeed an effective simpliﬁcation compared to the explicit posterior modeling in BO, we report an apple-to-apple comparison, by use the same weak predictor from Weak NAS, plus obtaining its uncertainty estimation by calculating its variance using a deep ensemble of ﬁve model [49]; we then use the classic Expected Improvement (EI) [50] acquisition function. Table 3 conﬁrms that such BO variant of Weak NAS is inferior our proposed formulation.

Advanced Architecture Encoding [41, 43] We also compare Weak NAS with NAS using custom architecture representation either in a unsupervised way such as arch2vec [41], or a supervised way such as CATE [43]. We show our Weak NAS could achieve comparable performance to both methods. Further, those architecture embedding are essentially complementary to our method to further boost the performance of Weak NAS, as shown in Appendix Section C.

La NAS [21]: La NAS and our framework both follow the divide-and-conquer idea, yet with two methodological differences: (a) How to split the search space: La NAS learns a classiﬁer to do binary hard partition on the search space (no ranking information utilized) and split it into two equally-sized subspaces. Ours uses a regressor to regress the performance of sampled architectures, and utilizes the ranking information to sample a percentage of the top samples ( soft partition), with the sample size N being controllable. (b) How to do exploration: La NAS uses Upper Conﬁdence Bound (UCB) to explore the search space by not always choosing the best subspace (left-most node) for sampling, while ours always chooses the best subspace and explore new architectures by adaptive sampling within it, via adjusting the ratio ϵ = M/N to randomly sample M models from Top N. Tables 4 and 5 shows the superior sample efﬁciency of Weak NAS over La NAS on NAS-Bench-101/201.

Semi-NAS [20] and Semi-Assessor[42]: Both our method and Semi-NAS/Semi-Assessor use an iterative algorithm containing prediction and sampling. The main difference lies in the use of pseudo labels: Semi-NAS and Semi-Assessor use pseudo labels as noisy labels to augment the training set, therefore being able to leverage unlabeled samples" (e.g., architectures without true accuracies, but with only accuracies generated by the predictors) to update their predictors. Our method explores an orthogonal innovative direction, where the pseudo labels" generated by the current predictor guide our sampling procedure, but are never used for training the next predictor.

That said, we point out that our method can be complementary to those semi-supervised methods [20, 42], thus they can further be integrated as one, For example, Semi-NAS can be used as a meta sampling method, where at each iteration we further train a Semi-NAS predictor with pseudo labeling strategy to augment the training set of our weak predictors. We show in Appendix Table 5 that the combination of our method with Semi-NAS can further boost the performance of Weak NAS.

BRP-NAS [8]: BRP-NAS uses a stronger GCN-based binary relation predictor which utilize extra topological prior, and leveraged a different scheme to control exploitation and exploration trade-off compare to our Weak NAS. Further, BRP-NAS also use a somehow unique setting, i.e. evaluating Top-40 predictions by the NAS predictor instead of the more common setting of Top-1 [2, 19, 21, 20]. Therefore, we include our comparison to BRP-NAS and more details in Appendix Section F.

Model Queries(#) Top-1 Err.(%) Top-5 Err.(%) Params(M) FLOPs(M) GPU Days

Mobile Net V2 - 25.3 - 6.9 585 - Shufﬂet Net V2 - 25.1 - 5.0 591 -

SNAS[51] - 27.3 9.2 4.3 522 1.5 DARTS[1] - 26.9 9.0 4.9 595 4.0 P-DARTS[52] - 24.4 7.4 4.9 557 0.3 PC-DARTS[53] - 24.2 7.3 5.3 597 3.8 DS-NAS[53] - 24.2 7.3 5.3 597 10.4

NASNet-A [35] 20000 26.0 8.4 5.3 564 2000 Amoeba Net-A [14] 10000 25.5 8.0 5.1 555 3150 PNAS [46] 1160 25.8 8.1 5.1 588 200 NAO [2] 1000 24.5 7.8 6.5 590 200

La NAS [21] (Oneshot) 800 24.1 - 5.4 567 3 La NAS [21] 800 23.5 - 5.1 570 150

Weak NAS 800 23.5 6.8 5.5 591 2.5

Table 6: Comparison to SOTA results on Image Net using NASNet search space.

Model Queries(#) Top-1 Acc.(%) Top-5 Acc.(%) FLOPs(M) GPU Days

Proxyless NAS[54] - 75.1 92.9 - - Semi-NAS[20] 300 76.5 93.2 599 - Big NAS[47] - 76.5 - 586 - FBNetv3[48] 20000 80.5 95.1 557 - OFA[36] 16000 80.0 - 595 1.6 La NAS[21] 800 80.8 - 598 0.3

Weak NAS 1000 81.3 95.1 560 0.16 800 81.2 95.2 593 0.13

Table 7: Comparison to SOTA results on Image Net using Mobile Net search space. Does not include supernet training cost.

4 Conclusions and Discussions of Broad Impact

In this paper, we present a novel predictor-based NAS framework named Weak NAS that progressively shrinks the sampling space, by learning a series of weak predictors that can connect towards the best architectures. By co-evolving the sampling stage and learning stage, our weak predictors can progressively evolve to sample towards the subspace of best architectures, thus greatly simplifying the learning task of each predictor. Extensive experiments on popular NAS benchmarks prove that the proposed method is both sample-efﬁcient and robust to various combinations of predictors and architecture encoding means. However, Weak NAS is still limited by the human-designed encoding of neural architectures, and our future work plans to investigate how to jointly learn the predictor and encoding in our framework.

For broader impact, the excellent sample-efﬁciency of Weak NAS reduces the resource and energy consumption needed to search for efﬁcient models, while still maintaining So TA performance. That can effectively serve the goal of Green AI, from model search to model deployment. It might meanwhile be subject to the potential abuse of searching for models serving malicious purposes.

Acknowledgment

Z.W. is in part supported by an NSF CCRI project (#2016727).

[1] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. ar Xiv preprint ar Xiv:1806.09055, 2018.

[2] Renqian Luo, Fei Tian, Tao Qin, Enhong Chen, and Tie-Yan Liu. Neural architecture optimization. In Advances in neural information processing systems, pages 7816 7827, 2018.

[3] Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardware-aware efﬁcient convnet design via differentiable neural architecture search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10734 10742, 2019.

[4] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3. In Proceedings of the IEEE International Conference on Computer Vision, 2019.

[5] Xuefei Ning, Yin Zheng, Tianchen Zhao, Yu Wang, and Huazhong Yang. A generic graph-based neural architecture encoding scheme for predictor-based nas. ar Xiv preprint ar Xiv:2004.01899, 2020.

[6] Chen Wei, Chuang Niu, Yiping Tang, and Jimin Liang. Npenas: Neural predictor guided evolution for neural architecture search. ar Xiv preprint ar Xiv:2003.12857, 2020.

[7] Wei Wen, Hanxiao Liu, Hai Li, Yiran Chen, Gabriel Bender, and Pieter-Jan Kindermans. Neural predictor for neural architecture search. ar Xiv preprint ar Xiv:1912.00848, 2019.

[8] Lukasz Dudziak, Thomas Chau, Mohamed Abdelfattah, Royson Lee, Hyeji Kim, and Nicholas Lane. Brp-nas: Prediction-based nas using gcns. Advances in Neural Information Processing Systems, 33, 2020.

[9] Renqian Luo, Xu Tan, Rui Wang, Tao Qin, Enhong Chen, and Tie-Yan Liu. Neural architecture search with gbdt. ar Xiv preprint ar Xiv:2007.04785, 2020.

[10] Yunhe Wang, Yixing Xu, and Dacheng Tao. Dc-nas: Divide-and-conquer neural architecture search. ar Xiv preprint ar Xiv:2005.14456, 2020.

[11] Xiyang Dai, Dongdong Chen, Mengchen Liu, Yinpeng Chen, and Lu Yuan. Da-nas: Data adapted pruning for efﬁcient neural architecture search. ECCV 2020, 2020.

[12] Zhaohui Yang, Yunhe Wang, Xinghao Chen, Jianyuan Guo, Wei Zhang, Chao Xu, Chunjing Xu, Dacheng Tao, and Chang Xu. Hournas: Extremely fast neural architecture search through an hourglass lens. ar Xiv preprint ar Xiv:2005.14446, 2020.

[13] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. ar Xiv preprint ar Xiv:1611.01578, 2016.

[14] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution for image classiﬁer architecture search. In Proceedings of the aaai conference on artiﬁcial intelligence, volume 33, pages 4780 4789, 2019.

[15] Zhaohui Yang, Yunhe Wang, Xinghao Chen, Boxin Shi, Chao Xu, Chunjing Xu, Qi Tian, and Chang Xu. Cars: Continuous evolution for efﬁcient neural architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1829 1838, 2020.

[16] Weijun Hong, Guilin Li, Weinan Zhang, Ruiming Tang, Yunhe Wang, Zhenguo Li, and Yong Yu. Dropnas: Grouped operation dropout for differentiable architecture search. In International Joint Conference on Artiﬁcial Intelligence, 2020.

[17] Yixing Xu, Yunhe Wang, Kai Han, Shangling Jui, Chunjing Xu, Qi Tian, and Chang Xu. Renas: Relativistic evaluation of neural architecture search. ar Xiv preprint ar Xiv:1910.01523, 2019.

[18] Yanxi Li, Minjing Dong, Yunhe Wang, and Chang Xu. Neural architecture search in a proxy validation loss landscape. In International Conference on Machine Learning, pages 5853 5862. PMLR, 2020.

[19] Han Shi, Renjie Pi, Hang Xu, Zhenguo Li, James Kwok, and Tong Zhang. Bridging the gap between sample-based and one-shot neural architecture search with bonas. Advances in Neural Information Processing Systems, 33, 2020.

[20] Renqian Luo, Xu Tan, Rui Wang, Tao Qin, Enhong Chen, and Tie-Yan Liu. Semi-supervised neural architecture search. ar Xiv preprint ar Xiv:2002.10389, 2020.

[21] Linnan Wang, Saining Xie, Teng Li, Rodrigo Fonseca, and Yuandong Tian. Sample-efﬁcient neural architecture search by learning actions for monte carlo tree search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.

[22] Rachael Tappenden, Peter Richtárik, and Jacek Gondzio. Inexact coordinate descent: complexity and preconditioning. Journal of Optimization Theory and Applications, 170(1):144 176, 2016.

[23] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Convergence rates of inexact proximalgradient methods for convex optimization. ar Xiv preprint ar Xiv:1109.2415, 2011.

[24] William W Hager and Hongchao Zhang. Convergence rates for an inexact admm applied to separable convex optimization. Computational Optimization and Applications, 2020.

[25] Zhi-Hua Zhou. Ensemble methods: foundations and algorithms. CRC press, 2012.

[26] Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, and Piotr Dollár. Designing network design spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10428 10436, 2020.

[27] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.

[28] Julien Siems, Lucas Zimmer, Arber Zela, Jovita Lukasik, Margret Keuper, and Frank Hutter. Nas-bench-301 and the case for surrogate benchmarks for neural architecture search. ar Xiv preprint ar Xiv:2008.09777, 2020.

[29] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785 794, 2016.

[30] Yiming Hu, Yuding Liang, Zichao Guo, Ruosi Wan, Xiangyu Zhang, Yichen Wei, Qingyi Gu, and Jian Sun. Angle-based search space shrinking for neural architecture search. ar Xiv preprint ar Xiv:2004.13431, 2020.

[31] Xuanyi Dong and Yi Yang. Nas-bench-201: Extending the scope of reproducible neural architecture search. In International Conference on Learning Representations, 2020.

[32] Chris Ying, Aaron Klein, Eric Christiansen, Esteban Real, Kevin Murphy, and Frank Hutter. Nas-bench-101: Towards reproducible neural architecture search. In International Conference on Machine Learning, pages 7105 7114, 2019.

[33] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.

[34] Patryk Chrabaszcz, Ilya Loshchilov, and Frank Hutter. A downsampled variant of imagenet as an alternative to the cifar datasets. ar Xiv preprint ar Xiv:1707.08819, 2017.

[35] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8697 8710, 2018.

[36] Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. Once-for-all: Train one network and specialize it for efﬁcient deployment. ar Xiv preprint ar Xiv:1908.09791, 2019.

[37] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A largescale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248 255. Ieee, 2009.

[38] Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu, Yichen Wei, and Jian Sun. Single path one-shot neural architecture search with uniform sampling. ar Xiv preprint ar Xiv:1904.00420, 2019.

[39] Ross Wightman. Pytorch image models. https://github.com/rwightman/ pytorch-image-models, 2019.

[40] Linnan Wang, Yiyang Zhao, Yuu Jinnai, Yuandong Tian, and Rodrigo Fonseca. Alphax: exploring neural architectures with deep neural networks and monte carlo tree search. ar Xiv preprint ar Xiv:1903.11059, 2019.

[41] Shen Yan, Yu Zheng, Wei Ao, Xiao Zeng, and Mi Zhang. Does unsupervised architecture representation learning help neural architecture search? Advances in Neural Information Processing Systems, 33, 2020.

[42] Yehui Tang, Yunhe Wang, Yixing Xu, Hanting Chen, Boxin Shi, Chao Xu, Chunjing Xu, Qi Tian, and Chang Xu. A semi-supervised assessor of neural architectures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1810 1819, 2020.

[43] Shen Yan, Kaiqiang Song, Fei Liu, and Mi Zhang. Cate: Computation-aware neural architecture encoding with transformers. ar Xiv preprint ar Xiv:2102.07108, 2021.

[44] Colin White, Willie Neiswanger, and Yash Savani. Bananas: Bayesian optimization with neural architectures for neural architecture search. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 35, pages 10293 10301, 2021.

[45] Binxin Ru, Xingchen Wan, Xiaowen Dong, and Michael Osborne. Interpretable neural architecture search via bayesian optimisation with weisfeiler-lehman kernels. In International Conference on Learning Representations, 2021.

[46] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pages 19 34, 2018.

[47] Jiahui Yu, Pengchong Jin, Hanxiao Liu, Gabriel Bender, Pieter-Jan Kindermans, Mingxing Tan, Thomas Huang, Xiaodan Song, Ruoming Pang, and Quoc Le. Bignas: Scaling up neural architecture search with big single-stage models. In European Conference on Computer Vision, pages 702 717. Springer, 2020.

[48] Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Bichen Wu, Zijian He, Zhen Wei, Kan Chen, Yuandong Tian, Matthew Yu, Peter Vajda, et al. Fbnetv3: Joint architecture-recipe search using neural acquisition function. ar Xiv preprint ar Xiv:2006.02049, 2020.

[49] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. ar Xiv preprint ar Xiv:1612.01474, 2016.

[50] Donald R Jones, Matthias Schonlau, and William J Welch. Efﬁcient global optimization of expensive black-box functions. Journal of Global optimization, 13(4):455 492, 1998.

[51] Sirui Xie, Hehui Zheng, Chunxiao Liu, and Liang Lin. Snas: stochastic neural architecture search. ar Xiv preprint ar Xiv:1812.09926, 2018.

[52] Xin Chen, Lingxi Xie, Jun Wu, and Qi Tian. Progressive differentiable architecture search: Bridging the depth gap between search and evaluation. In Proceedings of the IEEE International Conference on Computer Vision, pages 1294 1303, 2019.

[53] Yuhui Xu, Lingxi Xie, Xiaopeng Zhang, Xin Chen, Guo-Jun Qi, Qi Tian, and Hongkai Xiong. Pc-darts: Partial channel connections for memory-efﬁcient differentiable architecture search. ar Xiv preprint ar Xiv:1907.05737, 2019.

[54] Han Cai, Ligeng Zhu, and Song Han. Proxylessnas: Direct neural architecture search on target task and hardware. ar Xiv preprint ar Xiv:1812.00332, 2018.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] See Section 4

(c) Did you discuss any potential negative societal impacts of your work? [Yes] See Section 4 (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A] 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] Code and instructions are released in https://github.com/VITA-Group/Weak NAS. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See main paper Section 2.4 for predictor setup, and Section 3 for detailed setup in each dataset, and the supplemental material for Image Net training details. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] Experiments on NAS-Bench-101/201 are all ran multiple times, see conﬁdence interval in Figure 1,5,6 and Standard Deviation in Table 1,2,4. Image Net experiments in Table 6,7 are ran only once due to the high computational cost ( 600 GPU hours) in re-training each searched model, However, we will continue running more Image Net experiments and report the conﬁdence intervals in future versions. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See Section 3 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] We used publicly available data and cited the corresponding papers, including CIFAR10[33], CIFAR100[33], Image Net16-120[34], Image Net[37], NAS-Bench-101[32], NASBench-201[31], OFA[36], TIMM Library[39] (b) Did you mention the license of the assets? [Yes] We double check and make sure that the publication for the data and code we use include the corresponding licenses, See Appendix for more details. (c) Did you include any new assets either in the supplemental material or as a URL? [No] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]