# efficient_neural_architecture_search_via_proximal_iterations__63a097f1.pdf

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

Efﬁcient Neural Architecture Search via Proximal Iterations

Quanming Yao,1 Ju Xu,3 Wei-Wei Tu,1 Zhanxing Zhu2,3,4

14Paradigm Inc, 2School of Mathematical Sciences, Peking University 3Center for Data Science, Peking University, 4Beijing Institute of Big Data Research (BIBDR) {yaoquanming, tuweiwei}@4paradigm.com, {xuju, zhanxing.zhu}@pku.edu.cn

Neural architecture search (NAS) attracts much research attention because of its ability to identify better architectures than handcrafted ones. Recently, differentiable search methods become the state-of-the-arts on NAS, which can obtain highperformance architectures in several days. However, they still suffer from huge computation costs and inferior performance due to the construction of the supernet. In this paper, we propose an efﬁcient NAS method based on proximal iterations (denoted as NASP). Different from previous works, NASP reformulates the search process as an optimization problem with a discrete constraint on architectures and a regularizer on model complexity. As the new objective is hard to solve, we further propose an efﬁcient algorithm inspired by proximal iterations for optimization. In this way, NASP is not only much faster than existing differentiable search methods, but also can ﬁnd better architectures and balance the model complexity. Finally, extensive experiments on various tasks demonstrate that NASP can obtain high-performance architectures with more than 10 times speedup over the state-of-the-arts.

1 Introduction

Deep networks have been applied to many applications, where proper architectures are extremely important to ensure good performance. Recently, the neural architecture search (NAS) (Zoph and Le 2017; Baker et al. 2017) has been developed as a promising approach to replace human experts on designing architectures, which can ﬁnd networks with fewer parameters and better performance (Yao et al. 2018; Hutter, Kotthoff, and Vanschoren 2018). NASNet (Zoph and Le 2017) is the pioneered work along this direction and it models the design of convolutional neural networks (CNNs) as a multi-step decision problem and solves it with reinforcement learning (Sutton and Barto 1998). However, since the search space is discrete and extremely large, NASNet requires a month with hundreds of GPU to obtain a satisfying architecture. Later, observing the good

Q. Yao and J. Xu contribute equally, and the work was performed when J. Xu was an intern in 4Paradigm. Zhanxing Zhu is the corresponding author. Copyright c 2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

transferability of networks from small to large ones, NASNet A (Zoph et al. 2017) proposed to cut the networks into blocks and then the search only needs to be carried within such a block or cell. The identiﬁed cell is then used as a building block to assemble large networks. Such two-stage search strategy dramatically reduces the size of the search space, and subsequently leads to the signiﬁcant speedup of various previous search algorithms (e.g., evolution algorithm (Real et al. 2018), greedy search (Liu et al. 2018), and reinforcement learning (Zhong et al. 2018)).

Although the size of search space is reduced, the search space is still discrete that is generally hard to be efﬁciently searched (Parikh and Boyd 2013). More recent endeavors focused on how to change the landscape of the search space from a discrete to a differentiable one (Luo et al. 2018; Liu, Simonyan, and Yang 2019; Xie et al. 2019). The beneﬁt of such idea is that a differentiable space enables computation of gradient information, which could speed up the convergence of underneath optimization algorithm. Various techniques have been proposed, e.g., DARTS (Liu, Simonyan, and Yang 2019) smooths design choices with softmax and trains an ensemble of networks; SNAS (Xie et al. 2019) enhances reinforcement learning with a smooth sampling scheme. NAO (Luo et al. 2018) maps the search space into a new differentiable space with an auto-encoder.

Among all these works (Tab. 1), the state-of-the-art is DARTS (Liu, Simonyan, and Yang 2019) as it combines the best of both worlds, i.e., fast gradient descent (differentiable search space) within a cell (small search space). However, its search efﬁciency and performance of identiﬁed architectures are still not satisfying enough. As it maintains a supernet during the search, from the computational perspective, all operations need to be forward and backward propagated during gradient descent while only one operation will be selected. From the perspective of performance, operations typically correlate with each other (Xie et al. 2019), e.g., a 7x7 s convolution ﬁlter can cover a 3x3 one as a special case. When updating a network s weights, the ensemble constructed by DARTS during the search may lead to inferior architecture being discovered. Moreover, as mentioned in (Xie et al. 2019), DARTS is not complete (Tab. 1), i.e., the ﬁnal structure needs to be re-identiﬁed after the search. This causes a bias between

Table 1: Comparison of the proposed NASP with other state-of-the-art NAS methods on four perspectives of searching: differentiable (denoted as diff ), cell, complete, and constraint.

space complete complexity discrete search algorithm diff cell control architectures NASNet-A (Zoph et al. 2017) reinforcement learning Amoeba Net (Real et al. 2018) evolution algorithm SNAS (Xie et al. 2019) reinforcement learning DARTS (Liu, et.al. 2019) gradient descent NASP (proposed) proximal algorithm

the searched architecture and the ﬁnal architecture, and might lead to a decay on the performance of the ﬁnal architecture. In this work, we propose NAS with proximal iterations (NASP) to improve the efﬁciency and performance of existing differentiable search methods. We give a new formulation and optimization algorithm to NAS problem, which allows searching in a differentiable space while keeping discrete architectures. In this way, NASP removes the need of training a supernet, then speeds up the search and leads to better architectures. Our contributions are

Except for the popularly discussed perspectives of NAS, i.e., search space, completeness, and model complexity, we identify a new and important one, i.e., constraint on architectures ( discrete architectures in Tab. 1), to NAS. We formulate NAS as a constrained optimization problem, which keeps the space differentiable but enforces architectures being discrete during the search, i.e., only one of all possible operations to be actually employed during forward and backward propagation. This helps improve searching efﬁciency and decouple different operations during the training. A regularizer is also introduced into the new objective, which allows control of architectures size. Since such discrete constraint is hard to optimize and simple adaptation of DARTS cannot be applied, we propose a new algorithm derived from the proximal iterations (Parikh and Boyd 2013) for optimization. The closed-form solution to the proximal step with the proposed discrete constraint is new to the optimization literature, and removes the expensive second-order approximation required by DARTS. We further provide a theoretical analysis to guarantee convergence of the proposed algorithm. Finally, experiments are performed with various benchmark data sets on designing CNN and RNN architectures. Compared with state-of-the-art methods, the proposed NASP is not only much faster (more than ten times speedup over DARTS) but also can discover better architectures. These empirically demonstrate NASP can obtain the state-of-the-art performance on both test accuracy and computation efﬁciency.

The implementation of NASP is available at: https://github. com/xujinfan/NASP-codes.

2 Related Works In the sequel, vectors are denoted by lowercase boldface, and matrices by uppercase boldface.

2.1 Proximal Algorithm (PA) Proximal algorithm (PA) (Parikh and Boyd 2013), is a popular optimization technique in machine learning for handling constrained optimization problem as minx f(x), s.t. x S, where f is a smooth objective and S is a constraint set. The crux of PA is the proximal step:

x = prox S(z) = arg min y 1/2 y z 2 2 s.t. y S. (1)

Closed-form solution for the PA update exists for many constraint sets in (1), such as ℓ1and ℓ2-norm ball (Parikh and Boyd 2013). Then, PA generates a sequence of xt by

xt+1 = prox S(xt ε f(xt)), (2) where ε is the learning rate. PA guarantees to obtain the critical points of f when S is a convex constraint, and produces limit points when the proximal step can be exactly computed (Yao et al. 2017). Due to its nice theoretical guarantee and good empirical performance, it has been applied to many deep learning problems, e.g., network binarization (Bai, Wang, and Liberty 2018). Another variant of PA with lazy proximal step (Xiao 2010) maintains two copies of x, i.e., xt = prox S(xt), xt+1 = xt ε f( xt), (3) which is also popularily used in deep learning for network quantization (Courbariaux, Bengio, and David 2015; Hou, Yao, and Kwok 2017). It does not have convergence guarantee in the nonconvex case, but empirically performs well on network quantization tasks. Finally, neither (2) nor (3) have been introduced into NAS.

2.2 Differentiable Architecture Search (DARTS) DARTS (Liu, Simonyan, and Yang 2019) searchs architecture by cells (Fig. 1(a)), which is a directed acyclic graph consisting of an ordered sequence of N nodes, and it has two input nodes and a single output node (Zoph et al. 2017). Within a cell, each node x(i) is a latent representation and each directed edge (i, j) is associated with some operations O(i, j) that transforms x(i) to x(j). Thus, each intermediate node is computed using all of its predecessors, i.e., x(j) =

i<j O(i,j)(x(i)) as in Fig. 1(a). However, such search space is discrete. DARTS uses softmax relaxation to make discrete choices into smooth ones (Fig. 1(b)), i.e., each O(i,j) is replaced by O(i,j) as

O(i,j)(x(i)) = 1/C |O|

m=1 exp(a(i,j) m )Om(x(j)), (4)

where C = |O| n=1 exp(a(i,j) n ) is a normalization term, Om denotes the m-th operation in search space O. Thus, the choices of operations for an edge (i, j) is replaced by a real vector a(i,j) = [a(i,j) k ] R|O|, and all choices in a cell can be represented in a matrix A = [a(i,j)] (see Fig. 1(d)). With such a differentiable relaxation, the search problem in DARTS is formulated as

min A Lval (w , A) , s.t. w = arg min w Ltrain (w, A) , (5)

where Lval (resp. Ltrain) is the loss on validation (resp. training) set, and gradient descent is used for the optimization. Let the gradient w.r.t. A is

ALval (w, A) = ALval ( w(A), A)

ε 2 A,w Ltrain(w, A) ALval( w, A), (6)

where ε > 0 is the step-size and a second order derivative, i.e., 2 2,1( ) is involved. However, the evaluation of the second order term is extremely expensive, which requires two extra computations of gradient w.r.t. w and two forward passes of A. Finally, a ﬁnal architecture A needs to be discretized from the relaxed A (see Alg.1).

Algorithm 1 Differentiable architecture search (DARTS) (Liu, Simonyan, and Yang 2019).

1: Create a mixture operation O(i,j) parametrized by (4); 2: while not converged do 3: Update At by (6); // with second-order approximation; 4: Update wt by wt Ltrain(wt, At+1) using backpropagation; // with all operations; 5: end while 6: Drive the discrete architecture A from continuous A; // not complete; 7: return ﬁnal architecture A.

Due to the differentiable relaxation in (4), an ensemble of operations (i.e., a supernet) are maintained and all operations in the search space need to be forward and backward-propagated when updating w; the evaluation of the second order term in (6) is very expensive known as a computation bottleneck of DARTS (Xie et al. 2019; Noy et al. 2019). Besides, the performance obtained from DARTS is also not as good as desired. Due to the possible correlations among operations and the need of deriving a new architecture after the search (i.e., lack of completeness) (Xie et al. 2019). Finally, the objective (5) in DARTS does not consider model complexity, which means DARTS cannot control the model size of the ﬁnal architectures.

3 Our Approach: NASP As introduced in Sec.2.2, DARTS is a state-of-the-art NAS method, however, it has three signiﬁcant limitations:

a). search efﬁciency: the supernet resulting obtained from softmax trick in (4) is expensive to train;

b). architecture performance: correlations exist in operations, which can lead to inferior architectures. c). model complexity: depending on applications, we may also want to trade accuracy for smaller models; however, this is not allowed in DARTS. Recall that in earlier works of NAS (see Tab. 1), e.g., NASNet (Baker et al. 2017; Zoph and Le 2017) and Ge Net (Xie and Yuille 2017), architectures are discrete when updating networks parameters. Such discretization naturally avoids the problem of completeness and correlations among operations compared with DARTS. Thus, can we search in a differentiable space but keep discrete architectures when updating network s parameters? This motivates us to formulate a new search objective for NAS (Sec.3.1), and propose a new algorithm for optimization (Sec.3.2).

3.1 Search Objective As NAS can be seen as a black-box optimization problem (Yao et al. 2018; Hutter, Kotthoff, and Vanschoren 2018), here, we bring the wisdom of constraint optimization to deal with the NAS problem.

Discrete constraint Speciﬁcally, we keep A being continuous, which allows the usage of gradient descent, but constrain the values of A to be discrete ones. Thus, we propose to use the following relaxation instead of (4) on architectures:

O(i,j)(x(i))= |O|

k=1 a(i,j) k Ok(x(j)), s.t. a(i,j) C, (7)

where the constraint set is deﬁned as C = {a | a 0 = 1, and 0 ak 1}. While a(i,j) is continuous, the constraint C keeps its choices to be discrete, and there is one operation actually activated for each edge during training network parameter w as illustrated in Fig. 1(c).

Regularization on model complexity Besides, in the search space of NAS, different operations have distinct number of parameters. For example, the parameter number of sep conv 7x7 is ten times that of operation conv 1x1 . Thus, we may also want to regularize model parameters to trade-off between accuracy and model complexity (Cai, Zhu, and Han 2019; Xie et al. 2019). Recall that, one column in A denotes one possible operation (Fig. 1(d)), and whether one operation will be selected depending on its value a(i,j) (a row in A). Thus, if we suppress the value of a speciﬁc column in A, its operation will be less likely to be selected in Alg.2, due to the proximal step on C1. These motivate us to introduce a regularizer R(A) as

k=1 pk/ p ak 2 2 , (8)

where ak is the kth column in A, the parameter number with the ith operation is pi, and p = |O| i=1 pi.

Search objective Finally, the NAS problem, with our new relaxation (7) and regularization (8), becomes

min A F (w , A) , s.t.

w =arg min w Ltrain (w, A)

a(i,j) C , (9)

(a) Cell in NAS.

(d) Parameter A.

Figure 1: Comparison of computation graph in a cell between DARTS (Fig. 1(b)) and NASP (Fig. 1(c)). Three operations are considered, DARTS needs to forward and backward propagate along all operations for updating w, while NASP only requires computing along current selected one. The architecture parameters a(i,j) k can be arranged into a matrix form (Fig. 1(d)).

where F(w, A) = Lval (w, A) + ηR(A) with η 0 balancing between the complexity and the accuracy, and a larger η leads to smaller architectures. Remark 1. Literally, learning with a discrete constraint has only been explored with parameters, e.g., deep networks compression with binary weights (Courbariaux, Bengio, and David 2015), and gradient quantization (Alistarh et al. 2017), but not in hyper-parameter or architecture optimization. Meanwhile, other constraints have been considered in NAS, e.g., memory cost and latency (Tan et al. 2018; Cai, Zhu, and Han 2019). We are the ﬁrst to introduce searched constraints on architecture into NAS (Tab. 1).

3.2 Search Algorithm Solving the new NAS objective (9) here is not trivial. Due to the extra constraint and regularizer, neither simple adaptation of DARTS nor PA can be applied. In the sequel, we propose a new variant of PA algorithm for efﬁcient optimization.

Failure of existing algorithms A direct solution would be PA mentioned in Sec.2.1, then architecture At+1 can be either updated by (2), i.e.,

At+1 = prox C(At ε At F (w(At), At)), (10)

or updated by lazy proximal step (3), i.e.,

At = prox C(At),

At+1 = At ε At F(w( At), At), (11)

where the gradient can be evaluated by (6) and computation of second-order approximation is still required. Let C1 = {a | a 0 = 1} and C2 = {a | 0 ak 1}, i.e., C = C1 C2. The closed-form solution on proximal step is offered in Proposition 1 (Proofs in Appendix A.1). Proposition 1. prox C(a) = prox C1(prox C2(a)).

However, solving (9) is not easy. Due to the discrete nature of the constraint set, proximal iteration (10) is hard to obtain a good solution (Courbariaux, Bengio, and David 2015). Besides, while (3) empirically leads to better performance than (2) in binary networks (Courbariaux, Bengio, and David 2015; Hou, Yao, and Kwok 2017; Bai, Wang, and Liberty 2018), lazy-update (11) will not success here neither. The reason is that, as in DARTS (Liu, Simonyan, and Yang 2019),

At is naturally in range [0, 1] but (11) can not guarantee that. This in turn will bring negative impact on the searching performance.

Proposed algorithm Instead, motivated by Proposition 1, we keep A to be optimized as continuous variables but constrained by C2. Similar box-constraints have been explored in sparse coding and non-negative matrix factorization (Lee and Seung 1999), which help to improve the discriminative ability of learned factors. Here, as demonstrated in experiments, it helps to identify better architectures. Then, we also introduce another discrete A constrained by C1 derived from A during iterating. Note that, it is easy to see At C is guaranteed. The proposed procedure is described in Alg.2.

Algorithm 2 NASP: Efﬁcient Neural Architecture Search with Proximal Iterations.

1: Create a mixture operation O(i,j) parametrized by (7); 2: while not converged do

3: Get discrete architectures: a(i,j) t = prox C1(a(i,j) t ); 4: Update At+1 = prox C2(At At F(wt, At)); // no second-order approximation

5: Reﬁne discrete architectures: a(i,j) t+1 = prox C1(a(i,j) t+1 ); 6: Update wt by wt Ltrain(wt, At+1) using backpropagation; // with the selected operations 7: end while 8: return Searched architecture At.

Compared with DARTS, NASP also alternatively updates architecture A (step 4) and network parameters w (step 6). However, note that A is discretized at step 3 and 5. Speciﬁcally, in step 3, discretized version of architectures are more stable than the continuous ones in DARTS, as it is less likely for subsequent updates in w to change A. Thus, we can take wt (step 4) as a constant w.r.t. A, which helps us remove the second order approximation in (6) and signiﬁcantly speed up architectures updates. In step 5, network weights need only to be propagated with the selected operation. This helps to reduce models training time and decouples operations for training networks. Finally, we do not need an extra step to discretize architecture from a continuous one like DARTS,

Table 2: Classiﬁcation errors of NASP and state-of-the-art image classiﬁers on CIFAR-10.

Method Test Error (%) Para (M) Time (GPU days) Dense Net-BC (Huang et al. 2017) 3.46 25.6 NASNet-A + cutout (Zoph et al. 2017) 2.65 3.3 1800 Amoeba Net + cutout (Real et al. 2018) 2.55 0.05 2.8 3150 PNAS (Liu et al. 2018) 3.41 0.09 3.2 225 ENAS (Pham et al. 2018) 2.89 4.6 0.5 Random search + cutout (Liu, Simonyan, and Yang 2019) 3.29 0.15 3.2 4 DARTS (1st-order) + cutout (Liu, Simonyan, and Yang 2019) 3.00 0.14 3.3 1.5 DARTS (2nd-order) + cutout 2.76 0.09 3.3 4 SNAS (large complexity) + cutout (Xie et al. 2019) 2.98 2.9 1.5 SNAS (middle complexity) + cutout 2.85 0.02 2.8 1.5 SNAS (small complexity) + cutout 3.10 0.04 2.3 1.5 NASP (7 operations) + cutout 2.83 0.09 3.3 0.1 NASP (12 operations) + cutout 2.44 0.04 7.4 0.2

since a discrete architecture is already maintained during the search. This helps us to reduce the gap between the search and ﬁne-tuning, which leads to better architectures being identiﬁed.

Theoretical analysis Finally, unlike DARTS and PA with lazy-updates, the convergence of the proposed NASP can be guaranteed in Theorem 2. The proof is in Appendix A.2.

Theorem 2. Let max F(w, A) < and F be differentiable, then the sequence {At} generated by Alg.2 has limit points.

Note that, previous analysis cannot be applied. As the algorithm steps are different from all previous works, i.e., (Hou, Yao, and Kwok 2017; Bai, Wang, and Liberty 2018), and it is the ﬁrst time that PA is introduced into NAS. While two assumptions are made in Theorem 2, smoothness of F can be satisﬁed using proper loss functions, e.g., the cross-entropy in this paper, and the second assumption can empirically hold in our experiments.

4 Experiments Here, we perform experiments with searching CNN and RNN structures. Four datasets, i.e., CIFAR-10, Image Net, PTB, WT2 will be utilized in our experiments (see Appendix B.1).

4.1 Architecture Search for CNN Searching Cells on CIFAR-10 Same as (Zoph and Le 2017; Zoph et al. 2017; Liu, Simonyan, and Yang 2019; Xie et al. 2019; Luo et al. 2018), we search architectures on CIFAR-10 ((Krizhevsky 2009)). Following (Liu, Simonyan, and Yang 2019), the convolutional cell consists of N = 7 nodes, and the network is obtained by stacking cells for 8 times; in the search process, we train a small network stacked by 8 cells with 50 epochs (see Appendix B.2). Two different search spaces are considered here. The ﬁrst one is the same as DARTS and contains 7 operations. The second one is larger, which contains 12 operations (see Appendix B.3). Besides, our search space for normal cell and reduction cell is different. For normal cell, the search space only consists of identity and convolutional operations; for reduction cell, the search space only consists of identity and pooling operations.

Results compared with state-of-the-art NAS methods can be found in Tab. 2, the searched cells are in Appendix C.2. Note that Proxless NAS (Cai, Zhu, and Han 2019), Mnasnet (Tan et al. 2018), and Single Path One-Shot (Guo et al. 2019) are not compared as their codes are not available and they focus on NAS for mobile devices; Ge Net (Xie and Yuille 2017) is not compared, as its performance is much worse than Res Net. Note that we remove the extra data augmentation for ASAP except cutout for a fair comparison. We can see that when in the same space (with 7 operations), NASP has comparable performance with DARTS (2nd-order) and is much better than DARTS (1st-order). Then, in the larger space (with 12 operations), NASP is still much faster than DARTS, with much lower test error than others. Note that, NASP on the larger space also has larger models, as will be detailed in Sec.4.3, this is because NASP can ﬁnd operations giving lower test error, while others cannot. Regularization on model complexity In above experiments, we have set η = 0 for (9). Here, we vary η and the results on CIFAR-10 are demonstrated in Fig.2. We can see that the model size gets smaller with larger η.

Figure 2: Impact of penalty.

Transfering to Image Net In order to explore the transferability of our searched cells on Image Net, we stack the searched cells for 14 times. The experiment results can be seen in Tab. 4. Notably, NASP can achieve competitive test error with the state-of-the-art methods.

Table 3: Comparison with state-of-the-art language models on PTB (lower perplexity is better).

Architecture Perplexity (%) Params Time valid test (M) (GPU days) NAS (Zoph and Le 2017) - 64.0 25 10,000 ENAS (Pham et al. 2018) 68.3 63.1 24 0.5 Random search (Liu, Simonyan, and Yang 2019) 61.8 59.4 23 2 DARTS (1st-order) (Liu, Simonyan, and Yang 2019) 60.2 57.6 23 0.5 DARTS (2nd-order) 59.7 56.4 23 1 NASP 59.9 57.3 23 0.1

Table 4: Classiﬁcation accuracy of NASP and state-of-the-art image classiﬁers on Image Net.

Architecture Test Error (%) Params top1 top5 (M) Inception-v1 (Szegedy et al. 2015) 30.2 10.1 6.6

Mobile Net (Howard et al. 2017) 29.4 10.5 4.2

Shufﬂe Net 2 (Ma et al. 2018) 25.1 10.1 5 NASNet-A (Zoph et al. 2017) 26.8 8.4 5.3 Amoeba Net (Real et al. 2018) 24.3 7.6 6.4 PNAS (Liu et al. 2018) 25.8 8.1 5.1 DARTS (2nd-order) 26.9 9.0 4.9 SNAS (middle complexity) 27.3 9.2 4.3 NASP (7 operations) 27.2 9.1 4.6 NASP (12 operations) 26.3 8.6 9.5

4.2 Architecture Search for RNN Searching cells on PTB Following the setting of DARTS (Liu, Simonyan, and Yang 2019), the recurrent cell consists of N = 12 nodes; the ﬁrst intermediate node is obtained by linearly transforming the two input nodes, adding up the results and then passing through a tanh activation function; then the results of the ﬁrst intermediate node should be transformed by an activation function. The activation functions utilized are tanh, relu, sigmoid and identity. In the search process, we train a small network with sequence length 35 for 50 epochs. To evaluate the performance of searched cells on PTB, a single-layer recurrent network with the discovered cell is trained for at most 8000 epochs until convergence with batch size 64. Results can be seen in Tab. 3, and searched cells are in Appendix C.2. Again, we can see DARTS s 2ndorder is much slower than 1st-order, and NASP can be not only much faster than DARTS but also achieve comparable test performance with other state-of-the-art methods.

Transferring to Wiki-Text2 Following (Liu, Simonyan, and Yang 2019), we test the transferable ability of RNN s cell with Wiki Text-2 (WT2) (Pham et al. 2018) dataset. We train a single-layer recurrent network with the searched cells on PTB for at most 8000 epochs. Results can be found in Tab. 7. Unlike previous case with Image Net, performance obtained from NAS methods are not better than human designed ones. This is due to WT2 is harder to be transferred, which is also observed in (Liu, Simonyan, and Yang 2019).

4.3 Ablation Study Comparison with DARTS In Sec.4.1, we have shown an overall comparison between DARTS and NASP. Here, we show detailed comparisons on updating network s parameter (i.e., w) and architectures (i.e., A). Timing results and searched performance are in Tab. 5. First, NASP removes much computational cost, as no 2nd-order approximation of A and propagating w with selected operations. This clearly justiﬁes our motivation in Sec.3.1. Second, the discretized A helps to decouple operations on updating w, this helps NASP ﬁnd better operations under larger search space. We conduct experiments to compare the search time and validation accuracy in Fig. 3(a)-(b). We can see that in the same search time, our NASP obtains higher accuracy while our NASP cost less time in the same accuracy. This further veriﬁes the efﬁciency of NASP over DARTS.

(a) #ops = 12.

(b) #ops = 7.

Figure 3: Comparison of NASP and DARTS on convergence.

Finally, we illustrate why the second order approximation is a need for DARTS but not for NASP. Recall that, as in Sec.2.2, as A continuously changes during iteration second order approximation is to better capture w s impact for A.

Figure 4: Comparison on changes of architecture parameters between DARTS and NASP.

Table 5: Detailed comparison on computation time between DARTS and the proposed NASP on CIFAR-10. Note that DARTS needs to search four times to obtain a good architecture (Liu, Simonyan, and Yang 2019) while the performance from NASP is from one search. Thus the speedup here is small than that in Table 4.

computational time / epoch (in seconds) # of update A (validation) update w (training) total error params operations 1st-order 2nd-order forward backward (%) (M) 7 DARTS 270 1315 103 162 1693 2.76 3.3 NASP 176 - 25 31 343 2.83 3.3 12 DARTS 489 2381 187 293 3060 3.0 8.4 NASP 303 - 32 15 483 2.5 7.4

Table 6: Classiﬁcation errors of NASP and concurrent works on CIFAR-10. Ops denotes the number of operations in the search space; Nodes denotes the number of nodes in a cell.

Architecture Test Error (%) Para (M) Ops Nodes Time (GPU days) ASAP (Noy et al. 2019) 3.06 2.6 7 4 0.2 ASNG (Akimoto et al. 2019) 2.83 0.14 3.9 5 5 0.1 Bayes NAS (Zhou et al. 2019) 2.81 0.04 3.40 0.62 7 4 0.2 GDAS (Dong and Yang 2019) 2.82 2.5 8 4 0.3 NASP (7 operations) + cutout 2.83 0.09 3.3 7 4 0.1 NASP (12 operations) + cutout 2.44 0.04 7.4 12 4 0.2

Table 7: Comparison with state-of-the-art language models on WT2. SNAS do not provide codes on RNN and results are not reported neither.

Architecture Perplexity (%) Params valid test (M) LSTM (Yang et al. 2018) 66.0 63.3 33 ENAS (Pham et al. 2018) 72.4 70.4 33 DARTS (2nd order) 71.2 69.6 33 NASP 70.4 69.5 33

Then, in Sec.3.2, we argue that, since A is discrete, w s impact will not lead to frequent changes in A. This removes the needs of capturing future dynamics using the second order approximation. We plot A for DARTS and A for NSAP in Fig. 4. In Fig. 4, the x-axis represents the training epochs while the y-axis represents the operations (there are ﬁve operations selected in our ﬁgure). There are 14 connections between nodes, so there are 14 subﬁgures in both Fig. 4(a) and 4(b). Indeed, A is more stable than A in DARTS, which veriﬁes the correctness of our motivation.

Comparison with standard PA Finally, we demonstrate the needs of our designs in Sec.3.2 for NASP. CIFAR-10 with small search space is used here. Three algorithms are compared: 1). PA (standard), given by (10); 2). PA (lazyupdate), given by (11); and 3) NASP. Results are in Fig. 5(a) and Fig. 5(b). First, good performance cannot be obtained from a direct proximal step, which is due to the discrete constraint. Same observation is also previous made for binary networks (Courbariaux, Bengio, and David 2015). Second, PA(lazy-update) is much better than PA(standard) but still worse than NASP. This veriﬁes the needs to keep elements of the matrix A in [0, 1], as it can encourage better operations.

(a) #ops = 12.

(b) #ops = 7.

Figure 5: Comparison of NASP and direct usages of PA (i.e., simple adaptation of DARTS) on convergence.

4.4 Comparison with Concurrent Works

When we conducted our work, there were some concurrent works. ASAP (Noy et al. 2019) and Bayes NAS (Zhou et al. 2019) take NAS as a network pruning problem, they remove operations that are not promising during the search. ASNG (Akimoto et al. 2019) and GDAS (Dong and Yang 2019) both take stochastic relaxation to the search space, the difference is that ASNG uses natural gradient descent (Amari 1998) for optimization while GDAS use Gumbel Max trick (Jang, Gu, and Poole 2016) with gradient descent. We compare NASP with them in Tab. 6 and 8. Note that ASAP, ASNG and Bayes NAS cannot be used for searching RNN s architectures. We can see NASP is more efﬁcient than these works and offer better performance on the CNN task. Besides, NASP can also be applied with RNN.

5 Conclusion

We introduce NASP, a fast and differentiable neural architecture search method via proximal iterations. Compared with DARTS, NASP is more efﬁcient and performs better. The

Table 8: Comparison of NASP with concurrent works on PTB. ASAP, ASNG and Bayes NAS are not compared as they cannot be used for searching RNN.

Method Perplexity (%) Params Search Cost valid test (M) (GPU days) GDAS 59.8 57.5 23 0.4 NASP 59.9 57.3 23 0.1

key contribution of NASP is the proximal iterations in search process. This approach makes only one operation updated, which saves much time and makes it possible to utilize a larger search space. Besides, our NASP eliminates the correlation among operations. Experiments demonstrate that our NASP is faster and obtain better performance than baselines.

Acknowledgments Q. Yao would like to thank Xiangning Chen, Yongqi Zhang, and Huan Zhao for their helpful feedback. Z. Zhu is supported in part by National Natural Science Foundation of China (No.61806009), Beijing Natural Science Foundation(No. 4184090) and Beijing Academy of Artiﬁcial Intelligence (BAAI).

References Akimoto, Y.; Shirakawa, S.; Yoshinari, N.; Uchida, K.; Saito, S.; and Nishida, K. 2019. Adaptive stochastic natural gradient method for one-shot neural architecture search. In ICML, 171 180. Alistarh, D.; Grubic, D.; Li, J.; Tomioka, R.; and Vojnovic, M. 2017. QSGD: Communication-efﬁcient sgd via gradient quantization and encoding. In Neur IPS, 1709 1720. Amari, S. 1998. Natural gradient works efﬁciently in learning. Neural Computation 10(2):251 276. Bai, Y.; Wang, Y.-X.; and Liberty, E. 2018. Proxquant: Quantized neural networks via proximal operators. In ICLR. Baker, B.; Gupta, O.; Naik, N.; and Raskar, R. 2017. Designing neural network architectures using reinforcement learning. In ICLR. Cai, H.; Zhu, L.; and Han, S. 2019. Proxyless NAS: Direct neural architecture search on target task and hardware. In ICLR. Courbariaux, M.; Bengio, Y.; and David, J.-P. 2015. Binaryconnect: Training deep neural networks with binary weights during propagations. In Neur IPS, 3123 3131. Dong, X., and Yang, Y. 2019. Searching for a robust neural architecture in four GPU hours. In CVPR, 1761 1770. Guo, Z.; Zhang, X.; Mu, H.; Heng, W.; Liu, Z.; Wei, Y.; and Sun, J. 2019. Single path one-shot neural architecture search with uniform sampling. Technical report, Arvix. Hou, L.; Yao, Q.; and Kwok, J. 2017. Loss-aware binarization of deep networks. In ICLR. Howard, A.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; and Adam, H. 2017. Mobilenets: Efﬁcient convolutional neural networks for mobile vision applications. CVPR. Huang, G.; Liu, Z.; Van Der Maaten, L.; and Weinberger, K. 2017. Densely connected convolutional networks. In CVPR, 4700 4708. Hutter, F.; Kotthoff, L.; and Vanschoren, J., eds. 2018. Automated Machine Learning: Methods, Systems, Challenges. Springer.

Jang, E.; Gu, S.; and Poole, B. 2016. Categorical reparameterization with gumbel-softmax. In ICLR. Krizhevsky, A. 2009. Learning multiple layers of features from tiny images. Technical report, Citeseer. Lee, D., and Seung, S. 1999. Learning the parts of objects by non-negative matrix factorization. Nature 401:788 791. Liu, C.; Zoph, B.; Shlens, J.; Hua, W.; Li, L.; Li, F.-F.; Yuille, A.; Huang, J.; and Murphy, K. 2018. Progressive neural architecture search. In ECCV. Liu, H.; Simonyan, K.; and Yang, Y. 2019. DARTS: Differentiable architecture search. In ICLR. Luo, R.; Tian, F.; Qin, T.; Chen, E.; and Liu, T.-Y. 2018. Neural architecture optimization. In Neur IPS. Ma, N.; Zhang, X.; Zheng, H.; and Sun, J. 2018. Shufﬂe Net V2: Practical guidelines for efﬁcient CNN architecture design. ECCV 122 138. Noy, A.; Nayman, N.; Ridnik, T.; Zamir, N.; Doveh, S.; Friedman, I.; Giryes, R.; and Zelnik-Manor, L. 2019. ASAP: Architecture search, anneal and prune. Technical report, ar Xiv preprint ar Xiv:1904.04123. Parikh, N., and Boyd, S. 2013. Proximal algorithms. Foundations and Trends in Optimization 1(3):123 231. Pham, H.; Guan, M.; Zoph, B.; Le, Q.; and Dean, J. 2018. Efﬁcient neural architecture search via parameter sharing. Technical report, ar Xiv preprint. Real, E.; Aggarwal, A.; Huang, T.; and Le, Q. 2018. Regularized evolution for image classiﬁer architecture search. ar Xiv. Sutton, R., and Barto, A. 1998. Reinforcement learning: An introduction. MIT press. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S. E.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; and Rabinovich, A. 2015. Going deeper with convolutions. CVPR 1 9. Tan, M.; Chen, B.; Pang, R.; Vasudevan, V.; and Le, Q. 2018. Mnasnet: Platform-aware neural architecture search for mobile. Technical report, ar Xiv. Xiao, L. 2010. Dual averaging methods for regularized stochastic learning and online optimization. JMLR 11(Oct):2543 2596. Xie, L., and Yuille, A. 2017. Genetic CNN. In ICCV. Xie, S.; Zheng, H.; Liu, C.; and Lin, L. 2019. SNAS: stochastic neural architecture search. In ICLR. Yang, Z.; Dai, Z.; Salakhutdinov, R.; and Cohen, W. 2018. Breaking the softmax bottleneck: A high-rank rnn language model. In ICLR. Yao, Q.; Kwok, J.; Gao, F.; Chen, W.; and Liu, T.-Y. 2017. Efﬁcient inexact proximal gradient algorithm for nonconvex problems. In IJCAI, 3308 3314. AAAI Press. Yao, Q.; Wang, M.; Chen, Y.; Dai, W.; Hu, Y.; Li, Y.; Tu, W.- W.; Yang, Q.; and Yu, Y. 2018. Taking human out of learning applications: A survey on automated machine learning. Technical report, ar Xiv preprint. Zhong, Z.; Yan, J.; Wu, W.; Shao, J.; and Liu, C.-L. 2018. Practical block-wise neural network architecture generation. In CVPR. Zhou, H.; Yang, M.; Wang, J.; and Pan, W. 2019. Bayes NAS: A bayesian approach for neural architecture search. In ICML, 7603 7613. Zoph, B., and Le, Q. 2017. Neural architecture search with reinforcement learning. In ICLR. Zoph, B.; Vasudevan, V.; Shlens, J.; and Le, Q. 2017. Learning transferable architectures for scalable image recognition. In CVPR.