# atomnas_finegrained_endtoend_neural_architecture_search__513d4710.pdf

Published as a conference paper at ICLR 2020

ATOMNAS: FINE-GRAINED END-TO-END NEURAL ARCHITECTURE SEARCH

Jieru Mei1 , Yingwei Li1 , Xiaochen Lian2, Xiaojie Jin2, Linjie Yang2, Alan Yuille1 & Jianchao Yang2

1Johns Hopkins University 2Byte Dance AI Lab

meijieru@gmail.com, yingwei.li@jhu.edu, {xiaochen.lian, jinxiaojie, linjie.yang}@bytedance.com,

alan.l.yuille@gmail.com, yangjianchao@bytedance.com

Search space design is very critical to neural architecture search (NAS) algorithms. We propose a ﬁne-grained search space comprised of atomic blocks, a minimal search unit that is much smaller than the ones used in recent NAS algorithms. This search space allows a mix of operations by composing different types of atomic blocks, while the search space in previous methods only allows homogeneous operations. Based on this search space, we propose a resource-aware architecture search framework which automatically assigns the computational resources (e.g., output channel numbers) for each operation by jointly considering the performance and the computational cost. In addition, to accelerate the search process, we propose a dynamic network shrinkage technique which prunes the atomic blocks with negligible inﬂuence on outputs on the ﬂy. Instead of a searchand-retrain two-stage paradigm, our method simultaneously searches and trains the target architecture. Our method achieves state-of-the-art performance under several FLOPs conﬁgurations on Image Net with a small searching cost. We open our entire codebase at: https://github.com/meijieru/Atom NAS.

1 INTRODUCTION

Human-designed neural networks are already surpassed by machine-designed ones. Neural Architecture Search (NAS) has become the mainstream approach to discover efﬁcient and powerful network structures (Zoph & Le (2017); Pham et al. (2018); Tan et al. (2019); Liu et al. (2019a)). Although the tedious searching process is conducted by machines, humans still involve extensively in the design of the NAS algorithms. Designing of search spaces is critical for NAS algorithms and different choices have been explored. Cai et al. (2019) and Wu et al. (2019) utilize supernets with multiple choices in each layer to accommodate a sampled network on the GPU. Chen et al. (2019b) progressively grow the depth of the supernet and remove unnecessary blocks during the search. Tan & Le (2019a) propose to search the scaling factor of image resolution, channel multiplier and layer numbers in scenarios with different computation budgets. Stamoulis et al. (2019a) propose to use different kernel sizes in each layer of the supernet and reuse the weights of larger kernels for small kernels. Howard et al. (2019); Tan & Le (2019b) adopts Inverted Residuals with Linear Bottlenecks (Mobile Net V2 block) (Sandler et al., 2018), a building block with light-weighted depth-wise convolutions for highly efﬁcient networks in mobile scenarios.

However, the proposed search spaces generally have only a small set of choices for each block. DARTS and related methods (Liu et al., 2019a; Chen et al., 2019b; Liang et al., 2019) use around 10 different operations between two network nodes. Howard et al. (2019); Cai et al. (2019); Wu et al. (2019); Stamoulis et al. (2019a) search the expansion ratios in the Mobile Net V2 block but still limit them to a few discrete values. We argue that search space of ﬁner granularity is critical to ﬁnd

This work was done during the internship program at Bytedance.

Published as a conference paper at ICLR 2020

optimal neural architectures. Speciﬁcally, the searched building block in a supernet should be as small as possible to generate the most diversiﬁed model structures.

We revisit the architectures of state-of-the-art networks (Howard et al. (2019); Tan & Le (2019b); He et al. (2016)) and discover a commonly used building structure: convolution - channel-wise operation - convolution. We reinterpret this building structure as an ensemble of computationally independent blocks, which we call atomic blocks. As the minimum search unit, the atomic block constitutes a much larger and more ﬁne-grained search space, within which we are able to search for mixed operations (e.g., convolutions with different kernel sizes and their channel numbers).

For the efﬁcient exploration of the new search space, we propose a NAS framework named Atom NAS which applies network pruning techniques to architecture search. Speciﬁcally, we start from an initial large supernet and rewrite every convolution - channel-wise operation - convolution structure of it in the form the weighted sum of atomic blocks; the weights reﬂect the contribution of the atomic blocks to the network capacity and are called importance factors. For each atomic block, a penalty term in proportion to its FLOPs is enforced on its importance factor; effectively, the penalty makes Atom NAS favor atomic blocks with less FLOPs. By minimizing the combination of the original network loss and the total penalty on the weights, Atom NAS is able to learn both the parameters of the network and the weights of the atomic blocks. At the end of the learning, atomic blocks with very small weights (e.g., < 0.001) are removed from the network and we obtain the ﬁnal network which has fewer FLOPs. Since the pruned atomic blocks have little contribution to the network output due to their negligible weights, the ﬁnal network does not need to be retrained or ﬁnetuned.

Training on the large supernet is computationally demanding. We observe that for many pruned atomic blocks, their weights diminish at the early stage of learning and never revive throughout the rest of learning. We propose a dynamic network shrinkage technique which removes those atomic blocks on the ﬂy and greatly reduces the run time of Atom NAS.

In our experiment, our method achieves 75.9% top-1 accuracy on Image Net dataset around 360M FLOPs, which is 0.9% higher than state-of-the-art model (Stamoulis et al., 2019a). By further incorporating additional modules, our method achieves 77.6% top-1 accuracy. It outperforms Mix Net by 0.6% using 363M FLOPs, which is a new state-of-the-art under the mobile scenario.

In summary, the major contributions of our work are:

1. We design a ﬁne-grained search space which includes the exact number of channels and mixed operations (e.g., combination of different convolution kernels).

2. We propose an NAS framework, Atom NAS. Within the framework, an efﬁcient end-to-end NAS algorithm is proposed which can simultaneously search the network architecture and train the ﬁnal model. No ﬁnetuning is needed after the algorithm ﬁnishes.

3. With the proposed search space and Atom NAS, we achieve state-of-the-art performance on Image Net dataset under mobile setting.

2 RELATED WORK

2.1 NEURAL ARCHITECTURE SEARCH

Recently, there is a growing interest in automated neural architecture design. Reinforce learning based NAS methods (Zoph & Le, 2017; Tan et al., 2019; Tan & Le, 2019b;a) are usually computational intensive, thus hampering its usage with limited computational budget. To accelerate the search procedure, ENAS (Pham et al., 2018) represents the search space using a directed acyclic graph and aims to search the optimal subgraph within the large supergraph. A training strategy of parameter sharing among subgraphs is proposed to signiﬁcantly increase the searching efﬁciency. The similar idea of optimizing optimal subgraphs within a supergraph is also adopted by Liu et al. (2019a); Jin et al. (2019); Xu et al. (2020); Wu et al. (2019); Guo et al. (2019); Cai et al. (2019). Stamoulis et al. (2019a); Yu et al. (2020) further share the parameters of different paths within a block using super-kernel representation. A prominent disadvantage of the above methods is that their coarse search spaces only support selecting one out of a set of choices (e.g., selecting one kernel size from {3, 5, 7}). Mix Net tries to beneﬁt from mixed operations by using a predeﬁned set of mixed operations {{3}, {3, 5}, {3, 5, 7}, {3, 5, 7, 9}}, where the channels are equally distributed

Published as a conference paper at ICLR 2020

#,%[𝑐 , : ]

%..,#[: , 1]

%..,#[: , 2]

%..,#[: , 𝑐 ]

C + ... ... ... ...

Figure 1: Illustration of the ensemble perspective. Arrow means operators. The structure of two convolutions joined by a channel-wise operation is mathematically equivalent to the ensemble of multiple atomic blocks, according to Eq. (2). Colored rectangles represent tensors, with numbers inside indicating their channel numbers; The shaded path on the right is one example of atomic block.

among different kernel sizes. Due to this limitation, it is difﬁcult to learn optimal architectures under computational resource constraints. On the contrary, our method takes advantage of the ﬁne-grained search space and is able to search for more ﬂexible network architectures satisfying various resource constraints. The ﬁne-grained search space proposed in this paper is exponentially larger than previous search space. For reference, the total number of possible structures within the experiment is around 10162, compared with 1021 for FBNet. Recently, to improve the ﬁnal performance of the searched architectures, Yu et al. (2020) utilizes knowledge distillation which is orthogonal to our method. It could be easily integrated into our method by Eq. (5) thanks to the end-to-end learning paradigm of our method.

2.2 NETWORK PRUNING

Assuming that many parameters in the network are unnecessary, network pruning methods start from a computation-intensive model, identify the unimportant connections and remove them to get a compact and efﬁcient network. Early method (Han et al., 2016) simultaneously learns the important connections and weights. However, non-regularly removing connections in these works makes it hard to achieve theoretical speedup ratio on realistic hardwares due to extra overhead in caching and indexing. To tackle this problem, structured network pruning methods (He et al., 2017b; Liu et al., 2017; Luo et al., 2017; Ye et al., 2018; Gordon et al., 2018) are proposed to prune structured components in networks, e.g. the entire channel and kernel. In this way, empirical acceleration can be achieved on modern computing devices. Liu et al. (2017); Ye et al. (2018); Gordon et al. (2018) encourage channel-level sparsity by imposing the L-1 regularizer on the channel dimension, which is also used by our method. Recently, Liu et al. (2019b) show that in structured network pruning, the learned weights are unimportant. This suggests structured network pruning is actually a neural architecture search focusing on channel numbers. Our method jointly searches the channel numbers and a mix of operations, which is a much larger search space.

We formulate our neural architecture search method in a ﬁne-grained search space with the atomic block used as the basic search unit. An atomic block is comprised of two convolutions connected by a channel-wise operation. By stacking atomic blocks, we obtain larger building blocks (e.g. residual block and Mobile Net V2 block proposed in a variety of state-of-the-art models including Res Net, Mobile Net V2/V3 (He et al., 2016; Howard et al., 2019; Sandler et al., 2018). In Section 3.1, We ﬁrst show larger network building blocks (e.g. Mobile Net V2 block) can be represented by an ensembles of atomic blocks. Based on this view, we propose a ﬁne-grained search space using atomic blocks. In Section 3.2, we propose a resource-aware atomic block selection method for end-to-end architecture search. Finally, we propose a dynamic network shrinkage technique in Section 3.3, which greatly reduces the search cost.

Published as a conference paper at ICLR 2020

3.1 FINE-GRAINED SEARCH SPACE

Under the typical block-wise NAS paradigm (Tan et al., 2019; Tan & Le, 2019b), the search space of each block in a neural network is represented as the Cartesian product C = Q

i=1 Pi, where each Pi is the set of all choices of the i-th conﬁguration such as kernel size, number of channels and type of operation. For example, C = {conv, depth-wise conv, dilated conv} {3, 5} {24, 32, 64, 128} represents a search space of three types of convolutions by two kernel sizes and four options of channel number. A block in the resulting model can only pick one convolution type from the three and one output channel number from the four values. This paradigm greatly limits the search space due to the few choices of each conﬁguration. Here we present a more ﬁne-grained search space by decomposing the network into smaller and more basic building blocks.

We denote f c ,c(X) as a convolution operator, where X is the input tensor and c, c are the input and output channel numbers respectively. A wide range of manually-designed and NAS architectures share a structure that joins two convolutions by a channel-wise operation:

Y = f c ,c 1 g f c ,c 0 (X) (1)

where g is a channel-wise operator. For example, in VGG (Simonyan & Zisserman, 2015) and a Residual Block (He et al., 2016), f0 and f1 are convolutions and g is one of Maxpool, Re LU and BN-Re LU; in a Mobile Net V2 block (Sandler et al., 2018), f0 and f1 are point-wise convolutions and g is depth-wise convolution with BN-Re LU in the Mobile Net V2 block. Eq. (1) can be reformulated as follows:

f c ,1 1 [i, :] g[i, :] f 1,c 0 [:, i] (X), (2)

where f 1,c 0 [:, i] is the i-th convolution kernel of f0, g[i, :] is the operator of the i-th channel of g, and {f c ,1 1 [i, :]}c i=1 are obtained by splitting the kernel tensor of f1 along the the input channel dimension. Each term in the summation can be seen as a computationally independent block, which is called atomic block. Fig. (1) demonstrate this reformulation. By determining whether to keep each atomic block in the ﬁnal model individually, the search of channel number c is enabled through channel selection, which greatly enlarges the search space.

This formulation also naturally includes the selection of operators. To gain a better understanding, we ﬁrst generalize Eq. (2) as:

f c ,1 1i gi f 1,c 0i (X). (3)

Note the array indices i are moved to subscripts. In this formulation, we can use different types of operators for f0i, f1i and gi; in other words, f0, f1 and g can each be a combination of different operators and each atomic block can use different operators such as convolutions with different kernel sizes.

Formally, the search space is formulated as a supernet which is built based on the structure in Eq. (1); such structure satisﬁes Eq. (3) and thus can be represented by atomic blocks; each of f0, f1 and g is a combination of operators. The new search space includes some state-of-the-art network architectures. For example, by allowing g to be a combination of convolutions with different kernel sizes, the Mix Conv block in Mix Net (Tan & Le, 2019b) becomes a special case in our search space. In addition, our search space facilitates discarding any number of channels in g, resulting in a more ﬁne-grained channel conﬁguration. In comparison, the channel numbers are determined heuristically in Tan & Le (2019b).

3.2 RESOURCE-AWARE ATOMIC BLOCK SEARCH

In this work, we adopt a differentiable neural architecture search paradigm where the model structure is discovered in a full pass of model training. With the supernet deﬁned above, the ﬁnal model can be produced by discarding part of the atomic blocks during training. Following DARTS (Liu et al.

Published as a conference paper at ICLR 2020

(2019a)), we introduce a importance factor α to scale the output of each atomic block in the supernet. Eq. (3) then becomes

i=1 αi f c ,1 1i gi f 1,c 0i (X). (4)

Here, each αi is tied with an atomic block comprised of three operators f c ,1 1i ,gi and f 1,c 0i . The importance factors are learned jointly with the network weights. Once the training ﬁnishes, the atomic blocks that have negligible effect (i.e., those with factors smaller than a threshold) on the network output are discarded.

We still need to address two issues related to the importance factors αi s. The ﬁrst issue is where in the supernet we should put the α? Let s ﬁrst consider the case when g only contains linear operations, e.g., convolution, batch normalization and linear activation like Re LU. If g contains at least one BN layer, The scaling parameters in the BN layers can be directly used as such importance factors (Liu et al. (2017)). If g has no BN layers, which is rare, we can place α anywhere between f0 and f1; however, we need to apply regularization terms to the weights of f0 and f1 (e.g., weight decays) in order to prevent weights in f0 and f1 from getting too large and canceling the effect of α. When g contains non-linear operations, e.g., Swish activation and Sigmoid activation, we can only put α behind f1.

The second issue is how to avoid performance deterioration after discarding some of the atomic blocks. For example, DARTS discards operations with small scale factors after iterative training of model parameters and scale factors. Since the scale factors of the discarded operations are not small enough, the performance of the network will be affected which needs re-training to adjust the weights again. In order to maintain the performance of the supernet after dropping some atomics blocks, the importance factors α of those atomic blocks should be sufﬁciently small. Inspired by the channel pruning work in Liu et al. (2017), we add L1 norm penalty loss on α, which effectively pushes many importance factors to near-zero values. At the end of learning, atomic blocks with α close to zero are removed from the supernet. Note that since the BN scales change more dramatically during training due to the regularization term, the running statistics of BNs might be inaccurate and needs to be calculated again using the training set.

With the added regularization term, the training loss is

L = E + λ X

i S ci|αi|, (5)

ci = ˆci/ X

k S ˆck (6)

where λ is the coefﬁcient of L1 penalty term, S is the index set of all atomic blocks, and E is the conventional training loss (e.g., cross-entropy loss combined with the regularization term like weight decay and distillation loss.). |αi| is weighted by coefﬁcient ci which is proportional to the computation cost of i-th atomic block, i.e. ˆci. By using computation costs aware regularization, we encourage the model to learn network structures that strike a good balance between accuracy and efﬁciency. In this paper, we use FLOPs as the criteria of computation cost. Other metrics such as latency and energy consumption can be used similarly. As a result, the whole loss function L trades off between accuracy and FLOPs.

3.3 DYNAMIC NETWORK SHRINKAGE

Usually, the supernet is much larger than the ﬁnal search result. We observe that many atomic blocks become dead starting from the early stage of the search, i.e., their importance factors α are close to zero till the end of the search. To utilize computational resources more efﬁciently and speed up the search process, we propose a dynamic network shrinkage algorithm which cuts down the network architecture by removing atomic blocks once they are deemed dead .

We adopt a conservative strategy to decide whether an atomic block is dead : for importance factors α, we maintain its momentum ˆα which is updated as

ˆα β ˆα + (1 β)αt, (7)

Published as a conference paper at ICLR 2020

0 50 100 150 200 250 300 350 Epoch

Amortized search Eliminated

Figure 2: FLOPs change of the supernet during the searching and training for Atom NAS-C. The crossed-out region corresponds to the saved computation compared to training the supernet without the dynamic shrinkage. The region in yellow corresponds to the extra cost compared with training the ﬁnal model from scratch, the cost of which is the region below the red dashed line.

Initialize the supernet and the exponential moving average; while epoch max epoch do

Update network weights and importance factors α by minimizing the loss function L ; Update the ˆα by Eq. (7); if Total FLOPs of dead blocks then

Remove dead blocks from the supernet; end Recalculate BN s statistics by forwarding some training examples; Validate the performance of the current supernet; end

Algorithm 1: Dynamic network shrinkage

where αt is the importance factors at t-th iteration and β is the decay term. An atomic block is considered dead if both ˆα and αt are smaller than a threshold, which is set to 0.001 throughout experiments.

Once the total FLOPs of dead blocks reach a predeﬁned threshold, we remove those blocks from the supernet. As discussed above, we recalculate BN s running statistics before deploying the network. The whole training process is presented in Algorithm 1.

We show the FLOPs of a sample network during the search process in Fig. 2. We start from a supernet with 1521M FLOPs and dynamically discard dead atomic blocks to reduce search cost. The overall search and train cost only increases by 17.2% compared to that of training the searched model from scratch.

4 EXPERIMENT

We ﬁrst describe the implementation details in Section 4.1 and then compare Atom NAS with previous state-of-the-art methods under various FLOPs constraints in Section 4.2. In Section 4.3, we provide more detailed analysis about Atom NAS. Finally, in Section 4.4, we demonstrate the transferability of Atom NAS networks by evaluating them on detection and instance segmentation tasks.

4.1 IMPLEMENTATION DETAILS

The architecture of the supernet we use for the experiments is shown in table on the right of Fig. 3. The supernet contains 21 Atom NAS blocks, the searchable block in our supernet; the picture on the right of Fig. 3 illustrates the structure of an Atom NAS block, where f0 is a 1 1 pointwise convolutions that expands the input channel number from C to 3 6C; g is a mix of three depth-wise

Published as a conference paper at ICLR 2020

3x3 dw 5x5 dw 7x7 dw

1x1 pointwise

1x1 pointwise

Skip connection

Input Shape Block f n stride

2242 3 3x3 conv 32(16) 1 2 1122 32(16) 3x3 MB 16 1 1 1122 16 searchable 24 4 2 562 24 searchable 40 4 2 282 40 searchable 80 4 2 142 80 searchable 96 4 1 142 96 searchable 192 4 2 72 192 searchable 320 1 1 72 320 avgpool - 1 1 1280 fc 1000 1 -

Figure 3: (Left) The searchable block of the supernet. f0 and f1 are ﬁxed to 1 1 pointwise convolutions; g here is a mix of three convolutions with kernel sizes of 3 3, 5 5 and 7 7. f0 expands the input channel number from C to 18C and f1 projects the channel number to the output channel number. If the output dimension stays the same as the input dimension, we use a skip connection to add the input to the output. (Right) Architecture of the supernet. Column-Block denotes the block type; MB denotes Mobile Net V2 block; searchable means a searchable block shown on the left. Column-f denotes the output channel number of a block. Column-n denotes the number of blocks. Column-s denotes the stride of the ﬁrst block in a stage. The output channel numbers of the ﬁrst convolution are 16 for Atom NAS-A, 32 for Atom NAS-B and Atom NAS-C.

convolutions with kernel sizes of 3 3, 5 5 and 7 7, and f1 is another 1 1 pointwise convolutions that projects the channel number to the output channel number. Similar to Mobile Net V2 (Sandler et al., 2018), if the output dimension stays the same as the input dimension, we use a skip connection to add the input to the output. Atom NAS block is effectively an ensemble of 3 6C atomic blocks, whose underlying search space covers the Mobile Net V2 block (Sandler et al., 2018) and its multikernel variant, Mix Conv (Tan & Le, 2019b). Within Atom NAS block, we are able to optimize the distribution of computation resources (i.e., channel numbers) among the three types of depth-wise convolution.

We use the same training conﬁguration (e.g., RMSProp optimizer, EMA on weights and exponential learning rate decay) as Tan et al. (2019); Stamoulis et al. (2019a) and do not use extra data augmentation such as Mix Up (Zhang et al., 2018) and Auto Augment (Cubuk et al., 2018). We ﬁnd that using this conﬁguration is sufﬁcient for our method to achieve good performance. Our results are shown in Table 1 and Table 3. When training the supernet, we use a total batch size of 2048 on 32 Tesla V100 GPUs and train for 350 epochs. For our dynamic network shrinkage algorithm, we set the momentum factor β in Eq. (7) to 0.9999. At the beginning of the training, all of the weights are randomly initialized. To avoid removing atomic blocks with high penalties (i.e., FLOPs) prematurely, the weight of the penalty term in Eq. (5) is increased from 0 to the target λ by a linear scheduler during the ﬁrst 25 epochs. By setting the weight of the L1 penalty term λ to be 1.8 10 4, 1.2 10 4 and 1.0 10 4 respectively, we obtain networks with three different sizes: Atom NAS-A, Atom NAS-B, and Atom NAS-C. They have the similar FLOPs as previous state-of-the-art networks under 400M: Mix Net-S (Tan & Le, 2019b), Mix Net-M (Tan & Le, 2019b) and Single Path (Stamoulis et al., 2019a). In Appendix A, we visualize the architecture of Atom NAS-C.

4.2 EXPERIMENTS ON IMAGENET

We apply Atom NAS to search high performance light-weight model on Image Net 2012 classiﬁcation task (Deng et al., 2009). Table 1 compares our methods with previous state-of-the-art models, either manually designed or searched.

With models directly produced by Atom NAS, our method achieves the new state-of-the-art under all FLOPs constraints. Especially, Atom NAS-C achieves 75.9% top-1 accuracy with only 360M FLOPs, and surpasses all other models, including models like PDARTS and Dense NAS which have much higher FLOPs.

Published as a conference paper at ICLR 2020

200 300 400 500 600 FLOPs(M)

Imagenet Top-1 Accuracy(%)

Atom NAS Atom NAS+ Mobile Net V2 Shuffle Net V2 FBNet Proxyless DARTS Single Path Mnas Net

Efficient Net

Figure 4: FLOPs versus accuracy on Image Net. means methods use extra techniques like Swish activation and Squeeze-and-Excitation module.

Techniques like Swish activation function (Ramachandran et al., 2018) and Squeeze-and-Excitation (SE) module (Hu et al., 2018) consistently improve the accuracy with marginal FLOPs cost. For a fair comparison with methods that use these techniques, we directly modify the searched network by replacing all Re LU activation with Swish and add SE module with ratio 0.5 to every block and then retrain the network from scratch. Note that unlike other methods, we do not search the conﬁguration of Swish and SE, and therefore the performance might not be optimal. Extra data augmentations such as Mix Up and Auto Augment are still not used. We train the models from scratch with a total batch size of 4096 on 32 Tesla V100 GPUs for 250 epochs.

Simply adding these techniques improves the results further. Atom NAS-A+ achieves 76.3% top1 accuracy with 260M FLOPs, which outperforms many heavier models including Mnas Net-A2. Without extra data augmentations, it performs as well as Efﬁcient-B0 (Tan & Le, 2019a) by using 130M less FLOPs. It also outperforms the previous state-of-the-art Mix Net-S by 0.5%. In addition, Atom NAS-C+ improves the top-1 accuracy on Image Net to 77.6%, surpassing previous state-ofthe-art Mix Net-M by 0.6% and becomes the overall best performing model under 400M FLOPs.

Fig. 4 visualizes the top-1 accuracy on Image Net for different models. It s clear that our ﬁne-grained search space and the end-to-end resource-aware search method boost the performance signiﬁcantly.

4.3 ANALYSIS

4.3.1 RESOURCE-AWARE REGULARIZATION

To demonstrate the effectiveness of the resource-aware regularization in Section 3.2, we compare it with a baseline without FLOPs-related coefﬁcients ci, which is widely used in network pruning (Liu et al., 2017; He et al., 2017b). Table 2 shows the results. First, by using the same L1 penalty coefﬁcient λ = 1.0 10 4, the baseline achieves a network with similar performance but using much more FLOPs; then by increasing λ to 1.5 10 4, the baseline obtain a network which has similar FLOPs but inferior performance (i.e., about 1.0% lower). In Fig. 6b we visualized the ratio of different types of atomic blocks of the baseline network obtained by λ = 1.5 10 4. The baseline network keeps more atomic blocks in the earlier blocks, which have higher computation cost due to higher input resolution. On the contrary, Atom NAS is aware of the resource constraint, thus keeping more atomic blocks in the later blocks and achieving much better performance.

4.3.2 BN RECALIBRATION

As the BN s running statistics might be inaccurate as explained in Section 3.2 and Section 3.3, we re-calculate the running statistics of BN before inference, by forwarding 131k randomly sampled training images through the network. Table 3 shows the impact of the BN recalibration. The top-1 accuracies of Atom NAS-A, Atom NAS-B, and Atom NAS-C on Image Net improve by 1.4%, 1.7%, and 1.2% respectively, which clearly shows the beneﬁt of BN recalibration.

Published as a conference paper at ICLR 2020

Table 1: Comparision with state-of-the-arts on Image Net under the mobile setting. denotes methods using extra network modules such as Swish activation and Squeeze-and-Excitation module. denotes using extra data augmentation such as Mix Up and Auto Augment. denotes models searched and trained simultaneously.

Model Parameters FLOPs Top-1(%) Top-5(%)

Mobile Net V1 (Howard et al., 2017) 4.2M 575M 70.6 89.5 Mobile Net V2 (Sandler et al., 2018) 3.4M 300M 72.0 91.0 Mobile Net V2 (our impl.) 3.4M 301M 73.6 91.5 Mobile Net V2 (1.4) 6.9M 585M 74.7 92.5 Shufﬂe Net V2 (Ma et al., 2018) 3.5M 299M 72.6 - Shufﬂe Net V2 2 7.4M 591M 74.9 -

FBNet-A (Wu et al., 2019) 4.3M 249M 73.0 - FBNet-C 5.5M 375M 74.9 - Proxyless (mobile) (Cai et al., 2019) 4.1M 320M 74.6 92.2 Single Path (Stamoulis et al., 2019a) 4.4M 334M 75.0 92.2 NASNet-A (Zoph & Le, 2017) 5.3M 564M 74.0 91.6 DARTS (second order) (Liu et al., 2019a) 4.9M 595M 73.1 - PDARTS (cifar 10) (Chen et al., 2019b) 4.9M 557M 75.6 92.6 Dense NAS-A (Fang et al., 2019) 7.9M 501M 75.9 92.6 Fair NAS-A (Chu et al., 2019b) 4.6M 388M 75.3 92.4 Atom NAS-A 3.9M 258M 74.6 92.1 Atom NAS-B 4.4M 326M 75.5 92.6 Atom NAS-C 4.7M 360M 75.9 92.7

SCARLET-A (Chu et al., 2019a) 6.7M 365M 76.9 93.4 Mnas Net-A1 (Tan et al., 2019) 3.9M 312M 75.2 92.5 Mnas Net-A2 4.8M 340M 75.6 92.7 Mix Net-S (Tan & Le, 2019b) 4.1M 256M 75.8 92.8 Mix Net-M 5.0M 360M 77.0 93.3 Efﬁcient Net-B0 (Tan & Le, 2019a) 5.3M 390M 76.3 93.2 SE-DARTS+ (Liang et al., 2019) 6.1M 594M 77.5 93.6 Atom NAS-A+ 4.7M 260M 76.3 93.0 Atom NAS-B+ 5.5M 329M 77.2 93.5 Atom NAS-C+ 5.9M 363M 77.6 93.6

Table 2: Inﬂuence of awareness of resource metric. The upper block uses equal penalties for all atomic blocks. The lower part uses our resource-aware atomic block selection.

λ FLOPs Top-1(%)

1.0 10 4 445M 76.1 1.5 10 4 370M 74.9

1.0 10 4 360M 75.9

4.3.3 COST OF DYNAMIC NETWORK SHRINKAGE

Our dynamic network shrinkage algorithm speedups the search and train process signiﬁcantly. For Atom NAS-C, the total time for search-and-training is 25.5 hours. For reference, training the ﬁnal architecture from scratch takes 22 hours. Note that as the supernet shrinks, both the GPU memory consumption and forward-backward time are signiﬁcantly reduced. Thus it s possible to dynamically change the batch size once having sufﬁcient GPU memory, which would further speed up the whole procedure.

Published as a conference paper at ICLR 2020

Table 3: Inﬂuence of BN recalibration.

Model w/o Recalibration w/ Recalibration

Atom NAS-A 73.2 74.6 (+1.4) Atom NAS-B 73.8 75.5 (+1.7) Atom NAS-C 74.7 75.9 (+1.2)

4.4 EXPERIMENTS ON COCO DETECTION AND INSTANCE SEGMENTATION

In this section, we assess the performance of Atom NAS models as feature extractors for object detection and instance segmentation on COCO dataset (Lin et al., 2014). We ﬁrst pretrain Atom NAS models (without Swish activation function (Ramachandran et al., 2018) and Squeeze-and-Excitation (SE) module (Hu et al., 2018)) on Image Net, use them as drop-in replacements for the backbone in the Mask-RCNN model (He et al., 2017a) by building the detection head on top of the last feature map, and ﬁnetune the model on COCO dataset.

We use the open-source code MMDetection (Chen et al., 2019a). All the models are trained on COCO train2017 with batch size 16 and evaluated on COCO val2017. Following the schedule used in the open-source implementation of TPU-trained Mask-RCNN , the learning rate starts at 0.02 and decreases by a scale of 10 at 15-th and 20th epoch respectively. The models are trained for 23 epochs in total.

Table 4 compares the results with other baseline backbone models. The detection results of baseline models are from Stamoulis et al. (2019b). We can see that all three Atom NAS models outperform the baselines on object detection task. The results demonstrate that our models have better transferability than the baselines, which may due to mixed operations, a.k.a multi-scale here, are more important to object detection and instance segmentation.

Table 4: Comparision with baseline backbones on COCO object detection and instance segmentation. Cls denotes the Image Net top-1 accuracy; detect-m AP and seg-m AP denotes mean average precision for detection and instance segmentation on COCO dataset. The results of baseline models are from Stamoulis et al. (2019b). Single Path+ (Stamoulis et al., 2019b) contains SE module.

Model FLOPs Cls (%) detect-m AP (%) seg-m AP (%)

Mobile Net V2 (Sandler et al., 2018) 301M 73.6 30.5 -

Proxyless (mobile) (Cai et al., 2019) 320M 74.6 32.9 - Proxyless (mobile) (our impl.) 320M 74.9 32.7 30.0 Single Path+ (Stamoulis et al., 2019b) 353M 75.6 33.0 - Single Path (our impl.) 334M 75.0 32.0 29.7

Atom NAS-A 258M 74.6 32.7 30.1 Atom NAS-B 326M 75.5 33.6 30.8 Atom NAS-C 360M 75.9 34.1 31.4

5 CONCLUSION

In this paper, we revisit the common structure, i.e., two convolutions joined by a channel-wise operation, and reformulate it as an ensemble of atomic blocks. This perspective enables a much larger and more ﬁne-grained search space. For efﬁciently exploring the huge ﬁne-grained search space, we propose an end-to-end framework named Atom NAS, which conducts architecture search and network training jointly. The searched networks achieve signiﬁcantly better accuracy than previous state-of-the-art methods while using small extra cost.

https://github.com/tensorflow/tpu/tree/master/models/official/mask_ rcnn

Published as a conference paper at ICLR 2020

Han Cai, Ligeng Zhu, and Song Han. Proxylessnas: Direct neural architecture search on target task and hardware. In ICLR, 2019.

Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, Tianheng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, Yue Wu, Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli Ouyang, Chen Change Loy, and Dahua Lin. MMDetection: Open mmlab detection toolbox and benchmark. ar Xiv preprint ar Xiv:1906.07155, 2019a.

Xin Chen, Lingxi Xie, Jun Wu, and Qi Tian. Progressive differentiable architecture search: Bridging the depth gap between search and evaluation. Co RR, abs/1904.12760, 2019b.

Xiangxiang Chu, Bo Zhang, Jixiang Li, Qingyuan Li, and Ruijun Xu. Scarletnas: Bridging the gap between scalability and fairness in neural architecture search. Co RR, abs/1908.06022, 2019a.

Xiangxiang Chu, Bo Zhang, Ruijun Xu, and Jixiang Li. Fairnas: Rethinking evaluation fairness of weight sharing neural architecture search. Co RR, abs/1907.01845, 2019b.

Ekin Dogus Cubuk, Barret Zoph, Dandelion Man e, Vijay Vasudevan, and Quoc V. Le. Autoaugment: Learning augmentation policies from data. Co RR, abs/1805.09501, 2018.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. Imagenet: A large-scale hierarchical image database. In CVPR, pp. 248 255, 2009.

Jiemin Fang, Yuzhu Sun, Qian Zhang, Yuan Li, Wenyu Liu, and Xinggang Wang. Densely connected search space for more ﬂexible neural architecture search. Co RR, abs/1906.09607, 2019.

Ariel Gordon, Elad Eban, Oﬁr Nachum, Bo Chen, Hao Wu, Tien-Ju Yang, and Edward Choi. Morphnet: Fast & simple resource-constrained structure learning of deep networks. In CVPR, pp. 1586 1595, 2018.

Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu, Yichen Wei, and Jian Sun. Single path one-shot neural architecture search with uniform sampling. Co RR, abs/1904.00420, 2019.

Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. In ICLR, 2016.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pp. 770 778, 2016.

Kaiming He, Georgia Gkioxari, Piotr Doll ar, and Ross B. Girshick. Mask R-CNN. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp. 2980 2988, 2017a.

Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In ICCV, pp. 1398 1406, 2017b.

Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3. Corr, abs/1905.02244, 2019.

Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efﬁcient convolutional neural networks for mobile vision applications. Co RR, abs/1704.04861, 2017.

Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In CVPR, pp. 7132 7141, 2018.

Xiaojie Jin, Jiang Wang, Joshua Slocum, Ming-Hsuan Yang, Shengyang Dai, Shuicheng Yan, and Jiashi Feng. Rc-darts: Resource constrained differentiable architecture search. ar Xiv preprint ar Xiv:1912.12814, 2019.

Published as a conference paper at ICLR 2020

Hanwen Liang, Shifeng Zhang, Jiacheng Sun, Xingqiu He, Weiran Huang, Kechen Zhuang, and Zhenguo Li. Darts+: Improved differentiable architecture search with early stopping, 2019.

Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ar, and C. Lawrence Zitnick. Microsoft COCO: common objects in context. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, pp. 740 755, 2014.

Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: differentiable architecture search. In ICLR, 2019a.

Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efﬁcient convolutional networks through network slimming. In ICCV, pp. 2755 2763, 2017.

Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value of network pruning. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, 2019b.

Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A ﬁlter level pruning method for deep neural network compression. In ICCV, pp. 5068 5076, 2017.

Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufﬂenet V2: practical guidelines for efﬁcient CNN architecture design. In ECCV, pp. 122 138, 2018.

Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, and Jeff Dean. Efﬁcient neural architecture search via parameter sharing. In ICML, pp. 4092 4101, 2018.

Prajit Ramachandran, Barret Zoph, and Quoc V. Le. Searching for activation functions. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Workshop Track Proceedings, 2018.

Mark Sandler, Andrew G. Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR, pp. 4510 4520, 2018.

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.

Dimitrios Stamoulis, Ruizhou Ding, Di Wang, Dimitrios Lymberopoulos, Bodhi Priyantha, Jie Liu, and Diana Marculescu. Single-path NAS: designing hardware-efﬁcient convnets in less than 4 hours. Co RR, abs/1904.02877, 2019a.

Dimitrios Stamoulis, Ruizhou Ding, Di Wang, Dimitrios Lymberopoulos, Bodhi Priyantha, Jie Liu, and Diana Marculescu. Single-path mobile automl: Efﬁcient convnet design and NAS hyperparameter optimization. Co RR, abs/1907.00959, 2019b.

Mingxing Tan and Quoc V. Le. Efﬁcientnet: Rethinking model scaling for convolutional neural networks. In ICML, pp. 6105 6114, 2019a.

Mingxing Tan and Quoc V. Le. Mixconv: Mixed depthwise convolutional kernels. Co RR, abs/1907.09595, 2019b.

Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le. Mnasnet: Platform-aware neural architecture search for mobile. In CVPR, pp. 2820 2828, 2019.

Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardware-aware efﬁcient convnet design via differentiable neural architecture search. In CVPR, pp. 10734 10742, 2019.

Yuhui Xu, Lingxi Xie, Xiaopeng Zhang, Xin Chen, Guo-Jun Qi, Qi Tian, and Hongkai Xiong. PCDARTS: Partial channel connections for memory-efﬁcient architecture search. In ICLR, 2020.

Jianbo Ye, Xin Lu, Zhe Lin, and James Z. Wang. Rethinking the smaller-norm-less-informative assumption in channel pruning of convolution layers. In ICLR, 2018.

Published as a conference paper at ICLR 2020

Jiahui Yu, Pengchong Jin, Hanxiao Liu, Gabriel Bender, Pieter-Jan Kindermans, Mingxing Tan, Thomas Huang, Xiaodan Song, and Quoc Le. Scaling up neural architecture search with big single-stage models, 2020. URL https://openreview.net/forum?id=HJe7un NFDH.

Hongyi Zhang, Moustapha Ciss e, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In ICLR, 2018.

Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. In ICLR, 2017.

A VISUALIZATION

Figure 5: The architecture of Atom NAS-C. Blue, orange, cyan blocks denote atomic blocks with kernel size 3, 5 and 7 respectively; the heights of these blocks are proportional to their expand ratios.

We plot the structure of the searched architecture Atom NAS-C in Fig. 5, from which we see more ﬂexibility of channel number selection, not only among different operators within each block, but also across the network. In Fig. 6a, we visualize the ratio between atomic blocks with different kernel sizes in all 21 search blocks. First, we notice that all search blocks have convolutions of all three kernel sizes, showing that Atom NAS learns the importance of using multiple kernel sizes in network architecture. Another observation is that Atom NAS tends to keep more atomic blocks at the later stage of the network. This is because in earlier stage, convolutions of the same kernel size costs more FLOPs; Atom NAS is aware of this (thanks to its resource-aware regularization) and try to keep as less as possible computationally costly atomic blocks.

Published as a conference paper at ICLR 2020

288 432 432 432 432 720 720

720 720 1440 1440 1440 1440 1728

1728 1728 1728 3456 3456 3456 3456

(a) Atom NAS-C

288 432 432 432 432 720 720

720 720 1440 1440 1440 1440 1728

1728 1728 1728 3456 3456 3456 3456

(b) Baseline

Figure 6: Ratio of different types of atomic blocks in all 21 searchable blocks. The text above each pie tells the total number of atomic blocks of the corresponding block in the original supernet. Grey denotes dead atomic blocks; blue, orange, and cyan represent atomic blocks using depth-wise convolutions with kernel size 3, 5, 7 respectively. Blocks without skip connection are highlighted by bold text. (a) Visualization for Atom NAS-C. (b) Visualization for baseline (i.e., without FLOPs related coefﬁcients ci).