# binarized_neural_architecture_search__6cda2e52.pdf The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Binarized Neural Architecture Search Hanlin Chen,1 Li an Zhuo,1 Baochang Zhang,1 Xiawu Zheng,2 Jianzhuang Liu,4 David Doermann,3 Rongrong Ji2 1Beihang University, 2Xiamen University, 3University at Buffalo 4Shenzhen Institutes of Advanced Technology, University of Chinese Academy of Sciences {hlchen, bczhang}@buaa.edu.cn Neural architecture search (NAS) can have a significant impact in computer vision by automatically designing optimal neural network architectures for various tasks. A variant, binarized neural architecture search (BNAS), with a search space of binarized convolutions, can produce extremely compressed models. Unfortunately, this area remains largely unexplored. BNAS is more challenging than NAS due to the learning inefficiency caused by optimization requirements and the huge architecture space. To address these issues, we introduce channel sampling and operation space reduction into a differentiable NAS to significantly reduce the cost of searching. This is accomplished through a performancebased strategy used to abandon less potential operations. Two optimization methods for binarized neural networks are used to validate the effectiveness of our BNAS. Extensive experiments demonstrate that the proposed BNAS achieves a performance comparable to NAS on both CIFAR and Image Net databases. An accuracy of 96.53% vs. 97.22% is achieved on the CIFAR-10 dataset, but with a significantly compressed model, and a 40% faster search than the state-of-the-art PCDARTS. Introduction Neural architecture search (NAS) have attracted great attention with remarkable performance in various deep learning tasks. Impressive results have been shown for reinforcement learning (RL) based methods (Zoph et al. 2018; Zoph and Le 2016), for example, which train and evaluate more than 20, 000 neural networks across 500 GPUs over 4 days. Recent methods like differentiable architecture search (DARTs) reduce the search time by formulating the task in a differentiable manner (Liu, Simonyan, and Yang 2018). DARTS relaxes the search space to be continuous, so that the architecture can be optimized with respect to its validation set performance by gradient descent, which provides a fast solution for effective network architecture search. To reduce the redundancy in the network space, partially-connected DARTs (PC-DARTs) was recently introduced to perform Baochang Zhang is the corresponding author. Copyright c 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. a more efficient search without compromising the performance of DARTS (Xu et al. 2019). Although the network optimized by DARTS or its variants has a smaller model size than traditional light models, the searched network still suffers from an inefficient inference process due to the complicated architectures generated by multiple stacked full-precision convolution operations. Consequently, the adaptation of the searched network to an embedded device is still computationally expensive and inefficient. Clearly the problem requires further exploration to overcome these challenges. One way to address these challenges is to transfer the NAS to a binarized neural architecture search (BNAS), by exploring the advantages of binarized neural networks (BNNs) on memory saving and computational cost reduction. Binarized filters have been used in traditional convolutional neural networks (CNNs) to compress deep models (Rastegari et al. 2016a; Courbariaux et al. 2016; Courbariaux, Bengio, and David 2015; Xu, Boddeti, and Savvides 2016), showing up to 58-time speedup and 32-time memory saving. In (Xu, Boddeti, and Savvides 2016), the XNOR network is presented where both the weights and inputs attached to the convolution are approximated with binary values. This results in an efficient implementation of convolutional operations by reconstructing the unbinarized filters with a single scaling factor. In (Gu et al. 2019), a projection convolutional neural network (PCNN) is proposed to realize BNNs based on a simple back propagation algorithm. In our BNAS framework, we re-implement XNOR and PCNN for the effectiveness validation. We show that the BNNs obtained by BNAS can outperform conventional models by a large margin. It is a significant contribution in the field of BNNs, considering that the performance of conventional BNNs are not yet comparable with their corresponding full-precision models in terms of accuracy. The search process of our BNAS consists of two steps. One is the operation potential ordering based on partiallyconnected DARTs (PC-DARTs) (Xu et al. 2019) which serves as a baseline for our BNAS. It is further sped up with a second operation reduction step guided by a performancebased strategy. In the operation reduction step, we prune one operation at each iteration from one-half of the operations Updating likelihoods Reducing search space Selecting operations 宐宄宛季宓宒宒宏 孶宻孶 宐宄宛季宓宒宒宏 孶宻孶 完宧宨宱宷宬宷宼 宆宒宑宙 孶宻孶 宇宨害宷宫孰定宬家宨 宆宒宑宙季孶宻孶 Feature map in memory Feature map not in memory 宐宄宛季宓宒宒宏 孶宻孶 宐宄宛季宓宒宒宏 孶宻孶 宇宨害宷宫孰定宬家宨 宆宒宑宙季孶宻孶 完宧宨宱宷宬宷宼 宆宒宑宙 學宻學 宐宄宛季宓宒宒宏 孶宻孶 宇宨害宷宫孰定宬家宨 宆宒宑宙季孶宻孶 完宧宨宱宷宬宷宼 宐宄宛季宓宒宒宏 孶宻孶 宇宨害宷宫孰定宬家宨 宆宒宑宙季孶宻孶 完宧宨宱宷宬宷宼 宆宒宑宙 學宻學 宐宄宛季宓宒宒宏 孶宻孶 宇宨害宷宫孰定宬家宨 宆宒宑宙季孶宻孶 完宧宨宱宷宬宷宼 宐宄宛季宓宒宒宏 孶宻孶 宐宄宛季宓宒宒宏 孶宻孶 宇宨害宷宫孰定宬家宨 宆宒宑宙季孶宻孶 完宧宨宱宷宬宷宼 宐宄宛季宓宒宒宏 孶宻孶 宇宨害宷宫孰定宬家宨 宆宒宑宙季孶宻孶 完宧宨宱宷宬宷宼 likelihoods Figure 1: The main steps of our BNAS: (1) Search an architecture based on O(i,j) using PC-DARTS. (2) Select half the operations with less potential from O(i,j) for each edge, resulting in O(i,j) smaller. (3) Select an architecture by sampling (without replacement) one operation from O(i,j) smaller for every edge, and then train the selected architecture. (4) Update the operation selection likelihood s(o(i,j) k ) based on the accuracy obtained from the selected architecture on the validation data. (5) Abandon the operation with the minimal selection likelihood from the search space {O(i,j)} for every edge. with less potential as calculated by PC-DARTS. As such, the optimization of the two steps becomes faster and faster because the search space is reduced due to the operation pruning. We can take advantage of the differential framework of DARTS where the search and performance evaluation are in the same setting. We also enrich the search strategy of DARTS. Not only is the gradient used to determine which operation is better, but the proposed performance evaluation is included for further reduction of the search space. In this way BNAS is fast and well built. The contributions of our paper include: BNAS is developed based on a new search algorithm which solves the BNNs optimization and architecture search in a unified framework. The search space is greatly reduced through a performance-based strategy used to abandon operations with less potential, which improves the search efficiency by 40%. Extensive experiments demonstrate that the proposed algorithm achieves much better performance than other light models on CIFAR-10 and Image Net. Related Work Thanks to the rapid development of deep learning, significant gains in performance have been realized in a wide range of computer vision tasks, most of which are manually designed network architectures (Krizhevsky, Sutskever, and Hinton 2012; Simonyan and Zisserman 2014; He et al. 2016; Huang et al. 2017). Recently, the new approach called neural architecture search (NAS) has been attracting increased attention. The goal is to find automatic ways of designing neural architectures to replace conventional hand-crafted ones. Existing NAS approaches need to explore a very large search space and can be roughly divided into three type of approaches: evolution-based, reinforcement-learning-based and one-shot-based. In order to implement the architecture search within a short period of time, researchers try to reduce the cost of evaluating each searched candidate. Early efforts include sharing weights between searched and newly generated networks (Cai et al. 2018a). Later, this method was generalized into a more elegant framework named one-shot architecture search (Brock et al. 2017; Cai, Zhu, and Han 2018; Liu, Simonyan, and Yang 2018; Pham et al. 2018; Xie et al. 2018). In these approaches, an over-parameterized network or super network covering all candidate operations is trained only once, and the final architecture is obtained by sampling from this super network. For example, Brock et al. (Brock et al. 2017) trained the over-parameterized network using a Hyper Net (Ha, Dai, and V. Le 2016), and Pham et al. (Pham et al. 2018) proposed to share parameters among child models to avoid retraining each candidate from scratch. The paper (Liu et al. 2017) is based on DARTS, which introduces a differentiable framework and thus combines the search and evaluation stages into one. Despite its simplicity, researchers have found some of its drawbacks and proposed a few improved approaches over DARTS (Cai, Zhu, and Han 2018; Xie et al. 2018; Chen et al. 2019). Unlike previous methods, we study BNAS based on efficient operation reduction. We prune one operation at each iteration from one-half of the operations with smaller weights calculated by PC-DARTS, and the search becomes faster and faster in the optimization. Binarized Neural Architecture Search In this section, we first describe the search space in a general form, where the computation procedure for an architecture (or a cell in it) is represented as a directed acyclic graph. We then review the baseline PC-DARTS (Xu et al. 2019), which improves the memory efficiency, but is not enough for BNAS. Finally, an operation sampling and a performancebased search strategy are proposed to effectively reduce the search space. Our BNAS framework is shown in Fig. 1 and additional details of which are described in the rest of this section. Search Space Following Zoph et al. (2018); Real et al. (2018); Liu et al. (2018a;b), we search for a computation cell as the building block of the final architecture. A network consists of a predefined number of cells (Zoph and Le 2016), which can be either normal cells or reduction cells. Each cell takes the outputs of the two previous cells as input. A cell is a fullyconnected directed acyclic graph (DAG) of M nodes, i.e., {B1, B2, ..., BM}, as illustrated in Fig. 2(a). Each node Bi takes its dependent nodes as input, and generates an output through a sum operation Bj = i 1) do 3 Select O(i,j) smaller consisting of K/2 operations with smallest αo(i,j) k from O(i,j) for every edge; 4 for t = 1, ..., T epoch do 5 O (i,j) smaller O(i,j) smaller; 6 for e = 1, ..., K/2 epoch do 7 Select an architecture by sampling (without replacement) one operation from O (i,j) smaller for every edge; 8 Train the selected architecture and get the accuracy on the validation data; 9 Assign this accuracy to all the sampled operations; 12 Update s(o(i,j) k ) using Eq. 4; 13 Update the search space {O(i,j)} using Eq. 5; 14 Search the architecture for V epochs based on O(i,j) using PC-DARTS; 15 K = K 1; are solved in different BNNs, such as (Gu et al. 2019) and (Rastegari et al. 2016b). The complete loss function L for BNAS is defined as: L = LS + LA, (7) where LS is the conventional loss function, e.g., crossentropy. Experiments In this section, we compare our BNAS with state-of-the-art NAS methods, and also compare the BNNs obtained by our BNAS based on XNOR (Rastegari et al. 2016b) and PCNN (Gu et al. 2019). Experiment Protocol In these experiments, we first search neural architectures on an over-parameterized network on CIFAR-10, and then evaluate the best architecture with a stacked deeper network on the same data set. Then we further perform experiments to search architectures directly on Image Net. We run the experiment multiple times and find that the resulting architectures only show slight variation in performance, which demonstrates the stability of the proposed method. We use the same datasets and evaluation metrics as existing NAS works (Liu, Simonyan, and Yang 2018; Cai et al. 2018b; Zoph et al. 2018; Liu et al. 2018). First, most experiments are conducted on CIFAR-10 (Krizhevsky, Hinton, and others 2009), which has 50K training images and 10K testing images with resolution 32 32 and from 10 classes. The color intensities of all images are normalized to [ 1, +1]. During architecture search, the 50K training samples of CIFAR-10 is divided into two subsets of equal size, one for training the network weights and the other for finding the architecture hyper-parameters. When reducing the search space, we randomly select 5K images from the training set as a validation set (used in line 8 of Algorithm 1). To further evaluate the generalization capability, we stack the discovered optimal cells on CIFAR-10 into a deeper network, and then evaluate the classification accuracy on ILSVRC 2012 Image Net (Russakovsky et al. 2015), which consists of 1, 000 classes with 1.28M training images and 50K validation images. In the search process, we consider a total of 6 cells in the network, where the reduction cell is inserted in the second and the fourth layers, and the others are normal cells. There are M = 4 intermediate nodes in each cell. Our experiments follow PC-DARTS. We set the hyper-parameter C in PC-DARTS to 2 for CIFAR-10 so only 1/2 features are sampled for each edge. The batch size is set to 128 during the search of an architecture for L = 5 epochs based on O(i,j) (line 1 in Algorithm 1). Note for 5 L 10, the larger L has little effect on the final performance, but will cost more search time. We freeze the network hyper-parameters such as α, and only allow the network parameters such as filter weights to be tuned in the first 3 epochs. Then in the next 2 epochs, we train both the network hyper-parameters and the network parameters. This is to provide an initialization for the network parameters and thus alleviates the drawback of parameterized operations compared with free parameter operations. We also set T = 3 (line 4 in Algorithm 1) and V = 1 (line 14), so the network is trained less than 60 epochs, with a larger batch size of 400 (due to few operation samplings) during reducing the search space. The initial number of channels is 16. We use SGD with momentum to optimize the network weights, with an initial learning rate of 0.025 (annealed down to zero following a cosine schedule), a momentum of 0.9, and a weight decay of 5 10 4. The learning rate for finding the hyper-parameters is set to 0.01. After search, in the architecture evaluation step, our experimental setting is similar to (Liu, Simonyan, and Yang 2018; Zoph et al. 2018; Pham et al. 2018). A larger network of 20 cells (18 normal cells and 2 reduction cells) is trained on CIFAR-10 for 600 epochs with a batch size of 96 and an additional regularization cutout (De Vries and Taylor 2017). The initial number of channels is 36. We use the SGD optimizer with an initial learning rate of 0.025 (annealed down to zero following a cosine schedule without restart), a momentum of 0.9, a weight decay of 3 10 4 and a gradient clipping at 5. When stacking the cells to evaluate on Image Net, the evaluation stage follows that of DARTS, which starts with three convolution layers of stride 2 to reduce the input image resolution from 224 224 to 28 28. 14 cells (12 normal cells and 2 reduction cells) are stacked after these three layers, with the initial channel number being 64. The network is trained from scratch for 250 epochs using a batch size of 512. We use the SGD optimizer with a momentum of 0.9, an initial learning rate of 0.05 (decayed down to Architecture Test Error # Params Search Cost Search (%) (M) (GPU days) Method Res Net-18 (He et al. 2016) 3.53 11.1 (32 bits) - Manual WRN-22 (Zagoruyko and Komodakis 2016) 4.25 4.33 (32 bits) - Manual Dense Net (Huang et al. 2017) 4.77 1.0 (32 bits) - Manual SENet (Hu, Shen, and Sun 2018) 4.05 11.2 (32 bits) - Manual Res Net-18 (XNOR) 6.69 11.17 (1 bit) - Manual Res Net-18 (PCNN) 5.63 11.17 (1 bit) - Manual WRN22 (PCNN) (Gu et al. 2019) 5.69 4.29 (1 bit) - Manual Network in (Mc Donnell 2018) 6.13 4.30 (1 bit) - Manual NASNet-A (Zoph et al. 2018) 2.65 3.3 (32 bits) 1800 RL Amoeba Net-A (Real et al. 2018) 3.34 3.2 (32 bits) 3150 Evolution PNAS (Liu et al. 2018) 3.41 3.2 (32 bits) 225 SMBO ENAS (Pham et al. 2018) 2.89 4.6 (32 bits) 0.5 RL Path-level NAS (Cai et al. 2018b) 3.64 3.2 (32 bits) 8.3 RL DARTS(first order) (Liu, Simonyan, and Yang 2018) 2.94 3.1 (32 bits) 1.5 Gradient-based DARTS(second order) (Liu, Simonyan, and Yang 2018) 2.83 3.4 (32 bits) 4 Gradient-based PC-DARTS 2.78 3.5 (32 bits) 0.15 Gradient-based BNAS (full-precision) 2.84 3.3 (32 bits) 0.08 Performance-based BNAS (XNOR) 5.71 2.3 (1 bit) 0.104 Performance-based BNAS (XNOR, larger) 4.88 3.5 (1 bit) 0.104 Performance-based BNAS (PCNN) 3.94 2.6 (1 bit) 0.09375 Performance-based BNAS (PCNN, larger) 3.47 4.6 (1 bit) 0.09375 Performance-based Table 1: Test error rates for human-designed full-precision networks, human-designed binarized networks, full-precision networks obtained by NAS, and networks obtained by our BNAS on CIFAR-10. Note that the parameters are 1 bit in binarized networks, and are 32 bits in full-precision networks. For fair comparison, we select the architectures by NAS with similar parameters (< 5M). In addition, we also train an optimal architecture in a larger setting, i.e., with more initial channels (44 in XNOR or 48 in PCNN). zero following a cosine schedule), and a weight decay of 3 10 5. Additional enhancements are adopted including label smoothing and an auxiliary loss tower during training. All the experiments and models are implemented in Py Torch (Paszke et al. 2017). Results on CIFAR-10 We compare our method with both manually designed networks and networks searched by NAS. The manually designed networks include Res Net (He et al. 2016), Wide Res Net (WRN) (Zagoruyko and Komodakis 2016), Dense Net (Huang et al. 2017) and SENet (Hu, Shen, and Sun 2018). For the networks obtained by NAS, we classify them according to different search methods, such as RL (NASNet (Zoph et al. 2018), ENAS (Pham et al. 2018), and Path-level NAS (Cai et al. 2018b)), evolutional algorithms (Amoeba Net (Real et al. 2018)), Sequential Model Based Optimization (SMBO) (PNAS (Liu et al. 2018)), and gradient-based methods (DARTS (Liu, Simonyan, and Yang 2018) and PC-DARTS (Xu et al. 2019)). The results for different architectures on CIFAR-10 are summarized in Tab. 1. Using BNAS, we search for two binarized networks based on XNOR (Rastegari et al. 2016b) and PCNN (Gu et al. 2019). In addition, we also train a larger XNOR variant with 44 initial channels and a larger PCNN variant with 48 initial channels. We can see that the test errors of the binarized networks obtained by our BNAS are comparable to or smaller than those of the full-precision hu- man designed networks, and are significantly smaller than those of the other binarized networks. Compared with the full-precision networks obtained by other NAS methods, the binarized networks by our BNAS have comparable test errors but with much more compressed models. Note that the numbers of parameters of all these searched networks are less than 5M, but the binarized networks only need 1 bit to save one parameter, while the fullprecision networks need 32 bits. In terms of search efficiency, compared with the previous fastest PC-DARTS, our BNAS is 40% faster (tested on our platform (NVIDIA GTX TITAN Xp). We attribute our superior results to the proposed way of solving the problem with the novel scheme of search space reduction. Our BNAS method can also be used to search fullprecision networks. In Tab. 1, BNAS (full-precision) and PC-DARTS perform equally well, but BNAS is 47% faster. Both the binarized methods XNOR and PCNN in our BNAS perform well, which shows the generalization of BNAS. Fig. 3 and Fig. 4 show the best cells searched by BNAS based on XNOR and PCNN, respectively. We also use PC-DARTS to perform a binarized architecture search based on PCNN on CIFAR10, resulting in a network denoted as PC-DARTS (PCNN). Compared with PCDARTS (PCNN), BNAS (PCNN) achieves a better performance (95.12% vs. 96.06% in test accuracy) with less search time (0.18 vs. 0.09375 GPU days). The reason for this may be because the performance based strategy can help find bet- Architecture Accuracy (%) Params Search Cost Search Top1 Top5 (M) (GPU days) Method Res Net-18 (Gu et al. 2019) 69.3 89.2 11.17 (32 bits) - Manual Mobile Net V1 (Howard et al. 2017) 70.6 89.5 4.2 (32 bits) - Manual Res Net-18 (PCNN) (Gu et al. 2019) 63.5 85.1 11.17 (1 bit) - Manual NASNet-A (Zoph et al. 2018) 74.0 91.6 5.3 (32 bits) 1800 RL Amoeba Net-A (Real et al. 2018) 74.5 92.0 5.1 (32 bits) 3150 Evolution Amoeba Net-C (Real et al. 2018) 75.7 92.4 6.4 (32 bits) 3150 Evolution PNAS (Liu et al. 2018) 74.2 91.9 5.1 (32 bits) 225 SMBO DARTS (Liu, Simonyan, and Yang 2018) 73.1 91.0 4.9 (32 bits) 4 Gradient-based PC-DARTS (Xu et al. 2019) 75.8 92.7 5.3 (32 bits) 3.8 Gradient-based BNAS (PCNN) 71.3 90.3 6.2 (1 bit) 2.6 Performance-based Table 2: Comparison with the state-of-the-art image classification methods on Image Net. BNAS and PC-DARTS are obtained directly by NAS and BNAS on Image Net, others are searched on CIFAR-10 and then directly transferred to Image Net. dil_conv_3x3 skip_connect max_pool_3x3 skip_connect max_pool_3x3 dil_conv_5x5 B4 sep_conv_3x3 Output dil_conv_5x5 (a) Normal Cell dil_conv_5x5 dil_conv_5x5 sep_conv_3x3 B2 skip_connect max_pool_3x3 skip_connect avg_pool_3x3 B3 dil_conv_3x3 B4 (b) Reduction Cell Figure 3: Detailed structures of the best cells discovered on CIFAR-10 using BNAS based on XNOR. In the normal cell, the stride of the operations on 2 input nodes is 1, and in the reduction cell, the stride is 2. ter operations for recognition. Results on Image Net We further compare the state-of-the-art image classification methods on Image Net. All the searched networks are obtained directly by NAS and BNAS on Image Net by stacking the cells. Our binarized network is based on PCNNs. From the results in Tab. 2, we have the following observations: (1) BNAS (PCNN) performs better than human-designed binarized networks (71.3% vs. 63.5%) and has far fewer parameters (6.1M vs. 11.17M). (2) BNAS (PCNN) has a performance similar to the human-designed full-precision networks (71.3% vs. 70.6%), with a much more highly compressed model. (3) Compared with the full-precision networks obtained by other NAS methods, BNAS (PCNN) has little performance drop, but is fastest in terms of search efficiency (0.09375 vs. 0.15 GPU days) and is a much more highly compressed model due to the binarization of the network. The above results show the excellent transferability of dil_conv_5x5 max_pool_3x3 avg_pool_3x3 B4 dil_conv_5x5 dil_conv_5x5 max_pool_3x3 sep_conv_5x5 dil_conv_5x5 (a) Normal Cell max_pool_3x3 sep_conv_5x5 max_pool_3x3 max_pool_3x3 sep_conv_5x5 skip_connect Output max_pool_3x3 sep_conv_3x3 B4 (b) Reduction Cell Figure 4: Detailed structures of the best cells discovered on CIFAR-10 using BNAS based on PCNN. In the normal cell, the stride of the operations on 2 input nodes is 1, and in the reduction cell, the stride is 2. our BNAS method. In this paper, we have proposed BNAS, the first binarized neural architecture search algorithm, which effectively reduces the search time by pruning the search space in early training stages. It is faster than the previous most efficient search method PC-DARTS. The binarized networks searched by BNAS can achieve excellent accuracies on CIFAR-10 and Image Net. They perform comparable to the full-precision networks obtained by other NAS methods, but with much compressed models. Acknowledgements The work was supported in part by National Natural Science Foundation of China under Grants 61672079, 61473086, 61773117, 614730867. This work is supported by Shenzhen Science and Technology Program KQTD2016112515134654. Baochang Zhang is also with Shenzhen Academy of Aerospace Technology, Shenzhen 100083, China. Brock, A.; Lim, T.; Ritchie, J. M.; and Weston, N. 2017. Smash: one-shot model architecture search through hypernetworks. ar Xiv. Cai, H.; Chen, T.; Zhang, W.; Yu, Y.; and Wang, J. 2018a. Efficient architecture search by network transformation. In Proc. of AAAI. Cai, H.; Yang, J.; Zhang, W.; Han, S.; and Yu, Y. 2018b. Path-level network transformation for efficient architecture search. ar Xiv. Cai, H.; Zhu, L.; and Han, S. 2018. Proxylessnas: Direct neural architecture search on target task and hardware. ar Xiv. Chen, X.; Xie, L.; Wu, J.; and Tian, Q. 2019. Progressive differentiable architecture search: Bridging the depth gap between search and evaluation. ar Xiv. Courbariaux, M.; Bengio, Y.; and David, J.-P. 2015. Binaryconnect: Training deep neural networks with binary weights during propagations. In Proc. of NIPS. Courbariaux, M.; Hubara, I.; Soudry, D.; El-Yaniv, R.; and Bengio, Y. 2016. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1. ar Xiv. De Vries, T., and Taylor, G. W. 2017. Improved regularization of convolutional neural networks with cutout. ar Xiv. Gu, J.; Li, C.; Zhang, B.; Han, J.; Cao, X.; Liu, J.; and Doermann, D. 2019. Projection convolutional neural networks for 1-bit cnns via discrete back propagation. In Proc. of AAAI. Ha, D.; Dai, A.; and V. Le, Q. 2016. Hypernetworks. ar Xiv. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proc. of CVPR. Howard, A. G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; and Adam, H. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. ar Xiv. Hu, J.; Shen, L.; and Sun, G. 2018. Squeeze-and-excitation networks. In Proc. of CVPR. Huang, G.; Liu, Z.; Van Der Maaten, L.; and Weinberger, K. Q. 2017. Densely connected convolutional networks. In Proc. of CVPR. Krizhevsky, A.; Hinton, G.; et al. 2009. Learning multiple layers of features from tiny images. Technical report, Citeseer. Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In Proc. of NIPS. Liu, H.; Simonyan, K.; Vinyals, O.; Fernando, C.; and Kavukcuoglu, K. 2017. Hierarchical representations for efficient architecture search. ar Xiv. Liu, C.; Zoph, B.; Neumann, M.; Shlens, J.; Hua, W.; Li, L.- J.; Fei-Fei, L.; Yuille, A.; Huang, J.; and Murphy, K. 2018. Progressive neural architecture search. In Proc. of ECCV. Liu, H.; Simonyan, K.; and Yang, Y. 2018. Darts: Differentiable architecture search. ar Xiv. Mc Donnell, M. D. 2018. Training wide residual networks for deployment using a single bit for each weight. ar Xiv. Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; De Vito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; and Lerer, A. 2017. Automatic differentiation in pytorch. In Proc. of NIPS. Pham, H.; Guan, M. Y.; Zoph, B.; Le, Q. V.; and Dean, J. 2018. Efficient neural architecture search via parameter sharing. ar Xiv. Rastegari, M.; Ordonez, V.; Redmon, J.; and Farhadi, A. 2016a. Xnor-net: Imagenet classification using binary convolutional neural networks. In Proc. of ECCV. Rastegari, M.; Ordonez, V.; Redmon, J.; and Farhadi, A. 2016b. Xnor-net: Imagenet classification using binary convolutional neural networks. In Proc. of ECCV. Real, E.; Aggarwal, A.; Huang, Y.; and Le, Q. V. 2018. Regularized evolution for image classifier architecture search. ar Xiv. Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision. Simonyan, K., and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. ar Xiv. Xie, S.; Zheng, H.; Liu, C.; and Lin, L. 2018. Snas: stochastic neural architecture search. ar Xiv. Xu, Y.; Xie, L.; Zhang, X.; Chen, X.; Qi, G.-J.; Tian, Q.; and Xiong, H. 2019. Partial channel connections for memoryefficient differentiable architecture search. ar Xiv. Xu, J. F.; Boddeti, V. N.; and Savvides, M. 2016. Local binary convolutional neural networks. In Proc. of CVPR. Ying, C.; Klein, A.; Real, E.; Christiansen, E.; Murphy, K.; and Hutter, F. 2019. Nas-bench-101: Towards reproducible neural architecture search. ar Xiv. Zagoruyko, S., and Komodakis, N. 2016. Wide residual networks. In Proc. of BMVC. Zoph, B., and Le, Q. V. 2016. Neural architecture search with reinforcement learning. ar Xiv. Zoph, B.; Vasudevan, V.; Shlens, J.; and Le, Q. V. 2018. Learning transferable architectures for scalable image recognition. In Proc. of CVPR.