# adapting_neural_architectures_between_domains__2587f6f1.pdf Adapting Neural Architectures Between Domains Yanxi Li 1, Zhaohui Yang 2,3, Yunhe Wang 2, Chang Xu 1 1 School of Computer Science, University of Sydney, Australia 2 Noah s Ark Lab, Huawei Technologies, China 3 Key Lab of Machine Perception (MOE), Department of Machine Intelligence, Peking University, China yali0722@uni.sydney.edu.au, zhaohuiyang@pku.edu.cn, yunhe.wang@huawei.com, c.xu@sydney.edu.au Neural architecture search (NAS) has demonstrated impressive performance in automatically designing high-performance neural networks. The power of deep neural networks is to be unleashed for analyzing a large volume of data (e.g. Image Net), but the architecture search is often executed on another smaller dataset (e.g. CIFAR-10) to finish it in a feasible time. However, it is hard to guarantee that the optimal architecture derived on the proxy task could maintain its advantages on another more challenging dataset. This paper aims to improve the generalization of neural architectures via domain adaptation. We analyze the generalization bounds of the derived architecture and suggest its close relations with the validation error and the data distribution distance on both domains. These theoretical analyses lead to Adapt NAS, a novel and principled approach to adapt neural architectures between domains in NAS. Our experimental evaluation shows that only a small part of Image Net will be sufficient for Adapt NAS to extend its architecture success to the entire Image Net and outperform state-of-the-art comparison algorithms. 1 Introduction Neural architecture search (NAS) is to automate the design of neural architectures for networks. Recently, convolutional neural networks (CNNs) designed by NAS methods have already reached better performance than those manually designed ones on Image Net. However, early NAS methods are computationally intensive, because of their demand for training and evaluation of a large number of architectures [20, 26]. This, therefore, makes it intractable to directly conduct the architecture search on large-scale benchmarks like Image Net. As a trade-off, many NAS methods search on proxy tasks, such as CIFAR-10, and then retrain obtained architectures on Image Net. However, even though on CIFAR-10, it is common for most methods to cost thousands of GPU days. In the past years, great efforts were undertaken to significantly reduce the architecture search cost. Remarkably, the line of research on the differentiable manner for architecture search [17] have reduced the search cost dramatically to several GPU days or several GPU hours on the CIFAR-10 dataset. DARTS [17] relaxes discrete NAS search space as continuous architecture parameters and constructs a super network by weighted mixing of all candidate operations. In the super network, network weights and architecture parameters can be jointly optimized with gradient descent. The search cost of DARTS on CIFAR-10 is only 1 GPU day, but the parallel optimization of all candidate operations demands large GPU memory. GDAS [7] tackles this issue by using a differentiable sampler and sampling one operation per connection in each epoch. This dramatically reduces the usage of GPU memory, and the search can be completed within about 4 to 5 GPU hours depending on the setting. As GDAS reduces the width of the super network, P-DARTS [6] starts with a shallow super network and progressively increases its depth. This method slightly increases the search cost to about 7 GPU hours but can reach a better test performance. Besides direct reducing the training cost, CARS [24] proposes a novel efficient continuous evolutionary approach based on the historical evaluation. 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. Similarly, PVLL-NAS [16] schedules their evaluation with a performance estimator, who samples neural architectures for both architecture searching and iterative training of the estimator itself. However, due to the inconsistent performance of architectures on different domains, searching on proxy tasks and then reusing the obtained architectures on large-scale benchmarks has become a rut, which may result in a huge generalization gap of the neural architecture on the two different domains. This generalization gap could be either positive or negative, but reflecting in practice, it would cause either omitting of good architectures or choosing of poor architectures and make the performance on the desired domain uncontrollable. A few attempts are trying to break out of the rut by directly executing the architecture search on Image Net. Mnas Net [21] was established within the framework of reinforcement learning and costed 288 TPU days for one search on Image Net. By applying the differentiable NAS techniques, Proxyless NAS [5] binarized architectures to boost search speed and reduced the search cost on Image Net to 8.33 GPU day, but it is still about 28 to 52 times slower than its counterparts [7, 6, 23, 25] on CIFAR-10. NAS on CIFAR-10 is fast but deploying the searched architecture on Image Net will receive the accuracy fluctuation; directly searching on Image Net is slow but its architecture accuracy can be guaranteed. If it is infeasible to swallow the entire Image Net, we ask whether a smaller part of Image Net on top of the efficient NAS on CIFAR-10 would rescue us from this dilemma. In this paper, we consider the inconsistency in the generalization of architectures from a new perspective of adapting neural architectures between domains. The proxy task such as CIFAR-10 for searching is considered as the source domain, and the large-scale benchmark such as Image Net to deploy searched architectures for testing or application is our aiming target domain. Firstly, the relationship between the empirical source validation error and the expected target error of neural architectures is analyzed. Since NAS approaches typically optimize network weights during the training phase and then search for architectures during the validation phase, it is meaningful to find a generalization bound by validation. Two versions of the generalization bound are proposed. One associates with the source validation error, while another introduces an additional target validation error calculated on a subset of target samples. Based on them, we propose a lightweight method to explicitly minimize the cross-domain generalization gap of neural architectures during NAS. We name it as Adaptable Neural Architecture Search (Adapt NAS). The generalisability and efficiency of Adapt NAS are demonstrated with extensive experiments. On the three digits dataset, we show that Adapt NAS generalizes better than baselines without generalization constraint. Then, large-scale experiments are performed on CIFAR-10 and Image Net and compared with different state-of-the-art NAS methods that search either with proxy tasks or directly on Image Net. 2 Related Work Existing NAS methods with proxy typically use CIFAR-10 as a proxy task and directly generalize their obtained architectures to Image Net without any constraint. Early methods [20, 26] can easily cost thousands of GPU days to find an architecture even on CIFAR-10. The emerging of differentiable search methods, represented by DARTS [17], reduces search cost to one or several GPU days. A variant of DARTS, GDAS [7], further reduces the search cost to several GPU hours by proposing a differentiable architecture sampler. In terms of selection of the proxy task, P-DARTS [6] uses CIFAR-100 as one of their proxy tasks. CIFAR-100 contains fine-grained categories but identical sample size and number comparing to CIFAR-10 [14]. Thus, the search cost of P-DARTS on CIFAR-10 and CIFAR-100 are similar (0.3 GPU days). FBNet [22] and HM-NAS [23] searches on a subset of Image Net with 100 classes rather than the entire 1000 classes. The 100-class Image Net can be considered as a new proxy task. Proxyless NAS [5] can directly search on the Image Net to avoid the gap in generalization, but the search cost increases to 8.33 GPU days. Mde NAS [25] also directly searches on Imagenet, but they boost search speed by search with the Mobile Net V2 [11] as a backbone and reuse the structure found in [5] instead of search from scratch, which limits it potential. 3 Generalization Analysis for Adapt NAS Let X be the input data space, Z be a latent representation space and Y be the label space. A convolutional neural network (CNN) f : X Y can be disassembled into a representation mapping R : X Z and a classification hypothesis h : Z Y. In general, h is usually a naive single layer feed-forward network with weights wh, and R can have complex topology described by network weights w R and the neural architecture A. The target of NAS is to find an optimum architecture A A that minimize the classification loss L(w (A), A) = Exi D[ℓ(f(xi; w (A), A), yi)]: A = arg min A L(w (A), A), (1) where A is a predefined search space, D is a distribution over the input space X, ℓis a loss function and w (A) is the optimal value of network weights w = {w R, wh} depending on the current architecture A. We consider the bi-level optimization form of NAS: min A Lvalid(w (A), A) (2) s.t. w (A) = arg min w Ltrain(w, A), (3) where Ltrain and Lvalid are losses on the training distribution Dtrain and the held-out validation distribution Dvalid, respectively. In such a bi-level form, w and A are optimized alternately with Eqs. (3) and (2) until convergence or reach a maximum iteration number. We define Proxy NAS as those existing NAS methods that conduct optimization on a relatively small proxy task (e.g. CIFAR10) and evaluate the searched architectures on the large-scale task (e.g. Image Net). In this paper, we tend to revisit such a tradition of NAS training and evaluation from the perspective of domain adaptation and propose the Adapt NAS. The smaller training data of the architecture is taken as source domain, and we can also leverage a few data from the target domain (e.g. Image Net) to improve the generalization of the architecture. Formally, a domain can be considered as a pair of a distribution D on input space X and a labeling function f : X Y. We can thus define the source and target domains as DS, f S and DT , f T , respectively. In this section, we first introduce a generalization bound in NAS constrained by the source domain validation error and a domain distance. Then, we introduce the target domain validation error into the boundary to utilize any accessible target domain information. Detailed proofs are provided in the supplementary material. 3.1 Generalization Bounds via Validation of Source Domain To quantify the generalization gap between domains, a domain distance measurement is necessary. We use the A-distance [13] as the measurement. The A-distance is defined as follow: Definition 1 (A-distance). Let D and D be distributions on X, and A be a collection of subsets of X such that every A A is measurable w.r.t D and D . The A-distance between D and D is d A(D, D ) := 2 sup A A | Pr D[A] Pr D [A]|, (4) where Pr D[A] is the probability of A under D. The complexity of the A-distance can be limited by the symmetric difference hypothesis space H H [4]. For simplification, we discuss the binary classification scenario, where Y = {0, 1}. The theory results can be easily generalized to the multi-class case. Under the binary setting, we have H H = {h(z) h (z)|h, h H}, where is the XOR operation, and H is a hypothesis space. Based on this, AH H can be defined as a collection of all subsets A such that A = {x|x X, h(x) = h (x)} for some h, h H. Letting A = AH H in Eq. 4, we can have the symmetric difference A-distance, notated as d H H(D, D ). The advantage of using d H H( , ) is that it satisfies: h, h H, |εS(h, h ) εT (h, h )| 1 2d H H(DS, DT ), (5) where ε( , ) measures the disagreement of two hypothesis. The measure in the source domain is defined as εS(h, h ) = Ex DS[|h(x) h (x)|], and we use a similar definition for the target domain. Similar to DS, DT , we notate the source and target latent distribution on Z as e DS and e DT . The labelling functions from Z to X are represented by f S and f T , respectively. We define the expected error of h in a domain S as the disagreement between h and f S, notated as εS(h) := εS(h, f S). The similar notation is also used for the target domain. Then, Eq. 5 can lead to Lemma 1. Lemma 1. [4] Let R be a representation function R : X Z, and e DS and e DT be the source and target distribution over Z, respectively. For h H: εT (h) εS(h) + 1 2d H H( e DS, e DT ) + λ, (6) where λ is combined error of the optimum hypothesis h = arg minh H εS(h) + εT (h) on both domains: λ = εS(h ) + εT (h ). Lemma 1 reveals that the cross-domain generalization gap is bounded by the expected source error and the A-distance of latent distributions. This distance can be minimized by optimizing the representation function R. In NAS, w = {w R, wh} are optimized over the training data, while in the validation phase, given the fixed w, the architecture A is further optimized to minimize the validation error (see Eq. 2). We therefore proceed to extend the above analysis to the validation set. Let e US,train and e US,valid be a training set and a held-out validation set of i.i.d. sample drawn from e DS, respectively, such that e US,train e US,valid = . The validation is of a subset H of H depending on e US,train but is independent of e US,valid. The following theorem provides an analysis on the expected target error in terms of the empirical source validation error on e US,valid and an empirical A-distance. Theorem 2. Let m be the size of e US,valid, d be the VC-dimension of H , and e US and e UT be sets of unlabelled i.i.d. samples drawn from e DS and e DT , each with size m . With probability at least 1 δ, for h H : εT (h) ˆεS,valid(h) + d log m log δ 2(d log m log δ) 2d H H( e US, e UT ) + 4 d log(2m ) + log(4/δ) Theorem 2 provides an empirical estimate of the cross-domain generalizability of architectures by validation. The target expected error of an architecture A depends on two terms, the validation error of the entire network (including both R and h, but h is fixed during the validation) in source domain and the A-distance of e US and e UT generated by the neural architecture. 3.2 Generalization Bounds via a Hybrid Validation In Theorem 2, e US,valid is requested to compute ˆεS,valid(h). Besides the validation set on the source domain, we could further have labeled samples from the target domain for the validation use in practice. A hybrid validation set of m examples is therefore defined as the composition of βm source examples and (1 β)m target examples, where β [0, 1]. Validation errors on the source and target domain are combined by weighted sum with α [0, 1]: ˆεα,valid(h) = αˆεS,valid(h) + (1 α)ˆεT,valid(h). (8) The following lemma bounds the expected target error with the expected hybrid error. This bound can be extended to the validation set as well. Lemma 3. Let εα(h) be an expected hybrid error weighted by α [0, 1] . For h H: εT (h) εα(h) + α 1 2d H H( e DS, e DT ) + λ . (9) By applying Lemma 3 to Theorem 2, we can have the following corollary. Corollary 4. Let α [0, 1] be the weight of the hybrid error, and β [0, 1] be the ratio of i.i.d. samples drawn from e DS and e DT in a held-out validation set. With probability at least 1 δ, for h H : εT (h) ˆεα,valid(h) + α d log m log δ 2(d log m log δ) 1 2d H H( e US, e UT ) + 4 d log(2m ) + log(4/δ) To utilize Corollary 4, α and β need to be determined. When α = 1 and the target validation error is not considered, Corollary 4 will be reduced to Theorem 2. With α (0, 1), we will introduce both source and target samples for validation, and the generalizability of architectures could be improved (see the α before A-distance). The selection of β is a trade-off. With a fixed source validation set, a smaller β means more target samples and heavier computation cost. Besides, with a β approaches 0 or 1, the source and target sample number becomes highly unbalanced and the factor α/β + (1 α)/(1 β) approaches infinite, which makes the architecture optimization unpredictable. This therefore reminds us of carefully balancing the sample size in source and target domains. More empirical discussions can be found in experiments. 4 Adapt NAS Algorithm Motivated by theorems in Section 3, we propose two versions of Adapt NAS. The former, Adapt NASSource, following Theorem 2, optimizes network weights with source training samples and estimates the A-distance in the training phase. In the searching phase, the architecture A is optimized to reduce both the source validation loss and the A-distance. The latter, Adapt NAS-Combined, following Corollary 4, uses a similar schema as Adapt NAS-S, but further considers a subset of target samples to optimize both network weights and architectures by utilizing Eq. 8. It is intractable to directly compute the A-distance, but we can approximate it with a domain discriminator [3]. With a domain discriminator hd H, we have: d H H( e US, e UT ) = 2 1 2 min hd H ˆεd(hd) , (11) where ˆεd(hd) = 1 2m P2m i=1 |hd(zi) yd,i| is the empirical discrimination error on zi e US e UT , and yd,i is the domain label. Although the optimal hd is normally unsolvable, the A-distance can still be approximated arbitrarily well by optimizing it. An useful property of Eq. 11 is that d H H( e US, e UT ) 1 minhd H ˆεd(hd). With such an observation, it is possible to learn a domain discriminator during NAS and use adversarial learning to minimize the A-distance by maximizing discrimination error. In Adapt NAS, we first learn an hd to distinguish the latent representation produced by R in the training phase to minimize a discrimination loss: Ld(hd; w R, A) = Exi DS DT [ℓ(hd(R(xi; w R, A)), di)]. Then, A of R is optimized in the searching phase with an adversarial loss to maximize the discrimination loss. In Adapt NAS-S, the lower-level optimization in Eq. 3 can be reformed as Eq. 13, where LS,train(h; w R, A) = Exi DS,train [ℓ(h(R(xi; w R, A)), yi)] is the source training loss. The similar notations are also used for the source validation loss and the target training and validation losses. Similarly, the upper-level optimization in Eq. 2 can be reformed as Eq. 12. min A LS,valid(h; w R, A) Ld(hd; w R, A), (12) s.t. max hd min w,h LS,train(h; w R, A) Ld(hd; w R, A). (13) However, the discriminator gradients at early stage could be noisy and will corrupt the entire network. To control them in back-propagation, we apply a gradient reversal technique [8]. In the training phase, with gradient reversal, h and hd are still updated with their own gradients, but w R is updated with additional reversed discriminator gradients weighted by γ as in Eq. 14. The weight term γ [0, 1] can be dynamically adjusted during optimization. In the searching phase, by utilizing differentiable NAS [7, 17], architectures can be relaxed as continuous parameters and updated with adversarial learning as in Eq. 15. w R w R η LS,train(h; w R, A) w R γ Ld(hd; w R, A) A A η LS,valid(h; w R, A) A γ Ld(hd; w R, A) In Adapt NAS-C, we cannot simply replace LS(h; w, A) by Lα(h; w, A), because Corollary 4 has already revealed that the A-distance term should also be weighted by α. We, therefore, rewrite the optimization problem into the following form, where the source loss and discrimination loss are weighted together: min A α (LS,valid(h; w R, A) Ld(hd; w R, A)) (16) + (1 α)LT,valid(h; w R, A), s.t. max hd min h,w R α (LS,train(h; w R, A) Ld(hd; w R, A)) (17) + (1 α)LT,train(h; w R, A). Comparing to the origin bi-level optimization in Eqs. 2 and 3, Adapt NAS introduces an adversarial loss to both levels, and the Adapt NAS-C version also introduces the target loss. This ensures the generalizability of both levels. A general difficulty in the bi-level optimization setting is the upperlevel optimization highly depends on the lower-level one and is impacted by the quality of the lower-level solution. Similarly, if the solution of lower-level problem has a large generalization gap, it will be hard for the upper-level one to generalize well. The symmetrically constraint on both levels can alleviate this issue. (b) MNIST-M Search Method Source MNIST MNIST MNIST-M Target MNIST-M SVHN SVHN Search on Source 98.56 94.70 95.28 Adapt NAS (ours) 98.75 95.63 95.48 Search on Target 98.61 95.60 95.60 (d) Test accuracy of obtained architectures on the target domain. Figure 1: The generalisability of Adapt NAS. Figure (a), (b) and (c) shows sample images from each domain. Table (d) shows test accuracy of obtained architectures on the target domain. The first row corresponds to Proxy NAS method without generalization constraint. The last row is our aiming performance. The middle row is our method. A remaining problem is that for the hybrid loss calculation on CIFAR-10 and Image Net, we cannot directly use their labels. The reason is that to let the bound in Corollary 4 work, the hypothesis h should be identical for both domains. In practice, the classifier depends on the dimension of output, and there is a large gap between categories in CIFAR-10 and Image Net (10 versus 1,000 different classes). To bridge this gap, we apply self-supervised learning. In self-supervised learning, samples are transformed and labeled based on some predefined rules. The labels are no longer correlated to objects in samples but the rule we defined to transform the samples. Besides, self-supervised learning has been demonstrated to learn feature mapping on one dataset and then well apply the learned mapping to another dataset [10, 12]. To be specific, we utilize a rotation task [10] for its impressive performance. In the rotation task, each sample in the dataset is rotated to different degrees and labeled with them. We use four different degrees: 0 , 90 , 180 and 270 . We can therefore learn a 4-class classification task with the identical categories on both CIFAR-10 and Image Net. 5 Experiments We perform extensive experiments on various domains to demonstrate the practical generalisability of Adapt NAS. Firstly, we use three relatively small digits datasets to compare our result with the results of searching on the source domain only and on the target domain directly. Then, we search with both versions of our method under the standard NAS setting (i.e. with CIFAR-10 as the source domain and Image Net as the target domain) for multiple times with various hyperparameters to justify our claims following Theorem 2 and 4. Finally, the obtained architectures with our optimal settings are compared with the current state-of-the-arts. 5.1 Search Setting Following many previous works [6, 7, 17, 25, 26], we use the NASNet search space [26]. There are 2 kinds of cells, including normal cells and reduction cells, and each cell has 7 nodes, including 2 input nodes, 1 output node and 4 computation nodes. We use a set of 8 different candidate operations. The source dataset, CIFAR-10, contains 50,000 samples in the training set. For target domain samples, we construct a subset of 50,000 samples from Image Net, containing 50 samples from each category, as target samples that we have access to during searching. This is about 3.90% of the entire Image Net. More details of our experiments settings are available in the supplementary material. 5.2 Cross-Domain Generalization with Adapt NAS We use three pairs of source and target domain of digits to demonstrate the generalisability of Adapt NAS. The first pair is MNIST [15] and MNIST-M [9]. MNIST is a dataset of greyscale handwritten digits (Figure 1(a)). MNIST-M modifies MNIST by blending greyscale images over random patches of colour photos in BSDS500 [1] (Figure 1(b)). The blending introduces extra colour and texture. In the second pair, we still use MNIST as the source domain, and the target domain is SVHN [18], which includes natural images of street house numbers (Figure 1(c)). The last pair still includes SVHN as the target domain, but the source domain is the more divergent MNIST-M. Intuitively, the first setting is the simplest, the second one is the hardest, and the last one is moderate. Figure 1(d) shows the test accuracy of all obtained architectures on the target domain. Our method can consistently outperform the source only search method. It is also competitive to directly searching on the target domain and can even occasionally outperform it. This is because we only remain architectures after search and retrain the network weights from scratch. It is possible for an architecture whose generalisability is explicitly optimized to outperform another architecture searched on a single domain. 0 25 50 75 100 125 150 175 200 Epoch Accuracy (%) Reversal Factor ( ) Valid acc: Exp Cos (a) Discrimination accuracy; 0 25 50 75 100 125 150 175 200 Epoch Accuracy (%) Exp: Train acc Exp: Valid acc Cos: Train acc Cos: Valid acc (b) Source accuracy; 0 25 50 75 100 125 150 175 200 Epoch Accuracy (%) Exp: Train acc Exp: Valid acc Cos: Train acc Cos: Valid acc (c) Target accuracy; Figure 2: search curves 5.3 Better Generalization with The Hybrid Loss Firstly, we test different parameters for the Adapt NAS-C, including α and β. We use 5 different values for α from 0 to 1 with an interval of 0.25. When α = 1 and only the source loss is considered, Adapt NAS-C becomes Adapt NAS-S, which is identical to the first row of Table 2. A relatively new case is when α = 0, and only the target loss is considered. We also use 3 different values for β, including 0.50, 0.83 and 0.98. Table 1 shows the validation error Table 1: Performance of various Adapt NAS-C settings. Source Err. (%) (CIFAR-10) Target Err. (%) (Image Net) Valid Test Valid Test 0.00 0.50 49.26 3.00 42.52 24.5 0.25 0.50 30.00 2.97 40.13 24.2 0.50 0.50 25.16 2.50 40.13 24.5 0.75 0.50 22.78 2.62 42.41 25.1 1.00 0.50 23.15 2.53 55.37 25.4 0.00 0.83 52.19 3.21 53.65 25.5 0.25 0.83 38.06 3.17 51.56 25.0 0.50 0.83 33.82 2.95 49.86 24.7 0.75 0.83 28.68 3.00 54.17 25.5 1.00 0.83 23.89 2.98 56.39 25.8 0.00 0.98 74.80 3.91 69.65 29.5 0.25 0.98 67.31 3.66 70.90 26.5 0.50 0.98 51.93 3.56 64.25 25.8 0.75 0.98 40.68 3.02 62.75 25.1 1.00 0.98 30.15 2.93 61.85 25.7 during search and the test error of retraining on CIFAR-10 and the full Image Net. In the first group where β = 0.50, the source and target domain has the same number of samples (50,000 samples from each domain). With the extreme setting that α = 0, the performance on CIFAR10 is the worst. The target error is same to the one with α = 0.50 but is lower then the one with α = 0.25. Although a decent performance might be achieved by solely using target loss if there are sufficient target samples, if there are increasingly few target samples (e.g. the second group where β = 0.83 and the third group where β = 0.98), the effect of using domain discriminator loss can be even more remarkable. In the second group, the number of target samples is decreased to 10,000, and in the third group, the number is further decreased to 1,000. With less target samples, using the hybrid loss can improve the target domain performance by up to 4.4%. Table 2: Compare different versions of Adapt NAS. Hybrid Loss Source Err. (%) (CIFAR-10) Target Err. (%) (Image Net) Train Search N N 2.77 25.3 N Y 2.84 24.8 Y Y 2.50 24.5 We further compare Adapt NAS-S and C. When we introduce the target loss to Adapt NAS-C, we symmetrically introduce it to training and searching phase, because the upper-level optimization highly depends on the lower-level one. Despite that, we explore one more setting, where the target loss is solely introduce to the searching phase. Table 2 shows the test error of retraining. By using hybrid loss, the target error of architectures decreases. Even the hybrid loss is only used in searching, the improvement in target domain is remarkable. By using hybrid loss in both training and searching phase, the lowest target error is reached. 5.4 Gradient Reversal Scheduler in Adversarial Learning We compare two different schedulers for γ in Eqs. (14) and (15). An exponential scheduler is proposed by Ganin et al.[8], which updates γ by: γp = 2 1 + exp( 10 p) 1, (18) where p [0, 1] is the training procedure calculated by dividing the current epoch by the total number of epoch. However, as shown by the blue dashed line in Figure 2(a), the exponential scheduler rises too fast. We also test a cosine-based scheduler, which rises slower: γp = 1 cos(p π) Table 4: Comparison with state-of-the-art NAS methods searching on different domain. For error rates on CIFAR-10, if a paper provides results with cutout, we use that version, because cutout always yield their best performance, and we use it too. On Image Net, cutout is normally not used. Domain Method GPU Days CIFAR-10 Image Net Params (M) Err. (%) Top-1 Top-5 NASNet-A [26] 2K 3.3 2.65 5.3 564 26.0 8.4 ENAS (micro) [19] 0.45 4.6 2.89 - - - - DARTS (2nd order) [17] 1 3.3 2.76 0.09 4.7 574 26.7 8.7 GDAS [7] 0.21 3.4 2.93 5.3 581 26.0 8.5 P-DARTS [6] 0.3 3.4 2.50 4.9 557 24.4 7.4 Proxyless-G [5] 4.0 5.7 2.08 - - - - HM-NAS (2nd order) [23] 1.4 1.8 2.41 0.05 - - - - Mde NAS [25] 0.16 3.6 2.55 6.1 600 25.5 7.9 CIFAR-100 P-DARTS [6] 0.3 3.6 2.62 5.1 577 24.7 7.5 Mnas Net-A3 [21] 3.8K2 - - 5.2 403 23.3 6.7 FBNet-C3 [22] 9 - - 5.5 375 25.1 - Proxyless-R (Mobile) [5] 8.3 - - - - 25.4 7.8 Proxyless (GPU) [5] 8.3 - - 7.1 465 24.9 7.5 HM-NAS3 [23] 5 - - 3.6 482 26.2 - Mde NAS (CPU)4 [25] 2 - - - 600 24.8 - Mde NAS (GPU)4 [25] 2 - - - 600 25.9 - Cross-Domain Adapt NAS-S (Rot-1) 0.5 3.6 2.59 5.2 575 24.7 7.6 Adapt NAS-S (Rot-4) 1.8 3.5 2.77 5.0 552 25.3 7.8 Adapt NAS-C (Rot-1) 0.7 3.9 2.72 5.4 603 24.3 7.4 Adapt NAS-C (Rot-4) 2.0 3.7 2.50 5.3 583 24.2 7.4 1 Include methods using subset of Image Net. 2 Mnas Net takes 4.5 days on 64 TPUv2 for one search. The GPU days is estimated by [22]. 3 FBNet and HM-NAS searches on a subset of Image Net with 100 classes. 4 Mde Nas searches with the Mobile Net V2 [11] as backbone and accelerated by the structure in [5]. where the definition of p is the same as above. Both schedulers are experimented with Adapt NAS-S, which does not consider the target domain loss, to emphasize the impact of the discriminator. Table 3: Test error of searching with different γ schedulers. Scheduler Source Err. (%) (CIFAR-10) Target Err. (%) (Image Net) Exponential 2.93 25.1 Cosine 2.77 25.3 Figure 2 shows accuracy curves during search. We also retrain obtained architectures on CIFAR10 and the entire Image Net after search (Table 3). As shown in Figure 2(a), the initial accuracy of both discriminators is similar, and the one with an exponential scheduler immediately drops, while the one with cosine scheduler can achieve relatively high accuracy, and then declines as γ increases. Common sense is a strong discriminator usually leads to small loss and vanishing gradients [2], which makes the network hard to learn. This is verified by Figure 2(b) and 2(c). Although the cosine scheduler corresponds to a better performance in the source domain, its target domain performance is overtaken by the exponential scheduler, which means the network trained with exponential generalized better. In Table 3, the test performance by retraining also shows the same conclusion. When the cosine scheduler is used, the error is low on CIFAR-10 but is high on Image Net which indicates the architecture is not adapted successfully. 5.5 Comparison with State-of-the-arts Table 4 compares Adapt NAS with current state-of-the-art NAS methods, including methods both searching with proxy tasks or directly searching on Image Net. We notice one drawback of using self-supervised learning is it dramatically increases the searching time. In the rotation task, where we rotate an image to 4 different degrees, the sample number increases 4 times. To balance the performance and efficiency trade-off, we add a simplified Rot-1 setting, where each sample is randomly rotated to only 1 of 4 degrees in each epoch. The origin task is then notated as Rot-4. Comparing to methods searching on CIFAR-10, our method can reach lower Image Net top-1 error and competitive CIFAR-10 error. The simplified Adapt NAS-S Rot-1 is our fastest setting and is even faster than several differentiable NAS methods on CIFAR-10 including DARTS, HM-NAS, and Proxyless-G. Our best top-1 error on Image Net is reached by Adapt NAS-C with Rot-4, which costs 2 GPU days for searching. Even it is slower than many Proxy NAS methods, it is faster than most NAS methods that directly search on Image Net, including the ones using a subset of Image Net. Only Mde NAS, which searches with acceleration, can reach a similar search cost. 6 Conclusion In this paper, the generalization issue in Proxy NAS is studied, and two versions of generalization bound are proposed. Motivated by the generalization bound, we design a Adapt NAS method to find architectures with better generalizability. We provide a new perspective in NAS: instead of direct searching on Image Net or its subset, optimizing the generalizability of architectures by adding domain distance constraint during the search can reach better performance with lower computation cost. Extensive experiments on CIFAR-10 and Image Net demonstrate that Adapt NAS is a more affordable searching method with more controllable generalizability comparing to the current state-of-the-art proxy or proxyless NAS methods. Broader Impact This paper provides a novel perspective of cross-domain generalization in neural architecture search towards the efficient design of neural architectures with strong generalizability. This will lead to a better understanding of the generalizability of neural architectures. The proposed method will be used to design neural architectures for computer vision tasks with affordable computation cost. Acknowledgement The authors would like to thank the Area Chair and the reviewers for their constructive comments. This work was supported by the Australian Research Council under Project DE180101438. [1] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Contour detection and hierarchical image segmentation. IEEE transactions on pattern analysis and machine intelligence, 33(5):898 916, 2010. [2] M. Arjovsky and L. Bottou. Towards principled methods for training generative adversarial networks. arxiv e-prints, art. ar Xiv preprint ar Xiv:1701.04862, 2017. [3] S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain adaptation. In Advances in neural information processing systems, pages 137 144, 2007. [4] J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. Wortman. Learning bounds for domain adaptation. In Advances in neural information processing systems, pages 129 136, 2008. [5] H. Cai, L. Zhu, and S. Han. Proxylessnas: Direct neural architecture search on target task and hardware. ar Xiv preprint ar Xiv:1812.00332, 2018. [6] X. Chen, L. Xie, J. Wu, and Q. Tian. Progressive differentiable architecture search: Bridging the depth gap between search and evaluation. In Proceedings of the IEEE International Conference on Computer Vision, pages 1294 1303, 2019. [7] X. Dong and Y. Yang. Searching for a robust neural architecture in four gpu hours. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1761 1770, 2019. [8] Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation. ar Xiv preprint ar Xiv:1409.7495, 2014. [9] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096 2030, 2016. [10] S. Gidaris, P. Singh, and N. Komodakis. Unsupervised representation learning by predicting image rotations. ar Xiv preprint ar Xiv:1803.07728, 2018. [11] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. ar Xiv preprint ar Xiv:1704.04861, 2017. [12] L. Jing and Y. Tian. Self-supervised visual feature learning with deep neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020. [13] D. Kifer, S. Ben-David, and J. Gehrke. Detecting change in data streams. In VLDB, volume 4, pages 180 191. Toronto, Canada, 2004. [14] A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images. Citeseer, 2009. [15] Y. Le Cun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. [16] Y. Li, M. Dong, Y. Wang, and C. Xu. Neural architecture search in a proxy validation loss landscape. In International Conference on Machine Learning, 2020. [17] H. Liu, K. Simonyan, and Y. Yang. Darts: Differentiable architecture search. ar Xiv preprint ar Xiv:1806.09055, 2018. [18] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011. [19] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean. Efficient neural architecture search via parameter sharing. ar Xiv preprint ar Xiv:1802.03268, 2018. [20] S. A. R. Shah, W. Wu, Q. Lu, L. Zhang, S. Sasidharan, P. De Mar, C. Guok, J. Macauley, E. Pouyoul, J. Kim, et al. Amoebanet: An sdn-enabled network service for big data science. Journal of Network and Computer Applications, 119:70 82, 2018. [21] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le. Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2820 2828, 2019. [22] B. Wu, X. Dai, P. Zhang, Y. Wang, F. Sun, Y. Wu, Y. Tian, P. Vajda, Y. Jia, and K. Keutzer. Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10734 10742, 2019. [23] S. Yan, B. Fang, F. Zhang, Y. Zheng, X. Zeng, M. Zhang, and H. Xu. Hm-nas: Efficient neural architecture search via hierarchical masking. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 0 0, 2019. [24] Z. Yang, Y. Wang, X. Chen, B. Shi, C. Xu, C. Xu, Q. Tian, and C. Xu. Cars: Continuous evolution for efficient neural architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1829 1838, 2020. [25] X. Zheng, R. Ji, L. Tang, B. Zhang, J. Liu, and Q. Tian. Multinomial distribution learning for effective neural architecture search. In Proceedings of the IEEE International Conference on Computer Vision, pages 1304 1313, 2019. [26] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8697 8710, 2018.