# alphanet_improved_training_of_supernets_with_alphadivergence__2adb8ddb.pdf Alpha Net: Improved Training of Supernets with Alpha-Divergence Dilin Wang 1 Chengyue Gong * 2 Meng Li * 1 Qiang Liu 2 Vikas Chandra 1 Weight-sharing neural architecture search (NAS) is an effective technique for automating efficient neural architecture design. Weight-sharing NAS builds a supernet that assembles all the architectures as its sub-networks and jointly trains the supernet with the sub-networks. The success of weight-sharing NAS heavily relies on distilling the knowledge of the supernet to the subnetworks. However, we find that the widely used distillation divergence, i.e., KL divergence, may lead to student sub-networks that overestimate or under-estimate the uncertainty of the teacher supernet, leading to inferior performance of the sub-networks. In this work, we propose to improve the supernet training with a more generalized -divergence. By adaptively selecting the -divergence, we simultaneously prevent the over-estimation or under-estimation of the uncertainty of the teacher model. We apply the proposed -divergence based supernets training to both slimmable neural networks and weight-sharing NAS, and demonstrate significant improvements. Specifically, our discovered model family, Alpha Net, outperforms prior-art models on a wide range of FLOPs regimes, including Big NAS, Once-for All networks, and Attentive NAS. We achieve Image Net top-1 accuracy of 80.0% with only 444M FLOPs. Our code and pretrained models are available at https://github.com/ facebookresearch/Alpha Net. 1. Introduction Designing accurate and computationally efficient neural network architectures is an important but challenging task. *Equal contribution 1Facebook 2Department of Computer Science, The University of Texas at Austin. Correspondence to: Dilin Wang , Chengyue Gong , Meng Li , Qiang Liu , Vikas Chandra . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). Neural architecture search (NAS) automates the neural network design by exploring an enormous architecture space and achieves state-of-the-art (SOTA) performance on various applications including image classification (Zoph & Le, 2017; Zoph et al., 2018), object detection (Ghiasi et al., 2019), semantic segmentation (Zhang et al., 2019) and natural language processing (Wang et al., 2020b). Conventional NAS approaches can be prohibitively expensive as hundreds of candidate architectures need to be trained from scratch and evaluated (e.g., Tan et al., 2019; Zoph et al., 2018). The supernet based approach has recently emerged to be a promising approach for efficient NAS. A supernet assembles all candidate architectures into a weight sharing network with each architecture corresponding to one sub-network. By training the sub-networks simultaneously with the supernet, different architectures can directly inherit the weights from the supernet for evaluation and deployment, which eliminates the huge cost of training or fine-tuning each architecture individually. Though promising, simultaneously optimizing all subnetworks with weight-sharing is highly challenging for the supernet training (e.g., Yu et al., 2020; Cai et al., 2019a). To stabilize the supernet training and improve the performance of sub-networks, one widely used approach is inplace knowledge distillation (KD) (Yu & Huang, 2019). Inplace KD leverages the soft labels predicted by the largest sub-network in the supernet to supervise all the other subnetworks. By distilling the knowledge of the teacher model, the performance of the sub-networks can be improved significantly (Yu & Huang, 2019; Yu et al., 2020). Standard knowledge distillation uses KL divergence to measure the discrepancy between the teacher and student networks. However, KL divergence penalizes the student model much more when it fails to cover one or more local modes of the teacher model (Murphy, 2012). Hence, the student model tends to over-estimate the uncertainty of the teacher model and suffers from inaccurate approximation of the most important mode, i.e., the correct prediction of the teacher model. To further enhance the supernet training, we propose to replace the KL divergence with a more generalized - divergence (Amari, 1985; Minka et al., 2005). Specifically, by adaptively controlling in the proposed diver- Alpha Net: Improved Training of Supernets with Alpha-Divergence gence metric, we can penalize both the under-estimation and over-estimation of the teacher model uncertainty to encourage a more accurate approximation for the student models. While directly optimizing the proposed adaptive -divergence may suffer from a high variance of the gradients, we further propose a simple technique to clip the gradients of our adaptive -divergence to stabilize the training process. We show the clipped gradients still define a valid divergence metric implicitly and hence, yielding a proper optimization objective for KD. We empirically verify the proposed adaptive -divergence in two notable applications of supernets - slimmable networks (Yu & Huang, 2019) and weight-sharing NAS (Yu et al., 2020; Wang et al., 2020a) on Image Net. For weightsharing NAS, we train a supernet containing both small (200M FLOPs) and large (2G FLOPs) sub-networks following Wang et al. (2020a). With the proposed adaptive -divergence, we are able to train high-quality subnetworks, called Alpha Nets, that surpass all prior state-ofthe-art models in the range of 200 to 800 MFLOPs, like Efficient Nets (Tan & Le, 2019), OFANets (Cai et al., 2019a), and Big Nas (Yu et al., 2020). Specifically, Alpha Net-A4 achieves 80.0% accuracy with only 444M FLOPs. 2. Background Training high-quality supernets is fundamental for weightsharing NAS but non-trivial (Benyahia et al., 2019). Recently, in-place KD is shown to be an effective mechanism that significantly improves the supernet performance (Yu & Huang, 2019; Yu et al., 2020). To formalize the supernet training and in-place KD, consider a supernet with trainable parameter . Let A denote the collection of all sub-networks contained in the supernet. The goal of training a supernet is to learn such that all the sub-networks in A can be optimized simultaneously to achieve good accuracy. The supernet training process with the in-place KD is illustrated in Figure 1. At each training step, given a mini-batch of data, the supernet as well as several sub-networks are sampled. While the supernet is trained with the real labels, all the sampled sub-networks are supervised with the soft labels predicted by the supernet. Then, the gradients from all the sampled networks are aggregated before the supernet parameters are updated. More formally, at the training step t, the supernet parameters are updated by t t 1 + g( t 1), where is the step size, and g( t 1) = r LD( ) + γEs LKD([ , s]; t 1) Inactivated components Soft labels Training with true labels Training with KD Sampled Sub-network Supernet (teacher) True labels Figure 1. An illustration of training supernets with KD. Subnetworks are part of the supernet with weight-sharing. Here, LD( ) is the standard cross entropy loss of the supernet on a training dataset D, γ is the weight coefficient, and LKD([ , s]; t) is the KD loss for distilling the supernet into a randomly sampled sub-network s, for which KL divergence has been widely used (e.g., Yu et al., 2020). Let p(x; ) and q(x; , s) denote the output probability of the supernet and the sub-network s given input x, then, we have LKD([ , s], t) = Ex D[KL(p(x; t) || q(x; , s))], (2) where KL(p || q) = Ep[log(p/q)]. Note that the gradient on p(x; t) in the KD loss is stopped as (2) indicated. For notation simplicity, we denote p as our teacher model and q (or q ) as student models in the following. Additionally, note that the way KD is used in the supernet training is different from the standard settings such as Hinton et al. (e.g., 2015), where the teacher network is pretrained and fixed. 3. Supernet training with -divergence In this section, we analyze the limitations of using KL divergence in KD and propose to replace KL divergence with a more generalized -divergence. We study the impact of different choices of values in the proposed divergence metric and further propose an adaptive algorithm to select values during the supernet training. Meanwhile, we also show that while directly optimizing -divergence is challenging due to large gradient variances, a simple clipping strategy on -divergence can be very effective to stabilize the training. 3.1. Classic KL based KD and its limitations KL divergence has been widely used to measure the discrepancy in output probabilities between the teacher and student models in KD. One main drawback with KL di- Alpha Net: Improved Training of Supernets with Alpha-Divergence class 0 1 2 3 4 0.0 1.0 Teacher 6tudent class 0 1 2 3 4 0.0 1.0 Teacher 6tudent -divergence -1.0 -0.5 0 0.5 1.0 (.L) 0 (xample 1 (xample 2 (a) Example 1 (under-estimation) (b) Example 2 (over-estimation) (c) choices of Figure 2. (a) Example 1 - uncertainty under-estimation. The student network under-estimates the uncertainty of the teacher model and misses important local modes of the teacher model. (b) Example 2 - Uncertainty over-estimation. In this case, the student network over-estimates the uncertainty of the teacher model and misclassifies the most dominant mode of the teacher model. (c) plots the corresponding -divergences between the student model and the teacher model for Examples 1 and 2. Note that KL divergence is a special case of -divergences with = 1. We refer to the uncertainty as the entropy of predictions after the Softmax layer of the network. vergence is that it cannot sufficiently penalize the student model when it over-estimates the uncertainty of the teach model. Let p and q denote the output probability of the teacher and student models, respectively. The KL divergence between the teacher and student models is calculated by KL(p||q) = Ep[log(p/q)]. When p > 0, to ensure KL(p||q) remains finite, we must have q > 0. This is the so-called zero avoiding property of KL. In contrast, when p = 0, q > 0 does not get penalized. For example, as shown in Figure 2 (b) and (c), even though the student model over-estimates the uncertainty of the teacher model and predicts the wrong class ( class 4 ), the KL divergence is still small. The aforementioned over-estimation in Example 2 would be penalized at a larger magnitude when using other types of divergences, e.g., reverse KL divergence KL(q||p). For reverse KL divergence, KL(q || p) = Eq[log(q/p)] is infinite if p = 0 and q > 0. Hence if p = 0 we must ensure q = 0, this is known as the zero forcing property (Murphy, 2012). Therefore, minimizing reverse KL divergence encourages the student model q to avoid low probability modes of p while focusing on the modes with high probabilities, and thus, may under-estimate the uncertainty of the teacher model, as shown in Example 1 in Figure 2. Hence, a natural question is whether it is possible to generalize the KL divergence to simultaneously suppress both the under-estimation and over-estimation of the teacher model uncertainty during the supernet training. 3.2. KD with adaptive -divergence Our observations shown in Figure 2 motivate us to design a new KD objective that simultaneously penalize both overestimation and under-estimation of the teacher model uncertainty. We first generalize the typical KL divergence with a more flexible -divergence (Minka et al., 2005). Consider 2 R \ {0, 1}, the -divergence is defined as D (p || q) = 1 ( 1) where q = [qi]m i=1 and p = [pi]m i=1 are two discrete distributions on m categories. The -divergence includes a large spectrum of classic divergence measures. In particular, the KL divergence KL(p || q) is the limit of D (p || q) with ! 1 while the reverse KL divergence KL(q || p) is the limit of D (p || q) with ! 0. A key feature of -divergence is that we can decide to focus on penalizing different types of discrepancies (underestimation or over-estimation) by choosing different values. For example, as shown in Figure 2 (c), when is negative, D (p || q) is large when q is more widely spread than p (when q over-estimates the uncertainty in p), and is small when q is more concentrated than p (when q underestimates the uncertainty in p). The trend is opposite when is positive: under-estimation would be more heavily penalized than over-estimation. To simultaneously alleviate the over-estimation and underestimation problem when training the supernet, we consider a positive + together with a negative , and propose to use the maximum of D +(p || q) and D (p || q) in the KD loss function: D +, (p k q) = max D (p k q) | {z } penalizing over-estimation , D +(p k q) | {z } penalizing under-estimation Our KL loss now changes from Eqn. (2) to LKD([ , s], t) = Ex D[D +, (p(x; t) || q(x; , s))]. (4) We denote this KD strategy that always chooses the maximum of D and D + to optimize as Adaptive-KD. Alpha Net: Improved Training of Supernets with Alpha-Divergence 3.3. Stabilizing -divergence KD One would prefer to set both | +| and | | to be large to ensure the student model is sufficiently penalized when it either under-estimates or over-estimates the uncertainty the teacher model. However, directly optimizing the - divergence with large | | is often challenging in practice. Consider the gradient of -divergence: r D (p || q ) = 1 If | | is large, then the powered term (p/q ) can be quite significant and cause the training process to be unstable. To enhance the training stability, we clamp the maximum value of (p/q ) to be β, and obtain r D (p || q ) where Clipβ(t) = min(t, β). Eqn. (5) is a simple yet effective heuristic approximation of r D (p || q ). It is important to note that Eqn. (5) equals the exact gradient of a special f divergence between p and q . Hence, our updates still amount to minimizing a valid divergence. Note that the clipping function Clipβ( ) is only partially differentiable. So naively clipping on (p/q ) in Eqn. (3) may stop gradients back-propagating from the density ratio terms, hence yielding gradients that are not from a valid divergence. To show that we still optimize a valid divergence with Eqn. (5), note that, for a convex function f : [0, +1) ! R, the f-divergence between p and q is defined as Df(p || q ) = Eq Its gradient w.r.t. is r Df(p || q ) = Eq where f(t) = f 0(t)t f(t) (Wang et al. (2018)). Note that -divergence is a special case of f-divergence when f(t) = t /( ( 1)). Proposition 3.1. There exists a convex function f : (0, +1) ! R, such that r D (p || q ) in (5) is the exact gradient of Df(p || q ), that is, r D (p || q ) = r Df(p || q ). Proof. Let (t) = 1 Clipβ(t) . We just need to find a f such that f(t) = f 0(t)t f(t) = (t). Algorithm 1 Training supernets with -divergence 1: Input: Adaptive -divergence range given by and +, a clipping factor β, a supernet with parameter , and a search space A. 2: while not converging do 3: Sample a mini-batch of data B. 4: Train the supernet with true labels from B 5: Draw k subnetworks {s1, , sk} from A; train sub-networks to mimic the supernet on the minibatch data B with the KD loss defined in Eqn. (4) using clipped gradients in Eqn. (5). 6: end while Taking derivation on both sides, we get f 00(t)t = 0 (t). This gives f 00(t) = 0 (t)/t and hence f(t) = RR (t)/tdt, where denotes second-order antiderivative (or indefinite integral). Because (t) is nondecreasing, we have 0 (t)/t 0 for t > 0, and hence f is convex on (0, +1). In practice, we apply Eqn. (5) to the -divergence used in Eqn. (4). By clipping the value of importance weights, what we optimize is still a divergence metric but is more friendly to gradient-based optimization. 4. Experiments We apply our Adaptive-KD to improve notable supernetbased applications, including slimmable neural networks (Yu & Huang, 2019) and weight-sharing NAS (e.g., Cai et al., 2019a; Yu et al., 2020; Wang et al., 2020a). We provide an overview of our algorithm for training the supernet in Algorithm 1. Adaptive-KD settings In our algorithm, and + control the magnitude of penalizing on over-estimation and under-estimation, respectively. And, β controls the range of density ratios between the teacher model and the student model. We find our method performs robustly w.r.t. a wide of range of choices of , + and β, yielding consistent improvements over the KL based KD baseline. Throughout the experimental section, we set = 1, + = 1 and β = 5.0 as default for our method. We provide detailed ablation studies on these hyper-parameters in section 4.4. 4.1. Slimmable Neural Networks Slimmable neural networks (Yu et al., 2018; Yu & Huang, 2019) are examples of supernets that support a wide range of channel width configurations. The search space A of slimmable networks contains networks with different width and all the other architecture configurations (e.g. depth, convolution type, kernel size) are the same. This way, Alpha Net: Improved Training of Supernets with Alpha-Divergence Model Method 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 w/o KD 53.9 55.3 57.1 59.1 61.1 62.9 64.0 65.8 66.9 67.9 68.8 w/ KL-KD 56.4 57.8 59.5 61.0 63.0 64.4 65.5 67.1 68.3 69.1 69.8 w/ Adaptive-KD (ours) 56.4 57.9 59.7 61.7 63.4 65.0 66.2 67.7 68.8 69.5 70.1 w/o KD - - 61.9 62.8 63.7 64.5 65.1 67.2 67.7 68.3 69.0 w/ KL-KD - - 63.2 64.4 65.1 66.0 66.5 68.4 69.2 69.5 70.1 w/ Adaptive-KD (ours) - - 63.7 64.6 65.6 66.3 66.9 68.7 69.3 69.9 70.5 Table 1. Top-1 validation accuracy on Image Net for Slimmable Mobile Net V1 networks (denoted by Mb V1) and Slimmable Mobile Net V2 networks (denoted by Mb V2) trained with different KD strategies. Model Method 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 w/ KL-KD (T=0.5) 55.1 56.0 57.6 59.1 61.4 62.5 64.0 65.6 66.9 67.9 68.7 T=2.0 55.4 57.0 58.8 60.7 62.6 64.1 65.3 66.6 67.9 68.7 69.5 T=4.0 48.7 50.7 53.1 55.9 58.8 60.9 62.7 64.6 66.0 67.4 68.3 w/ KL-KD (T=0.5) - - 61.7 62.9 63.8 64.6 65.0 67.4 68.4 68.8 69.8 T=2.0 - - 62.6 63.9 64.8 65.6 66.4 68.1 68.6 69.1 70.0 T=4.0 - - 59.3 60.9 62.2 63.1 64.0 66.3 67.1 67.7 68.8 Table 2. Comparison to KL based KD with different temperature (T). We report top-1 validation accuracy on Image Net for slimmable Mobile Net V1 and Mobile Net V2 networks, denoted by Mb V1 and Mb V2, respectively. slimmable networks allow different devices or applications to adaptively adjust the model width on the fly according to on-device resource constraints to achieve the optimal accuracy vs. energy efficiency trade-off. Settings We closely follow the training recipe provided in Yu & Huang (2019), and use slimmable Mobile Net V1 (Howard et al., 2017) and slimmable Mobile Net V2 (Sandler et al., 2018) as our testbed. Specifically, we train slimmable Mobile Net V1 to support arbitrary dynamic width in the range of [0.25, 1.0], and train slimmable Mobile Net V2 to support dynamic widths of [0.35, 1.0]. We adopt the sandwich rule sampling proposed in Yu & Huang (2019) for training. At each training iteration, we sample the largest sub-network with the largest channel width, the smallest sub-network with the smallest channel width and two random sub-networks to accumulate the gradients. We train the supernet with ground truth labels and train all subsampled sub-networks with KD following (1). For our baseline KD strategy, we set the KD coefficient γ to be the number of sub-networks sampled, i.e., γ = 3, as default following Yu & Huang (2019). To evaluate the effectiveness of our method, we simply replace the baseline KL-based KD loss used in Yu & Huang (2019) with our adaptive KD loss in (4). Additionally, we train all models for 360 epochs using SGD optimizer with momentum as 0.9, weight decay as 10 5 and dropout as 0.2. We use cosine learning rate decay, with an initial learning rate of 0.8, and batch size of 2048 on 16 GPUs. Following Yu & Huang (2019), we evaluate on Ima- ge Net (Deng et al., 2009). We note that the baseline models trained with our hyper-parameter settings outperform those reported in Yu & Huang (2019). Results We summarize our results in Table 1. We report the top-1 accuracy on the Image Net. Here, w/o KD denotes the training strategy that excludes the effect of KD. All such sub-networks are trained with ground truth labels via cross entropy. As we can see from Table 1, both baseline KL based KD (denoted as w/ KL-KD) and our adaptive KD (denoted as w/ Adaptive-KD) yield significant performance improvements compared to w/o KD. Our results confirm the importance of KD for training Slimmable networks. Meanwhile, our Adaptive-KD further improves on KL based KD for all the channel width configurations evaluated for both Slimmable Mobile Net V1 (denoted by Mb V1) and Slimmable Mobile Net V2 (denoted by Mb V2). Comparison to KD with different temperature coefficients As discussed in Hinton et al. (2015), for standard KL based KD, one can soften (or sharpen) the probabilities of the teacher and the student model by applying a temperature in their softmax layers. The best distillation performance might be achieved with a different temperature other than the normally used temperature of 1. To ensure a fair comparison, we further evaluate the baseline KL based KD under different temperature (T) settings following the approach in Hinton et al. (2015). We refer the reader to Appendix C for detailed discussion on this topic. Alpha Net: Improved Training of Supernets with Alpha-Divergence Top-1 validation accuracy 200 300 400 500 600 700 800 w/ K/-KD w/ Ad Dpt Lve-KD (ours) 150 200 250 300 360 w/o KD w/ K/-KD w/ Ad Dpt Lve-KD (ours) 150 200 250 300 360 76 w/o KD w/ K/-KD w/ Ad Dpt Lve-KD (ours) M FLOPs Training Epoch Training Epoch (a) Performance Pareto front (b) Training curve of the smallest sub-network (c) Training curve of the supernet Figure 3. (a) Comparison of Pareto-set performance of the supernet trained via KL based KD and our adaptive KD, respectively. Each dot represents a sub-network evaluated during the evolutionary search step. (b-c) Training curves of the smallest sub-network and the largest sub-network (i.e., the supernet). Validation Accuracy 77.0 77.5 78.0 78.5 79.0 79.5 80.0 80.5 w/ ./-.D w/ Ad Dpt Lve-.D (ours) Figure 4. Top-1 accuracy on Image Net from weight-sharing NAS with KL-based KD and adaptive-KD. Each box plot shows the performance of sampled sub-networks within each FLOPs regime. From bottom to top, each horizontal bar represents the minimum accuracy, the first quartile, the median, the third quartile and the maximum accuracy, respectively. In particular, we test a number of temperatures - 0.5, 2 and 4. We summarize our results in Table 2. We find all these settings to systematically perform worse than the simple KD strategy without temperature scaling, i.e., T = 1. Additionally, the models trained via our method yield the best performance. 4.2. Weight-sharing NAS We apply our Adaptive-KD to improve the training of the supernet for weight-sharing NAS (Cai et al., 2019a; Yu et al., 2020; Wang et al., 2020a). Please see Appendix A for a brief introduction on weight-sharing NAS. Note that one main procedure of weight-sharing NAS is to simultaneously train all sub-networks specified in the search space to convergence. Similar to training Slimmable neural networks, this is often achieved by enforcing all sub-networks to learn from the supernet with KL based KD, (e.g., Yu et al., 2020). Training. Our training recipe follows Wang et al. (2020a) except we use uniform sampling for simplicity. We pursue minimum code modifications to ablate the effective- ness of our KD strategy. We evaluate on the Image Net dataset (Deng et al., 2009). All training details and the search space we used are discussed in Appendix B. We use the update rule defined in (1) to train the supernet. Following Wang et al. (2020a) and Yu et al. (2020), at each iteration, we train the supernet with ground truth labels and simultaneously we train the smallest sub-network and two random sub-networks with KD. In this way, a total of 4 networks are trained at each iteration. Evaluation We compare the accuracy vs. FLOPs Pareto formed by the supernet learned by different KD strategies. To estimate the performance Pareto, we proceed as follows: 1) we first randomly sample 512 sub-networks from the supernet and estimate their accuracy on the Image Net validation set; 2) we apply crossover and random mutation on the best performing 128 sub-networks following Wang et al. (2020a). We fix both the crossover size and mutation size to be 128, yielding 256 new sub-networks. We then evaluate the performance of these sub-networks; 3) We repeat the second step 20 times. The total number of sub-networks thus evaluated is 5, 376. Alpha Net: Improved Training of Supernets with Alpha-Divergence A0 (203M) A1(279M) A2(317M) A3(357M) A4(444M) A5 (491M) A6 (709M) w/o KD 73.8 75.4 75.6 76.0 76.8 77.1 77.9 w/ KL-KD 77.0 78.2 78.5 78.8 79.3 79.6 80.1 w/ Symmetric KL-KD 77.0 78.4 78.5 78.7 79.3 79.5 79.9 w/ KL-KD + Attentive Sampling 77.3 78.4 78.8 79.1 79.8 80.1 80.7 w/ Adaptive-KD (ours - Alpha Net) 77.8 78.9 79.1 79.4 80.0 80.3 80.8 Table 3. Performance on the discovered networks in Wang et al. (2020a). Each (#M) denotes the FLOPs of the corresponding model. uses additional attentive sampling (Wang et al., 2020a) for training the supernet. We denote our models as Alpha Net models. Here symmetric KL refers to a combination of the KL and the reverse KL divergence, i.e., KL(q || p) + KL(p || q). Top-1 validation accuracy 200 300 400 500 600 700 800 75 Alpha1et-A6 AOSha1et (ours) (ffi Fientnet 0o Ei Oenetv3 01as1et )air1A6 Big1as )B1etv3 2)A (#75e S) Figure 5. Comparison with prior art NAS approaches on Image Net. #75ep denotes the models are further finetuned for 75 epochs with weights inherited from the corresponding supernet. Results As we can see from Figure 3(a), Adaptive-KD achieves a significantly better Pareto frontier compared to the KL-based KD baseline (denoted as w/ KL-KD) and the simple training strategy without KD (denoted as w/o KD). Figures 3(b) and (c) plot the convergence curve of the smallest sub-network and the supernet, respectively. Our method adaptively optimizes a more difficult KD loss between the supernet and the sub-networks, yielding slightly slower convergence in the early stage of the training but better performance towards the end of the training. In Figure 4, we group sub-networks according to their FLOPs and visualize five statistics for each group of subnetworks, including the minimum, the first quantile, the median, the third quantile and the maximum accuracy. Our method learns significantly better sub-networks in a quantitative way. Improvement on SOTA As we use the same search space as in Wang et al. (2020a), we further evaluate the discovered Attentive NAS models (from A0 to A6) with the supernet weights learned by our adaptive KD. We refer to our models as Alpha Net models. Dataset Eff-B0 Alp-A0 Eff-B1 Alp-A6 Oxford Flowers 97.2 97.7 97.8 98.7 Oxford-IIIT Pets 91.2 91.5 92.4 92.9 Food-101 87.6 88.3 89.0 89.6 Stanford Cars 91.0 91.5 92.2 92.6 FGVC Aircraft 88.1 88.5 88.7 89.1 Table 4. Comparison of transfer learning accuracy. Eff and Alp denotes Efficient Net and Alpha Net, respectively. All the networks are pretrained on Image Net and then finetuned on transfer learning datasets. Efficient Net-B0 and B1 has a model size of 390 MFLOPs and 700 MFLOPs, respectively. Alpha Net-A0 and A6 use 203 MFLOPs and 709 MFLOPs, respectively. As we can see from Table 3, our Adaptive-KD significantly improves on classic KL based KD, yielding an average of 0.7% improvements in the top-1 accuracy from A0 to A6. We aslo compare with symmetric KL based KD (namely, KL(p||q)+KL(q||p)). The corresponding results are no better than those by using standard KL based KD training. This is probably because the two different KL terms produce conflicted gradients during training, which may therefore lead to inferior final performance. Additionally, our Alpha Net outperform all corresponding Attentive NAS models (Wang et al., 2020a), which requires building Pareto-aware sampling distributions with additional computational overhead. We further compare our Alpha Net against prior art NAS baselines, including Efficient Net (Tan & Le, 2019), FBNet V3 (Dai et al., 2020), Big NAS (Yu et al., 2020), OFA (Cai et al., 2019a), Mobile Net V3 (Howard et al., 2019), Fair NAS (Chu et al., 2019) and MNas Net (Tan et al., 2019), in Figure 5. Our method outperforms all the baselines evaluated, establishing new SOTA accuracy vs. FLOPs trade-offs on Image Net. For example, our model achieves 77.8% top-1 accuracy with only 203M FLOPs. Under similar FLOPs constraint, the corresponding top-1 accuracy is 75.2% with 219M FLOPs for Mobile Net V3, 76.5% top-1 accuracy with 242M FLOPs for Big NAS. Compared to OFA, our model achieves the same 80.0% top-1 accuracy with 35% fewer FLOPs (444M v.s. 595M) and the same 79.1% top-1 accuracy with 26% fewer FLOPs (317M v.s. 400M). Alpha Net: Improved Training of Supernets with Alpha-Divergence Relative Accuracy 200 300 400 500 600 700 800 0.0 β = 1 β = 5 β = 10 Relative Accuracy 200 300 400 500 600 700 800 0.0 α -2, α + 1 α -1, α + 1 α 0, α + 1 α -1, α + 0.5 α -1, α + 2 α α+ -1 MFLOPs 50 MFLOPs 50 (a) Ablation study of β (b) Ablation study of and + Figure 6. Relative accuracy compared to the results of KL based KD. Figure (a): we fix = 1, + = 1 and study the effect of our clipping factor β. Figure (b): we set β = 5 as default and study the impact of and +. Model w/o KD w/ KL-KD (T=1) T=2 T=4 Adaptive-KD (Ours) Mobile Net V3 0.75 73.3 73.9 72.2 70.8 73.9 Mobile Net V3 0.5 69.6 69.8 65.4 63.6 70.0 Table 5. Comparison to KL based KD with fixed teacher models on Image Net. Here T denotes the temperature used in classic KL based KD (see Appendix C). We use a Mobile Net V3 1.0 as our teacher model, which yields 75.4% top-1 validation accuracy on Image Net. All Mobile Net V3 student models are trained for 360 epochs with cosine learning rate decay. Teacher Mobile Net V1 1.0x Mobile Net V2 1.0x Reg Net Y Student Shuffle Net 0.5x Shuffle Net 1.0x Mobile Net V2 0.25x Mobile Net V2 0.5x Dei T-tiny w/ KL-KD (T=1) 60.3 69.3 54.4 65.3 74.6 Adaptive-KD (Ours) 61.1 69.5 55.0 65.7 75.2 Table 6. Additional KD results on Image Net. Our Mobile Net V1 and V2 teacher has a top-1 accuracy of 73.2% and 72.9%, respectively. All Shuffle Nets (Ma et al., 2018) and Mobile Net V2 models are trained for 120 epochs with standard random crop and resize data augmentation. For Dei T-tiny (Touvron et al., 2020), we exactly follow the settings of Dei T for training and use a Reg Net Y (Radosavovic et al., 2020) as the teacher model. 4.3. Transfer learning Here we show that our Alpha Net models are not overfitted on Image Net and the knowledge learned on Image Net could be transferred to other datasets as well. Specifically, we take our Alpha Net-A0 and Alpha Net-A6 models pretrained on Image Net and fine-tune them on a number of transfer learning benchmarks. We closely follow the training settings in Efficient Net (Tan & Le, 2019) and GPipe (Huang et al., 2018). We use SGD with momentum of 0.9, label smoothing of 0.1 and dropout of 0.5. All models are fine-tuned for 150 epochs with batch size of 64. Following Huang et al. (2018), we search the best learning rate and weight decay on a hold-out subset (20%) of the training data. Transfer learning results We evaluated on five transfer learning benchmark datasets, including Oxford Flowers (Nilsback & Zisserman, 2008), Oxford Pets (Parkhi et al., 2012), Food-101 (Bossard et al., 2014), Stan- ford Cars (Krause et al., 2013) and Aircraft (Maji et al., 2013). As we can see from Table 4, our Alpha Net-A0 and Alpha Net-A6 models lead to significant better transfer learning accuracy compared to those from Efficient Net-B0 and Efficient Net-B1 models. 4.4. Additional results Robustness w.r.t. clipping factor β We follow the training and evaluation settings in section 4.2 and study the effect of β. In Figure 6 (a), we group sub-networks according to their FLOPs, and report the relative top-1 accuracy improvements of the maximum top-1 accuracy of each FLOPs group over the result from the KL based KD baseline. As shown in Figure 6(a), our algorithm is robust to the choice of β. Our algorithm works with a large range of β, from 1 to 10, yielding consistent improvements over the classic KL based KD baseline. And our default setting β = 5 achieves best performance on all FLOPs regimes evaluated. Alpha Net: Improved Training of Supernets with Alpha-Divergence Robustness w.r.t. We ablate the impact of both and + under the same settings as in section 4.2. In this case, we fix β = 5. We present our findings in Figure 6(b). Firstly, we test with = 2, 1, 0, with + fixed as 1. With a more negative (e.g., = 2), this defines a more difficult objective that brings optimization challenges. With a large (e.g., = 0), the resulting KD loss is less discriminative regarding uncertainty overestimation. Overall, = 1 achieves a good balance between optimization difficulty and over-estimation penalization, yielding the best performance. Secondly, we vary + from 0.5 to 2, with fixed as 1. Similarly, we find that large + (e.g, + = 1) yields the best performance. Lastly, we set both = + = 1. In this case, we still achieve better performance compared to the results of our KL based KD baseline, indicating the importance of penalizing over-estimation in training sub-networks. Also, our adaptive KD that regularizes on both over-estimation and over-estimation achieves better performance in general. Improvement on single network training To further demonstrate the broader applicability of our method, we apply our Adaptive-KD to train a single neural network with a pretrained teacher model, as in convectional KD setup (See Appendix C). Specifically, in Table 5, we use a Mobile Net V3 1.0 (Howard et al., 2019) as our teacher model and train Mobile Net V3 0.5 and 0.75 as our student models. In Table 6, we provide additional comparisons for training Shuffle Nets (Ma et al., 2018), Mobile Netv2 models (Sandler et al., 2018) and more recent vision transformers (Touvron et al., 2020) 1 with a fixed temperature of 1.0. We summarize the top-1 validation accuracy on Image Net from the models trained with different KD strategies in both Table 5 and Table 6. The student models trained via our method yield the best accuracy. 5. Related work Neural architecture search (NAS) NAS offers a powerful tool to automate the design of neural architectures for challenging machine learning tasks (e.g., Fang et al., 2020; Fu et al., 2021; Moons et al., 2020; Li et al., 2020; Peng et al., 2020). Early NAS solutions usually build upon black-box optimization, e.g. reinforcement learning (e.g., Zoph & Le, 2017), Bayesian optimisation (e.g., Kandasamy et al., 2018), evolutionary algorithms (e.g., Real et al., 2019). These methods find good networks but are extremely computationally expensive in practice. More recent NAS approaches have adopted weight-sharing 1https://github.com/facebookresearch/deit (Pham et al., 2018) to improve search efficiency. Weightsharing based approaches often frame NAS as a constrained optimization and solve with continuous relaxations (e.g., Liu et al., 2019; Cai et al., 2019b). However, these methods require to run NAS for each deployment consideration, e.g. a specific latency constraint for a particular mobile device, the total search cost grows linearly with the number of deployment considerations (Cai et al., 2019a). To further alleviate the aforementioned limitations, oneshot supernet-based NAS (e.g., Cai et al., 2019a; Yu et al., 2020; Wang et al., 2020a) proposes to first jointly train all candidate sub-networks specified in the weight-sharing graph such that all sub-networks reach good performance at the end of training; then one can apply typical search algorithms, e.g., genetic search, to find a set of Pareto optimal networks for various deployment scenarios. Overall, one-shot supernet based methods provide a highly flexible and efficient NAS framework, yielding state-of-the-art empirical NAS performance on various challenging applications (e.g., Cai et al., 2019a; Wang et al., 2020b). Knowledge Distillation Our knowledge distillation forces the student model to mimic the predictions of the teacher model. As shown in the literature, the features in intermediate layers of the teacher model can also be used as knowledge to supervise the training of the student model, notable examples include (Romero et al., 2014; Huang & Wang, 2017; Ahn et al., 2019; Jang et al., 2019; Passalis & Tefas, 2018; Li et al., 2019, e.g.,). Furthermore, correlations between different training examples (e.g. similarity) learned by the teacher model also provide rich information, which could be distilled to the student model (Park et al., 2019; Yim et al., 2017). However, in our work, our KD involves training a large amount of sub-networks (students) with different architecture configurations, e.g., different network depth, channel width, etc. It is less clear on how to define a good matching in the latent feature space between the teacher supernet and student sub-networks in a consistent way. While our method offers a simple distillation mechanism that is easy to use in practice and in the meantime, leads to significant empirical improvements. 6. Conclusion In this work, we propose a method to improve the training of supernets with -divergence based knowledge distillation. By adaptively selecting an -divergence to optimize, our method simultaneously penalizes over-estimation and under-estimation in KD. Applying our method for neural architecture search, the searched Alpha Net models establish the new state-of-the-art accuracy vs. FLOPs trade-offs on the Image Net dataset. Alpha Net: Improved Training of Supernets with Alpha-Divergence Sungsoo Ahn, Shell Xu Hu, Andreas Damianou, Neil D Lawrence, and Zhenwen Dai. Variational information distillation for knowledge transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9163 9171, 2019. Shun-ichi Amari. Differential-geometrical methods in statistics. Lecture Notes on Statistics, 28:1, 1985. Yassine Benyahia, Kaicheng Yu, Kamil Bennani Smires, Martin Jaggi, Anthony C Davison, Mathieu Salzmann, and Claudiu Musat. Overcoming multi-model forgetting. In International Conference on Machine Learning, pp. 594 603. PMLR, 2019. Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101 mining discriminative components with random forests. In European conference on computer vision, pp. 446 461. Springer, 2014. Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. Once-for-all: Train one network and specialize it for efficient deployment. In International Conference on Learning Representations, 2019a. Han Cai, Ligeng Zhu, and Song Han. Proxylessnas: Direct neural architecture search on target task and hardware. International Conference on Learning Representations, 2019b. Xiangxiang Chu, Bo Zhang, Ruijun Xu, and Jixiang Li. Fairnas: Rethinking evaluation fairness of weight sharing neural architecture search. ar Xiv preprint ar Xiv:1907.01845, 2019. Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Va- sudevan, and Quoc V Le. Autoaugment: Learning augmentation policies from data. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018. Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Bichen Wu, Zi- jian He, Zhen Wei, Kan Chen, Yuandong Tian, Matthew Yu, Peter Vajda, et al. Fbnetv3: Joint architecture-recipe search using neural acquisition function. ar Xiv preprint ar Xiv:2006.02049, 2020. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In IEEE conference on computer vision and pattern recognition, pp. 248 255. Ieee, 2009. Jiemin Fang, Yuzhu Sun, Qian Zhang, Yuan Li, Wenyu Liu, and Xinggang Wang. Densely connected search space for more flexible neural architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10628 10637, 2020. Chaoyou Fu, Yibo Hu, Xiang Wu, Hailin Shi, Tao Mei, and Ran He. Cm-nas: Rethinking cross-modality neural architectures for visible-infrared person re-identification. ar Xiv e-prints, pp. ar Xiv 2101, 2021. Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Nas-fpn: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7036 7045, 2019. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531, 2015. Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1314 1324, 2019. Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. ar Xiv preprint ar Xiv:1704.04861, 2017. Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132 7141, 2018. Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Fi- rat, Mia Xu Chen, Dehao Chen, Hyouk Joong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in Neural Information Processing Systems, 2018. Zehao Huang and Naiyan Wang. Like what you like: Knowledge distill via neuron selectivity transfer. ar Xiv preprint ar Xiv:1707.01219, 2017. Yunhun Jang, Hankook Lee, Sung Ju Hwang, and Jinwoo Shin. Learning what and where to transfer. In International Conference on Machine Learning, pp. 3030 3039. PMLR, 2019. Kirthevasan Kandasamy, Willie Neiswanger, Jeff Schnei- der, Barnabas Poczos, and Eric Xing. Neural architecture search with bayesian optimisation and optimal transport. Advances in Neural Information Processing Systems, 2018. Jonathan Krause, Jia Deng, Michael Stark, and Li Fei-Fei. Collecting a large-scale dataset of fine-grained cars. Second Workshop on Fine-Grained Visual Categorization, 2013. Alpha Net: Improved Training of Supernets with Alpha-Divergence Changlin Li, Jiefeng Peng, Liuchun Yuan, Guangrun Wang, Xiaodan Liang, Liang Lin, and Xiaojun Chang. Block-wisely supervised neural architecture search with knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1989 1998, 2020. Zhuohan Li, Zi Lin, Di He, Fei Tian, Tao Qin, Liwei Wang, and Tie-Yan Liu. Hint-based training for nonautoregressive machine translation. Empirical Methods in Natural Language Processing, 2019. Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. International Conference on Learning Representations, 2019. Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on computer vision (ECCV), pp. 116 131, 2018. Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual classification of aircraft. ar Xiv preprint ar Xiv:1306.5151, 2013. Tom Minka et al. Divergence measures and message pass- ing. Technical report, Citeseer, 2005. Bert Moons, Parham Noorzad, Andrii Skliar, Giovanni Mariani, Dushyant Mehta, Chris Lott, and Tijmen Blankevoort. Distilling optimal neural networks: Rapid search in diverse spaces. ar Xiv preprint ar Xiv:2012.08859, 2020. Kevin P Murphy. Machine learning: a probabilistic per- spective, chapter 21. MIT press, 2012. Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722 729. IEEE, 2008. Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Re- lational knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3967 3976, 2019. Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In IEEE conference on computer vision and pattern recognition, pp. 3498 3505. IEEE, 2012. Nikolaos Passalis and Anastasios Tefas. Learning deep rep- resentations with probabilistic knowledge transfer. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 268 284, 2018. Houwen Peng, Hao Du, Hongyuan Yu, QI LI, Jing Liao, and Jianlong Fu. Cream of the crop: Distilling prioritized paths for one-shot neural architecture search. Advances in Neural Information Processing Systems, 33, 2020. Hieu Pham, Melody Guan, Barret Zoph, Quoc Le, and Jeff Dean. Efficient neural architecture search via parameters sharing. In International Conference on Machine Learning, pp. 4095 4104. PMLR, 2018. Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, and Piotr Doll ar. Designing network design spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10428 10436, 2020. Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution for image classifier architecture search. In Proceedings of the aaai conference on artificial intelligence, volume 33, pp. 4780 4789, 2019. Adriana Romero, Nicolas Ballas, Samira Ebrahimi Ka- hou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. ar Xiv preprint ar Xiv:1412.6550, 2014. Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510 4520, 2018. Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pp. 6105 6114. PMLR, 2019. Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasude- van, Mark Sandler, Andrew Howard, and Quoc V Le. Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2820 2828, 2019. Hugo Touvron, Matthieu Cord, Matthijs Douze, Fran- cisco Massa, Alexandre Sablayrolles, and Herv e J egou. Training data-efficient image transformers & distillation through attention. ar Xiv preprint ar Xiv:2012.12877, 2020. Dilin Wang, Hao Liu, and Qiang Liu. Variational inference with tail-adaptive f-divergence. ar Xiv preprint ar Xiv:1810.11943, 2018. Alpha Net: Improved Training of Supernets with Alpha-Divergence Dilin Wang, Meng Li, Chengyue Gong, and Vikas Chandra. Attentivenas: Improving neural architecture search via attentive sampling. ar Xiv preprint ar Xiv:2011.09011, 2020a. Hanrui Wang, Zhanghao Wu, Zhijian Liu, Han Cai, Ligeng Zhu, Chuang Gan, and Song Han. Hat: Hardware-aware transformers for efficient natural language processing. ar Xiv preprint ar Xiv:2005.14187, 2020b. Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4133 4141, 2017. Jiahui Yu and Thomas S Huang. Universally slimmable networks and improved training techniques. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1803 1811, 2019. Jiahui Yu, Linjie Yang, Ning Xu, Jianchao Yang, and Thomas Huang. Slimmable neural networks. International Conference on Learning Representations, 2018. Jiahui Yu, Pengchong Jin, Hanxiao Liu, Gabriel Bender, Pieter-Jan Kindermans, Mingxing Tan, Thomas Huang, Xiaodan Song, Ruoming Pang, and Quoc Le. Bignas: Scaling up neural architecture search with big singlestage models. European conference on computer vision, 2020. Yiheng Zhang, Zhaofan Qiu, Jingen Liu, Ting Yao, Dong Liu, and Tao Mei. Customizable architecture search for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11641 11650, 2019. Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. International Conference on Learning Representations, 2017. Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8697 8710, 2018.