# differentiable_model_scaling_using_differentiable_topk__0009eb35.pdf

Differentiable Model Scaling using Differentiable Topk

Kai Liu 1 Ruohui Wang 1 Jianfei Gao 1 Kai Chen 1

Abstract Over the past few years, as large language models have ushered in an era of intelligence emergence, there has been an intensified focus on scaling networks. Although Neural Architecture Search (NAS) methods have been proposed to automate this process, they suffer from low search efficiency. This study introduces Differentiable Model Scaling (DMS), increasing the efficiency for searching optimal width and depth in networks. DMS can model both width and depth in a direct and fully differentiable way, making it easy to optimize. We have evaluated our DMS across diverse tasks, ranging from vision tasks to NLP tasks and various network architectures, including CNNs and Transformers. Results consistently indicate that our DMS can find improved structures and outperforms state-of-the-art NAS methods. Specifically, for image classification on Image Net, our DMS improves the top-1 accuracy of Efficient Net-B0 and Deit-Tiny by 1.4% and 0.6%, respectively, and outperforms the state-of-the-art zero-shot NAS method, Zi Co, by 1.3% while requiring only 0.4 GPU days for searching. For object detection on COCO, DMS improves the m AP of Yolo-v8-n by 2.0%. For language modeling, our pruned Llama-7B outperforms the prior method with lower perplexity and higher zeroshot classification accuracy. Our code is available at https://github.com/LKJacky/ Differentiable-Model-Scaling.

1. Introduction

In recent years, large models such as GPTs (Radford et al., 2018) and Vi Ts (Dosovitskiy et al., 2020) have showcased outstanding performance. Notably, the emergent intelligence of GPT4 (Open AI, 2023) has underscored the im-

1Shanghai AI Laboratory, Shanghai, China. Correspondence to: Kai Liu <liukai@pjlab.org.cn>, Kai Chen <chenkai@pjlab.org.cn>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

portance of scaling networks as a critical pathway toward achieving artificial general intelligence (AGI). To support this scaling process, we introduce a general and potent method to determine the optimal width and depth of a network during its scaling.

Currently, the structure design of most networks still relies on human expertise. It typically demands significant resources to tune structural hyperparameters, making it challenging to pinpoint the optimal structure. Meanwhile, Neural Architecture Search (NAS) methods have been introduced to automate network structure design. We classify NAS methods into two categories based on their search strategies: stochastic search methods (Xie et al., 2022; Liu et al., 2022; Tan & Le, 2019) and gradient-based methods (Liu et al., 2018a; Wan et al., 2020; Guo et al., 2021a).

The stochastic search methods involve sampling numerous sub-networks to compare performance. However, these methods are limited to low search efficiency due to the sample-evaluate cycle, leading to reduced performance and increased search costs.

Unlike stochastic search methods, gradient-based methods employ gradient descent to optimize structural parameters, enhancing their efficiency and making them more adept at balancing search costs with ultimate performance. However, a significant challenge persists: how to model structural hyperparameters in a direct and differentiable manner. Prior methods have struggled to meet this challenge, resulting in diminished performance and increased costs. Specifically, we group prior methods into three categories based on their modeling strategies: (1) multiple element selection, (2) single number selection, and (3) gradient estimate topk. Specifically, when searching for the number of channels in a convolutional layer, multiple element selection methods (Li et al.; Guo et al., 2021b) model the channel number as multiple selections of channels, as shown in Figure 1 (a.1). They introduce a much larger search space of element combinations. Single number selection methods (Wan et al., 2020) model the channel number as a single selection from multiple numbers, as shown in Figure 1 (a.2). It ignores the order relationship among these numbers. Gradient estimate topk approaches (Guo et al., 2021a; Gao et al., 2022; Ning et al., 2020) attempt to model width and depth directly, as shown in Figure 1 (a.3). However, they are not differen-

Differentiable Model Scaling using Differentiable Topk

tiable, necessitating the development of different gradient estimation methods. As a result, these methods lack stability and are difficult to optimize.

Regrettably, all the above strategies fall short of modeling structural hyperparameters in a clear-cut and fully differentiable fashion. To address the aforementioned challenge, we introduce a fully differentiable topk operator, which can seamlessly model depths and widths in a direct and differentiable manner. Notably, each differentiable topk operator has a single learnable parameter, representing either a depth or width structural hyperparameter. It can be optimized based on guidance from both task loss and resource constraint loss. Our method stands out in terms of high optimization efficiency when contrasted with existing gradient-based approaches.

Based on our differentiable topk, we develop a Differentiable Model Scaling (DMS) algorithm to search for networks optimal width and depth. To validate the efficacy and efficiency of our approach, we rigorously tested it across various tasks, including vision tasks and NLP tasks, and different architectures, including CNNs and Transformers. Thanks to the high search efficiency of our differentiable topk, DMS achieves better performance or much lower search costs than prior SOTA methods.

Overall, our contributions are as follows:

We introduce a differentiable topk operator, which is easy to optimize as it can model structural hyperparameters in a direct and differentiable manner.

We develop a Differentiable Model Scaling (DMS) algorithm based on our differentiable topk to search for networks optimal width and depth.

We evaluate our DMS across various tasks and architectures. For example, DMS outperforms the state-ofthe-art zero-shot NAS method, Zi Co, by 1.3% while requiring only 0.4 GPU days for searching. DMS costs fewer than a fraction of dozens of the search costs of one-shot NAS methods and multi-shot NAS methods with comparable performance. Besides, our method is a widely applicable method, which improves the m AP of Yolo-v8-n by 2.0% on COCO and improves the zero-shot classification accuracy of pruned Llama-7B.

2. Related Work

The width and depth of networks are critical aspects of model architecture design. A multitude of methodologies have been proposed to automate this process, notably Neural Architecture Search (NAS) (Zoph & Le, 2016; Liu et al., 2018a) and model structure pruning (Li et al., 2020; Li et al.). NAS algorithms typically aim to design models

automatically from scratch, while model structure pruning approaches focus on compressing pretrained models to enhance their efficiency. Despite their contrasting methodologies, both approaches contribute to the search for model structure.

These search methods can generally be categorized into two groups based on their search strategies: stochastic search methods (Zoph & Le, 2016; Xie et al., 2022; Liu et al., 2022) and gradient-based methods (Liu et al., 2018a; Guo et al., 2021a). In the following sections, we will introduce these methods and compare them with ours.

2.1. Stochastic Search Methods

Stochastic search methods usually operate through a cyclical process of sampling and evaluation. At each step, they sample models with different structures and then evaluate them. This strategy is versatile as it can handle both contiguous and discrete search spaces. However, a significant downside is its low search efficiency, leading to high resource consumption and suboptimal performance. Specifically, stochastic search-based methods can be divided into three groups: multi-shot NAS, one-shot NAS, and zero-shot NAS. Multi-shot NAS (Tan & Le, 2019; Liu et al., 2022) requires the training of multiple models, which is time-consuming. For instance, Efficient Net (Tan & Le, 2019) uses over 1714 TPU days for searching. One-shot NAS (Xie et al., 2022; Cai et al., 2019) requires training a large supernet, which is also resource-intensive. For example, Scale Net (Xie et al., 2022) uses 379 GPU days for training a supernet. Zeroshot NAS (Li et al., 2023; Lin et al., 2021) reduces the cost by eliminating the need to train any model. However, its performance has not yet met the desired standard.

2.2. Gradient-based Methods

Gradient-based structure search methods (Liu et al., 2018a; Guo et al., 2021a) employ gradient descent to explore the structure of models. Generally, these methods are more efficient than their stochastic search counterparts. The critical aspect of gradient-based methods is how to use learnable parameters to model structural hyperparameters and compute their gradients. Ideally, the learnable parameters should directly model structural hyperparameters, and their gradients should be computed in a fully differentiable manner. However, prior methods have struggled to meet these two criteria in modeling the width and depth of networks. We group them into three categories: (1) multiple element selection, (2) single number selection, and (3) gradient estimate topk. The first two categories model structural hyperparameters indirectly, while the third category is not differentiable and requires gradient estimation.

Multiple element selection methods (Li et al.) model the number of elements as multiple selections from elements

Differentiable Model Scaling using Differentiable Topk

Figure 1. Different Gradient-based Modeling Strategies for Width and Depth. For all strategies, they use learnable parameters to generate an element mask to select width elements or depth elements. Sub Figure (a) illustrates four methods to generate the element mask, while (b) shows how the mask is used to search width and depth. (a.1) Multiple Element Selection: The element count is transformed into a multiple-element selection. (a.2) Single Number Selection: The element count is transformed into a selection from multiple numbers. (a.3) Gradient Estimate Topk: The element count is directly modeled yet non-differentiable. (a.4) Our Differentiable Topk: The element count is directly modeled and is fully differentiable. Direct means that the learnable parameters directly model the structural hyperparameters, while Differentiable means that the gradient of the learnable parameters can be computed in a fully differentiable manner.

(e.g., channel selection), as shown in Figure 1 (a.1). Similarly, Single number selection methods (Wan et al., 2020) model element quantity as a single choice from multiple numbers, as shown in Figure 1 (a.2). Both them model structural hyperparameters in indirect and inaccurate ways and introduce much more learnable structural parameters, making optimization hard. Naturally, They result in low performance.

Gradient estimate topk approaches (Guo et al., 2021a; Gao et al., 2022; Ning et al., 2020) attempt to model width and depth directly, as shown in Figure 1 (a.3). However, they are not differentiable, necessitating the development of different gradient estimation methods. As a result, these methods lack stability and are also difficult to optimize.

To improve the optimization efficiency for structure search, we introduce a new differentiable topk that can model width and depth directly and is fully differentiable. Thanks to the differentiable topk, our method can achieve better performance and higher efficiency than prior methods.

2.3. Differentiable Topk

The topk operator selects the k elements from N elements based on their importance values. Previous differentiable topk operators focus on making the selection of elements differentiable with a fixed k (Xie et al., 2020; Petersen et al., 2022). In contrast, our topk method focuses on making the number of selected elements k differentiable. Our differentiable topk is a novel operator which is the only one that can be applied to search structural hyperparameters.

In this section, we will detail our Differentiable Model Scaling (DMS) in two steps. First, we introduce our differentiable topk, which models structural hyperparameters directly in a fully differentiable manner. Second, we explain how to use our differentiable topk to construct our DMS algorithm.

Differentiable Model Scaling using Differentiable Topk

3.1. Differentiable Topk

Suppose there is a structural hyperparameter denoted by k, representing the number of elements, such as k channels in a convolutional layer or k residual blocks in a network stage. k has a maximal value of N. We use c RN to represent the element importance, where a larger value indicates a higher importance. The objective of our differentiable topk is to output a soft mask m [0, 1]N to indicate the selected elements with top k importance scores.

Our topk operator uses a learnable parameter a as a threshold to select elements whose importance values are larger than a. a is able to model the number of elements k directly, as k can be seen as a function of a, where k = PN i=1 1[ci > a]. 1[A] is an indicator function, which equals 1 if the A is true and 0 otherwise. We use ci to represent the importance of the i-th element. We denote our topk as a function f as follows:

( 1 if ci > a 0 otherwise (1)

In prior methods, f is usually a piecewise function, which is not smooth and not differentiable, and the gradient of a is computed by estimation. We argue the biggest challenge to employing a fully differentiable f with respect to a is that the channel importance is distributed unevenly. Specifically, uneven distribution causes the importance difference between two neighboring elements, ordered by importance value, to vary significantly. Supposed a is updated by a fixed value in each iteration, when the difference is large, a lot of steps are needed for a to go across these two elements. When the difference is small, a can cross many elements in one step. Therefore, optimizing a in a fully differentiable manner is too hard when element importance is uneven.

To address this challenge, we employ an importance normalization process to forcefully convert the unevenly distributed importance to evenly distributed values, making the topk function smooth and easy to optimize in a differentiable way. To sum up, our differentiable topk has two steps: importance normalization and soft mask generation.

3.1.1. IMPORTANCE NORMALIZATION

We normalize all element importance by mapping them to evenly distributed values from 0 to 1, based on the following:

j=1 1[ci > cj]. (2)

The normalized element importance is denoted by c . 1[A] is the same indicator function as above. Any two elements in c are supposed to be different, which is usually the case in practice. Notably, although c is evenly distributed from 0 to 1, c can follow any distribution.

Intuitively, c i indicates the portion of c values smaller than ci. Besides, the learnable threshold a also becomes meaningful, representing the pruning ratio of elements. k can be computed by k = (1 a)N , where is a round function. a is limited to the range of [0, 1], where a = 0 indicates no pruning and a = 1 indicates pruning all elements.

3.1.2. SOFT MASK GENERATION

After the normalization, it s easy to generate the soft mask m using a smooth and differentiable function based on the relative size of pruning ratio a and normalized element importance c .

mi = f(a) = Sigmoid(λ(c i a)) = 1 1 + e λ(c i a) .

We add a hyperparameter λ to control the degree of approximation from Equation 3 to a hard mask generation function. When λ tends to infinity, Equation (3) approaches a hard mask generation function. We usually set λ to N. Because when c i > a+3/N or c i < a 3/N, |(mi mi )| < 0.05. It means that except for the six elements whose importance values are around the pruning ratio, the masks of other elements are close to 0 or 1, where the approximation error is less than 0.05. Therefore, λ = N is sufficient to approximate a hard mask generation function for our topk.

The forward and backward graph of Equation 3 are shown in Figure 2 (a) and Figure 2 (b), respectively. It can be observed that 1) Our topk models the number of elements k directly using the learnable pruning ratio a, and it generates a polarized soft mask m to simulate the pruned model perfectly during forward. 2) Our differentiable topk is fully differentiable and is able to be optimized stably. The gradient of a with respect to mi is mi

a = λ(1 mi)mi. Our topk intuitively detects the gradient of the mask in the fuzzy area with 0.05 < mi < 0.95. Note, Figure 2 (b) illustrates the value of mi

a , which is not the total gradient of a. The total gradient of a is PN i=1 task loss

a + resource loss

3.1.3. ELEMENT EVALUATION

As we do not limit the distribution of element importance, element importance can be quantified through various methods, such as L1-norm (Li et al., 2016), among others. In our approach, we implement Taylor importance (Molchanov et al., 2019) in a moving average manner as follows:

Differentiable Model Scaling using Differentiable Topk

0.0 0.5 1.0 Normalized Importance

Soft Mask Value

a=0.25 a=0.5 a=0.75

(a) Forward

0.0 0.5 1.0 Normalized Importance

Gradient Value

a=0.25 a=0.5 a=0.75

(b) Backward

Figure 2. Forwad and Backward Graph of Our Differentiable Topk. We set maximal element number N = λ = 100, pruning ratio a {0.25, 0.5, 0.75}. The x-axis represents the normalized element importance c i. (a) demonstrates the forward process, where the y-axis represents the soft mask mi. (b) illustrates the backward process, where the y-axis represents the gradient of a with respect to mi, mi

ct+1 i = ct i decay + (mt i gi)2 (1 decay). (4)

Here, t represents the training step. gi is the gradient of mi with respect to training loss. Decay refers to the decay rate. The initial value of c0 i is set to zero, and the decay rate is set to 0.99. Note that the importance of elements is not updated by gradient descent. By leveraging Taylor importance, we can efficiently and stably estimate the importance of elements. We conduct an ablation study on the importance evaluation methods in Section 5.1, which shows that Taylor importance is enough to achieve good performance.

3.2. Differentiable Model Scaling

Relying on our differentiable topk, we develop Differentiable Model Scaling (DMS) to optimize the width and depth of networks. Our DMS has three pipeline variants based on that of training-based model pruning, as shown in Table 1.

DMSp is the standard training-based model pruning pipeline, it consists of pretrain stage, search stage, and retrain stage. The pretrain stage is used to pretrain a supernet. It usually costs a lot of time and resources. In the search stage, we search for the optimal width and depth of the supernet under a specific resource constraint. Due to the high search efficiency of our method, the search stage only uses about 1/10 or fewer of the epochs of retraining. In the retrain stage, we retrain the searched model. We use this pipeline when comparing with SOTA pruning methods.

DMSnp is our default and most-used pipeline in our paper. The high costs of pretrain stage, which may make up most of the total costs, is a big obstacle to the practical application of NAS and pruning methods. To overcome this problem, we discard the pretrain stage from the DMSpand directly

Table 1. Three Pipelines We Used in Our Paper.

Pipeline Pretrain Search Retrain

DMSnp N All weights Y DMSp Y All weights Y DMSp Y Only structural parameters N

search from a randomly initialized supernet. According to our ablation study in Appendix A.3.2, DMSnpsurpluses DMSpby increasing the supernet size on both performance and efficiency. DMSnpmakes our method maintain high performance and is more efficient than other NAS methods.

DMSpis used to compare different search methods extremely quickly. Compared with DMSp, it only optimizes the structural parameters and does not retrain the searched models. Take advantage of existing pretrained supernets, it also outputs reasonable results. Besides, it only takes hundreds of iterations, costing less than 10 minutes on a single RTX3090, to search for a model.

Search Space: As shown in Figure 1 (b), our search space encompasses both the width and depth of networks, which are the most critical structural hyperparameters for model scaling. To represent these dimensions, we use our differentiable topk. The width in networks typically covers the channel dimension in convolutional layers, the feature dimension in fully connected layers, and so on. Regarding depth, we focus on networks with residual connections and search the number of blocks in each stage. Specifically, We incorporate the soft masks of differentiable topk into residual connections, allowing each block to be represented as xi+1 = xi + f(xi) mi.

Besides, for a structural hyperparameter x, we search it in the range of [1, xmax] with a step of 1, while most of prior NAS methods (Cai et al., 2019; Chen et al., 2021) search it in the range of [xmin, xmax] with a large step, like 32. Their search spaces have been a pretty good sub-space of ours, designed by human experts. However, our search space is more general and costs the least human effort. More details about our search space are provided in Appendix A.1.1.

Resource Constraint Loss To ensure that a network adheres to specific resource constraints, we incorporate an additional component into the optimization process, termed the resource constraint loss . Consequently, the aggregate loss function is:

loss =losstask + λresource lossresource. (5)

lossresource =

rt ) if rc > rt 0 otherwise . (6)

Here, losstask denotes the task loss. lossresource repre-

Differentiable Model Scaling using Differentiable Topk

sents the additional resource constraint loss, and the term λresource acts as its weighting factor. rc symbolizes the current level of resource consumption, which is calculated based on the learnable parameters of differentiable topk operators. rt denotes the targeted level of resource consumption and is user-specified. As our topk is fully differentiable, the learnable structural parameters can be optimized under the guidance of both task loss and resource constraint loss. Our method is compatible with various resource constraints, including MAC constraint, latency constraint, and number of parameters constraint. More details about our resource constraint loss are provided in Appendix A.1.2.

4. Experiment

We applied our method to rigorous evaluations across various tasks, including vision and NLP tasks, and architectures, including CNNs and Transformers. We focus on search costs and performance. The search cost associated with a model is divided into two distinct components: the public cost that costs once for all searched models for a method, like supernet pretraining, and the private cost that costs for each model. Notably, our method consistently outperforms both baseline models and prior NAS methods, highlighting its superior performance and adaptability.

4.1. Comparison with Different Search Methods

To demonstrate the higher search efficiency of our method, we first compared it with different search methods in the same setting. For simplicity and fairness, we utilize DMSppipeline for all methods in this section, which loads pretrained weights. As all methods load pretrained weights, we don t count pretrain costs in search costs in this section. All experiments in this section are conducted on Image Net (Deng et al., 2009).

4.1.1. COMPARISON WITH GRADIENT-BASED METHODS

Firstly, we compare our method with other gradient-based methods, including the multiple element selection (MES), single number selection (SNS), and gradient estimate topk (GET). We use Res Net-50 as our supernet and search for models with about 3G MACs. The search stage costs 800 iterations for all methods. The results are shown in Table 2.

Our method achieves the best performance among all gradient-based methods by a large margin. Compared with the multiple element selection and single number selection, our method uses much fewer (less than 1/250) learnable parameters, making it easier to optimize. Compared with the gradient estimate topk, our parameters get more accurate gradients, achieving much better performance. These results demonstrate our direct and differentiable manner is much more efficient than other gradient-based methods.

Table 2. Comparison with other Gradient-based Methods: MES means multiple element selection, SNS means single number selection, and GET means gradient estimate topk. N means the number of learnable structural parameters. We use DMSppipeline for all methods in this table.

Method Top-1 (%) Macs (G) N

MES (Herrmann et al., 2020) 55.5 3.2 11468

SNS (Wan et al., 2020) 58.6 3.4 11468 GET (Yao et al., 2021) 61.6 3 41 DMSp-(ours) 70.7 3 41

Table 3. Comparison with Evolutionary Algorithm: EA means evolutionary algorithm. Only EA + predictor needs a public search cost, which is used to train an accuracy predictor. The unit for search cost is GPU hours. We use the pipeline of DMSpin this table.

Method Top-1 (%) Macs (G) Cost Public+Private

EA + predictor (Cai et al., 2019) 78.2 0.45 40 + 0.01

EA (Cai et al., 2019) 78.2 0.45 0 + 1.1

DMSp-(ours) 78.2 0.45 0 + 0.05

4.1.2. COMPARISON WITH EVOLUTIONARY ALGORITHM

We additionally compare our method with an evolutionary algorithm (EA), a typical stochastic search method. We utilize the supernet of OFA (Cai et al., 2019) and search for models with 0.45G MACs. The results are shown in Table 3. Our DMS just consumes less than 1/20 of the search cost of the evolutionary algorithm without any performance degradation. It reveals that our method is more efficient than the evolutionary algorithm under the guidance of gradients.

4.2. Comparison with SOTA NAS Methods

Then, we compare our method with SOTA NAS methods on Image Net. We choose Efficient Net models as baselines and search for optimal configurations in terms of their width and depth. Efficient Net (Tan & Le, 2019) is a widely accepted baseline for NAS research (Xie et al., 2022; Liu et al., 2022). We use our default pipeline DMSnp , which does not load pretrained weights, in this section, Therefore, the public search cost for our method is zero.

The performance of these searched models and their search costs are presented in Table 4. According to the level of their search costs, we divide all compared NAS methods into two groups: low-search-cost methods and high-searchcost methods. Low-search-cost methods usually contain

Differentiable Model Scaling using Differentiable Topk

Table 4. Experiments on Efficient Net. We compare our DMS with other NAS methods on Efficient Net variants. We divide all NAS methods into two groups, including low-search-cost methods and high-search-cost methods. Our method outperforms low-search-cost methods by a large margin with similar search costs, and it uses much fewer search costs than high-search-cost methods achieving comparable or better performance. The unit of search cost is TPU days for Efficient Net and GPU days for other models. Ratio stands for the ratio of the search cost of the model to that of our corresponding DMS model. means the model is trained with distillation. How to obtain these search costs is detailed in Appendix A.4.2. Note we use our default DMSnp pipeline, which does not load pretrained supernets, in this table.

Model NAS Type Top-1 (%) MACs (G) Params (M) Search Cost Public + Private Ratio Cost Level

Joint Pruning (Guo et al., 2021a) Gradient 77.3 0.34 / 0 + 8 2.5

Low DMSnp-EN-350 (ours) Gradient 78.5 0.35 5.6 0 + 3.2 1 Zen-score (Lin et al., 2021) Zero Shot 78.0 0.41 5.7 0 + 0.5 1.3 Zi Co (Li et al., 2023) Zero Shot 78.1 0.45 / 0 + 0.4 1 DMSnp*-EN-450 (ours) Gradient 79.4 0.45 6.5 0 + 0.4 1 Scale Net-EN-B0 (Xie et al., 2022) One Shot 77.5 0.35 4.4 379 + 1.6 119

DMSnp-EN-350 (ours) Gradient 78.0 0.35 5.6 0 + 3.2 1 Efficient Net-B0 (Tan & Le, 2019) Multi Shot 77.1 0.39 5.3 1714 + 0 536 DMSnp-EN-B0 (ours) Gradient 78.5 0.39 6.2 0 + 3.2 1 Efficient Net-B1 (Tan & Le, 2019) Multi Shot 79.1 0.69 7.8 1714 + 0 296 Scale Net-EN-B1 (Xie et al., 2022) One Shot 79.9 0.80 7.4 379 + 3.7 66 MA-EN-B1 (Liu et al., 2022) Multi Shot 79.9 0.68 8.8 > 124 + 131 > 44 DMSnp-EN-B1 (ours) Gradient 80.0 0.68 8.9 0 + 5.8 1 Efficient Net-B2 (Tan & Le, 2019) Multi Shot 80.1 1.0 9.2 1714 + 0 245 MA-EN-B2 (Liu et al., 2022) Multi Shot 80.9 1.0 9.3 > 124 + 192 > 45 DMSnp-EN-B2 (ours) Gradient 81.1 1.1 9.6 0 + 7.0 1

gradient-based and zero-shot NAS methods. This category only needs several GPU days, which are fewer than that of training the searched model itself. high-search-cost methods usually contain multi-shot and one-shot NAS methods, which usually cost more than one hundred GPU days (much over than training the searched model itself).

Compared with Low-Search-Cost Methods: Our method also belongs to low-search-cost methods. Compared with this category, our method achieves better performance with similar or even lower search costs. For example, our method outperforms Joint Pruning (Guo et al., 2021a) by 1.2% with 2/5 of its search cost. It also outperforms zero-shot NAS methods, Zi Co and Zen-score, by a margin of 1.3% and 1.4%, respectively. This result demonstrates that our method achieves significant performance improvements than previous low-search-cost NAS methods.

Compared with High-Search-Cost Methods: Compared with this category, our method spends much fewer search costs but still achieves comparable or even better performance. For example, Compared with Efficient Net, our searched models, DMSnp-EN-B0, B1, and B2, have improved performance by 1.4%, 0.9%, and 1.0%, respectively. Remarkably, DMS also achieves over 100 times search cost savings in the search process. our method outperforms Scale Net by 0.5% and 0.1% on Efficient Net-B0 and B1, respectively, with less than 1/50 of the search cost of it. Our DMS also outperforms Model Amplification (MA) by 0.1%

and 0.2% on Efficient Net-B1 and B2, respectively, with less than 1/40 of the search cost of it. This result proves the high search efficiency of our method.

For high-search-cost methods, their search costs usually come from huge public costs, like supernet training. Even if we average their public cost over four variants (the number of variants used by most papers), they still cost over 20 times our total search cost. Except for the drawback of high search costs, the high-public-cost methods make it impossible to search a large model. For example, if a target model needs a month to train from scratch, like a LLM, the public search process of high-search-cost methods may take over a year, which is unacceptable for practical applications. While our method only requires a few days to search for a model, it makes it possible to search for a large model.

Through these experiments, we demonstrate that our method is more suitable for real-world applications, as it achieves comparable or even better performance than SOTA NAS methods with low search costs. It makes it easy to be inserted into the practical pipeline of model development. We compare our method with more NAS methods in Appendix A.2.1 and draw this table as accuracy vs MACs plots and a search cost vs accuracy plot in Figure 3.

Differentiable Model Scaling using Differentiable Topk

Table 5. Comparison with SOTA Pruning Methods. The supernet is Res Net-50 for all methods. We use DMSp pipeline, which loads pretrained weights as other methods, in this table.

Method MACs (G) Top-1 (%)

Res Net-50 4.1 76.5

LFPC (He et al., 2020) 1.6 74.46 GReg2 (Wang et al., 2020) 1.6 74.93 CC (Li et al., 2021) 1.5 74.54 TPP (Wang & Fu, 2022) 1.6 75.12 DMSp(ours) 1.6 75.53

Table 6. Experiment on Deit. We use the pipeline of DMSnp , which does not load pretrained weights, in this table.

Model Top-1 (%) MACs (G) Params (M)

Deit-T (Touvron et al., 2021) 74.5 1.3 5.7

DMS-Deit-T 75.1 1.3 6.2

4.3. Comparison with SOTA pruning methods

Our method can also be applied as a model structure pruning method. We compare our method with SOTA pruning methods in Table 5. Following prior structure pruning methods, we utilize DMSp pipeline, which loads pretrained weights, and only prune width.

Compared with SOTA pruning methods, our DMS achieves the best performance though we do not employ complicated importance evaluation methods like others. This is because of the strong search ability of our differentiable topk.

4.4. Experiment on Transformers

Except for CNNs, we also applied our method to Transformers, a widely used architecture in vision and language tasks.

4.4.1. EXPERIMENT ON DEIT

We apply our method to Deit-T, a one-stage vision transformer. We use the pipeline of DMSnp , which does not load pretrained weights, in this experiment. Our searched models outperform the original Deit-T by 0.6% with similar MACs, as shown in Table 6.

4.4.2. EXPERIMENT ON LLM

Beyond vision tasks, we extend our method to evaluate its applicability on a large language model (LLM) called Llama (Touvron et al., 2023), as shown in Table 7. Due to resource constraints, we cannot train an LLM from scratch. Instead,

we adopt a prune and finetune strategy using the alpaca dataset (Taori et al., 2023). This is similar to our pipeline of DMSp , but only finetuning rather than retraining. To mitigate overfitting to the alpaca dataset, we use the original model to distill the pruned model both during pruning and the subsequent finetuning process. In alignment with LLMPruner (Ma et al., 2023), we limit our pruning to the heads of self-attentions and the hidden dimensions of the feed-forward networks (FFN) within Llama. The resource constraint is set to 5.47B parameters as LLMPruner. After pruning 20% of the parameters from Llama-7B and comparing it with LLMPruner, our method demonstrates superior performance across various benchmarks. Specifically, we observe reduced perplexity on Wiki Text2 (Merity et al., 2016) and Pth (Marcus et al., 1993), and higher zero-shot classification accuracy on Bool Q (Clark et al., 2019), Wino Grande (Sakaguchi et al., 2021), ARC-e (Clark et al., 2018), and ARC-c (Clark et al., 2018). These experiments demonstrate that our method is also compatible with Transformers.

We also provide more experiments on classic CNNs on image classification and object detection tasks in Appendix A.2.2 and Appendix A.2.3. Our method consistently outperforms baseline models. In conclusion, our method is universal and effective across various tasks and architectures.

5. Ablation Study

5.1. Ablation Study on Element Importance Metrics

Here, we compare different element importance metrics, including Index metric, SNIP (Lee et al., 2018), Fisher (Liu et al., 2021), and Taylor (Molchanov et al., 2019). We use the pipeline of DMSnp , which does not load pretrained weights, in this section. We search for 1G models on Res Net50 and compare the performance after retraining.

Index metric is the simplest metric, which assigns importance value according to the index of elements statically. SNIP, Fisher, and Taylor are more complicated metrics, using gradients and activations to update the importance of elements with a moving average. The results are shown in Table 8. Index metric works poorly as its static strategy. SNIP, Fisher, and Taylor work better than the index metric and are comparable with each other for our method.

We further compare Taylor importance with and without moving average. The results are shown in Table 8. It can be seen that moving average can improve the performance of Taylor importance, because the moving average can smooth the importance of elements, making it more stable.

Therefore we just apply Taylor importance as our default metric, as it s more widely used in prior works (Humble et al., 2022; Molchanov et al., 2019).

Differentiable Model Scaling using Differentiable Topk

Table 7. Experiment on Llama-7B. We pruned Llama-7B using DMS and compared it with LLMPruner. We evaluate the pruned model using perplexity on Wikitext2 and Pth datasets and zero-shot classification accuracy on Bool Q, Wino Grande, ARC-e, and ARC-c datasets. In our results, the symbol denotes that a larger value is better, while signifies that a smaller value is preferable. We use our pipeline of DMSp , which loads pretrained weights, in this table, due to resource constraints.

Model Params Wikitext2 Pth Bool Q Wino Grande ARC-e ARC-c

Llama-7B (Touvron et al., 2023) 6.74B 12.62 22.14 76.5 67.01 72.8 41.38

LLM-Pruner-Llama-7B (Ma et al., 2023) 5.47B 17.39 30.2 66.79 64.96 64.06 37.88 DMSp-Llama-7B (ours) 5.47B 17.13 27.98 75.23 65.35 71.46 39.59

Table 8. Ablation Study on Element Importance Metric. We use the pipeline of DMSnp in this table.

Element Importance Top-1 (%)

Index metric 72.3

SNIP (Lee et al., 2018) 73.0 Fisher (Liu et al., 2021) 73.2

Taylor w/o moving average 72.5 Taylor (Molchanov et al., 2019) 73.1

More Ablation: There are more ablation studies in Appendix A.3:

(1) Ablation study on importance normalization, in Appendix A.3.1. It demonstrates that without importance normalization, our method drops to an extremely low performance. It s caused by the unevenly distributed importance, which makes our topk function hard to optimize. This result demonstrates the significance of our importance normalization.

(2) Ablation study on pretraining and supernet sizes, in

Appendix A.3.2. It shows that our DMSnp, which does not load pretrained weights, can outperform DMSp, which loads pretrained weights, on efficiency and efficacy by increasing supernet size.

(3) Experiments with latency constraint, in Appendix A.3.3. It shows that our method is also compatible with latency constraints.

(4) Ablation study on search time, in Appendix A.3.4. It shows that only 1/10 of epochs of training from scratch is enough for our method to search for the optimal structure.

(5) Ablation study on our hyperparameters, in Appendix A.3.5. Our method only has two hyperparameters that need to be tuned for each model. They include the weight of resource constraint loss λresource and the learning rate for structure parameters lrstructure. Our

hyperparameters are easy to tune and do not require much effort. Under the premise of ensuring that the model can be pruned to a specific resource constraint, smaller λresource and larger lrstructure are preferred.

(6) Ablation study on search spaces, in Appendix A.3.6. Our method uses a fine-grained search space, which does not set minimum and step values for each structural hyperparameter. It costs the least human effort to design but is harder to search. This ablation study shows that a human-designed coarse-grained search space, used by prior methods, can further improves the performance of our method.

6. Conclusion

In this paper, we introduce a novel model scaling method termed Differentiable Model Scaling (DMS). Compared with prior NAS methods, our DMS has three advantages. (1) DMS is efficient for searching, which makes it easy to use. (2) DMS also achieves high performance, comparable with SOTA NAS methods. (3) DMS is universal and is compatible with various tasks and architectures. In conclusion, our DMS is a highly efficient and versatile method for model scaling. In the future, we will apply our method to cuttingedge tasks and models, like searching for a large language model from scratch.

Acknowledgements

We express our appreciation to our colleague Zhihao Lin for his insightful discussions. We are grateful to the anonymous reviewers for their constructive feedback and recommendations. This project is supported by the National Key R&D Program of China (No. 2022ZD0161600).

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Differentiable Model Scaling using Differentiable Topk

Cai, H., Gan, C., Wang, T., Zhang, Z., and Han, S. Oncefor-all: Train one network and specialize it for efficient deployment. ar Xiv preprint ar Xiv:1908.09791, 2019.

Chen, M., Peng, H., Fu, J., and Ling, H. Autoformer: Searching transformers for visual recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 12270 12280, 2021.

Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins, M., and Toutanova, K. Boolq: Exploring the surprising difficulty of natural yes/no questions. ar Xiv preprint ar Xiv:1905.10044, 2019.

Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge. ar Xiv preprint ar Xiv:1803.05457, 2018.

Contributors, M. Openmmlab model compression toolbox and benchmark. https://github.com/ open-mmlab/mmrazor, 2021.

Contributors, M. MMYOLO: Open MMLab YOLO series toolbox and benchmark. https://github.com/ open-mmlab/mmyolo, 2022.

Contributors, M. Openmmlab s pre-training toolbox and benchmark. https://github.com/ open-mmlab/mmpretrain, 2023.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248 255. Ieee, 2009.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929, 2020.

Fang, G., Ma, X., Song, M., Mi, M. B., and Wang, X. Depgraph: Towards any structural pruning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16091 16101, 2023.

Frankle, J. and Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks. ar Xiv preprint ar Xiv:1803.03635, 2018.

Gao, S., Huang, F., Zhang, Y., and Huang, H. Disentangled differentiable network pruning. In European Conference on Computer Vision, pp. 328 345. Springer, 2022.

Guo, J., Liu, J., and Xu, D. Jointpruning: Pruning networks along multiple dimensions for efficient point cloud processing. IEEE Transactions on Circuits and Systems for Video Technology, 32(6):3659 3672, 2021a.

Guo, Y., Yuan, H., Tan, J., Wang, Z., Yang, S., and Liu, J. Gdp: Stabilized neural network pruning via gates with differentiable polarization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5239 5250, 2021b.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

He, Y., Ding, Y., Liu, P., Zhu, L., Zhang, H., and Yang, Y. Learning filter pruning criteria for deep convolutional neural networks acceleration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2009 2018, 2020.

Herrmann, C., Bowen, R. S., and Zabih, R. Channel selection using gumbel softmax. In European Conference on Computer Vision, pp. 241 257. Springer, 2020.

Humble, R., Shen, M., Latorre, J. A., Darve, E., and Alvarez, J. Soft masking for cost-constrained channel pruning. In European Conference on Computer Vision, pp. 641 657. Springer, 2022.

Jocher, G., Chaurasia, A., and Qiu, J. YOLO by Ultralytics, January 2023. URL https://github.com/ ultralytics/ultralytics.

Lee, N., Ajanthan, T., and Torr, P. H. Snip: Single-shot network pruning based on connection sensitivity. ar Xiv preprint ar Xiv:1810.02340, 2018.

Li, B., Wu, B., Su, J., and Wang, G. Eagleeye: Fast sub-net evaluation for efficient neural network pruning. In Computer Vision ECCV 2020: 16th European Conference, Glasgow, UK, August 23 28, 2020, Proceedings, Part II 16, pp. 639 654. Springer, 2020.

Li, G., Yang, Y., Bhardwaj, K., and Marculescu, R. Zico: Zero-shot nas via inverse coefficient of variation on gradients. ar Xiv preprint ar Xiv:2301.11300, 2023.

Li, H., Kadav, A., Durdanovic, I., Samet, H., and Graf, H. P. Pruning filters for efficient convnets. ar Xiv preprint ar Xiv:1608.08710, 2016.

Li, Y., Zhao, P., Yuan, G., Lin, X., Wang, Y., and Chen, X. Pruning-as-search: Efficient neural architecture search via channel pruning and structural reparameterization.

Differentiable Model Scaling using Differentiable Topk

Li, Y., Lin, S., Liu, J., Ye, Q., Wang, M., Chao, F., Yang, F., Ma, J., Tian, Q., and Ji, R. Towards compact cnns via collaborative compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6438 6447, 2021.

Lin, M., Wang, P., Sun, Z., Chen, H., Sun, X., Qian, Q., Li, H., and Jin, R. Zen-nas: A zero-shot nas for highperformance image recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 347 356, 2021.

Liu, C., Han, K., Xiao, A., Nie, Y., Zhang, W., and Wang, Y. Network amplification with efficient macs allocation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1933 1942, 2022.

Liu, H., Simonyan, K., and Yang, Y. Darts: Differentiable architecture search. ar Xiv preprint ar Xiv:1806.09055, 2018a.

Liu, L., Zhang, S., Kuang, Z., Zhou, A., Xue, J.-H., Wang, X., Chen, Y., Yang, W., Liao, Q., and Zhang, W. Group fisher pruning for practical network compression. In International Conference on Machine Learning, pp. 7021 7032. PMLR, 2021.

Liu, Z., Sun, M., Zhou, T., Huang, G., and Darrell, T. Rethinking the value of network pruning. ar Xiv preprint ar Xiv:1810.05270, 2018b.

Ma, X., Fang, G., and Wang, X. Llm-pruner: On the structural pruning of large language models. ar Xiv preprint ar Xiv:2305.11627, 2023.

Marcus, M., Santorini, B., and Marcinkiewicz, M. A. Building a large annotated corpus of english: The penn treebank. 1993.

Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models. ar Xiv preprint ar Xiv:1609.07843, 2016.

Molchanov, P., Mallya, A., Tyree, S., Frosio, I., and Kautz, J. Importance estimation for neural network pruning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11264 11272, 2019.

Moons, B., Noorzad, P., Skliar, A., Mariani, G., Mehta, D., Lott, C., and Blankevoort, T. Distilling optimal neural networks: Rapid search in diverse spaces. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12229 12238, 2021.

Ning, X., Zhao, T., Li, W., Lei, P., Wang, Y., and Yang, H. Dsa: More efficient budgeted pruning via differentiable sparsity allocation. In European Conference on Computer Vision, pp. 592 607. Springer, 2020.

Open AI. Gpt-4 technical report, 2023.

Petersen, F., Kuehne, H., Borgelt, C., and Deussen, O. Differentiable top-k classification learning. In International Conference on Machine Learning, pp. 17656 17668. PMLR, 2022.

Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al. Improving language understanding by generative pre-training. 2018.

Sakaguchi, K., Bras, R. L., Bhagavatula, C., and Choi, Y. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99 106, 2021.

Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and Chen, L.-C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510 4520, 2018.

Tan, M. and Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pp. 6105 6114. PMLR, 2019.

Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M., Howard, A., and Le, Q. V. Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2820 2828, 2019.

Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., and Hashimoto, T. B. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/ stanford_alpaca, 2023.

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., and J egou, H. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pp. 10347 10357. PMLR, 2021.

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi ere, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation language models. ar Xiv preprint ar Xiv:2302.13971, 2023.

Wan, A., Dai, X., Zhang, P., He, Z., Tian, Y., Xie, S., Wu, B., Yu, M., Xu, T., Chen, K., et al. Fbnetv2: Differentiable neural architecture search for spatial and channel dimensions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12965 12974, 2020.

Differentiable Model Scaling using Differentiable Topk

Wang, H. and Fu, Y. Trainability preserving neural structured pruning. ar Xiv preprint ar Xiv:2207.12534, 2022.

Wang, H., Qin, C., Zhang, Y., and Fu, Y. Neural pruning via growing regularization. ar Xiv preprint ar Xiv:2012.09243, 2020.

Wightman, R. Pytorch image models. https://github. com/rwightman/pytorch-image-models, 2019.

Wightman, R., Touvron, H., and J egou, H. Resnet strikes back: An improved training procedure in timm. ar Xiv preprint ar Xiv:2110.00476, 2021.

Xie, J., Su, X., You, S., Ma, Z., Wang, F., and Qian, C. Scalenet: Searching for the model to scale. In European Conference on Computer Vision, pp. 104 120. Springer, 2022.

Xie, Y., Dai, H., Chen, M., Dai, B., Zhao, T., Zha, H., Wei, W., and Pfister, T. Differentiable top-k with optimal transport. Advances in Neural Information Processing Systems, 33:20520 20531, 2020.

Yao, Z., Wu, X., Ma, L., Shen, S., Keutzer, K., Mahoney, M. W., and He, Y. Leap: Learnable pruning for transformer-based models. ar Xiv preprint ar Xiv:2105.14636, 2021.

Zhuang, T., Zhang, Z., Huang, Y., Zeng, X., Shuang, K., and Li, X. Neuron-level structured pruning using polarization regularizer. Advances in neural information processing systems, 33:9865 9877, 2020.

Zoph, B. and Le, Q. V. Neural architecture search with reinforcement learning. ar Xiv preprint ar Xiv:1611.01578, 2016.

Differentiable Model Scaling using Differentiable Topk

A. Appendix

A.1. More Details about DMS

A.1.1. SEARCH SPACE

Our search space encompasses both the width and depth of networks, which are the most critical structural hyperparameters for model scaling.

The width in networks typically covers the channel dimension in convolutional layers, the feature dimension in fully connected layers, qkv dimension and the number of heads in attention mechanisms, among others. For convolutional and fully connected layers, we use two distinct differentiable topk operators to model their respective input and output widths, treating each channel or feature as an individual element. For multi-head attention, we employ a single differentiable topk to represent the number of heads, treating each head as a separate element.

Specifically, we apply our differentiable topk to different layers by multiplying masks, output by differentiable topk operators, and inputs to layers. For convolutional layers, suppose the input is X RB C H W , and the mask is reshaped as m R1 C 1 1, X m works as the new input to the layer. For an attention layer, we search the head dims of qkv and the number of heads. Suppose our supernet has H heads and D dims in each head. We have a mask for qk head dim with mqk R1 1 1 D, a mask for v head dim with mv R1 1 1 D, and a mask for number of heads mhead R1 H 1 1. Suppose the sequence length is L, and the qkv for self-attention is Q, K, V RB H L D. We compute the output of the self-attention by softmax( Q K T

D )V , where Q = Q mqk mhead, K = K mqk mhead, V = V mv mhead.

It is crucial to highlight that there can be channel or feature dependencies within models (Liu et al., 2021; Fang et al., 2023). Interdependent Layers are treated as one group and share the same differentiable topk. We implemented this using an open-source model compression toolkit MMRazor (Contributors, 2021), which is able to build element dependencies automatically.

Regarding depth, we focus on networks with residual connections. In this context, a residual block can be defined as xi+1 = xi + f(xi), and contiguous residual blocks are viewed as a network stage. The depth in our approach mainly comprises the number of blocks in each stage. We use a single differentiable topk for a network stage, with each block functioning as a distinct element. We incorporate the soft masks of differentiable topk into residual connections, allowing each block to be represented as xi+1 = xi + f(xi) mi. In the context of Transformers, an attention mechanism combined with a feed-forward network (FFN) is considered as one block sharing the same soft mask.

The depth and width structure hyperparameters are trained jointly in our approach. For example, we have a layer and an input x; we use m Li [0, 1] to denote the depth mask and m C [0, 1]N for the width mask. The forward process is as follows: y = m C x + m Li layer(m C x). After searching, we will prune depth and width according to the depth mask and width mask, respectively.

Besides, as some maximal numbers of elements are small from several to tens, like the number of blocks and attention heads, we increase the λ in the differentiable topk operators from N to 4N to approximate a hard mask generation function better.

For a structural hyperparameter x, we search it, in a fine-grained manner, in the range of [1, xmax] with a step of 1. while most of prior NAS methods (Cai et al., 2019; Chen et al., 2021) search it, in a coarse-grained manner, in the range of [xmin, xmax] with a large step, like 32. Our search method is also comparable with these coarse-grained search spaces. We limit the value of a to [0, 1 xmin

xmax ] to ensure x [xmin, xmax]. We treat contiguous elements in a step as a unit, and each unit shares the same element importance and mask. For example, when searching the number of channels in a layer with a step of 32, each 32 channels share the same element importance and mask.

A.1.2. RESOURCE CONSTRAIT

Our resource constraint loss is defined as:

lossresource =

rt ) if rc > rt 0 otherwise . (7)

In this definition, rc symbolizes the current level of resource consumption, and rt denotes the targeted level of resource

Differentiable Model Scaling using Differentiable Topk

consumption. If rc exceeds rt, a non-zero lossresource is used to compress the model. The value of rt is user-specified. The value of rc is calculated based on the learnable parameters of differentiable topk operators. Take a linear layer as an example, we use fin and fout to represent the number of input and output features, respectively. ain and aout are the learnable parameters of differentiable topk operators for that layer. We compute different resource consumption as follows:

MAC constraint: rc = fin ain fout aout batchsize

The Number of Parameters constraint: rc = fin ain fout aout

Latency constraint: rc = latencymax F(ain, aout), where latencymax is the latency of the layer without pruning. F is a contiguous function to map the pruning ratio ain,aout to the ratio of the latency of the pruned layer to the original layer. F can be found by:

F = argmin F (MSE(latencypruned(ain,aout) latencymax F(ain, aout))) (9)

latencypruned(ain,aout) is the latency of the pruned layer with ain,aout as the pruning ratio. Take the OFA search space as an example, aout always equals ain, and each block only has three width choices. Hence, we just employ a quadratic function as F, and it works fine.

To enhance stability during training, we gradually reduce rt to the final target value rfinal t throughout the training process by default. For an epoch-based training procedure, rt is determined by an exponential decay function as shown below:

rt = ( rfinal t rsupernet ) e emax rsupernet, where rfinal t rsupernet < 1. (10)

Here, e denotes the current epoch out of total epochs emax. rsupernet is a constant representing the resource demand of the supernet.

Furthermore, since resource consumption can fluctuate significantly with respect to depth, we introduce extra epochs dedicated to optimizing width while maintaining depth constant. By adopting the strategies outlined above, Differentiable Model Scaling ensures that models adhere to specific resource constraints.

A.2. More Experiments

A.2.1. COMPARISON WITH MORE NAS METHODS

We additionally compare our method with more NAS methods, as shown in Table 9. Note this is a rough comparison. As it s hard to compute a precise search cost for some methods, we only group them into two groups, including high-search-cost methods and low-search-cost methods, according to a rough estimation of their search costs. High-search-cost methods cost over 100 GPU days, while others belong to low-search-cost methods.

We draw Table 4 and Table 9 as accuracy vs MACs plots and a search costs vs accuracy plot, as shown in Figure 3. It can be observed that our DMS outperforms low-search-cost methods significantly. DMS achieves much lower search costs than high-search-cost methods, achieving comparable and even higher performance.

A.2.2. MORE IMAGE CLASSIFICATION EXPERIMENTS ON CNNS

We validated our method on Res Net (He et al., 2016) and Mobile Net V2 (Sandler et al., 2018). Our searched Res Net surpasses Res Net-50 by 1.1%. Furthermore, when the searched Res Net is trained using an enhanced training setting (referred to as rsb-a1 (Wightman et al., 2021)), it also exceeds the corresponding model by 0.9%. Although Mobile Net V2 is lightweight, our searched version outperforms the original model by a margin of 1.0%.

A.2.3. OBJECT DETECTION EXPERIMENTS ON COCO

Since the complete end-to-end searching of our differentiable topk, DMS is a general search method that can be applied to various tasks. We also evaluated DMS for object detection on COCO. We chose Yolo-v8-n (Jocher et al., 2023) as the baseline model and searched for the optimal structure of it. Our searched version betters the original model by 2.0% in box AP, as shown in Table 11.

Differentiable Model Scaling using Differentiable Topk

Table 9. Rough Comparison with more NAS methods. As some methods did not report their search cost, we simply use High and Low to represent the search cost of NAS methods, while High for multi-shot NAS methods and one-shot NAS methods, Low for gradient-based NAS methods and zero-shot NAS methods. means the model is trained with distillation. We use the pipeline of DMSnp , which does not load pretrained weights, in this table

Model NAS Type Top-1 (%) MACs (G) Params (M) Cost Level

Joint Pruning (Guo et al., 2021a) Gradient 77.3 0.34 /

DMSnp-EN-350 (ours) Gradient 78.5 0.35 5.6 Zen-score (Lin et al., 2021) Zero Shot 78.0 0.41 5.7 DMSnp-EN-B0 (ours) Gradient 79.0 0.39 6.2 Zi Co (Li et al., 2023) Zero Shot 78.1 0.45 / DMSnp*-EN-450 (ours) Gradient 79.4 0.45 6.5 Zen-score (Lin et al., 2021) Zero Shot 79.1 0.60 7.1 Zi Co (Li et al., 2023) Zero Shot 79.4 0.60 / DMSnp-EN-B1 (ours) Gradient 80.7 0.68 8.9 Zi Co (Li et al., 2023) Zero Shot 80.5 1.0 / Zen-score (Lin et al., 2021) Zero Shot 80.8 0.9 19.4 DMSnp-EN-B2 (ours) Gradient 81.8 1.1 9.6

Mnas Net-A2 (Tan et al., 2019) Multi Shot 75.6 0.34 4.8

FBNet V2-L1 (Wan et al., 2020) Gradient 77.2 0.33 / Scale Net-EN-B0 (Xie et al., 2022) One Shot 77.5 0.35 4.4 DMSnp-EN-350 (ours) Gradient 78.0 0.35 5.6 Mnas Net-A3 (Tan et al., 2019) Multi Shot 76.7 0.40 5.2 Efficient Net-B0 (Tan & Le, 2019) Multi Shot 77.1 0.39 5.3 DMSnp-EN-B0 (ours) Gradient 78.5 0.39 6.2 DONNA (Moons et al., 2021) One Shot 78.0 0.50 / DMSnp*-EN-450 (ours) Gradient 78.8 0.45 6.5 DMSnp*-EN-450 (ours) Gradient 79.4 0.45 6.5 Efficient Net-B1 (Tan & Le, 2019) Multi Shot 79.1 0.69 7.8 DMSnp-EN-B1 (ours) Gradient 80.0 0.68 8.9 Scale Net-EN-B1 (Xie et al., 2022) One Shot 79.9 0.80 7.4 Model Amplification-EN-B1 (Liu et al., 2022) Multi Shot 79.9 0.68 8.8 Efficient Net-B2 (Tan & Le, 2019) Multi Shot 80.1 1.0 9.2 Model Amplification-EN-B2 (Liu et al., 2022) Multi Shot 80.9 1.0 9.3 Big NAS-XL (Liu et al., 2022) One Shot 80.9 1.0 9.5 DMSnp-EN-B2 (ours) Gradient 81.1 1.1 9.6 DMSnp-EN-B2 (ours) Gradient 81.8 1.1 9.6

Table 10. Experiments on Image Net with Classic CNNs. We searched the models width and depth and compared them with the original models. We use the pipeline of DMSnp , which does not load pretrained weights, in this table.

Model Top-1 (%) MACs (G) Params (M)

Res Net-50 (He et al., 2016) 76.5 4.1 25.6 DMS-Res Net 77.6 4.0 28.4

Res Net-50-rsb-a1 (He et al., 2016) 80.1 4.1 25.6 DMS-Res Net-rsb-a1 81.0 4.0 28.4

Mobile Net V2 (Sandler et al., 2018) 72.0 0.3 3.4 DMS-Mobile Net V2 73.0 0.3 5.3

Table 11. Object Detection Experiments on COCO. We use the pipeline of DMSnp , which does not load pretrained weights, in this table.

Model m AP (%) MACs (G) Params (M)

Yolo-v8-n (Jocher et al., 2023) 37.4 4.4 3.2 DMSnp-Yolo-v8-n (ours) 39.4 4.2 2.7

Differentiable Model Scaling using Differentiable Topk

Figure 3. We draw these three plots based on Table 4 and Table 9. We use larger dot sizes to represent the High search cost level and smaller dot sizes to represent the Low search cost level. Dashed lines are used to represent the models trained with distillation. (a) Performance Comparison with Low-Search-Cost methods It can be seen that our method outperforms these methods significantly. (b) Performance Comparison with High-Search-Cost methods Our method achieves comparable or even better performance, while all high-search-cost methods cost more than dozens of times our total search costs. (c) Search Cost Comparison with High-Search-Cost methods We compare the search costs of ours and that of high-search-cost methods. We only draw methods with precise search cost estimation. The search costs of unpainted high-search-cost methods are also larger than 100 GPU days. We present search costs on a log scale.

Table 12. Comparison with other Gradient-based Methods: We use DMSppipeline for all methods in this table.

Method Top-1 (%) MACs (G)

DMSp-w/o normalization 0.1 3 DMSp-(ours) 70.7 3

A.3. More Ablaion Study

A.3.1. ABLATION STUDY ON IMPORTANCE NORMALIZATION

We compare our method with that of disabling importance normalization (replace Eq 2 by c i = ci), as shown in Table 12. We use the pipeline of DMSpfor all methods in this table, and Res Net-50 as our supernet. Without importance normalization, our method drops to an extremely low performance. It s caused by the unevenly distributed importance, which makes our topk function hard to optimize. This result demonstrates the significance of our importance normalization.

A.3.2. ABLATION STUDY ON PRETRAINING AND SUPERNET SIZES

It s widely accepted that both pretraining and increasing supernet size can improve the performance of searched models (Liu et al., 2018b; Frankle & Carbin, 2018). However, both of them also introduce more resource consumption. Here, we compare the influence of conducting pretraining and increasing supernet size on performance and resource consumption. Note, when we load pretrained weights, it means we use DMSp pipeline, and when we do not load pretrained weights, it means we use DMSnp pipeline. We search for Res Net models of 1G MACs models with different supernet sizes and compare their performance and resource consumption. The results are shown in Table 13.

For pretraining, searching on a pretrained Res Net-50 surpluses searching on a randomly initialized Res Net-50 by 0.7%. However, pretraining also increases resource consumption by about 10 times. For supernet sizes, searching on a randomly initialized Res Net-152 surpluses searching on a randomly initialized Res Net-50 by 1.5%, and only increases about 3 times the resource consumption. We find increasing supernet size is more efficient than using strong pretrained weights for our

Differentiable Model Scaling using Differentiable Topk

Table 13. Ablation Study on Pretraining and Supernet Size. We search for 1G models in these experiments. Costpretrain is the cost of pretraining a supernet, Costsearch is the cost of searching a model. The unit of cost is GMACs epochs.

Supernet Pretrain (Pipeline) Cost Top-1 (%)

Res Net-50 Y (DMSp) 410 + 41 73.8

Res Net-50 N (DMSnp) 0 + 41 73.1 Res Net-101 N (DMSnp) 0 + 79 74.2 Res Net-152 N (DMSnp) 0 + 116 74.6

Table 14. Experiment with Latency Constraint: We compare our DMS with an evolutionary algorithm (EA) under the latency constraint of 3ms on RTX3090. Only EA + predictor needs a public cost, which is used to train an accuracy predictor. The unit for search cost is GPU hours, respectively. We use the pipeline of DMSpin this table.

Method Top-1 (%) Latency (ms) Search Cost

EA + predictor (Cai et al., 2019) 78.3 3 40 + 0.01

DMSp-(ours) 78.3 3 0 + 0.05

method. Therefore, our default pipeline DMSnp just searches on randomly initialized supernets, making our method much more efficient than high-search-cost NAS methods. It also makes our method easier to be applied in the real world. This benefit originates from the high search efficiency of our method, making us use much fewer epochs to search a model than that of pretraining a supernet.

A.3.3. SEARCH WITH LATENCY CONSTRAINT

Except for the MAC constraint, we also conduct experiments with latency constraint. We utilize the supernet of OFA (Cai et al., 2019) and search for models under the latency constraint of 3ms on RTX3090. The results are shown in Table 14. Our method costs much less in search costs and achieves the same performance as the evolutionary algorithm. It demonstrates that our method is applicable under different constraints. The implementation detail of our latency constraint is detailed in Appendix A.1.2.

A.3.4. ABLATION STUDY OF SEARCH TIME

We assessed the relationship between search time and final performance, as detailed in Table 15. Our method achieves the best performance with 10 epochs. It s about 1/10 of the epochs of retraining the searched model. This proves the high search efficiency of our method, where a few epochs are enough to search for a model. Besides, even though when the search time is only 3 epochs, the performance of the searched model is still better than the human-designed Res Net-18 with fewer MACs.

A.3.5. ABLATION STUDY OF HYPERPARAMETERS FOR OUR METHOD

We divide the hyperparameters of our method into two categories: fixed hyperparameters and unfixed hyperparameters. Fixed hyperparameters are hyperparameters that are fixed for all models, while unfixed hyperparameters are hyperparameters that needs to be turned for different models.

The fixed hyperparameters include the decay rate for Taylor importance and the temperature λ for our differentiable topk operator.

Taylor importance (Molchanov et al., 2019) is a well-known method to measure the importance of elements, and the decay of moving average is also widely used in the literature. Therefore, we directly use the decay rate of 0.99 regarding prior works.

Temperature λ of our differentiable topk. The temperature is used to polarize (Zhuang et al., 2020) the mask of elements. Directly selecting a value that can polarize the mask of elements is enough. Thanks to our importance normalization, the temperature can be directly computed by closed-form, detailed in Section 3.1.2. The temperature λ is set to N for width elements and 4N for depth elements and the number of heads in attention mechanisms. N is the number of elements in the

Differentiable Model Scaling using Differentiable Topk

Table 15. Ablation Study on Search Time. We use the pipeline of DMSnp , which does not load pretrained weights, in this table. Note: Training a Res Net-18 from scratch requires 100 epochs.

Search Time MACs (G) Top-1 (%)

Res Net-18 1.8 69.9

3 epochs 1 71.6 5 epochs 1 72.9 10 epochs 1 73.1 20 epochs 1 72.8

Table 16. Ablation Study on Unfixed Hyperparameters. / denotes that the model cannot reach our resource target. We use the pipeline of DMSnp , which does not load pretrained weights, in this table.

λresource lrstructure 5e-2 5e-3 5e-4 5e-5

0.1 / / / / 1 73.0 73.1 72.9 / 10 72.5 72.2 72.6 70.9

corresponding dimension. They work well for all models.

Therefore, we do not conduct an ablation study on these fixed hyperparameters.

The unfixed hyperparameters include the weight of resource constraint loss λresource and the learning rate for structure parameters lrstructure. They are used to control the update of the structure parameters. The update value of a structure parameter is computed by lrstructure (gtask +λresource gresource), where gtask and gresource is the gradient of structure parameters with respect to the task loss and resource constraint loss, Table 16 shows the ablation study results.

Obviously, 1) Smaller λresource is better, as far as the model can reach the target resource constraint. Smaller λresource means that the task loss takes more control of the update of the structure parameter. 2) When λresource is small, the model is not sensitive to the change of lrstructure. When λresource is large, a relatively large lrstructure is better. This is because reaching the target resource constraint quickly can reduce the influence of the resource constraint loss, as resource constraint loss is zero when the model reaches the target resource constraint.

Therefore, the setting of λresource and lrstructure is not difficult. We first fix lrstructure and turn λresource to a small value and ensure the model can reach the target resource constraint. Then, we turn lrstructure to a relatively large value, which makes the model reach the target resource constraint in the first hundreds of iterations. Only observing the decrease in resource use in the first epoch is enough to set these two hyperparameters.

Compared with other NAS methods, our method uses fewer hyperparameters. For example, Model Amplification (Liu et al., 2022) must turn at least five hyperparameters for different tasks and models.

A.3.6. ABLATION STUDY ON SEARCH SPACES

In prior experiments, except for our own search space, we also apply our method on the search space defined by OFA (Cai et al., 2019) in Section 4.1.2 and Appendix A.3.3. Experiment results demonstrate our method works fine on these search spaces. Here, we further conduct an ablation study about search spaces.

There are two differences between the search space of our method and that of prior NAS methods.

On CNN models, we only search for width and depth, while prior methods usually also search for resolution and kernel size. We do not search for resolution and kernel size because we want to build a general search method that can be applied to various tasks and architectures. Searching for resolution and kernel size is not necessary for transformers and other tasks except for vision tasks.

Differentiable Model Scaling using Differentiable Topk

Table 17. Ablation Study on Search Space. We compare the performance with Autoformer with the exact same search space. We use the pipeline of DMSnp , which does not load pretrained weights, in this table.

Method Search Space Supernet Size (G) Top-1 (G) Search Cost (GPU days)

Autoformer-T (Chen et al., 2021) Carse-Grained 2.2 74.7 > 25

DMSnp(ours) Fine-Grained 2.2 74.8 2 DMSnp(ours) Fine-Grained 6.1 75.1 5.5 DMSnp(ours) Carse-Grained 2.2 75.2 2

Due to our high search efficiency, our search space is more fine-grained, while prior methods usually implement a coarse-grained search space containing much fewer sub-networks than ours. Besides, Coarse-grained search spaces also always need more expert knowledge to design, making it hard to build a general search method. Specifically, for a structure hyperparameter x, we directly search it in the range of [1, xmax] with step 1, while most prior methods usually search it in the range of [xmin, xmax] with a minimal step, such as 32 and 64.

Here, we conduct an ablation study based on Autoformer (Chen et al., 2021) search space. Autoformer implements a coarse-grained search space, such as searching embedding size with a minimal size of 192, a maximal size of 240, and a step of 24. We search two models with the exact same coarse-grained search space as Autoformer and a corresponding fine-grained search space. The results are shown in Table 17.

On the same supernet size, regardless of whether the fine-grained search space or the coarse-grained search space, our method performs better than Autoformer, with less than 1/10 of the search costs. It proves our high search efficiency. Besides, we find coarse-grained search yields better performance than fine-grained search space for our methods. This is because human-designed coarse-grained search space has been a pretty good sub-space of the fine-grained search space by the search conducted by human experts. The coarse-grained search space was updated by many human experts in different research projects. The search costs of this human searching stage may take a lot of time and resources.

Besides, we find increasing the supernet size makes the fine-grained search space achieve comparable performance with the coarse-grained search space. It demonstrates that a bigger supernet size contains better subnets, which can be searched by our method.

In this paper, as we want to build a general search method with the least human labor, we still use the fine-grained search space as our default search space.

A.4. Implementation Details

A.4.1. DETAIL OF TRAINING SETTING

In general, given a baseline model and a training setting, we enlarge the baseline model as our supernet and decrease the number of epochs of the training setting as our searching setting. We list details of our experiment setting as shown below.

Efficient Net: For all DMSnp-ES variants, we pruned the supernets over a span of 30 epochs. For those DMSnp-ES variants with MACs fewer than 0.5G, the pruning was conducted from Efficient Net-B4, using an input size of 224. Meanwhile, for DMSnp-EN-B1 and B2, the pruning was initiated from Efficient Net-B7. The input sizes for DMSnp-EN-B1 and B2 were 256 and 288, respectively. Subsequently, the DMSnp-EN variants were retrained using the corresponding training scripts of Efficient Net available in the Timm library (Wightman, 2019).

Res Net: We pruned the Res Net over ten epochs, starting from the Res Net-152 model. After pruning, the Res Net was retrained utilizing the MMPretrain (Contributors, 2023) training settings. This encompasses the foundational setting with a step learning scheduler and the rsb-a1 configuration.

Mobile Net V2: To search for the ideal structure for Mobile Net, we commenced by enlarging Mobile Net V2 before pruning. Specifically, all channel numbers were expanded by 1.5 times, and the number of blocks in each stage was doubled. The pruning process for Mobile Net V2 spans 30 epochs. Subsequent to this, the architecture was retrained employing the MMPretrain training settings.

Differentiable Model Scaling using Differentiable Topk

Deit: We enhanced the depth of the Deit-small model, moving from 12 to 16, to serve as the supernet. The pruning for Deit was conducted with 30 epochs, including 20 epochs as a warmup phase. After pruning, we retrained the model using MMPretrain combined with the Swin training setting.

Yolo-v8 We used Yolo-v8 with deepen factor of 0.5 and widen factor of 0.5 as our supernet, while the original Yolo-v8-n has deepen factor of 0.33 and widen factor of 0.26. We used the training setting of Yolo-v8-n to train the supernet and pruned it over 30 epochs. The experiment of Yolo-v8 was conducted based on MMYolo (Contributors, 2022).

A.4.2. DETAIL OF SEARCH COST ESTIMATION

In this section, we delve into the specifics of how we estimate the search costs for other NAS methods as outlined in Table 4. The search cost of a searched model is divided into two parts: the public part and the private part. The public part is conducted for all sub-models, while the private part pertains to a specific sub-model.

Efficient Net: Efficient Net searches for common scaling strategies across all variants, thus incurring no private search cost. The public search cost estimate for Efficient Net is sourced directly from the Scale Net paper (Xie et al., 2022).

Scale Net (Xie et al., 2022): The Scale Net paper explicitly presented their search cost, which includes a public cost of 379 GPU days and a private cost of 106 GPU days for several sub-models, totaling 21G MACs. We compute the private search cost for a sub-model based on the ratio of its MACs to the overall 21G MACs.

Model Amplification (Liu et al., 2022): As a multi-shot NAS method, Model Amplification requires training multiple models. For all sub-models, it utilizes a public proxy dataset and a proxy training script. Approximately 2007 epochs are expended to examine the proxy dataset, and an additional 2963 epochs are used for the proxy training script, leading to a total of 4970 epochs. During the model search phase, for a variant with 390M MACs, Model Amplification trains about 390 models per iteration. Assuming a ten-fold iteration search per model, this results in roughly 3000 epochs. By benchmarking the training time of Efficient Net-B0 on A100, we determine that 100 epochs require about 2.5 GPU days. As a result, the public search cost for Model Amplification is at least 144 GPU days, while the private cost for the 390M MACs variant is 75 GPU days. We linearly scale the search costs of different variants based on their MACs.

Joint Pruning (Guo et al., 2021a): As a gradient-based pruning method, Joint Pruning presumably employs a supernet and training script analogous to ours. We deduce its search cost based on the number of pruning epochs. Joint Pruning paper indicates that a quarter of the total training epochs is earmarked for model searching. In contrast, we utilize at most a tenth of the total epochs for this purpose. Hence, the search cost for Joint Pruning is 2.5 times that of ours.

A.5. Visualization of Searched Model Structure

In Figure 4, a visualization is provided to delineate the structural intricacies of our searched DMSnp-EN-B0 in comparison to Efficient Net-B0. A distinct observation that stands out is the depth of our DMSnp-EN-B0. It possesses 8 more inverted residual blocks than its Efficient Net counterpart. Furthermore, when we delve deeper into the channel distribution across different stages, it becomes evident that our DMSnp-EN-B0 has undergone significant structural modifications, veering away from the traditional blueprint of Efficient Net-B0. Such distinct differences underscore the fine-grained adaptability of our method, emphasizing its capability to recalibrate and refine models in a way that they are acutely tailored to the task.

Differentiable Model Scaling using Differentiable Topk

Figure 4. Visualization of Our Searched Structure. The x-axis represents the layers width (channels/features), while the y-axis represents the layers. As DMSnp-EN-B0 has more layers than Efficient Net-B0, the width of extra layers for Efficient Net-B0 are seen as 0.