# channel_pruning_via_automatic_structure_search__b85f1e78.pdf Channel Pruning via Automatic Structure Search Mingbao Lin1 , Rongrong Ji1 , Yuxin Zhang1 , Baochang Zhang2 , Yongjian Wu3 , Yonghong Tian4 1Media Analytics and Computing Laboratory, Department of Artificial Intelligence, School of Informatics, Xiamen University, China 2School of Automation Science and Electrical Engineering, Beihang University, China 3Tencent Youtu Lab, Tencent Technology (Shanghai) Co., Ltd, China 4School of Electronics Engineering and Computer Science, Peking University, Beijing, China lmbxmu@stu.xmu.edu.cn, rrji@xmu.edu.cn, yxzhangxmu@163.com, bczhang@buaa.edu.cn, littlekenwu@tencent.com, yhtian@pku.edu.cn Channel pruning is among the predominant approaches to compress deep neural networks. To this end, most existing pruning methods focus on selecting channels (filters) by importance/optimization or regularization based on ruleof-thumb designs, which defects in sub-optimal pruning. In this paper, we propose a new channel pruning method based on artificial bee colony algorithm (ABC), dubbed as ABCPruner, which aims to efficiently find optimal pruned structure, i.e., channel number in each layer, rather than selecting important channels as previous works did. To solve the intractably huge combinations of pruned structure for deep networks, we first propose to shrink the combinations where the preserved channels are limited to a specific space, thus the combinations of pruned structure can be significantly reduced. And then, we formulate the search of optimal pruned structure as an optimization problem and integrate the ABC algorithm to solve it in an automatic manner to lessen human interference. ABCPruner has been demonstrated to be more effective, which also enables the fine-tuning to be conducted efficiently in an end-to-end manner. The source codes can be available at https: //github.com/lmbxmu/ABCPruner. 1 Introduction The high demands in computing power and memory footprint of deep Convolutional Neural Networks (CNNs) prohibit their practical applications on edge computing devices such as smart phones or wearable gadgets. To address this problem, extensive studies have made on compressing CNNs. Prevalent techniques resort to quantization [Wang et al., 2019a], decomposition [Zhang et al., 2015], and pruning [Singh et al., 2019]. Among them, channel pruning has been recognized as one of the most effective tools for compressing Corresponding Author CNNs [Luo et al., 2017; He et al., 2017; He et al., 2018b; He et al., 2018a; Liu et al., 2019a; Wang et al., 2019b]. Channel pruning targets at removing the entire channel in each layer, which is straightforward but challenging because removing channels in one layer might drastically change the input of the next layer. Most cutting-edge practice implements channel pruning by selecting channels (filters) based on rule-of-thumb designs. Existing works follow two mainstreams. The first pursues to identify the most important filter weights in a pre-trained model, which are then inherited by the pruned model as an initialization for the follow-up fine-tuning [Hu et al., 2016; Li et al., 2017; He et al., 2017; He et al., 2018a]. It usually performs layer-wise pruning and fine-tuning, or layer-wise weight reconstruction followed by a data-driven and/or iterative optimization to recover model accuracy, both of which however are time-cost. The second typically performs channel pruning based on handcrafted rules to regularize the retraining of a full model followed by pruning and fine-tuning [Liu et al., 2017; Huang and Wang, 2018; Lin et al., 2019]. It requires human experts to design and decide hyper-parameters like sparsity factor and pruning rate, which is not automatic and thus less practical in compressing various CNN models. Besides, rule-of-thumb designs usually produce the sub-optimal pruning [He et al., 2018b]. The motivation of our ABCPruner is two-fold. First, [Liu et al., 2019b] showed that the essence of channel pruning lies in finding optimal pruned structure, i.e., channel number in each layer, instead of selecting important channels. Second, [He et al., 2018b] proved the feasibility of applying automatic methods for controlling hyper-parameters to channel pruning, which requires less human interference. However, exhaustively searching for the optimal pruning structure is of a great challenge. Given a CNN with L layers, the combinations of pruned structure could be QL j=1 cj, where cj is channel number in the j-th layer. The combination overhead is extremely intensive for deep CNNs1, which is prohibitive for resource-limited scenarios. In this paper, we introduce ABCPruner towards optimal 1For example, the combinations for VGGNet-16 are 2104, 2448 for Goog Le Net, and 21182 for Res Net-152. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) Figure 1: Framework of ABCPruner. (a) A structure set is initialized first, elements of which represent the preserved channel number. (b) The filters of the pre-trained model are randomly assigned to each structure. We train it for given epochs to measure its fitness. (c) Then, the ABC algorithm is introduced to update the structure set and the fitness is recalculated through (b). (b) and (c) will continue for some cycles. (d) The optimal pruned structure with best fitness is picked up, and the trained weights are reserved as a warm-up for fine-tuning the pruned network. (Best viewed with zooming in) channel pruning. To solve the above problem, two strategies are adopted in our ABCPruner. First, we propose to shrink the combinations by limiting the number of preserved channels to {0.1cj, 0.2cj, ..., αcj} where the value of α falls in {10%, 20%, ..., 100%}, which shows that there are 10α feasible solutions for each layer, and up to α% percentage of channels in each layer will be preserved. This operation is interpretable since channel pruning usually preserves a certain percentage of channels. By doing so, the combinations will be significantly reduced to (10α)L QL j=1 cj, making the search more efficient. Second, we obtain the optimal pruned structure through the artificial bee colony (ABC) algorithm [Karaboga, 2005]. ABC is automatic-based rather than hand-crafted, which thus lessens the human interference in determining the channel number. As shown in Fig. 1, we first initialize a structure set, each element of which represents the preserved channel number in each layer. The filter weights of the full model are randomly selected and assigned to initialize each structure. We train it for a given number of epochs to measure its fitness, a.k.a, accuracy performance in this paper. Then, ABC is introduced to update the structure set. Similarly, filter assignment, training and fitness calculation for the updated structures are conducted. We continue the search for some cycles. Finally, the one with the best fitness is considered as the optimal pruned structure, and its trained weights are reserved as a warm-up for fine-tuning. Thus, our ABCPruner can effectively implement channel pruning via automatically searching for the optimal pruned structure. To our best knowledge, there exists only one prior work [Liu et al., 2019a] which considers pruned structure in an automatic manner. Our ABCPruner differs from it in two aspects: First, the pruning in [Liu et al., 2019a] is a twostage scheme where a large Pruning Net is trained in advance to predict weights for the pruned model, and then the evolutionary algorithm is applied to generate a new pruned structure. While our ABCPruner is one-stage without particularly designed network, which thus simplifies the pruning process. Second, the combinations in ABCPruner are drastically re- duced to (10α)L (α {10%, 20%, ..., 100%}), which provides more efficiency for the channel pruning. 2 Related Work Network Pruning. Network pruning can be categorized into either weight pruning or channel pruning. Weight pruning removes individual neurons in the filters or connections across different layers [Han et al., 2015; Guo et al., 2016; Li et al., 2017; Aghasi et al., 2017; Zhu and Gupta, 2017; Frankle and Carbin, 2019]. After pruning, a significant portion of CNNs weights are zero, and thus the memory cost can be reduced by arranging the model in a sparse format. However, such a pruning strategy usually leads to an irregular network structure and memory access, which requires customized hardware and software to support practical speedup. In contrast, channel pruning is especially advantageous by removing the entire redundant filters directly, which can be well supported by general-purpose hardware and BLAS libraries. Existing methods pursue pruning by channel selection based on rule-of-thumb designs. To that effect, many works aim to inherit channels based on the importance estimation of filter weights, e.g., lp-norm [Li et al., 2017; He et al., 2018a] and sparsity of activation outputs [Hu et al., 2016]. [Luo et al., 2017; He et al., 2017] formulated channel pruning as an optimization problem, which selects the most representative filters to recover the accuracy of pruned network with minimal reconstruction error. However, it is still hard to implement in an end-to-end manner without iteratively pruning (optimization) and fine-tuning. Another group focuses on regularized-based pruning. For example, [Liu et al., 2017] imposed sparsity constraint on the scaling factor of batch normalization layer, while [Huang and Wang, 2018; Lin et al., 2019] proposed a sparsity-regularized mask for channel pruning, which is optimized through a data-driven selection or generative adversarial learning. However, these methods usually require another round of retraining and manual hyper-parameter analysis. Our work differs from traditional methods in that it is automatic and the corresponding fine-tuning is end-to-end, which has been demonstrated to be feasible by the recent work in [He et al., 2018b]. Auto ML. Traditional pruning methods involve human expert in hyper-parameter analysis, which hinders their practical applications. Thus, automatic pruning has attracted increasing attention, which can be regarded as a specific Auto ML task. Most prior Auto ML based pruning methods are implemented in a bottom-up and layer-by-layer manner, which typically achieve the automations through Q-value [Lin et al., 2017], reinforcement learning [He et al., 2018b] or an automatic feedback loop [Yang et al., 2018]. Other lines include, but not limited to, constraint-aware optimization via an annealing strategy [Chen et al., 2018] and sparsityconstraint regularization via a joint training manner [Luo and Wu, 2018]. Different from most prior Auto ML pruning, our work is inspired by the recent work in [Liu et al., 2019b] which reveals that the key of channel pruning lies in the pruned structure, i.e., channel number in each layer, instead of selecting important channels. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) 3 The Proposed ABCPruner Given a CNN model N that contains L convolutional layers and its filters set W, we refer to C = (c1, c2, ..., c L) as the network structure of N, where cj is the channel number of the j-th layer. Channel pruning aims to remove a portion of filters in W while keeping a comparable or even better accuracy. To that effect, traditional methods focus on selecting channels (filters) based on rule-of-thumb designs, which are usually sub-optimal [He et al., 2018b]. Instead, our ABCPruner differs from these methods by finding the optimal pruned structure, i.e., the channel number in each layer, and then solves the pruning in an automatic manner by integrating offthe-shelf ABC algorithm [Karaboga, 2005]. We illustrate our ABCPruner in Fig. 1. 3.1 Optimal Pruned Structure Our channel pruning is inspired by the recent study in [Liu et al., 2019b] which reveals that the key step in channel pruning lies in finding the optimal pruned structure, i.e., channel number in each layer, rather than selecting important channels. For any pruned model N , we denote its structure as C = (c 1, c 2, ..., c L), where c j cj is the channel number of the pruned model in the j-th layer. Given the training set Ttrain and test set Ttest, we aim to find the optimal combination of C , such that the pruned model N trained/fine-tuned on Ttrain obtains the best accuracy. To that effect, we formulate our channel pruning problem as (C ) = arg max C acc N (C , W ; Ttrain); Ttest , (1) where W is the weights of pruned model trained/fine-tuned on Ttrain, and acc( ) denotes the accuracy on Ttest for N with structure C . As seen, our implementation is one-stage where the pruned weights are updated directly on the pruned model, which differs from [Liu et al., 2019a] where an extra large Pruning Net has to be trained to predict weights for N , resulting in high complexity in the channel pruning. However, the optimization of Eq. 1 is almost intractable. Detailedly, the potential combinations of network structure C are extremely large. Exhaustively searching for Eq. 1 is infeasible. To solve this problem, we further propose to shrink the combinations in Sec. 3.2. 3.2 Combination Shrinkage Given a network N, the combination of pruned structure could be QL i=1 ci, which are extremely large and computationally prohibitive for the resource-limited devices. Hence, we further propose to constrain Eq. 1 as: (C ) = arg max C acc N (C , W ; Ttrain); Ttest , s.t. c i {0.1ci, 0.2ci, ..., αci}L, (2) where α = 10%, 20%, ..., 100% is a pre-given constant and its influence is analyzed in Sec. 4.5. It denotes that for the i-th layer, at most α percentage of the channels in the pre-trained network N are preserved in N , i.e., c i αci, and the value of c i is limited in {0.1ci, 0.2ci, ..., αici} 2. Note that α is 2When αci < 1, we set c i = 1 to preserve one channel. shared across all layers, which greatly relieves the burden of human interference to involve in the parameter analysis. Our motivations of introducing α are two-fold: First, typically, a certain percentage of filters in each layer are preserved in the channel pruning, and α can play as the upbound of preserved filters. Second, the introduced α significantly decreases the combinations to (10α)L (α 1), which makes solving Eq. 1 more feasible and efficient. Though the introduction of α significantly reduces the combinations of the pruned structure, it still suffers large computational cost to enumerate the remaining (10α)L combinations for deep neural networks. One possible solution is to rely on empirical settings, which however requires manual interference. To solve the problem, we further automate the structure search in Sec. 3.3. 3.3 Automatic Structure Search Instead of resorting to manual structure design or enumeration of all potential structure (channel number in each layer), we propose to automatically search the optimal structure. In particular, we initialize a set of n pruned structures {C j}n j=1 with the i-th element c ji of C j randomly sampled from {0.1ci, 0.2ci, ..., αci}. Accordingly, we obtain a set of pruned model {N j}n j=1 and a set of pruned weights {W j}n j=1. Each pruned structure C j represents a potential solution to the optimization problem of Eq. 2. Our goal is to progressively modify the structure set and finally pick up the best pruned structure. To that effect, we show that our optimal structure search for channel pruning can be optimized automatically by integrating off-the-shelf ABC algorithm [Karaboga, 2005]. We first outline our algorithm in Alg. 1. More details are elaborated below, which mainly contains three steps. Employed Bee (Line 3 Line 13). The employed bee generates a new structure candidate G j for each pruned structure C j. The i-th element of G j is defined as below: g ji = c ji + r (c ji c gi) , (3) where r is a random number within [ 1, +1] and g = j denotes the g-th pruned structure. returns the value closest to the input in {0.1ci, 0.2ci, ..., αci}. Then, employed bee will decide if the generated candidate G j should replace C j according to their fitness, which is defined in our implementation as below: fit C j = acc N j(C j, W j; Ttrain); Ttest . (4) If the fitness of G j is better than C j, C j will be updated as C j = G j. Otherwise, C j will keep unchanged. Onlooker Bee (Line 14 Line 28). The onlooker bee further updates {C j}n j=1 following the rule of Eq. 3. Differently, a pruned structure C j is chosen with a probability related to its fitness, defined as: Pj = 0.9 fit C j max(fit C j) + 0.1. (5) Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) Algorithm 1: ABCPruner Input: Cycles: T , Upbound: α, Number of pruned structures: n, Counter: {ti}n i=1 = 0, Max times: M. Output: Optimal pruned structure (C ) . 1 Initialize the pruned structure set {C j}n j=1; 2 for h = 1 T do 3 for j = 1 n do 4 generate candidate G j via Eq. 3; 5 calculate the fitness of C j and G j via Eq. 4; 6 if fit C j < fit G j then 7 C j = G j; 8 fit C j = fit G j; 11 tj = tj + 1; 14 for j = 1 n do 15 calculate the probability of Pj via Eq. 5; 16 generate a random number ϵj [0, 1]; 17 if ϵj <= Pj then 18 generate candidate G j via Eq. 3; 19 calculate the fitness of G j via Eq. 4; 20 if fit C j < fit G j then 21 C j = G j; 22 fit C j = fit G j; 25 tj = tj + 1; 29 for j = 1 n do 30 if tj > M then 31 re-initialize C j; 35 (C ) = arg max C j acc N j(C j, W j; Ttrain); Ltest . Therefore, the better fitness of C j is, the higher probability of C j would be selected, which then produces a new and better pruned structure. By this, ABCPruner will automatically result in the optimal pruned structure progressively. Scout Bee (Line 29 Line 33). If a pruned structure C j has not been updated more than M times, the scout bee will re-initialize it to further produce a new pruned structure. However, to calculate the fitness in Eq. 4, it has to train N j on Ttrain which is time-consuming and infeasible, especially when Ttrain is large-scale. To solve this problem, as shown in Fig. 1, given a potential combination of C , we first randomly pick up c j filters from the pre-trained model, which then serve as the initialization for the j-th layer of the pruned model N . Then, we train N for some small epochs to obtain its fitness. Lastly, we give more epochs to fine-tune the pruned model with the best structure (C ) , a.k.a., best fitness. More details about the fine-tuning are discussed in Sec. 4.1. 4 Experiments We conduct compression for representative networks, including VGGNet, Goog Le Net and Res Net-56/110 on CIFAR-10 [Krizhevsky et al., 2009], and Res Net-18/34/50/101/152 on ILSVRC-2012 [Russakovsky et al., 2015]. 4.1 Implementation Details Training Strategy. We use the Stochastic Gradient Descent algorithm (SGD) for fine-tuning with momentum 0.9 and the batch size is set to 256. On CIFAR-10, the weight decay is set to 5e-3 and we fine-tune the network for 150 epochs with a learning rate of 0.01, which is then divided by 10 every 50 training epochs. On ILSVRC-2012, the weight decay is set to 1e-4 and 90 epochs are given for fine-tuning. The learning rate is set as 0.1, and divided by 10 every 30 epochs. Performance Metric. Channel number, FLOPs (floatingpoint operations), and parameters are used to measure the network compression. Besides, for CIFAR-10, top-1 accuracy of pruned models are provided. For ILSVRC-2012, both top-1 and top-5 accuracies are reported. For each structure, we train the pruned model N for two epochs to obtain its fitness. We empirically set T =2, n=3, and M=2 in the Alg. 1. 4.2 Results on CIFAR-10 We conduct our experiments on CIFAR-10 with three classic deep networks including VGGNet, Goog Le Net and Res Nets. The experiment results are reported in Tab. 1. VGGNet. For VGGNet, the 16-layer (13-Conv + 3FC) model is adopted for compression on CIFAR-10. We remove 61.20% channels, 73.68% FLOPs and 88.68% parameters while still keeping the accuracy at 93.08%, which is even slightly better than the baseline model. This greatly facilitates VGGNet model, a popular backbone for object detection and semantic segmentation, to be deployed on mobile devices. Goog Le Net. For Goog Le Net, as can be seen from Tab. 1, we remove 22.19% inefficient channels with only 0.21% performance drop of accuracy. Besides, nearly 66.56% FLOPs computation is saved and 60.14% parameters are reduced. Therefore, ABCPruner can be well applied even on the compact-designed Goog Le Net. Res Nets. For Res Nets, we choose two different depths, including Res Net-56 and Res Net-110. As seen from Tab. 1, the pruning rates with regard to channels, FLOPs and parameters increase from Res Net-56 to Res Net-110. Detailedly, the channels reduction raises from 27.07% to 33.28%; the FLOPs reduction raises from 54.13% to 65.05%; and the parameters reduction raises from 54.20% to 67.41%. To explain, Res Net110 is deeper and more over-parameterized. ABCPruner can automatically find out the redundancies and remove them. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) Model Top1-acc Channel Pruned FLOPs Pruned Parameters Pruned VGGNet-16 Base 93.02% 0.00% 4224 0.00% 314.59M 0.00% 14.73M 0.00% VGGNet-16 ABCPruner-80% 93.08% 0.06% 1639 61.20% 82.81M 73.68% 1.67M 88.68% Goog Le Net Base 95.05% 0.00% 7904 0.00% 1534.55M 0.00% 6.17M 0.00% Goog Le Net ABCPruner-30% 94.84% 0.21% 6150 22.19% 513.19M 66.56% 2.46M 60.14% Res Net-56 Base 93.26% 0.00% 2032 0.00% 127.62M 0.00% 0.85M 0.00% Res Net-56 ABCPruner-70% 93.23% 0.03% 1482 27.07% 58.54M 54.13% 0.39M 54.20% Res Net-110 Base 93.50% 0.00% 4048 0.00% 257.09M 0.00% 1.73M 0.00% Res Net-110 ABCPruner-60% 93.58% 0.08% 2701 33.28% 89.87M 65.04% 0.56M 67.41% Table 1: Accuracy and pruning ratio on CIFAR-10. We count the pruned channels, parameters and FLOPs for VGGNet-16 [Simonyan and Zisserman, 2015], Goog Le Net [Szegedy et al., 2015] and Res Nets with different depths of 56 and 110 [He et al., 2016]. ABCPruner-α here denotes that at most α percentage of channels are preserved in each layer. Model Top1-acc Top5-acc Channel Pruned FLOPs Pruned Parameters Pruned Res Net-18 Base 69.66% 0.00% 89.08% 0.00% 4800 0.00% 1824.52M 0.00% 11.69M 0.00% Res Net-18 ABCPruner-70% 67.28% 2.38% 87.67% 1.41% 3894 18.88% 1005.71M 44.88% 6.60M 43.55% Res Net-18 ABCPruner-100% 67.80% 1.86% 88.00% 1.08% 4220 12.08% 968.13M 46.94% 9.50M 18.72% Res Net-34 Base 73.28% 0.00% 91.45% 0.00% 8512 0.00% 3679.23M 0.00% 21.90M 0.00% Res Net-34 ABCPruner-50% 70.45% 2.83% 89.69% 1.76% 6376 25.09% 1509.76M 58.97% 10.52M 51.76% Res Net-34 ABCPruner-90% 70.98% 2.30% 90.05% 1.40% 6655 21.82% 2170.77M 41.00% 10.12M 53.58% Res Net-50 Base 76.01% 0.00% 92.96% 0.00% 26560 0.00% 4135.70M 0.00% 25.56M 0.00% Res Net-50 ABCPruner-70% 73.52% 2.49% 91.51% 1.45% 22348 15.86% 1794.45M 56.61% 11.24M 56.01% Res Net-50 ABCPruner-80% 73.86% 2.15% 91.69% 1.27% 22518 15.22% 1890.60M 54.29% 11.75M 54.02% Res Net-101 Base 77.38% 0.00% 93.59% 0.00% 52672 0.00% 7868.40M 0.00% 44.55M 0.00% Res Net-101 ABCPruner-50% 74.76% 2.62% 92.08% 1.51% 41316 21.56% 1975.61M 74.89% 12.94M 70.94% Res Net-101 ABCPruner-80% 75.82% 1.56% 92.74% 0.85% 43168 17.19% 3164.91M 59.78% 17.72M 60.21% Res Net-152 Base 78.31% 0.00% 93.99% 0.00% 75712 0.00% 11605.91M 0.00% 60.19M 0.00% Res Net-152 ABCPruner-50% 76.00% 2.31% 92.90% 1.09% 58750 22.40% 2719.47M 76.57% 15.62M 74.06% Res Net-152 ABCPruner-70% 77.12% 1.19% 93.48% 0.51% 62368 17.62% 4309.52M 62.87% 24.07M 60.01% Table 2: Accuracy and pruning ratio on ILSVRC-2012. We count pruned channels, parameters and FLOPs for Res Nets with different depths of 18, 34, 50, 101 and 152 [He et al., 2016]. ABCPruner-α here denotes that at most α percentage of channels are preserved in each layer. Besides, it retains comparable accuracies against the baseline model, which verifies the efficacy of ABCPruner in compressing the residual-designed networks. 4.3 Results on ILSVRC-2012 We further perform our method on the large-scale ILSVRC-2012 for Res Nets with different depths, including 18/34/50/101/152. We present two different pruning rates for each network and the results are reported in Tab. 2. From Tab. 2, we have two observations. First observation is that the performance drops on ILSVRC-2012 are more than these on CIFAR-10. The explanations are two-fold: On one hand, the Res Net itself is a compact-designed network, there might exist less redundant parameters. On the other hand, ILSVRC-2012 is a large-scale dataset and contains 1,000 categories, which is much complex than the small-scale CIFAR10 with only 10 categories. Second observation comes that ABCPruner obtains higher pruning rates and less accuracy drops as the depth of network increases. To explain, compared with shallow Res Nets, e.g., Res Net-18/34, the deeper Res Nets, e.g., Res Net-50/191/152 contain relatively more redundancies, which means more pointless parameters are automatically found and removed by ABCPruner. In-depth Analysis. Combining Tab. 1 and Tab. 2, we can see that ABCPruner can well compress popular CNNs while keeping a better or at least comparable accuracy performance against the full model. Its success lies in the automatically optimal pruned structure search. For deeper analysis, we display the layer-wise pruning results by ABCPruner-80% for VGGNet-16 in Fig. 2. ABCPruner-80% denotes that at most 80% percentage of the channels are preserved in each layer. As can be seen, over 20% channels are removed for most layers and the pruning rate differs across different layers. Hence, the ABCPruner can automatically obtain optimal pruned structure which then feeds back good performance. 4.4 Comparison with Other Methods In Tab. 3, we compare ABCPruner with traditional methods including [Luo et al., 2017; He et al., 2017; Huang and Wang, 2018; Lin et al., 2019], and automatic method [Liu et al., Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) Figure 2: The pruned percentage of each layer for VGGNet on CIFAR-10 when α = 80%. Model FLOPs Top1-acc Baseline-acc Epochs Thi Net-30 1.10G 68.42% 76.01% 244 (196 + 48) ABCPruner-30% 0.94G 70.29% 76.01% 102 (12+90) SSS-26 2.33G 71.82% 76.01% 100 GAL-0.5 2.33G 71.95% 76.01% 150 (90 + 60) GAL-0.5-joint 1.84G 71.80% 76.01% 150 (90 + 60) Thi Net-50 1.71G 71.01% 76.01% 244 (196 + 48) ABCPruner-50% 1.30G 72.58% 76.01% 102 (12+90) SSS-32 2.82G 74.18% 76.01% 100 CP 2.73G 72.30% 76.01% 206 (196 + 10) ABCPruner-100% 2.56G 74.84% 76.01% 102 (12+90) Meta Pruning-0.50 1.03G 69.92% 76.01% 160 (32 + 128) ABCPruner-30% 0.94G 70.29% 76.01% 102 (12+90) Meta Pruning-0.75 2.26G 72.17% 76.01% 160 (32 + 128) ABCPruner-50% 1.30G 72.58% 76.01% 102 (12+90) Meta Pruning-0.85 2.92G 74.49% 76.01% 160 (32 + 128) ABCPruner-100% 2.56G 74.84% 76.01% 102 (12+90) Table 3: The FLOPs reduction, top-1 accuracy and training efficiency of ABCPruner and SOTAs including Thi Net [Luo et al., 2017], CP [He et al., 2017], SSS [Huang and Wang, 2018], GAL [Lin et al., 2019], and Meta Pruning [Liu et al., 2019a] for Res Net50 [He et al., 2016] on ILSVRC-2012. Effectiveness. The results in Tab. 3 show that ABCPruner obtains better FLOPs reduction and accuracy performance. Compared with importance-based pruning [Luo et al., 2017; He et al., 2017] and regularized-based pruning [Huang and Wang, 2018; Lin et al., 2019], ABCPruner also advances in its end-to-end fine-tuning and automatic structure search. It effectively verifies the finding in [Liu et al., 2019b] that optimal pruned structure is more important in channel pruning, rather than selecting important channels. In comparison with the automatic-based Meta Pruning [Liu et al., 2019a], the superiority of ABCPruner can be attributed to the onestage design where the fitness of pruned structure is directly measured in the target network without any additionally introduced network as in [Liu et al., 2019a]. Hence, ABCPruner is more advantageous in finding the optimal pruned structure. Efficiency. We also display the overall training epochs for all methods in Tab. 3. As seen, ABCPruner requires less training epochs of 102 compared with others. To analyze, 12 epochs are used to search for the optimal pruned structure and 90 epochs are adopted for fine-tuning the optimal pruned 3We have carefully fixed the codes provided by the authors and re-run the experiments using our baseline of Res Net-50. Figure 3: The influence of α for Res Net-56 on CIFAR-10. Generally, large α results in better accuracy but less pruning rate. network. We note that the importance-based pruning methods [Luo et al., 2017; He et al., 2017] are extremely inefficient since they require layer-wise pruning or optimization (196 epochs). Besides, 48 epochs [Luo et al., 2017] and 10 epochs [He et al., 2017] are used to fine-tune the network. [Huang and Wang, 2018] consume similar training time with ABCPruner, but suffers more accuracy drops and less FLOPs reduction. [Lin et al., 2019] cost 90 epochs for retraining the network and additional 60 epochs are required for fine-tuning. Besides, ABCPruner also shows its better efficiency against the automatic-based counterpart [Liu et al., 2019a] which is a two-stage method where 32 epochs are first adopted to train the large Pruning Net and 128 epochs are used for fine-tuning. Hence, ABCPruner is more efficient compared to the SOTAs. 4.5 The Influence of α In this section, we take Res Net-56 on CIFAR-10 as an example and analyze the influence of the introduced constant α, which represents the upbound of preserved channel percentage in each layer. It is intuitive that larger α leads less reductions of channel, parameters and FLOPs, but better accuracy performance. The experimental results in Fig. 3 demonstrate this assumption. To balance the accuracy performance and model complexity reduction, in this paper, we set α = 70% as shown in Tab. 1. 5 Conclusion In this paper, we introduce a novel channel pruning method, termed as ABCPruner. ABCPruner proposes to find the optimal pruned structure, i.e., channel number in each layer via an automatic manner. We first propose to shrink the combinations of pruned structure, leading to efficient search of optimal pruned structure. Then, the artificial bee colony is integrated to achieve the optimal pruned structure search in an automatic manner. Extensive experiments on popular CNNs have demonstrated the efficacy of ABCPruner over traditional channel pruning methods and automatic-based counterpart. Acknowledgements This work is supported by the Nature Science Foundation of China (No.U1705262, No.61772443, No.61572410, No.61802324 and No.61702136), National Key R&D Program (No.2017YFC0113000, and No.2016YFB1001503), and Nature Science Foundation of Fujian Province, China (No. 2017J01125 and No. 2018J01106). Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) References [Aghasi et al., 2017] Alireza Aghasi, Afshin Abdi, Nam Nguyen, and Justin Romberg. Net-trim: Convex pruning of deep neural networks with performance guarantee. In Neur IPS, 2017. [Chen et al., 2018] Changan Chen, Frederick Tung, Naveen Vedula, and Greg Mori. Constraint-aware deep neural network compression. In ECCV, 2018. [Frankle and Carbin, 2019] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In ICLR, 2019. [Guo et al., 2016] Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efficient dnns. In Neur IPS, 2016. [Han et al., 2015] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Neur IPs, 2015. [He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. [He et al., 2017] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In ICCV, 2017. [He et al., 2018a] Yang He, Guoliang Kang, Xuanyi Dong, Yanwei Fu, and Yi Yang. Soft filter pruning for accelerating deep convolutional neural networks. In IJCAI, 2018. [He et al., 2018b] Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. Amc: Automl for model compression and acceleration on mobile devices. In CVPR, 2018. [Hu et al., 2016] Hengyuan Hu, Rui Peng, Yu-Wing Tai, and Chi-Keung Tang. Network trimming: A data-driven neuron pruning approach towards efficient deep architectures. ar Xiv preprint ar Xiv:1607.03250, 2016. [Huang and Wang, 2018] Zehao Huang and Naiyan Wang. Data-driven sparse structure selection for deep neural networks. In ECCV, 2018. [Karaboga, 2005] Dervis Karaboga. An idea based on honey bee swarm for numerical optimization. Technical report, 2005. [Krizhevsky et al., 2009] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Technical report, 2009. [Li et al., 2017] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. In ICLR, 2017. [Lin et al., 2017] Ji Lin, Yongming Rao, Jiwen Lu, and Jie Zhou. Runtime neural pruning. In Neur IPS, 2017. [Lin et al., 2019] Shaohui Lin, Rongrong Ji, Chenqian Yan, Baochang Zhang, Liujuan Cao, Qixiang Ye, Feiyue Huang, and David Doermann. Towards optimal structured cnn pruning via generative adversarial learning. In CVPR, 2019. [Liu et al., 2017] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. In ICCV, 2017. [Liu et al., 2019a] Zechun Liu, Haoyuan Mu, Xiangyu Zhang, Zichao Guo, Xin Yang, Tim Kwang-Ting Cheng, and Jian Sun. Metapruning: Meta learning for automatic neural network channel pruning. In ICCV, 2019. [Liu et al., 2019b] Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value of network pruning. In ICLR, 2019. [Luo and Wu, 2018] Jian-Hao Luo and Jianxin Wu. Autopruner: An end-to-end trainable filter pruning method for efficient deep model inference. ar Xiv preprint ar Xiv:1805.08941, 2018. [Luo et al., 2017] Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A filter level pruning method for deep neural network compression. In ICCV, 2017. [Russakovsky et al., 2015] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211 252, 2015. [Simonyan and Zisserman, 2015] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. [Singh et al., 2019] Pravendra Singh, Vinay Kumar Verma, Piyush Rai, and Vinay P. Namboodiri. Play and prune: Adaptive filter pruning for deep model compression. In IJCAI, 2019. [Szegedy et al., 2015] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015. [Wang et al., 2019a] Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. Haq: Hardware-aware automated quantization with mixed precision. In CVPR, 2019. [Wang et al., 2019b] Wenxiao Wang, Cong Fu, Jishun Guo, Deng Cai, and Xiaofei He. Cop: Customized deep model compression via regularized correlation-based filter-level pruning. In IJCAI, 2019. [Yang et al., 2018] Tien-Ju Yang, Andrew Howard, Bo Chen, Xiao Zhang, Alec Go, Mark Sandler, Vivienne Sze, and Hartwig Adam. Netadapt: Platform-aware neural network adaptation for mobile applications. In ECCV, 2018. [Zhang et al., 2015] Xiangyu Zhang, Jianhua Zou, Xiang Ming, Kaiming He, and Jian Sun. Efficient and accurate approximations of nonlinear convolutional networks. In CVPR, 2015. [Zhu and Gupta, 2017] Michael Zhu and Suyog Gupta. To prune, or not to prune: exploring the efficacy of pruning for model compression. In ICLR, 2017. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)