# dynamic_structure_pruning_for_compressing_cnns__a8cc81fc.pdf

Dynamic Structure Pruning for Compressing CNNs

Jun-Hyung Park1, Yeachan Kim2, Junho Kim2, Joon-Young Choi2, Sang Keun Lee1,2

1Department of Computer Science and Engineering, Korea University, Seoul, Republic of Korea 2Department of Artificial Intelligence, Korea University, Seoul, Republic of Korea {irish07, yeachan, monocrat, johnjames, yalphy}@korea.ac.kr

Structure pruning is an effective method to compress and accelerate neural networks. While filter and channel pruning are preferable to other structure pruning methods in terms of realistic acceleration and hardware compatibility, pruning methods with a finer granularity, such as intra-channel pruning, are expected to be capable of yielding more compact and computationally efficient networks. Typical intra-channel pruning methods utilize a static and hand-crafted pruning granularity due to a large search space, which leaves room for improvement in their pruning performance. In this work, we introduce a novel structure pruning method, termed as dynamic structure pruning, to identify optimal pruning granularities for intrachannel pruning. In contrast to existing intra-channel pruning methods, the proposed method automatically optimizes dynamic pruning granularities in each layer while training deep neural networks. To achieve this, we propose a differentiable group learning method designed to efficiently learn a pruning granularity based on gradient-based learning of filter groups. The experimental results show that dynamic structure pruning achieves state-of-the-art pruning performance and better realistic acceleration on a GPU compared with channel pruning. In particular, it reduces the FLOPs of Res Net50 by 71.85% without accuracy degradation on the Image Net dataset. Our code is available at https://github.com/irishev/DSP.

Introduction Network pruning removes redundant weights or neurons from neural networks (Han et al. 2015; Li et al. 2017; He, Zhang, and Sun 2017). Since deep neural networks (DNNs) are typically over-parameterized (Denton et al. 2014; Ba and Caruana 2014), pruning techniques can reduce the network size without a significant accuracy loss by identifying and removing redundant weights. Hence, network pruning can address problems in the deployment of DNNs on low-end devices with limited space and computational capacity. Most approaches in network pruning are classified into two categories; weight pruning (Han et al. 2015; Guo, Yao, and Chen 2016) and structure pruning (Li et al. 2017; He, Zhang, and Sun 2017; Mao et al. 2017). Weight pruning deletes individual weights in a network. However, because of the irregular structures generated by weight pruning, it is difficult to leverage the high efficiency of pruned models

Copyright 2023, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

without using special hardware or libraries (Li et al. 2017). In contrast, structure pruning directly discards high-level structures of tensors. Among the structure pruning methods, pruning channels, filters, and layers are commonly more favorable because they can easily achieve acceleration on general hardware and libraries (Li et al. 2017; He, Zhang, and Sun 2017; He et al. 2018a). Nevertheless, pruning with a finer granularity is expected to be capable of better efficiency, due to the higher degree of freedom of connectivity (Mao et al. 2017). Intra-channel pruning (Wen et al. 2016; Yu et al. 2017; Meng et al. 2020) focuses on finding a new pruning granularity that can provide realistic compression and acceleration benefits by decomposing a filter or channel into smaller components. For example, grouped kernel pruning (Zhong et al. 2022) has found a generally accelerable granularity by grouping filters and pruning channels within a group. Although these methods have empirically shown better pruning rates compared with channel and filter pruning, they have several limitations that can potentially degrade pruning results. First, they typically use a static and strictly constrained pruning granularity in a layer, such as a regular structure of adjacent kernels (Wen et al. 2016) and a channel of evenly grouped filters (Zhong et al. 2022). Such granularity limits the degree of freedom, which possibly degrades pruning performance. Second, they utilize hand-crafted heuristics in determining a pruning granularity. In prior methods, a pruning granularity is determined heuristically based on the number of filters (Zhong et al. 2022) or the shape of kernels (Meng et al. 2020). Hence, information on how a set of weights delivers important features for a given task may not be properly identified and utilized. In this work, we introduce a novel intra-channel pruning method called dynamic structure pruning to address the abovementioned issues. Our method automatically learns dynamic pruning granularities in each layer based on the grouped kernel pruning scheme (Zhong et al. 2022). As shown in Figure 1, the pruning granularity is changed dynamically depending on which and how many filters are grouped together. Hence, we aim to identify optimal pruning granularities by learning filter groups for effective pruning. Because identifying optimal filter groups is challenging due to their discrete nature and exponential search space, we propose a differentiable group learning method inspired by research on architecture search (Liu, Simonyan, and Yang 2019; Xie

The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23)

Figure 1: Comparison of pruning granularity among filter, channel, and our dynamic structure pruning.

et al. 2019). We relax the search space to be continuous and approximate the gradient of filter groups with respect to the objective function of a training task. This facilitates efficient learning of filter groups during the gradient-based training of DNNs without time-consuming discrete explorations and evaluations. Consequently, dynamic structure pruning provides a higher degree of freedom on the structural sparsity than filter and channel pruning. We validate the effectiveness of dynamic structure pruning through extensive experiments with diverse network architectures on the CIFAR-10 (Krizhevsky, Hinton et al. 2009) and Image Net (Deng et al. 2009) datasets. The results reveal that the proposed method outperforms state-of-the-art structure pruning methods in terms of both pruned FLOPs and accuracy. Notably, it reduces the FLOPs of Res Net50 by 71.85% without accuracy degradation on the Image Net inference. Additionally, it exhibits a better accuracy-latency tradeoff on general hardware and libraries than the baselines. To the best of our knowledge, this work is the first that explores dynamic pruning granularities for intra-channel pruning and provides deep insights. By automatically learning effective pruning granularities, dynamic structure pruning significantly improves pruning results. In addition, it is readily applicable to convolutional and fully-connected layers and generates efficient structured networks, which ensures wide compatibility with modern hardware, libraries, and neural network architectures. The main contributions of this work are summarized as follows: We introduce a novel intra-channel pruning method, called dynamic structure pruning, that automatically learns dynamic pruning granularities for effective pruning. We propose a differentiable group learning method that can efficiently optimize filter groups using gradient-based methods during training. We verify that dynamic structure pruning establishes new state-of-the-art pruning results while maintaining accuracy across diverse networks on several popular datasets.

Related Work Weight Pruning A large subset of modern pruning algorithms has attempted to prune the individual weights of networks. To induce weight-

level sparsity in a network, each weight with a magnitude less than a certain threshold is removed (Han et al. 2015; Guo, Yao, and Chen 2016). These approaches inevitably generate unstructured models that require specialized hardware and libraries to realize the compression effect (He et al. 2018a).

Structure Pruning Channel/filter pruning. Pruning a network into a regular structure is particularly appealing as a hardware-friendly approach. Typical structure pruning methods (Li et al. 2017; He, Zhang, and Sun 2017; He et al. 2018a) prune channels and filters, because doing so reduces memory footprints and inference time on general hardware. Data-free methods (Li et al. 2017; He et al. 2018a; Lin et al. 2020b) mostly use the ℓp-norm of weights to evaluate the importance of a structure. Data-dependent methods empirically evaluate importance based on gradients or activations by utilizing training data (Molchanov et al. 2019; Gao et al. 2019; Lin et al. 2020b). Several structure pruning methods involve different methods for evaluating the importance of structures, such as their distance from the geometrical median (He et al. 2019). Another branch of structure pruning directly induces sparsity in networks during training by utilizing regularization techniques (Wen et al. 2016; Yang, Wen, and Li 2020; Wang et al. 2019; Zhuang et al. 2020; Wang et al. 2021a). Moreover, machine learning can be applied to generate compact networks (He et al. 2018b; Liu et al. 2019; Ye et al. 2020; Gao et al. 2021). These studies have focused predominantly on reconsidering the importance criteria for finding redundant filters and channels more accurately. In contrast, our method considers a finer granularity than a filter or channel to generate more compact and easily accelerable networks.

Intra-channel pruning. Another line of work exploits an intra-channel-level sparsity to achieve better efficiency from a finer granularity. Wen et al. (2016) introduce group-wise pruning, which induces shape-level structured sparsity in neural networks. Mao et al. (2017) have further explored a wide range of pruning granularity and evaluated its effects on prediction accuracy and efficiency. Meng et al. (2020) have proposed stripe-wise pruning to learn the shape of filters. In most cases, an intra-channel-level sparsity requires a specialized library to realize its efficiency, which limits its applicability. Grouped kernel pruning (Yu et al. 2017; Zhang

et al. 2022; Zhong et al. 2022) has attracted attention due to its compatibility. These methods prune a constant number of channels within evenly grouped filters and consequently yield pruned networks that can be executed using a grouped convolution, which is widely supported by modern hardware and libraries. Intra-channel pruning methods typically leverage a static and hand-crafted pruning granularity with strict constraints. In contrast, we explore a dynamic pruning granularity based on gradient-based learning to maximize efficiency with an intra-channel-level sparsity.

Other approaches. Other structure pruning approaches combine structure pruning with different network compression methods such as matrix decomposition (Li et al. 2020a), factorization (Li et al. 2019a), and skip calculation (Tang et al. 2021), which are distantly related to the proposed method.

Dynamic Structure Pruning In this section, we formulate dynamic structure pruning and propose a differentiable group learning method to optimize filter groups. Then, we illustrate our group channel pruning method. The entire process is described in Algorithm 1.

Problem Formulation We provide a formal description of dynamic structure pruning. Given a weight tensor of a convolutional layer, we aim to search for N filter groups. Let L be the number of convolutional layers, and the weight tensor of i-th convolutional layer can be represented by Wi RCi out Ci in Ki h Ki w, i = 1, ..., L, where Ci out, Ci in, Ki h, and Ki w are the number of output channels (i.e., the number of filters), input channels, kernel height, and kernel width, respectively. Then, a filter group of the layer is defined as:

Gi p = {Wi k|αi k,p = 1, k {1, ..., Ci out}}

s.t. p {1, ..., N}, N q=1Gi q = Gi,

( m, n {1, ..., N}) m = n Gi m Gi n = ,

where αi k,p {0, 1} is a binary group parameter, which indicates whether k-th filter Wi k belongs to p-th group Gi p; N is the pre-defined number of groups; and Gi = {Wi 1, ..., Wi Ci out} is the set of all filters of the i-th layer. Following our definition, all filters belong to one and only one group. Subsequently, we separately prune input channels of each filter group, termed as group channels; essentially, this implies that filter groups determine pruning granularities. Therefore, we optimize the set of all group parameters α, which determine filter groups, along with the set of all weight tensors W to minimize the training loss. Let L( ) and δ( ) denote the training loss function and the pruning operator, respectively. We solve a bilevel optimization problem with α as the upper-level variable and W as the lower-level variable:

min α L(δ(W (α), α))

s.t. W (α) = arg min W L(δ(W, α)). (2)

This formulation is also found in the work of differentiable architecture search (Liu, Simonyan, and Yang 2019), which

Algorithm 1: Dynamic Structure Pruning

// Differentiable group learning while not converged do

α α ϵ1 αL(δ(W (α), α)) W W ϵ2 WL(δ(W (α), α)) end // Group channel pruning for i = 1 to L do

αi Discretize(αi) for p = 1 to N do

Ii p {||αi :,p Wi :,1,:,:||2 2, ... , ||αi :,p Wi :,Ci in,:,:||2 2}

Ci p Find Redundant Channel(Ii p, Wi) end Wi prune Prune Group Channel(Wi, αi, Ci) end // Fine-tuning while not converged do

Wprune Wprune ϵ Wprune L(Wprune) end return Wprune

can be explained in that filter groups can be considered as a special type of architecture.

Differentiable Group Learning

Next, we propose a novel differentiable method to jointly learn group parameters α and weights W. The optimization problem in Eq. 2 has non-differentiable factors, which prevent the application of standard gradient-based methods, such as back-propagation and stochastic gradient descent. Therefore, we introduce several approximation techniques to facilitate gradient calculation.

Continuous relaxation of α. Because α is a discrete variable, the back-propagation algorithm cannot be applied to compute its gradients. Hence, we use a gumbel-softmax reparameterization trick (Jang, Gu, and Poole 2017) to perform continuous relaxation of α. We generate αi k,p α from learnable parameters πi k,1, ..., πi k,N as follows:

αi k,p = exp ((log(πi k,p) + gi k,p)/τ) PN j=1 exp((log(πi k,j) + gi k,j)/τ) , (3)

where gi k,1, ..., gi k,N are i.i.d samples drawn from a Gumbel distribution Gumbel(0, 1), and τ is a temperature parameter to control sharpness. When we need discrete values after training, we calculate a one-hot vector ai k with ai k,p = 1, where p = arg maxj=1,...,N πi k,j.

Regularization-based pruning. Among various approaches for a pruning operator δ( ), we use a regularizationbased approach with a group-wise regularizer R( ) due to its

Figure 2: Illustration of obtaining one-step adapted filters W for differentiable group learning.

differentiability and smoothness as follows:

min α (L(W (α)) + λR(W (α), α)),

s.t. W (α) = arg min W (L(W) + λR(W, α)),

R(W, α) = X

m=1 |αi :,p|0.5

k=1 ||αi k,p W i k,m||2 2,

(4) where λ is a regularization hyper-parameter and I is the set of indices of layers in which the filters are grouped. This regularizer induces sparsity at group channels of each W i. Note that we adaptively scale λ for each layer.

One-step unrolling. In Eq. 4, calculating the gradient of α can be prohibitive due to the costly optimization of W (α). We approximate W (α) by using filters adapted with a single training step as follows:

W = W ϵ WL(W) ϵλ WR(W, α), (5)

where ϵ is a learning rate. The one-step adapted filters W can be obtained using group-wise regularization and precalculated gradient of filters, as shown in Figure 2. From Eq. 4 and 5, we derive the gradient by applying the chain rule:

λ αR(W , α)

ϵλ 2 α,WR(W, α) W (L(W ) + λR(W , α)), (6)

Note that we assume that αi k,p is independent of variables other than W i k for efficiency, because this assumption enables us to analytically calculate the second derivative from Eq. 4.

Group Channel Pruning After optimizing group parameters, we prune group channels. The majority of existing pruning methods (You et al. 2019; Molchanov et al. 2019) conduct iterative optimization procedures, which require numerous evaluations during prune-retrain cycles. These incur a considerable computational cost, particularly when pruning at an extreme sparsity level. Therefore, we introduce a simple yet effective one-shot

pruning method. Given a set of weights S, we adaptively prune each group with the maximum pruning rate that satisfies the following condition:

||Qi p||2 2 ||Gip||2 2 < β, (7)

where Qi p is the weights connected to pruned channels of p-th group in the i-th layer, and β is a pre-defined hyperparameter. Intuitively, this constrains the relative amount of the information to be removed from group channels.

Experiments In this section, we present experimental results that demonstrate the efficacy of dynamic structure pruning. We evaluate dynamic structure pruning on two image classification benchmarks including CIFAR-10 (Krizhevsky, Hinton et al. 2009) and Image Net (Deng et al. 2009) using the prevalent Res Net (He et al. 2016) following the related work (He et al. 2018a). In addition, to further evaluate its applicability, we extend our experiments to VGG (Simonyan and Zisserman 2014) and Mobile Net V2 (Sandler et al. 2018). We primarily compare our proposed method to state-of-the-art channel and filter pruning methods. In addition, we compare our method to state-of-the-art intra-channel pruning methods such as TMI-GKP (Zhong et al. 2022). We evaluate dynamic structure pruning by setting the number of groups to two, three, and four, which is denoted in form of DSP (g={number of groups}) . Note that we omit the results of more than four groups because we have not observed any significant differences from those of four groups. We report test/validation accuracy of pruned models (P. Acc.) for CIFAR-10/Image Net, accuracy difference between the original and pruned models ( Acc.), and pruning rates of parameters (Params ) and FLOPs (FLOPs ). We report the average results of five runs on a single pre-trained model.

Experimental Settings. In the CIFAR-10 experiments, we manually pre-train the original networks using the standard pre-processing and training settings in He et al. (2016). In the

Network Method P. Acc. (%) Acc. (%) Params (%) FLOPs (%)

Res Net20 (91.90%)

TMI-GKP (Zhong et al. 2022) 92.01 -0.34 43.35 42.87 Rethinking (Ye et al. 2018) 90.90 -1.34 37.22 47.40 DHP (Li et al. 2020b) 91.53 -1.00 43.87 48.20 DSP (g=2) (ours) 91.91 +0.01 50.67 55.76 DSP (g=3) (ours) 92.00 +0.10 53.22 57.91 DSP (g=4) (ours) 92.16 +0.26 53.86 58.81

Res Net56 (93.26%)

HRank (Lin et al. 2020a) 93.52 +0.26 16.80 29.30 NISP (Yu et al. 2018) 93.01 -0.03 42.40 35.50 GAL (Lin et al. 2019) 93.38 +0.22 11.80 37.60 TMI-GKP (Zhong et al. 2022) 94.00 +0.22 43.49 43.23 CHIP (Sui et al. 2021) 94.16 +0.90 42.80 47.40 KSE (Li et al. 2019b) 93.23 +0.20 45.27 48.00 Random (Li et al. 2022) 93.48 +0.22 44.92 48.97 DHP (Li et al. 2020b) 93.58 +0.63 41.58 49.04 TDP (Wang et al. 2021b) 93.76 +0.07 40.00 50.00 Hinge (Li et al. 2020a) 93.69 +0.74 48.73 50.00 DSP (g=2) (ours) 93.92 +0.66 57.33 58.56 DSP (g=3) (ours) 94.08 +0.82 58.06 59.89 DSP (g=4) (ours) 94.19 +0.93 59.50 60.56

VGG16 (93.88%)

GAL (Lin et al. 2019) 93.77 -0.19 77.60 39.60 HRank (Lin et al. 2020a) 93.43 -0.53 82.90 53.50 CHIP (Sui et al. 2021) 93.86 -0.10 81.60 58.10 Hinge (Li et al. 2020a) 93.59 -0.43 19.95 60.93 DSP (g=2) (ours) 93.88 +0.00 74.51 75.58 DSP (g=3) (ours) 93.91 +0.03 76.65 77.80 DSP (g=4) (ours) 93.91 +0.03 76.93 80.51

Table 1: Pruning results on CIFAR-10. Our baseline accuracy is reported in parentheses below the network name.

Image Net experiments, we use the pre-trained checkpoints provided by Pytorch1 (Paszke et al. 2019). We search the hyperparameters for dynamic structure pruning based on the empirical analysis, i.e., the value of τ {0.125, 0.25, 0.5, 1}, λ {5e-4, 1e-3, 2e-3, 3e-3} for CIFAR-10 and λ {1e-4, 2e-4, 3e-4, 5e-4} for Image Net. We use Adam optimizer with a learning rate of 0.001 and momentum of (0.9, 0.999) to train group parameters. During differentiable group learning, we set the initial learning rate to 0.05, and train models for 120 and 60 epochs in the CIFAR-10 and Image Net experiments, respectively. Then, pruned models are fine-tuned for 80 epochs with initial learning rates of 0.015 and 0.05 for five and three iterations in the CIFAR-10 and Image Net experiments, respectively. We use a cosine learning rate scheduling with weight decay of 1e-3 and 3e-5 for the CIFAR-10 and Image Net experiments, respectively, to yield the best results fitted to our additional regularization. Note that we further prune filters of the final convolutional layer in each residual block using β to maximize pruning rates. The experiments are implemented using Pytorch and conducted on a Linux machine with an Intel i9-10980XE CPU and 4 NVIDIA RTX A5000 GPUs.

Experimental Results Results on CIFAR-10. We evaluate the pruning performance on the CIFAR-10 dataset with Res Net20, Res Net56, and VGG16. The pruning results shown in Table 1 indicate

1https://github.com/pytorch/examples/tree/master/imagenet

that dynamic structure pruning generates more compact and computationally efficient models than baseline pruning methods. Without sacrificing accuracy, dynamic structure pruning achieves significantly higher pruning rates than the baseline methods for all the given networks in this experiment. In particular, dynamic structure pruning shows competitive accuracy by using about 75% of the FLOPs compared with a state-of-the-art pruning method (CHIP) with Res Net56. These results confirm that dynamic structure pruning can effectively learn dynamic pruning granularities and eliminate redundancy on a wide variety of networks. Moreover, dynamic structure pruning achieves greater improvements in pruning results compared with the baselines as the model complexity increases. In addition, we observe that dynamic structure pruning with four groups outperforms that with fewer groups, which shows that more groups tend to be beneficial in terms of accuracy with commensurate pruning rates.

Results on Image Net. To verify the scalability and realworld applicability of dynamic structure pruning, we perform further evaluation on the ILSVRC-2012 dataset (Deng et al. 2009) with Res Net18, Res Net50, and Mobile Net V2. The results presented in Table 2 confirm that dynamic structure pruning can consistently generate more efficient and accurate models than the baselines on the large-scale dataset and models. Similar to the results on the CIFAR-10 dataset, dynamic structure pruning reduces more parameters and FLOPs while maintaining the original accuracy, whereas most of the baseline methods show significant accuracy degradation. Par-

Network Method Top-1 P. Acc. (%) Top-1 Acc. (%) Top-5 P. Acc. (%) Top-5 Acc. (%) Params (%) FLOPs (%)

Res Net18 (69.76% / 89.08%)

PFP (Liebenwein et al. 2020) 67.38 -2.36 87.91 -1.16 43.80 29.30 SCOP (Tang et al. 2020) 68.62 -1.14 88.45 -0.63 43.50 45.00 DSP (g=2) (ours) 69.28 -0.48 88.59 -0.49 54.14 60.43 DSP (g=3) (ours) 69.30 -0.46 88.61 -0.47 54.87 61.21 DSP (g=4) (ours) 69.38 -0.38 88.77 -0.31 54.99 61.33

Res Net50 (76.13% / 92.86%)

TMI-GKP (Zhong et al. 2022) 75.33 -0.62 - - 33.21 33.74 Thi Net (Luo, Wu, and Lin 2017) 72.04 -0.84 91.14 -0.47 33.70 36.80 GAL (Lin et al. 2020a) 71.95 -4.20 90.94 -1.93 16.90 43.00 HRank (Lin et al. 2020a) 74.98 -1.17 92.33 -0.54 36.60 43.70 SCOP (Tang et al. 2020) 75.95 -0.20 92.79 -0.08 42.80 45.30 CHIP (Sui et al. 2021) 76.15 +0.00 92.91 +0.04 44.20 48.70 TDP (Wang et al. 2021b) 75.90 -0.25 - - 53.00 50.00 Random (Li et al. 2022) 75.13 -0.20 92.52 -0.08 45.88 51.01 DCP (Zhuang et al. 2018) 74.95 -1.06 92.32 -0.61 51.56 55.50 DSP (g=2) (ours) 76.16 +0.03 92.89 +0.03 60.88 67.78 DSP (g=3) (ours) 76.20 +0.07 92.90 +0.04 62.23 70.98 DSP (g=4) (ours) 76.22 +0.09 92.92 +0.06 63.45 71.85

Mobile Net V2 (71.88% / 90.29%)

MP (Liu et al. 2019) 71.20 +0.00 - - - 27.67 Random (Li et al. 2022) 70.90 -0.30 - - - 29.13 AMC (He et al. 2018b) 70.80 -0.40 - - - 30.00 DSP (g=2) (ours) 71.60 -0.28 90.14 -0.15 31.80 37.01 DSP (g=3) (ours) 71.60 -0.28 90.15 -0.14 31.87 37.10 DSP (g=4) (ours) 71.61 -0.27 90.17 -0.12 31.95 37.24

Table 2: Pruning results on Image Net. Our baseline top-1 / top-5 accuracy is reported in parentheses below the network name.

Figure 3: Image Net top-1 accuracy vs. acceleration on a GPU.

ticularly, dynamic structure pruning with Res Net50 achieves better accuracy while using about half of the FLOPs, compared with CHIP.

Realistic acceleration. Because implementing dynamic structure pruning introduces sparse operations of gathering input channels, we consider the possibility of a gap between the theoretical acceleration (i.e. pruned FLOPs) and the realistic acceleration. To evaluate the realistic acceleration in general environments, we measure the realistic acceleration of dynamic structure pruning on the Image Net inference with a batch size of 128 using an NVIDIA RTX A5000 GPU, and compare it with that of a state-of-the-art channel pruning method (SCOP) that has reported the realistic acceleration. As shown in Figure 3, dynamic structure pruning with two

groups achieves a significant improvement in the accuracylatency tradeoff compared with SCOP. This result indicates that sparsity in dynamic structure pruning can be effectively exploited by general hardware and libraries. In addition, as can be seen in Figure 3, we observe that dynamic structure pruning with three groups typically shows a worse accuracylatency tradeoff than with two groups despite better pruning performance. This is because the realistic speedup only comes when the reduced computational time outweighs the increased time for gathering channels for additional groups. In this experiment, we find that two groups are generally the best option in terms of an accuracy-latency tradeoff, and more groups may be better with an extreme sparsity level.

Pruned group structure. To further understand the improved pruning performance, we inspect the grouping and pruning results produced by dynamic structure pruning. Figure 4 illustrates the structure of the second block in Res Net20 pruned by dynamic structure pruning. We find several patterns in the block structure that may be advantageous to pruning performance. First, each group comprises and utilizes different filters and input channels, which suggests that dynamic structure pruning can adaptively identify the number of filters and channels for each group for optimal performance. Second, dynamic structure pruning can adjust the number of groups by pruning redundant filter groups as shown in Figure 4(c). These observations show that dynamic structure pruning indeed provides a higher degree of freedom to pruned structures than standard channel pruning, leading to improvements

Figure 4: Pruned group structure in the Res Net20 block. Dynamic structure pruning allows input channels to be used repeatedly by multiple groups, while output channels are unique for each group.

Pruning rates Best Worst Average Ours Groups Acc. Groups Acc. Groups Acc. Groups Acc.

0.25 [0,1,4,5,6] [2,3,7] 97.03 [0,2,4,7] [1,3,5,6] 59.07 - 92.85

[0,1,2,5,6,7] [3,4] 96.05 0.5 [0,1,2] [3,4,5,6,7] 88.41 [0,3,4,7] [1,2,5,6] 25.74 - 66.33 81.77 0.75 [0,3,6,7] [1,2,4,5] 71.28 [0,3,4,5,7] [1,2,6] 10.16 - 41.63 45.12

Table 3: Comparison of filter groups learned by differentiable group learning with the best, worst, and average cases found by brute force. Underline denotes the second-best accuracy.

Method Groups Pruning rates 0.25 0.5 0.75

Brute force [0,1,4,5,6] [2,3,7] 97.03 66.35 43.78 [0,1,2] [3,4,5,6,7] 95.29 88.41 19.97 [0,3,6,7] [1,2,4,5] 95.29 69.37 71.28

Ours [0,1,2,5,6,7] [3,4] 96.05 81.77 45.12

Table 4: Compatibility analysis of filter groups in Table 3.

in pruning performance.

Empirical analysis of differentiable group learning. Although we are currently unaware of theoretical guarantees that our differentiable group learning finds optimal groups, we have empirically observed that it yields acceptable groups that perform well with a wide range of pruning rates. We compare the groups learned by differentiable group learning with the best, worst, and average cases found by brute force. We evaluate the methods on the MNIST dataset (Deng 2012) with a toy network with two convolutional layers, which output 8 channels, followed by batch normalization and Re LU activation layers. We set the initial learning rate, number of epochs, and λ to 0.1, 40, and 1e-3, respectively. We use SGD with the 0.9 momentum factor and exponential learning rate schedule with the 0.9 decay factor. Note that we report the validation accuracy of a pruned network with two groups without fine-tuning. We apply dynamic structure pruning to the second convolutional layer and learn the filter groups from scratch. The results are reported in Tables 3 and 4. Ta-

ble 3 shows that dynamic group learning effectively finds group composition that exhibits better pruned accuracy than the average case. Our method achieves an accuracy close to the best with low and moderate pruning rates (i.e., 0.25 and 0.5), while still achieving better accuracy than average with a high pruning rate (i.e., 0.75). Though dynamic group learning has not found the best case obtained via brute force in this experiment, it has found groups more compatible with various pruning rates. As shown in Table 4, the best case of the brute force method exhibits groups overfitted to a specific pruning rate, leading to low accuracy with different pruning rates. In contrast, dynamic group learning achieves the second-best accuracy with all pruning rates in this experiment, which confirms its compatibility.

Conclusion In this work, we have introduced dynamic structure pruning, which automatically learns dynamic pruning granularities for intra-channel pruning. The results of our empirical evaluations on popular network architectures and datasets demonstrate the efficacy of dynamic structure pruning in obtaining compact and efficient models. In particular, we have found that dynamic structure pruning consistently outperforms the state-of-the-art structure pruning methods in terms of pruned FLOPs, while also exhibiting better accuracy. Moreover, our results reveal that dynamic structure pruning generates more accelerated models on general hardware and libraries than conventional channel pruning. We plan to investigate the efficacy of dynamic structure pruning on other DNN architectures like Transformers (Vaswani et al. 2017) in the future.

Acknowledgements This work was supported by the Basic Research Program through the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (2020R1A4A1018309), National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (2021R1A2C3010430) and Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2019-0-00079, Artificial Intelligence Graduate School Program (Korea University)).

References Ba, J.; and Caruana, R. 2014. Do deep nets really need to be deep? In Advances in Neural Information Processing Systems, 2654 2662. Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 248 255. Deng, L. 2012. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6): 141 142. Denton, E. L.; Zaremba, W.; Bruna, J.; Le Cun, Y.; and Fergus, R. 2014. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in Neural Information Processing Systems, 1269 1277. Gao, S.; Huang, F.; Cai, W.; and Huang, H. 2021. Network Pruning via Performance Maximization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9270 9280. Gao, X.; Zhao, Y.; Dudziak, Ł.; Mullins, R.; and Xu, C.- z. 2019. Dynamic channel pruning: Feature boosting and suppression. In International Conference on Learning Representations. Guo, Y.; Yao, A.; and Chen, Y. 2016. Dynamic network surgery for efficient dnns. In Advances in Neural Information Processing Systems, 1379 1387. Han, S.; Pool, J.; Tran, J.; and Dally, W. 2015. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, 1135 1143. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770 778. He, Y.; Kang, G.; Dong, X.; Fu, Y.; and Yang, Y. 2018a. Soft filter pruning for accelerating deep convolutional neural networks. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, 2234 2240. He, Y.; Lin, J.; Liu, Z.; Wang, H.; Li, L.-J.; and Han, S. 2018b. Amc: Automl for model compression and acceleration on mobile devices. In Proceedings of the European Conference on Computer Vision, 784 800. He, Y.; Liu, P.; Wang, Z.; Hu, Z.; and Yang, Y. 2019. Filter pruning via geometric median for deep convolutional

neural networks acceleration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4340 4349. He, Y.; Zhang, X.; and Sun, J. 2017. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, 1389 1397. Jang, E.; Gu, S.; and Poole, B. 2017. Categorical reparameterization with gumbel-softmax. In International Conference on Learning Representations. Krizhevsky, A.; Hinton, G.; et al. 2009. Learning multiple layers of features from tiny images. Technical report, University of Toronto. Li, H.; Kadav, A.; Durdanovic, I.; Samet, H.; and Graf, H. P. 2017. Pruning filters for efficient convnets. In International Conference on Learning Representations. Li, T.; Wu, B.; Yang, Y.; Fan, Y.; Zhang, Y.; and Liu, W. 2019a. Compressing convolutional neural networks via factorized convolutional filters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3977 3986. Li, Y.; Adamczewski, K.; Li, W.; Gu, S.; Timofte, R.; and Van Gool, L. 2022. Revisiting Random Channel Pruning for Neural Network Compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 191 201. Li, Y.; Gu, S.; Mayer, C.; Gool, L. V.; and Timofte, R. 2020a. Group sparsity: The hinge between filter pruning and decomposition for network compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8018 8027. Li, Y.; Gu, S.; Zhang, K.; Van Gool, L.; and Timofte, R. 2020b. DHP: Differentiable Meta Pruning via Hyper Networks. In Proceedings of the European Conference on Computer Vision, 608 624. Li, Y.; Lin, S.; Zhang, B.; Liu, J.; Doermann, D.; Wu, Y.; Huang, F.; and Ji, R. 2019b. Exploiting kernel sparsity and entropy for interpretable CNN compression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2800 2809. Liebenwein, L.; Baykal, C.; Lang, H.; Feldman, D.; and Rus, D. 2020. Provable filter pruning for efficient neural networks. In International Conference on Learning Representations. Lin, M.; Ji, R.; Wang, Y.; Zhang, Y.; Zhang, B.; Tian, Y.; and Shao, L. 2020a. HRank: Filter pruning using high-rank feature map. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1529 1538. Lin, S.; Ji, R.; Yan, C.; Zhang, B.; Cao, L.; Ye, Q.; Huang, F.; and Doermann, D. 2019. Towards optimal structured cnn pruning via generative adversarial learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2790 2799. Lin, T.; Stich, S. U.; Barba, L.; Dmitriev, D.; and Jaggi, M. 2020b. Dynamic model pruning with feedback. In International Conference on Learning Representations.

Liu, H.; Simonyan, K.; and Yang, Y. 2019. Darts: Differentiable architecture search. In International Conference on Learning Representations. Liu, Z.; Mu, H.; Zhang, X.; Guo, Z.; Yang, X.; Cheng, K.-T.; and Sun, J. 2019. Metapruning: Meta learning for automatic neural network channel pruning. In Proceedings of the IEEE International Conference on Computer Vision, 3296 3305. Luo, J.-H.; Wu, J.; and Lin, W. 2017. Thinet: A filter level pruning method for deep neural network compression. In Proceedings of the IEEE International Conference on Computer Vision, 5058 5066. Mao, H.; Han, S.; Pool, J.; Li, W.; Liu, X.; Wang, Y.; and Dally, W. J. 2017. Exploring the granularity of sparsity in convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 13 20. Meng, F.; Cheng, H.; Li, K.; Luo, H.; Guo, X.; Lu, G.; and Sun, X. 2020. Pruning filter in filter. In Advances in Neural Information Processing Systems, volume 33, 17629 17640. Molchanov, P.; Mallya, A.; Tyree, S.; Frosio, I.; and Kautz, J. 2019. Importance estimation for neural network pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 11264 11272. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; Desmaison, A.; Kopf, A.; Yang, E.; De Vito, Z.; Raison, M.; Tejani, A.; Chilamkurthy, S.; Steiner, B.; Fang, L.; Bai, J.; and Chintala, S. 2019. Py Torch: An Imperative Style, High-Performance Deep Learning Library. In Wallach, H.; Larochelle, H.; Beygelzimer, A.; d'Alch e-Buc, F.; Fox, E.; and Garnett, R., eds., Advances in Neural Information Processing Systems 32, 8024 8035. Curran Associates, Inc.

Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; and Chen, L.-C. 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4510 4520. Simonyan, K.; and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556. Sui, Y.; Yin, M.; Xie, Y.; Phan, H.; Aliari Zonouz, S.; and Yuan, B. 2021. CHIP: CHannel independence-based pruning for compact neural networks. In Advances in Neural Information Processing Systems, volume 34, 24604 24616. Tang, Y.; Wang, Y.; Xu, Y.; Deng, Y.; Xu, C.; Tao, D.; and Xu, C. 2021. Manifold regularized dynamic network pruning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5018 5028. Tang, Y.; Wang, Y.; Xu, Y.; Tao, D.; Xu, C.; Xu, C.; and Xu, C. 2020. Scop: Scientific control for reliable neural network pruning. In Advances in Neural Information Processing Systems, volume 33, 10936 10947.

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In Advances in neural information processing systems, volume 30, 5998 6008.

Wang, H.; Qin, C.; Zhang, Y.; and Fu, Y. 2021a. Neural pruning via growing regularization. In International Conference on Learning Representations. Wang, H.; Zhang, Q.; Wang, Y.; Yu, L.; and Hu, H. 2019. Structured pruning for efficient convnets via incremental regularization. In 2019 International Joint Conference on Neural Networks (IJCNN), 1 8. Wang, W.; Chen, M.; Zhao, S.; Chen, L.; Hu, J.; Liu, H.; Cai, D.; He, X.; and Liu, W. 2021b. Accelerate cnns from three dimensions: A comprehensive pruning framework. In International Conference on Machine Learning, 10717 10726. Wen, W.; Wu, C.; Wang, Y.; Chen, Y.; and Li, H. 2016. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems, 2074 2082. Xie, S.; Zheng, H.; Liu, C.; and Lin, L. 2019. SNAS: stochastic neural architecture search. In International Conference on Learning Representations. Yang, H.; Wen, W.; and Li, H. 2020. Deep Hoyer: Learning sparser neural network with differentiable scale-invariant sparsity measures. In International Conference on Learning Representations. Ye, J.; Lu, X.; Lin, Z.; and Wang, J. Z. 2018. Rethinking the smaller-norm-less-informative assumption in channel pruning of convolution layers. In International Conference on Learning Representations. Ye, M.; Gong, C.; Nie, L.; Zhou, D.; Klivans, A.; and Liu, Q. 2020. Good subnetworks provably exist: Pruning via greedy forward selection. In International Conference on Machine Learning, 10820 10830. You, Z.; Yan, K.; Ye, J.; Ma, M.; and Wang, P. 2019. Gate decorator: Global filter pruning method for accelerating deep convolutional neural networks. In Advances in Neural Information Processing Systems, volume 32, 2133 2144. Yu, N.; Qiu, S.; Hu, X.; and Li, J. 2017. Accelerating convolutional neural networks by group-wise 2D-filter pruning. In 2017 international joint conference on neural networks (IJCNN), 2502 2509. Yu, R.; Li, A.; Chen, C.-F.; Lai, J.-H.; Morariu, V. I.; Han, X.; Gao, M.; Lin, C.-Y.; and Davis, L. S. 2018. Nisp: Pruning networks using neuron importance score propagation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 9194 9203. Zhang, G.; Xu, S.; Li, J.; and Guo, A. J. 2022. Group-based network pruning via nonlinear relationship between convolution filters. Applied Intelligence, 52(8): 9274 9288. Zhong, S.; Zhang, G.; Huang, N.; and Xu, S. 2022. Revisit Kernel Pruning with Lottery Regulated Grouped Convolutions. In International Conference on Learning Representations. Zhuang, T.; Zhang, Z.; Huang, Y.; Zeng, X.; Shuang, K.; and Li, X. 2020. Neuron-level Structured Pruning using Polarization Regularizer. In Advances in Neural Information Processing Systems, volume 33, 9865 9877. Zhuang, Z.; Tan, M.; Zhuang, B.; Liu, J.; Guo, Y.; Wu, Q.; Huang, J.; and Zhu, J. 2018. Discrimination-aware channel pruning for deep neural networks. In Advances in Neural Information Processing Systems, volume 31, 875 886.