# prior_gradient_mask_guided_pruningaware_finetuning__0d77b357.pdf Prior Gradient Mask Guided Pruning-Aware Fine-Tuning Linhang Cai1,2, Zhulin An1*, Chuanguang Yang1,2, Yangchun Yan3, Yongjun Xu1 1Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China 2University of Chinese Academy of Sciences, Beijing, China 3Horizon Robotics, Beijing, China Email: {cailinhang19g, anzhulin, yangchuanguang, xyj}@ict.ac.cn, yangchun.yan@horizon.ai We proposed a Prior Gradient Mask Guided Pruning-Aware Fine-Tuning (PGMPF) framework to accelerate deep Convolutional Neural Networks (CNNs). In detail, the proposed PGMPF selectively suppresses the gradient of those unimportant parameters via a prior gradient mask generated by the pruning criterion during fine-tuning. PGMPF has three charming characteristics over previous works: (1) Pruningaware network fine-tuning. A typical pruning pipeline consists of training, pruning and fine-tuning, which are relatively independent, while PGMPF utilizes a variant of the pruning mask as a prior gradient mask to guide fine-tuning, without complicated pruning criteria. (2) An excellent tradeoff between large model capacity during fine-tuning and stable convergence speed to obtain the final compact model. Previous works preserve more training information of pruned parameters during fine-tuning to pursue better performance, which would incur catastrophic non-convergence of the pruned model for relatively large pruning rates, while our PGMPF greatly stabilizes the fine-tuning phase by gradually constraining the learning rate of those unimportant parameters. (3) Channel-wise random dropout of the prior gradient mask to impose some gradient noise to fine-tuning to further improve the robustness of final compact model. Experimental results on three image classification benchmarks CIFAR10/100 and ILSVRC-2012 demonstrate the effectiveness of our method for various CNN architectures, datasets and pruning rates. Notably, on ILSVRC-2012, PGMPF reduces 53.5% FLOPs on Res Net-50 with only 0.90% top-1 accuracy drop and 0.52% top-5 accuracy drop, which has advanced the state-of-the-art with negligible extra computational cost. Introduction Despite the superior performance of deep Convolutional Neural Networks in various tasks, e.g., image classification (He et al. 2016; Xu et al. 2021), object detection (Bochkovskiy, Wang, and Liao 2020), image retrieval (Hu et al. 2020), semantic segmentation (He et al. 2017), the deployment of CNNs to resource-limited mobile devices have posed great challenges. Network pruning is a powerful method to compress the model with little performance loss, which can be divided into two categories: weight *Corresponding author. Copyright 2022, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. pruning and filter pruning based on the granularity (Zhu and Gupta 2018; Liu et al. 2019c; Frankle and Carbin 2019). Weight pruning methods remove unimportant connections or weights in the network, inducing unstructured sparsity in filters, thus requiring specialized libraries for real acceleration. In contrast, filter pruning structurally remove unimportant filters, capable of compressing both the model size and the computational burden. Hence we focus on filter pruning. A three-step filter pruning pipeline consists of: training a network, evaluating the importance of every filter to generate a pruning mask to mask out unimportant filters and then fine-tuning the pruned network to compensate for the performance degradation. The pruning and fine-tuning phases could be iteratively used to greedily compress the model. These phases are relatively independent as the pruning operation is non-differentiable, while our Prior Gradient Mask Guided Pruning-Aware Fine-Tuning (PGMPF) utilizes a modified version of the pruning mask generated by the pruning stage as a prior gradient mask to guide finetuning, as shown in Figure 1. Previous Soft Filter Pruning (SFP) based methods, e.g., Asymptotic Soft Filter Pruning (ASFP) and Asymptotic Softe R Filter Pruning (ASRFP) (He et al. 2018, 2019a; Cai et al. 2021b), also allow pruned filters to update their parameters to maintain a large model capacity during fine-tuning to pursue better performance, shown in Figure 1(b), where weight decay mask is a modified version of the Boolean pruning mask to smoothly soften the pruning operation in order to maintain more training information inside those filters chosen to be pruned. However, these methods confronted with catastrophic non-convergence of the pruned model for relatively large pruning rates. The catastrophic non-convergence of the pruned model for large pruning rates means that the Test Accuracy Drop before and after pruning would be very huge, where 0 denotes no evident accuracy drops incurred by pruning. Note that soft pruning based methods allow all filters to unconstrainedly update the parameter during fine-tuning, ignoring the uneven importance of filters. Unlike Hard Filter Pruning (HFP) that disables the update of pruned filters, gradually reducing the model capacity or SFP based methods that encounter catastrophic nonconvergence of the pruned model, our PGMPF allows the update of pruned filters via a prior gradient mask generated by the pruning criterion, striking an excellent tradeoff be- The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22) Pruning Criterion Network Pruning Mask 0/1 Hard Pruned Fine-tuning Iterative pruning (a) Hard Filter Pruning Pruning Criterion Network Weight Decay Mask Pruning Mask 0/1 Softly Pruned Network Fine-tuning Iterative pruning (b) Asymptotic Softe R Filter Pruning Pruning Criterion Network Pruning Mask 0/1 Fine-tuning Iterative pruning Prior Gradient Mask Guides Weight Decay Mask Softly Pruned Network Figure 1: Comparison of three kinds of pruning pipelines. Our PGMPF devises a prior gradient mask generated by the Boolean pruning mask at the pruning stage to guide the next fine-tuning stage, making the fine-tuning stage to be pruning-aware. Weight decay mask, proposed in Asymptotic Softe R Filter Pruning (ASRFP), is a variant of the Boolean pruning mask to soften the pruning operation to maintain training information inside those pruned filters. tween large model capacity during fine-tuning and stable convergence speed to obtain the final compact model. Our contribution points are as follows: (1) We proposed a novel Prior Gradient Mask Guided Pruning-Aware Fine Tuning (PGMPF) method to compress and accelerate deep models, which provides state-of-the-art performance without complicated handcrafted or learnt pruning criteria. (2) Our PGMPF greatly stabilizes the fine-tuning phase by gradually constraining the learning rate of those unimportant parameters, achieving an excellent tradeoff between large model capacity during fine-tuning and stable convergence speed to obtain the final compact model. (3) We proposed channel-wise random dropout of the prior gradient mask to impose some gradient noise to fine-tuning to further improve the robustness of final compact model. Related Works Prevalent works on compressing and accelerating CNN models mainly consist of network pruning, knowledge distillation, model quantization, low-rank approximation and efficient network module design. Network pruning focuses on compressing the model without incurring obvious performance loss. Recently, much attention has been paid to filter pruning, since filter pruning is much friendlier to hardwares, capable of compressing both the model size and the computational cost. Until now, diverse filter pruning methods have been proposed. Besides, pruning can be categorized into static pruning and dynamic pruning. Static pruning removes unimportant filters statically, eventually obtaining a fixed and static compact model invariant to different inputs. In contrast, given a unique input, dynamic pruning uses channel-wise or spatialwise attention modules to adaptively predict the importance of each channel and skip the computation of unimportant channels and locations, or replace the computation with a low-precision version (Hua et al. 2019; Gao et al. 2019; Liu et al. 2020). Even though dynamic pruning surpasses static pruning by learning instance-level network activation paths, drawbacks are that the model size is not compressed and the actual inference speed is hindered by the computational cost of reindexing the dynamic network structure for each input (Chen et al. 2019; Liu et al. 2019a). Pruning Criteria. Existing criteria for evaluating the importance of a filter include ℓ1-norm, ℓ2-norm, weight similarity, feature redundancy, scaling factors in Batch Normalization layers, the rank of the feature map, cross-layer weight dependency and so on (Li et al. 2017; Liu et al. 2017; Ayinde and Zurada 2018; Wang et al. 2019; Lin et al. 2020). Some approaches compare the importance of each filter layer-wisely, while others compare the importance in the whole network. A disadvantage of global pruning is that how to design a global filter importance criterion as magnitudes of filters vary from layer to layer. Recently, Channel Pruning via Multi-Criteria (CPMC) method takes three aspects, i.e., cross-layer filter dependency, the parameter numbers and FLOPs of each filter into account, and then normalizes these criteria to generate a global multi-criteria importance to measure the importance in a global manner (Yan et al. 2021). Filter Pruning via Geometric Median (FPGM) approach prunes filters via Geometric Median, claiming that the prevalent smaller-norm-less-important criterion demands large deviation of filter norms and near zero norms of unimportant filters (He et al. 2019b). Auto Pruner proposes to use a channel-wise attention module and a scaled sigmoid function to gradually scale each channel and find unimportant filters automatically during training (Luo and Wu 2020), however, evidently increasing training-time computational costs and requiring heavy tuning of parameters in the scaled sigmoid function for each network and dataset. Inspired by Differentiable Architecture Search (DARTS) (Liu, Simonyan, and Yang 2019), Learning Filter Pruning Criteria (LFPC) proposes a Differentiable Criteria Sampler (DCS) to learn layer-wise importance criteria (He et al. 2020b), which is computationally expensive and time-consuming. Meta Pruning adopts Meta Learning and evolutionary algorithm for automatic channel pruning, whose training cost is very expensive (Liu et al. 2019b). Likewise, Eagle Eye also relies on evolutionary algorithm together with adaptive batch normalization to search an optimal structure (Li et al. 2020). In short, how to design a pruning criterion is still an open issue. In contrast, our proposed PGMPF does not rely on complicated handcrafted or learnt pruning criteria. For simplicity, we adopt the simple ℓ2-norm criterion. We utilize a modified version of the pruning mask generated by the pruning stage as a prior gradient mask to guide fine-tuning. Unlike conventional HFP based methods which disable the update of pruned filters, gradually reducing the model capacity, our proposed PGMPF allows the update of pruned filters via a prior gradient mask generated by the pruning criterion, balancing well between large search space during fine-tuning and stable convergence speed to obtain the pruned model. Gradually Hard Filter Pruning (GHFP) (Cai et al. 2021a) alleviates the issue of catastrophic non-convergence of the pruned model via a monotonically increasing parameter to control the proportion of soft pruning and hard pruning to balance between performance and convergence speed. While GHFP still suffers from relatively large pruning rates, our PGMPF greatly stabilizes the fine-tuning phase by gradually constraining the learning rate of those unimportant parameters. Moreover, our PGMPF are totally pruningaware, meaning that the pruning phase could intimately affect the fine-tuning phase via our prior gradient mask, while in most previous methods, pruning and fine-tuning are relatively independent, shown in Figure 2. Low-rank approximation of convolutional filters reduces model size and computation by decomposing large matrices into small matrices, however, achieving relatively tiny speedups on small-size convolutional kernels (Jaderberg, Vedaldi, and Zisserman 2014; Alvarez and Salzmann 2017). Model quantization quantizes the weights and activations into fewer bits to reduce model size and computational budgets (Hubara et al. 2016; Han et al. 2020). Efficient network module design aims at designing more lightweight modules, e.g., Mobile Net, Cond Conv, ACNet, HCGNet (Howard et al. 2017; Yang et al. 2019; Ding et al. 2019; Yang et al. 2020). Knowledge distillation (KD) methods define various knowledge, e.g., the activation or attention map (Hinton, Vinyals, and Dean 2015; Yuan et al. 2020; Yang, An, and Xu 2021), and then transfer the knowledge from a large teacher model to a small student model, which could be regarded as a kind of instance-level label smoothing. Recently, self-supervised learning (SSL) (Chen et al. 2020; He et al. 2020a; Yang et al. 2021) is defined as one kind of knowledge to improve the performance (Yin et al. 2020), introducing auxiliary tasks, e.g., rotation, jigsaw, to push the model to learn more generalized or task-specific representations. These approaches can be combined with PGMPF to achieve improvement. Methods Formulation For a network with L convolutional layers, the weight of the l-th convolutional layer Wl can be denoted by Rn m s s, where 1 l L. In detail, s denotes the kernel size. m and n are the number of input channels and output channels respectively. We denote Il and Ol as the input and output feature maps of the l-th layer. The shape of the input tensor Il and the output tensor Ol = Wl Il are m hl wl and n hl+1 wl+1 respectively, represented as Ol,j = Wl,j Il for 1 j n, (1) where Ol,j Rhl+1 wl+1 and Wl,j Rm s s denote the j-th output channel and the j-th filter individually in the l-th layer. If the filter pruning rate for the l-th layer is Pl, then n Pl filters in the l-th layer would be removed. After pruning, the size of the pruned output tensor ˆOl is n (1 Pl) hl+1 wl+1. Pruning Mask. During pruning, given the weight tensor Wl and the pruning rate Pl, we adopt a simple ℓ2-norm filter importance criterion to generate a Boolean pruning mask Ml,j. Specifically, Ml,j = 0 if Wl,j is pruned. Otherwise, Ml,j = 1 means that the filter Wl,j is not pruned. According to ASRFP, the pruned weights of the l-th layer are gradually zeroized, given by ˆ Wl,j =Wl,j Ml,j +αWl,j (1 Ml,j) for 1 j n, (2) where denotes the element-wise multiplication. α is a monotonically decreasing parameter to control the decaying speed of pruned filters and to better utilize the trained information of pruned filters. ASRFP exponentially decays α from 1 towards 0 as the pruning and fine-tuning procedure goes on. Prior Gradient Mask Guided Pruning-Aware Fine-Tuning In Figure 2, the prior gradient mask categorizes filters into important ones and unimportant ones, based on the ℓ2-norm of each filter. The closer a filter gets to the center of concentric circles, the less important the filter is. The weight decay mask at the pruning stage would push unimportant filters towards the center via a monotonically decreasing parameter α. During fine-tuning, Both unconstrained finetuning (UFT) and PGMPF calculate the gradient as normal, and the aimed direction of gradient update is denoted by solid arrow. UFT just moves each filter to the aimed position, treating each filter as equally important during finetuning, which would incur catastrophic non-convergence of the pruned model for relatively large pruning rates. Our PGMPF is pruning-aware, gradually scaling down the learning rate of unimportant filters. Besides, the guidance of a prior gradient mask obtained in the last pruning stage would continue for an epoch. After each fine-tuning epoch, the roles of important and unimportant filters may change, and a new prior gradient mask can be obtained. After obtaining the Boolean pruning mask Ml,j, we define a modified asymptotic variant ˆ Ml,j, named as prior gradient mask, given by ˆ Ml,j =Ml,j +β(1 Ml,j) for 1 j n, (3) where β constrains the learning rate of those pruned parameters, decreasing from 1 to 0, given by β(t) = (tmax 1 t tmax 1 )3 for 0 t < tmax, (4) where tmax is the maximal number of pruning and finetuning epochs. Once we obtain the prior gradient mask ˆ M = { ˆ Ml,j|l [1, L], j [1, n]}, we adopt it to guide the next fine-tuning stage. Assume that gt is the normal gradient computed by regular backpropagation during fine-tuning in the t-th epoch. We impose our prior gradient mask to constrain the learning rate of those unimportant parameters determined by the last pruning stage, and obtain a modified gradient ˆgt, given by (a) Importance Assignment (b) UFT (c) PGMPF Figure 2: Comparison of UFT in SFP based methods and our PGMPF. (a) The prior gradient mask categorizes filters into important ones and unimportant ones, denoted by red solid circles and blue solid triangles respectively. Purple dotted concentric circles indicate the importance of each filter, measured by the ℓ2-norm of the filter. (b) During fine-tuning, Both UFT and PGMPF calculate the gradient as normal, and the aimed direction of gradient update is denoted by solid arrow. Then we can get the new position of each filter, denoted by blue filled triangle or red filled circle. UFT just moves each filter to the aimed position, ignoring the pruning objective. (c) PGMPF scales down the learning rate of those unimportant filters, denoted by dashed arrows. PGMPF is pruning-aware, gradually constraining the learning rate of those unimportant parameters. ˆgt = gt ˆ M =gt M+β gt (1 M) for 0 t